Home » Blog » Archives

Archive for the ‘Commentary’ Category

The sad story of Xerox

February 25th, 2010

I stumbled upon the news today that Xerox is suing Google and Yahoo! for patent infringement on search technology.

Did they just find out about that patent? I mean, was it hidden in a drawer somewhere for all these years? weren’t they supposed to be the ‘document company’?

That got me to dig a little deeper: from a very superficial view of their market evaluation over time, the company hasn’t been this bad since the early 80’s, yeah you read that right “80’s”, not even the various bubbles and recessions in between did as much damage.

A little more digging shows that S&P recently degraded Xerox ratings to “brink of junk” territory (and I’m quoting the Wall Street Journal here, not my words) early this month (Feb 2010) following their acquisition of Affiliated Computer Services Inc. the week before (for 5.6B$.. and Xerox market cap today is 8B$… yeah, fishy).

So, let me get this straight: Xerox has one of the most advanced and prolific research labs in the history of innovation (PARC) where things like window-based graphical user interfaces, the mouse, ethernet and the laser printer were invented.

Yet, they failed to capitalize on *any* of those inventions (not even the laser printing one which really feels like a no brainer to me).

And now they sue Google and Yahoo (but hey, not Microsoft! go figure) for patent infringement? on search? 10 years later?

I have a generally moderate position on software patents (I think there are a few genius ones that do deserve their temporary monopoly), but I feel the problem is not in the concept of rewarding innovation (which I strongly support) but in the way the system has turned around and now it’s used to abuse and harass way more than to protect investment.

Xerox is nothing but the poster child of failure to capitalize from its own innovation and, frankly, resorting to the judiciary system to compensate for it shouts managerial incapacity to my ears.

Not only that, but it gives off a sense of utter desperation: one thing is for two directly competing businesses to sue each other trying to get any minimum advantage. It’s not pretty, but it’s understandable (it’s a prisoner’s dilemma scenario where cease-fire and moral-high-ground are inherently unstable).

Completely different case when a company facing difficulties is trying to compensate by milking somebody else’s cash cow for no other reason than they thought about it too but could not (or did not want to) profit from it when they did. This is no prisoner’s dilemma, this is no unavoidable escalation: this feels no different than any other patent troll that feed off as parasites on the fact that having “filed” an idea first can give them a temporary monopoly on it, no matter how obvious it is for others to come up with the same solution when faced with the same problem.

Shame on you, Xerox: you were a company that I have admired and respected greatly over the years. So sad now to think of you as a desperate patent troll.

UPDATE: apparently, they are not new to waking up late in the game and using suing as a measure to compensate for their managerial inadequacy to capitalize on their own invention. Still, pretty sad overall.

Permalink | Posted in Commentary
 

More Thoughts on Reconciliation

July 25th, 2009

My previous post about the importance of reconciliation and relational density in the web of data managed to attract a lot of interest and stimulating comments and criticism. I’ll try to address it here.

LOD and the Curse of Incoherence

Ed Summers (from the Library of Congress) really liked the notion of relational density as a differentiator (and asked if there is a way to calculate it), but disagrees with my view that Linking Open Data (LOD) and Freebase differ in their model of reconciliation and that the reason why LOD is making a bigger impact than other semantic web efforts is precisely because of the switch from a posteriori to a priori identifier reconciliation.

Ed certainly has a point and it is true that maybe my criticism of a posteriori reconciliation is better suited for something like Sig.ma than for LOD.

At the same time, the part about relational density remains valid: LOD itself claims that 1 out of 33 triples actually ‘interlink’ data between datasets, the others are internal relationships or literal values.

Here is worth mentioning the experience that we had in SIMILE when creating a browsing interface on top of the union of three different high-quality digital libraries: the surprising and painful result was that even with a powerful faceted-browsing user interface, a small number of datasets and a taylored UI, the perception of our users was that merging many high quality datasets resulted in a bigger but lower quality one. The items from the two sets mixed, but didn’t merge (even if equivalences between various identifiers were found and explicitly integrated in the system).

Basically, a-posteriori sameAs identifier equivalences is enough to reconcile if and only if the two items being merged are described using the same exact data model. I see this as another form of ‘abstraction leakage‘: even if the identifiers of items and schemas were mapped and aligned, this operation might not be enough to perform a real reconciliation, as I explained in a previous post on the quality of metadata.

So, in short, while it is fair to say that LOD is indeed nudging people to link to one another’s datasets a priori and make an effort to do so, the result is relationally sparse and ontologically inconsistent… and I don’t think this is because LOD is doing a bad job but because their model resolves around the idea that more data the better, no matter how relationally dense, which is in striking contrast to what Freebase does, focusing more on higher relational density than higher item or domain counts.

Reconciliation Costs

John Giannandrea, Metaweb’s CTO and former Chief Technologist at Netscape, wrote me privately in an email (which I quote here with permission) challenging my assumption that there is no evidence that the reconciliation costs don’t grow with the amount of data and that it might lead to saturation of resources:

Actually I think there is some evidence already that a-priori costs are an inverse function of graph size, assuming that the graph is not getting more sparse.  That is, the more you know about stuff you are trying to match, the better you can do.

I added the emphasis in the quote above because I think that’s the key to the whole process: in order to attract large quantities of information you can decide to punt the problem of reconciliation, but those costs will grow at least linearly (but probably more given that the amount of potential relationships grows quadratically) with the amount of information to reconcile. On the other hand, by focusing on high relational density since the very beginning, the velocity of data acquisition will be small but the reconciliation costs will get smaller over time, creating a positive network effect.

I found it interesting to spot the same concept mentioned in this article in a related but different context (mentioned in Tim O’Reilly’s latest web re-branding campaign):

Consider what happens when there are two records describing two different people as they appear to share the same name. “What happens is a third record shows up in the future that works like glue, which causes them to collapse,” he said. Eventually, “the more data we loaded, the fewer number of people there were.”

Big vs. Small

Prof. David Karger flatters me by writing his first (citing him) ‘blog rebuttal‘ and asking whether or not RDF is any good without a web of linked data:

[..] there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data with no reference to anyone else’s. The first is interaction [...] The second is portability [...]

What I find interesting about his argument is that David uses “RDF” and “structured data” interchangeably and uses Exhibit to mention the value of such interaction value that you get from adding such structure to your data, even if Exhibit itself is not properly using RDF but a data model that is equivalent in its relational structure but lost the notion of globally unique identifiers (and in that regard, is much more similar to Excel than to Tabulator):

It doesn’t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City—as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.

I added the emphasis to what I consider the key here: big or small the dataset, consistency is a conditio sine qua non for achieving benefit out of your data. While Exhibit’s interaction design is drastically different than one of spreadsheets, their data model is very similar and while it’s true that all Exhibits can output their data in valid RDF, their ‘isolated consistency’ model is not nearly as demanding and ambitious as the RDF one (and indeed, because of this, the user experience on Tabulator is vastly inferior, at least with the current state of the web of data, to that of Exhibit).

I think there is a subtle but important difference: namely the idea that exporting RDF might help surfacing more data but does nothing to help bringing the internal consistency to the outside and grow the relational density of aggregations of exhibits.

David (Huynh)’s work on Potluck was specifically targeted at trying to find a way to build an agglomerative strategy from a collection of exhibits, something that would be required to be able to condense a network effect value from something like Citeline.

At the end, I think we both agree that small or large the dataset, subjective or objective its modeling, consistency and relational density are key to more advanced and sophisticated uses of data and hardly something marginal that can be ignored while trying to surface its bits and pieces.

 

First Impressions on Sig.ma

July 22nd, 2009

Last week I went out for lunch with fellow italians Giovanni Tummarello and Paolo Bouquet that happened to be in LA for a conference and they mentioned their respective projects Sig.ma and Okkam.

Some of the things we talked about inspired me to write the last post about reconciliation, but I couldn’t yet  link to Sig.ma because it wasn’t released. But Giovanni posted about it today and officially released it to the public so I can now talk about it.

Sig.ma is a sort of metacrawler with an a-posteriori reconciler: when you search for something,  say “Barack Obama“, it queries a series of underlying search engines for all data fragments found on the web about that query (currently it uses mostly Yahoo! BOSS and Sindice). Then it tries hard to merge them all into a single topic. Because this operation can result in unwanted merges, it gives you options to remove certain sources of data (those who you don’t agree with or that are not exactly relevant) and solidify that collection of data fragments (which they call a ’sigma’) into a permanent URL which you can then send around or embed as an iframe in another web page.

First let me say that compared to most semweb-related academic research projects, this stands out to be one of those rare cases where the scientists care about providing useful services and not just publishing papers about potential Utopian scenarios. For that alone, Sig.ma needs to be mentioned and Giovanni and his team praised.

Just like Google Squared, Sig.ma follows on a model where the user will search for something and gather a vast noisy collection of more-or-less related resources, then spend a considerable amount of time and effort cleaning up, evaluating the results and pruning dead branches. Then condense the reconciliation efforts in a particular URL or set of rules.

Unfortunately, the reconciliation energy spent by each individual on the data periphery (at least for now) can’t be easily used to simplify the job of the next person looking to cleanup this data (unlike, for example, when you edit a wiki page or commit a patch into an open source project)

While it’s not hard to imagine ways to emerge such information from usage patterns or further harvest them, my principal worry is that a-posteriori reconciliation efforts clash pretty badly with the cognitive efforts that one person exhibits when looking for something.

When you look for something, you don’t have time nor the will to do ‘editing’ job and cleanup somebody else’s mess. You might be willing to do those things, but at a separate time, not when you finding useful information is your immediate goal (which, if you think about it, is why Google managed to wipe out the entire set of search engines that existed when it surfaced: PageRank was more effective and the cognitive effort perceived when using Google to sieve thru crap was much less than when using other search engines like Altavista or Lycos)

This is also the reason why the vast majority of people that land on a wikipedia page from a search engine don’t stop and edit it, or they don’t stop and change around the rank of Google search results even if they can: those activities would be in the way of what the user is currently doing.

This is not really criticism for Sig.ma or Google Squared, which are both fine examples of much needed and fresh innovation in the field of web search, but a criticism for the general approach of solutions that force users down paths that don’t match their state of mind and that have a hard time collecting human activity simply because of this. Understanding user intent and creating an interaction design that flows harmoniously with it cannot be an afterthought but it needs to be a firm and a-priori driver for the design of the service.

This said, I’m happy to see Sig.ma surface if only because its non-purist approach comforts me and it’s refreshing in a world of semantic web research that is often so purist to become effectively blind.

Permalink | Posted in Commentary