Home » Blog » Archives

Archive for the ‘Commentary’ Category

More Thoughts on Reconciliation

July 25th, 2009

My previous post about the importance of reconciliation and relational density in the web of data managed to attract a lot of interest and stimulating comments and criticism. I’ll try to address it here.

LOD and the Curse of Incoherence

Ed Summers (from the Library of Congress) really liked the notion of relational density as a differentiator (and asked if there is a way to calculate it), but disagrees with my view that Linking Open Data (LOD) and Freebase differ in their model of reconciliation and that the reason why LOD is making a bigger impact than other semantic web efforts is precisely because of the switch from a posteriori to a priori identifier reconciliation.

Ed certainly has a point and it is true that maybe my criticism of a posteriori reconciliation is better suited for something like Sig.ma than for LOD.

At the same time, the part about relational density remains valid: LOD itself claims that 1 out of 33 triples actually ‘interlink’ data between datasets, the others are internal relationships or literal values.

Here is worth mentioning the experience that we had in SIMILE when creating a browsing interface on top of the union of three different high-quality digital libraries: the surprising and painful result was that even with a powerful faceted-browsing user interface, a small number of datasets and a taylored UI, the perception of our users was that merging many high quality datasets resulted in a bigger but lower quality one. The items from the two sets mixed, but didn’t merge (even if equivalences between various identifiers were found and explicitly integrated in the system).

Basically, a-posteriori sameAs identifier equivalences is enough to reconcile if and only if the two items being merged are described using the same exact data model. I see this as another form of ‘abstraction leakage‘: even if the identifiers of items and schemas were mapped and aligned, this operation might not be enough to perform a real reconciliation, as I explained in a previous post on the quality of metadata.

So, in short, while it is fair to say that LOD is indeed nudging people to link to one another’s datasets a priori and make an effort to do so, the result is relationally sparse and ontologically inconsistent… and I don’t think this is because LOD is doing a bad job but because their model resolves around the idea that more data the better, no matter how relationally dense, which is in striking contrast to what Freebase does, focusing more on higher relational density than higher item or domain counts.

Reconciliation Costs

John Giannandrea, Metaweb’s CTO and former Chief Technologist at Netscape, wrote me privately in an email (which I quote here with permission) challenging my assumption that there is no evidence that the reconciliation costs don’t grow with the amount of data and that it might lead to saturation of resources:

Actually I think there is some evidence already that a-priori costs are an inverse function of graph size, assuming that the graph is not getting more sparse.  That is, the more you know about stuff you are trying to match, the better you can do.

I added the emphasis in the quote above because I think that’s the key to the whole process: in order to attract large quantities of information you can decide to punt the problem of reconciliation, but those costs will grow at least linearly (but probably more given that the amount of potential relationships grows quadratically) with the amount of information to reconcile. On the other hand, by focusing on high relational density since the very beginning, the velocity of data acquisition will be small but the reconciliation costs will get smaller over time, creating a positive network effect.

I found it interesting to spot the same concept mentioned in this article in a related but different context (mentioned in Tim O’Reilly’s latest web re-branding campaign):

Consider what happens when there are two records describing two different people as they appear to share the same name. “What happens is a third record shows up in the future that works like glue, which causes them to collapse,” he said. Eventually, “the more data we loaded, the fewer number of people there were.”

Big vs. Small

Prof. David Karger flatters me by writing his first (citing him) ‘blog rebuttal‘ and asking whether or not RDF is any good without a web of linked data:

[..] there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data with no reference to anyone else’s. The first is interaction [...] The second is portability [...]

What I find interesting about his argument is that David uses “RDF” and “structured data” interchangeably and uses Exhibit to mention the value of such interaction value that you get from adding such structure to your data, even if Exhibit itself is not properly using RDF but a data model that is equivalent in its relational structure but lost the notion of globally unique identifiers (and in that regard, is much more similar to Excel than to Tabulator):

It doesn’t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City—as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.

I added the emphasis to what I consider the key here: big or small the dataset, consistency is a conditio sine qua non for achieving benefit out of your data. While Exhibit’s interaction design is drastically different than one of spreadsheets, their data model is very similar and while it’s true that all Exhibits can output their data in valid RDF, their ‘isolated consistency’ model is not nearly as demanding and ambitious as the RDF one (and indeed, because of this, the user experience on Tabulator is vastly inferior, at least with the current state of the web of data, to that of Exhibit).

I think there is a subtle but important difference: namely the idea that exporting RDF might help surfacing more data but does nothing to help bringing the internal consistency to the outside and grow the relational density of aggregations of exhibits.

David (Huynh)’s work on Potluck was specifically targeted at trying to find a way to build an agglomerative strategy from a collection of exhibits, something that would be required to be able to condense a network effect value from something like Citeline.

At the end, I think we both agree that small or large the dataset, subjective or objective its modeling, consistency and relational density are key to more advanced and sophisticated uses of data and hardly something marginal that can be ignored while trying to surface its bits and pieces.

 

First Impressions on Sig.ma

July 22nd, 2009

Last week I went out for lunch with fellow italians Giovanni Tummarello and Paolo Bouquet that happened to be in LA for a conference and they mentioned their respective projects Sig.ma and Okkam.

Some of the things we talked about inspired me to write the last post about reconciliation, but I couldn’t yet  link to Sig.ma because it wasn’t released. But Giovanni posted about it today and officially released it to the public so I can now talk about it.

Sig.ma is a sort of metacrawler with an a-posteriori reconciler: when you search for something,  say “Barack Obama“, it queries a series of underlying search engines for all data fragments found on the web about that query (currently it uses mostly Yahoo! BOSS and Sindice). Then it tries hard to merge them all into a single topic. Because this operation can result in unwanted merges, it gives you options to remove certain sources of data (those who you don’t agree with or that are not exactly relevant) and solidify that collection of data fragments (which they call a ’sigma’) into a permanent URL which you can then send around or embed as an iframe in another web page.

First let me say that compared to most semweb-related academic research projects, this stands out to be one of those rare cases where the scientists care about providing useful services and not just publishing papers about potential Utopian scenarios. For that alone, Sig.ma needs to be mentioned and Giovanni and his team praised.

Just like Google Squared, Sig.ma follows on a model where the user will search for something and gather a vast noisy collection of more-or-less related resources, then spend a considerable amount of time and effort cleaning up, evaluating the results and pruning dead branches. Then condense the reconciliation efforts in a particular URL or set of rules.

Unfortunately, the reconciliation energy spent by each individual on the data periphery (at least for now) can’t be easily used to simplify the job of the next person looking to cleanup this data (unlike, for example, when you edit a wiki page or commit a patch into an open source project)

While it’s not hard to imagine ways to emerge such information from usage patterns or further harvest them, my principal worry is that a-posteriori reconciliation efforts clash pretty badly with the cognitive efforts that one person exhibits when looking for something.

When you look for something, you don’t have time nor the will to do ‘editing’ job and cleanup somebody else’s mess. You might be willing to do those things, but at a separate time, not when you finding useful information is your immediate goal (which, if you think about it, is why Google managed to wipe out the entire set of search engines that existed when it surfaced: PageRank was more effective and the cognitive effort perceived when using Google to sieve thru crap was much less than when using other search engines like Altavista or Lycos)

This is also the reason why the vast majority of people that land on a wikipedia page from a search engine don’t stop and edit it, or they don’t stop and change around the rank of Google search results even if they can: those activities would be in the way of what the user is currently doing.

This is not really criticism for Sig.ma or Google Squared, which are both fine examples of much needed and fresh innovation in the field of web search, but a criticism for the general approach of solutions that force users down paths that don’t match their state of mind and that have a hard time collecting human activity simply because of this. Understanding user intent and creating an interaction design that flows harmoniously with it cannot be an afterthought but it needs to be a firm and a-priori driver for the design of the service.

This said, I’m happy to see Sig.ma surface if only because its non-purist approach comforts me and it’s refreshing in a world of semantic web research that is often so purist to become effectively blind.

Permalink | Posted in Commentary
 

Theory vs. Practice

June 11th, 2009

It’s a little bit of a truism really, but what’s good in theory, even in complicated and brilliant theories, not always works in practice.

The latest of such failures sits right here on my desk and has the shape of three large pieces of paper, with three various colors, that the Repubblica Italiana (aka the government of my country of citizenship) delivered to my house. They are ballots. I’m asked to make a decision on a ‘abrogative referendum‘, which is a complicated way to say that they want me to say yes or no to a patch that the people of Italy want to apply to the law of the Republic.

The Italian constitution provides these powers: the Italian people are allowed to prepare a patch for the law, collect a number of signatures (don’t remember the actual number but it has to be 250k or so) and then ask the rest of the population how they feel about it.

Only two limitations: the patch can only be removing (which is why it’s called ‘abrogative’) and it can’t touch taxes.

On paper, the idea is great: if congress goes too far and comes up with a law that goes against the people, the people can bypass congress and remove it themselves. It was meant as a last resort, a safegard… and after an empire, 1200 years of invasions, a pope-powered state built right inside the capital, one Mussolini and a civil war, it is very much understandable that the post-WWII constitutional engineers built a pretty considerable set of checks and balances (and sometimes even went too far, but that’s another story).

So here I am, with these three ballots, each with two big boxes, yes and no and an even bigger box, a *huge* box, that contains, I’m not kidding, probably 4000 words that read like this: remove word ‘in’ at paragraph 3, comma 23 of Legislative Decree #533 of December 23, 1993… remove ‘of coalition’ paragraph 2, comma 12….

You get the idea.

This means that, basically, I’m asking to evaluate the effects of this patch. The title of the patch is “removing the possibility of linking electoral lists and the attribution of a majority bonus for an electoral coalition”…. which, if I understand it right, is supposed to avoid small parties from linking up together to form a bigger party and get a ‘majority bonus’, which later turns out to be toxic because as soon as they get their seats in congress, these coalitions fragment in a bunch of shards (in the best case) or they hold the majority hostage of their will (as it happens regularly).

Ideally, one would think that voting yes (remember, yes here means ‘go head, apply the patch and remove’) would imply that less electoral coalitions get formed, which hopefully would mean that small political parties get less representation in congress, which would lead to a less unstable political system (even if less representative of minorities).

So, in theory, we have this awesome constitutional power to ’stick it to the man’ and we have this set of patches that, in theory, would enable a more stable political system (which is something Italy seriously needs).

Yet, in practice, this means reading a title and hoping that what the patch actually does is in line with what the title of the patch says. In another country, you might take this for granted, in Italy not so much: knee-jerk distrust for everything governmental goes so deep that even when I’m asked to stick it to the man, I’m wondering if there isn’t one of those men using me as a tool to stick it to some other men. Actually, no, you can count that’s the case.

So, ideally, one should ignore the patch titles and just read what they say… but this is a patch and it looks like a patch… it only has deltas and differences, it doesn’t tell me how the law works, it doesn’t show me where to find the law (best I can do is to get here… then what?) … and I’m no lawyer and I’m no jurist and I’m nowhere near capable of understanding the far-reaching dynamic implications of patching anything of any law.

What in theory is a ’stick it to the man’ power, in practice turns me into a political sock puppet. What in theory was designed as a tool to empower ends up increasing distrust and amplifying fear of action.

I have a few days to vote (Italians that live abroad vote by mail earlier) but I have no idea what I’ll vote… and not because I don’t know where I stand on the issue (I do: I want more stable Italian governments even if this means less minority representation), but because I don’t know what I’m voting on is actually going to do what I’m being told it’s going to do.

Oh, and if you think this ‘referendum’ thing is weird and not that significant, it is worth mentioning that the Italian people change their form of government from monarchy to a republic with it in 1946, decided to allow women to divorce in 1974 and to abort in 1981 and to stop the use of nuclear power in 1987 (if you’re curious, check out the full list of Italian referendums).

Anyway, this might look like democracy and walk like democracy, but ultimately it doesn’t feel like one to me at all.

Permalink | Posted in Commentary