More Thoughts on Reconciliation
July 25th, 2009
My previous post about the importance of reconciliation and relational density in the web of data managed to attract a lot of interest and stimulating comments and criticism. I’ll try to address it here.
LOD and the Curse of Incoherence
Ed Summers (from the Library of Congress) really liked the notion of relational density as a differentiator (and asked if there is a way to calculate it), but disagrees with my view that Linking Open Data (LOD) and Freebase differ in their model of reconciliation and that the reason why LOD is making a bigger impact than other semantic web efforts is precisely because of the switch from a posteriori to a priori identifier reconciliation.
Ed certainly has a point and it is true that maybe my criticism of a posteriori reconciliation is better suited for something like Sig.ma than for LOD.
At the same time, the part about relational density remains valid: LOD itself claims that 1 out of 33 triples actually ‘interlink’ data between datasets, the others are internal relationships or literal values.
Here is worth mentioning the experience that we had in SIMILE when creating a browsing interface on top of the union of three different high-quality digital libraries: the surprising and painful result was that even with a powerful faceted-browsing user interface, a small number of datasets and a taylored UI, the perception of our users was that merging many high quality datasets resulted in a bigger but lower quality one. The items from the two sets mixed, but didn’t merge (even if equivalences between various identifiers were found and explicitly integrated in the system).
Basically, a-posteriori sameAs identifier equivalences is enough to reconcile if and only if the two items being merged are described using the same exact data model. I see this as another form of ‘abstraction leakage‘: even if the identifiers of items and schemas were mapped and aligned, this operation might not be enough to perform a real reconciliation, as I explained in a previous post on the quality of metadata.
So, in short, while it is fair to say that LOD is indeed nudging people to link to one another’s datasets a priori and make an effort to do so, the result is relationally sparse and ontologically inconsistent… and I don’t think this is because LOD is doing a bad job but because their model resolves around the idea that more data the better, no matter how relationally dense, which is in striking contrast to what Freebase does, focusing more on higher relational density than higher item or domain counts.
John Giannandrea, Metaweb’s CTO and former Chief Technologist at Netscape, wrote me privately in an email (which I quote here with permission) challenging my assumption that there is no evidence that the reconciliation costs don’t grow with the amount of data and that it might lead to saturation of resources:
Actually I think there is some evidence already that a-priori costs are an inverse function of graph size, assuming that the graph is not getting more sparse. That is, the more you know about stuff you are trying to match, the better you can do.
I added the emphasis in the quote above because I think that’s the key to the whole process: in order to attract large quantities of information you can decide to punt the problem of reconciliation, but those costs will grow at least linearly (but probably more given that the amount of potential relationships grows quadratically) with the amount of information to reconcile. On the other hand, by focusing on high relational density since the very beginning, the velocity of data acquisition will be small but the reconciliation costs will get smaller over time, creating a positive network effect.
Consider what happens when there are two records describing two different people as they appear to share the same name. “What happens is a third record shows up in the future that works like glue, which causes them to collapse,” he said. Eventually, “the more data we loaded, the fewer number of people there were.”
Big vs. Small
[..] there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data with no reference to anyone else’s. The first is interaction [...] The second is portability [...]
What I find interesting about his argument is that David uses “RDF” and “structured data” interchangeably and uses Exhibit to mention the value of such interaction value that you get from adding such structure to your data, even if Exhibit itself is not properly using RDF but a data model that is equivalent in its relational structure but lost the notion of globally unique identifiers (and in that regard, is much more similar to Excel than to Tabulator):
It doesn’t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City—as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.
I added the emphasis to what I consider the key here: big or small the dataset, consistency is a conditio sine qua non for achieving benefit out of your data. While Exhibit’s interaction design is drastically different than one of spreadsheets, their data model is very similar and while it’s true that all Exhibits can output their data in valid RDF, their ‘isolated consistency’ model is not nearly as demanding and ambitious as the RDF one (and indeed, because of this, the user experience on Tabulator is vastly inferior, at least with the current state of the web of data, to that of Exhibit).
I think there is a subtle but important difference: namely the idea that exporting RDF might help surfacing more data but does nothing to help bringing the internal consistency to the outside and grow the relational density of aggregations of exhibits.
David (Huynh)‘s work on Potluck was specifically targeted at trying to find a way to build an agglomerative strategy from a collection of exhibits, something that would be required to be able to condense a network effect value from something like Citeline.
At the end, I think we both agree that small or large the dataset, subjective or objective its modeling, consistency and relational density are key to more advanced and sophisticated uses of data and hardly something marginal that can be ignored while trying to surface its bits and pieces.