On Data Reconciliation Strategies and Their Impact on the Web of Data
July 18th, 2009
Suppose that you are given two fragments of data, each representing the same objective fact about the same thing (say, the fact that Paris is the capital of France and that the Eiffel Tower is located in Paris) but using different models (aka schemas/ontologies) and different identifiers for the entities described in the data.
Reconciling these fragments means to align the different identifiers given to the same ‘entity’ (in this case ‘Paris’) and fold them together so that the two facts are now related to the same thing (how this is done in practice is not important for now).
This reconciliation activity seems mechanical and artificial at first, but digging deeper into the way natural languages emerge shows some light on the fact that reconciliation can be seen as a form of categorization: we are lumping together all things that indicate “Paris”, just like we do naturally for synonyms or for words with different sounds (imagine “Paris” pronounced in french, ‘pah-ree’).
Now, not a lot of people actually understand this but the only and exclusive benefit of RDF compared to other data models (say XML or JSON) is the automatic and transparent mixability of independent fragments.
RDF’s unique property is that you can get two fragments and merge them together to form a bigger data model. The important part is that you can always do this and you don’t need any other logic or strategy to perform this merging: it’s inherent in the way the RDF model is designed. By virtue of using a graph model and globally unique identifiers, two RDF models are basically a longer list of triples, relationships between uniquely identified entities.
This property does not exist in most other data models: two HTML or XML or JSON files need additional information to know how they should be merged or transformed into a bigger piece. The only way to make them merge is to parse them, tokenize their content and create an inverted index, which is how the search engines manage to get all sorts of incoherent pieces of data to fit together in the same container and be searchable by a single interface no matter their original format.
The idea behind RDF (and all syntactic forms of the RDF model like RDFa, Turtle, ntriples, RDF/XML etc) is that describing data fragments on the web with it (or other things like Microformats that could be easily and mechanically RDFized) allows harvesters to merge data naturally since RDF is, in a sense, already liquid.
There is one problem though: two RDF models always merge… but not necessarily in the way that you would want them to. In the example above, if I had two RDF fragments, written by different people and harvested from different URLs, it is very likely that their identifiers for Paris could be globally unique, but different.
Which means that you don’t know two assertions about Paris, you know one assertion about “urn:france:paris” and another assertion about “http://wikipedia.com/en/paris”… but the RDF engine doesn’t know, unless you load another piece of information that explicitly says so, that these two identifiers are equivalent and they mean to identify the same exact entity in real life.
The biggest difference between data integration semweb-style and data integration datawarehouse-style is when reconciliation happens: the semantic web model assumes that reconciliation must happen a posteriori, when the data is consumed, while data warehousing assumes that reconciliation must happen a priori, when the data enters the system.
The semantic web architects correctly identified a priori reconciliation as the biggest scalability impediment for a world-scale data integration effort and decided to avoid worrying about it (so much that it took years for the concept of ‘identifier equivalence’ to even surface in the semantic web architecture and it was included in OWL which feels horribly overdesigned to be used simply for that particular purpose).
For years, I’ve been a fairly vocal advocate for the elegance and scalability of a-posteriori reconciliation via equivalence mappings as a superior mechanism (scale-wise) to a-priori reconciliation efforts… but this started to change very rapidly once I started working for Metaweb and saw first hand how much more effective a-priori reconciliation can be, even if drastically more expensive and limiting in the data acquisition front.
Freebase spends considerable amount of resources performing a priori reconciliation of all the bulk loads of data to try to have the most compact and densest possible graph, even at the cost of limiting the rate with which new data can be acquired. On the other hand, Linking Open Data follows the a posteriori reconciliation model where it is assumed that identifier reconciliation is a low-energy point and the world-wide web of data will, once big enough, tend to naturally reconcile identifiers and schemas toward an increased graph density.
Both are huge bets: there is no indication that a priori reconciliation costs are not a function of the quantity of data already contained in the graph (which would eventually saturate its ability to grow); and there is no indication that a denser graph is naturally a lower energy point for unreconciled agglomerations of datasets and that an increase in relational density would happen naturally and spontaneously.
It’s important that I mention explicitly the reason why I stress ‘relational density’ as a critically important property for a web of data: without it there would be very little value in it compared to what traditional search engines are already doing. The problem text-based search engines have is that they have a really hard time emerging from the token soup of their inverted indices even the most trivial of the relationships between data fragments (here is worth mentioning that while Google Squared inspires awe and admiration from data geeks, myself included, it is still a vastly useless tool for any low-tech end user given how noisy its results are).
A soup of triples that hardly ever connects is never going to be relevant and valuable enough to provide services that existing text-tokenizing search engines cannot match (and Google Squared’s biggest merit is to show precisely that): the fact that RDF merges naturally simply matches with the fact that tokenized text merges naturally too. Without a dense network of relationships, which can’t happen without identifiers reconciliation, a web of data remains a Babel of identifiers and ontologies that is only marginally more useful than the web pages that contained the same information.
No matter what user interfaces will drive the user interaction, the dream of being able to search the web of data following relational connections (say, somehow looking for “the height of all towers located in Paris”) dies miserably when it’s powered by a vastly sparse and unconnected graph. Which is mostly the reason why while both Google and Yahoo already acquired all the billions of RDF triples that can currently be found on the web and that Linking Open Data helped surface, very little of that information gets ever used in their search results. The graph of harvested RDF is too sparse to be more useful than the data they already have. And while more and more RDF gets liberated and exposed on the web every day, there is no indication that the relational sparsity of the aggregated world-wide-web of data is getting any denser or that a natural and decentralized tendency to get denser will ever surface spontaneously.
If it is not natural for the relational density of a web of data to increase over time (which is what I’m more and more lead to believe as time passes), I cannot help but think that any effort tasked to promote and increase such density will look and feel just like Freebase: carefully selecting datasets and painfully trying to reconcile as much data as possible right when it enters the system and entice a community of volunteers to maintain it, curate it and clean up eventual mistakes.