Home » Blog

More Thoughts on Reconciliation

July 25th, 2009

My previous post about the importance of reconciliation and relational density in the web of data managed to attract a lot of interest and stimulating comments and criticism. I’ll try to address it here.

LOD and the Curse of Incoherence

Ed Summers (from the Library of Congress) really liked the notion of relational density as a differentiator (and asked if there is a way to calculate it), but disagrees with my view that Linking Open Data (LOD) and Freebase differ in their model of reconciliation and that the reason why LOD is making a bigger impact than other semantic web efforts is precisely because of the switch from a posteriori to a priori identifier reconciliation.

Ed certainly has a point and it is true that maybe my criticism of a posteriori reconciliation is better suited for something like Sig.ma than for LOD.

At the same time, the part about relational density remains valid: LOD itself claims that 1 out of 33 triples actually ‘interlink’ data between datasets, the others are internal relationships or literal values.

Here is worth mentioning the experience that we had in SIMILE when creating a browsing interface on top of the union of three different high-quality digital libraries: the surprising and painful result was that even with a powerful faceted-browsing user interface, a small number of datasets and a taylored UI, the perception of our users was that merging many high quality datasets resulted in a bigger but lower quality one. The items from the two sets mixed, but didn’t merge (even if equivalences between various identifiers were found and explicitly integrated in the system).

Basically, a-posteriori sameAs identifier equivalences is enough to reconcile if and only if the two items being merged are described using the same exact data model. I see this as another form of ‘abstraction leakage‘: even if the identifiers of items and schemas were mapped and aligned, this operation might not be enough to perform a real reconciliation, as I explained in a previous post on the quality of metadata.

So, in short, while it is fair to say that LOD is indeed nudging people to link to one another’s datasets a priori and make an effort to do so, the result is relationally sparse and ontologically inconsistent… and I don’t think this is because LOD is doing a bad job but because their model resolves around the idea that more data the better, no matter how relationally dense, which is in striking contrast to what Freebase does, focusing more on higher relational density than higher item or domain counts.

Reconciliation Costs

John Giannandrea, Metaweb’s CTO and former Chief Technologist at Netscape, wrote me privately in an email (which I quote here with permission) challenging my assumption that there is no evidence that the reconciliation costs don’t grow with the amount of data and that it might lead to saturation of resources:

Actually I think there is some evidence already that a-priori costs are an inverse function of graph size, assuming that the graph is not getting more sparse.  That is, the more you know about stuff you are trying to match, the better you can do.

I added the emphasis in the quote above because I think that’s the key to the whole process: in order to attract large quantities of information you can decide to punt the problem of reconciliation, but those costs will grow at least linearly (but probably more given that the amount of potential relationships grows quadratically) with the amount of information to reconcile. On the other hand, by focusing on high relational density since the very beginning, the velocity of data acquisition will be small but the reconciliation costs will get smaller over time, creating a positive network effect.

I found it interesting to spot the same concept mentioned in this article in a related but different context (mentioned in Tim O’Reilly’s latest web re-branding campaign):

Consider what happens when there are two records describing two different people as they appear to share the same name. “What happens is a third record shows up in the future that works like glue, which causes them to collapse,” he said. Eventually, “the more data we loaded, the fewer number of people there were.”

Big vs. Small

Prof. David Karger flatters me by writing his first (citing him) ‘blog rebuttal‘ and asking whether or not RDF is any good without a web of linked data:

[..] there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data with no reference to anyone else’s. The first is interaction [...] The second is portability [...]

What I find interesting about his argument is that David uses “RDF” and “structured data” interchangeably and uses Exhibit to mention the value of such interaction value that you get from adding such structure to your data, even if Exhibit itself is not properly using RDF but a data model that is equivalent in its relational structure but lost the notion of globally unique identifiers (and in that regard, is much more similar to Excel than to Tabulator):

It doesn’t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City—as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.

I added the emphasis to what I consider the key here: big or small the dataset, consistency is a conditio sine qua non for achieving benefit out of your data. While Exhibit’s interaction design is drastically different than one of spreadsheets, their data model is very similar and while it’s true that all Exhibits can output their data in valid RDF, their ‘isolated consistency’ model is not nearly as demanding and ambitious as the RDF one (and indeed, because of this, the user experience on Tabulator is vastly inferior, at least with the current state of the web of data, to that of Exhibit).

I think there is a subtle but important difference: namely the idea that exporting RDF might help surfacing more data but does nothing to help bringing the internal consistency to the outside and grow the relational density of aggregations of exhibits.

David (Huynh)‘s work on Potluck was specifically targeted at trying to find a way to build an agglomerative strategy from a collection of exhibits, something that would be required to be able to condense a network effect value from something like Citeline.

At the end, I think we both agree that small or large the dataset, subjective or objective its modeling, consistency and relational density are key to more advanced and sophisticated uses of data and hardly something marginal that can be ignored while trying to surface its bits and pieces.

 

First Impressions on Sig.ma

July 22nd, 2009

Last week I went out for lunch with fellow italians Giovanni Tummarello and Paolo Bouquet that happened to be in LA for a conference and they mentioned their respective projects Sig.ma and Okkam.

Some of the things we talked about inspired me to write the last post about reconciliation, but I couldn’t yet  link to Sig.ma because it wasn’t released. But Giovanni posted about it today and officially released it to the public so I can now talk about it.

Sig.ma is a sort of metacrawler with an a-posteriori reconciler: when you search for something,  say “Barack Obama“, it queries a series of underlying search engines for all data fragments found on the web about that query (currently it uses mostly Yahoo! BOSS and Sindice). Then it tries hard to merge them all into a single topic. Because this operation can result in unwanted merges, it gives you options to remove certain sources of data (those who you don’t agree with or that are not exactly relevant) and solidify that collection of data fragments (which they call a ‘sigma’) into a permanent URL which you can then send around or embed as an iframe in another web page.

First let me say that compared to most semweb-related academic research projects, this stands out to be one of those rare cases where the scientists care about providing useful services and not just publishing papers about potential Utopian scenarios. For that alone, Sig.ma needs to be mentioned and Giovanni and his team praised.

Just like Google Squared, Sig.ma follows on a model where the user will search for something and gather a vast noisy collection of more-or-less related resources, then spend a considerable amount of time and effort cleaning up, evaluating the results and pruning dead branches. Then condense the reconciliation efforts in a particular URL or set of rules.

Unfortunately, the reconciliation energy spent by each individual on the data periphery (at least for now) can’t be easily used to simplify the job of the next person looking to cleanup this data (unlike, for example, when you edit a wiki page or commit a patch into an open source project)

While it’s not hard to imagine ways to emerge such information from usage patterns or further harvest them, my principal worry is that a-posteriori reconciliation efforts clash pretty badly with the cognitive efforts that one person exhibits when looking for something.

When you look for something, you don’t have time nor the will to do ‘editing’ job and cleanup somebody else’s mess. You might be willing to do those things, but at a separate time, not when you finding useful information is your immediate goal (which, if you think about it, is why Google managed to wipe out the entire set of search engines that existed when it surfaced: PageRank was more effective and the cognitive effort perceived when using Google to sieve thru crap was much less than when using other search engines like Altavista or Lycos)

This is also the reason why the vast majority of people that land on a wikipedia page from a search engine don’t stop and edit it, or they don’t stop and change around the rank of Google search results even if they can: those activities would be in the way of what the user is currently doing.

This is not really criticism for Sig.ma or Google Squared, which are both fine examples of much needed and fresh innovation in the field of web search, but a criticism for the general approach of solutions that force users down paths that don’t match their state of mind and that have a hard time collecting human activity simply because of this. Understanding user intent and creating an interaction design that flows harmoniously with it cannot be an afterthought but it needs to be a firm and a-priori driver for the design of the service.

This said, I’m happy to see Sig.ma surface if only because its non-purist approach comforts me and it’s refreshing in a world of semantic web research that is often so purist to become effectively blind.

Permalink | Posted in Commentary
 

On Data Reconciliation Strategies and Their Impact on the Web of Data

July 18th, 2009

Suppose that you are given two fragments of data, each representing the same objective fact about the same thing (say, the fact that Paris is the capital of France and that the Eiffel Tower is located in Paris) but using different models (aka schemas/ontologies) and different identifiers for the entities described in the data.

Reconciling these fragments means to align the different identifiers given to the same ‘entity’ (in this case ‘Paris’) and fold them together so that the two facts are now related to the same thing (how this is done in practice is not important for now).

This reconciliation activity seems mechanical and artificial at first, but digging deeper into the way natural languages emerge shows some light on the fact that reconciliation can be seen as a form of categorization: we are lumping together all things that indicate “Paris”, just like we do naturally for synonyms or for words with different sounds (imagine “Paris” pronounced in french, ‘pah-ree’).

Now, not a lot of people actually understand this but the only and exclusive benefit of RDF compared to other data models (say XML or JSON) is the automatic and transparent mixability of independent fragments.

RDF’s unique property is that you can get two fragments and merge them together to form a bigger data model. The important part is that you can always do this and you don’t need any other logic or strategy to perform this merging: it’s inherent in the way the RDF model is designed. By virtue of using a graph model and globally unique identifiers, two RDF models are basically a longer list of triples, relationships between uniquely identified entities.

This property does not exist in most other data models: two HTML or XML or JSON files need additional information to know how they should be merged or transformed into a bigger piece. The only way to make them merge is to parse them, tokenize their content and create an inverted index, which is how the search engines manage to get all sorts of incoherent pieces of data to fit together in the same container and be searchable by a single interface no matter their original format.

The idea behind RDF (and all syntactic forms of the RDF model like RDFa, Turtle, ntriples, RDF/XML etc) is that describing data fragments on the web with it (or other things like Microformats that could be easily and mechanically RDFized) allows harvesters to merge data naturally since RDF is, in a sense, already liquid.

There is one problem though: two RDF models always merge… but not necessarily in the way that you would want them to. In the example above, if I had two RDF fragments, written by different people and harvested from different URLs, it is very likely that their identifiers for Paris could be globally unique, but different.

Which means that you don’t know two assertions about Paris, you know one assertion about “urn:france:paris” and another assertion about “http://wikipedia.com/en/paris”… but the RDF engine doesn’t know, unless you load another piece of information that explicitly says so, that these two identifiers are equivalent and they mean to identify the same exact entity in real life.

The biggest difference between data integration semweb-style and data integration datawarehouse-style is when reconciliation happens: the semantic web model assumes that reconciliation must happen a posteriori, when the data is consumed, while data warehousing assumes that reconciliation must happen a priori, when the data enters the system.

The semantic web architects correctly identified a priori reconciliation as the biggest scalability impediment for a world-scale data integration effort and decided to avoid worrying about it  (so much that it took years for the concept of ‘identifier equivalence’ to even surface in the semantic web architecture and it was included in OWL which feels horribly overdesigned to be used simply for that particular purpose).

For years, I’ve been a fairly vocal advocate for the elegance and scalability of a-posteriori reconciliation via equivalence mappings as a superior mechanism (scale-wise) to a-priori reconciliation efforts… but this started to change very rapidly once I started working for Metaweb and saw first hand how much more effective a-priori reconciliation can be, even if drastically more expensive and limiting in the data acquisition front.

The difference between efforts like Freebase and efforts like Linking Open Data hinges around their model for reconciliation.

Freebase spends considerable amount of resources performing a priori reconciliation of all the bulk loads of data to try to have the most compact and densest possible graph, even at the cost of limiting the rate with which new data can be acquired. On the other hand, Linking Open Data follows the a posteriori reconciliation model where it is assumed that identifier reconciliation is a low-energy point and the world-wide web of data will, once big enough, tend to naturally reconcile identifiers and schemas toward an increased graph density.

Both are huge bets: there is no indication that a priori reconciliation costs are not a function of the quantity of data already contained in the graph (which would eventually saturate its ability to grow); and there is no indication that a denser graph is naturally a lower energy point for unreconciled agglomerations of datasets and that an increase in relational density would happen naturally and spontaneously.

It’s important that I mention explicitly the reason why I stress ‘relational density’ as a critically important property for a web of data: without it there would be very little value in it compared to what traditional search engines are already doing. The problem text-based search engines have is that they have a really hard time emerging from the token soup of their inverted indices even the most trivial of the relationships between data fragments (here is worth mentioning that while Google Squared inspires awe and admiration from data geeks, myself included, it is still a vastly useless tool for any low-tech end user given how noisy its results are).

A soup of triples that hardly ever connects is never going to be relevant and valuable enough to provide services that existing text-tokenizing search engines cannot match (and Google Squared’s biggest merit is to show precisely that): the fact that RDF merges naturally simply matches with the fact that tokenized text merges naturally too. Without a dense network of relationships, which can’t happen without identifiers reconciliation, a web of data remains a Babel of identifiers and ontologies that is only marginally more useful than the web pages that contained the same information.

No matter what user interfaces will drive the user interaction, the dream of being able to search the web of data following relational connections (say, somehow looking for “the height of all towers located in Paris”) dies miserably when it’s powered by a vastly sparse and unconnected graph. Which is mostly the reason why while both Google and Yahoo already acquired all the billions of RDF triples that can currently be found on the web and that Linking Open Data helped surface, very little of that information gets ever used in their search results. The graph of harvested RDF is too sparse to be more useful than the data they already have. And while more and more RDF gets liberated and exposed on the web every day, there is no indication that the relational sparsity of the aggregated world-wide-web of data is getting any denser or that a natural and decentralized tendency to get denser will ever surface spontaneously.

If it is not natural for the relational density of a web of data to increase over time (which is what I’m more and more lead to believe as time passes), I cannot help but think that any effort tasked to promote and increase such density will look and feel just like Freebase: carefully selecting datasets and painfully trying to reconcile as much data as possible right when it enters the system and entice a community of volunteers to maintain it, curate it and clean up eventual mistakes.

Permalink | Posted in Article