Home » Blog » Continuing Thoughts on Reconciliation

Continuing Thoughts on Reconciliation

September 23rd, 2009

My thoughts about data reconciliation (and later updates) seem to have hit a nerve. After taking some time to digest the comments I received publicly and privately about it, I want to address some of the issues that surfaced.

I want to draw some attention to two different blogs posts that appeared recently on the Haystack Blog: one by Prof. David Karger that continues our blog debate on the dynamics around RDF and one by one of his grad students, Edward Benson, about the differences between Microdata and RDFa in the HTML5 embedded data debate.

They seem about different things but I saw a common underlying thread between the two: the concept of ‘RDF without URIs’. Karger considers the RDF model valuable independently on the scope or aspiration of the identifiers used: not only the graph model is both natural and very malleable, it’s also very ‘liquid’ in the sense that it can adapt to its container better than other data models.

In a sense, both Exhibit and the Microdata proposal in HTML5 go in the same direction: let’s focus on technology that can give users immediate benefits and avoid speculating about other aspects.

Benson suggests that the ‘namespace traffic jam’ might be the reason why people failed to use RDF in its full namespaced potential, but I disagree: the problem, IMO, is the lack of clear and globally applicable answer to the question “what’s in for me?” “what does using one identifier scheme vs. my own buy me?”. Google telling the world they will index ontology A and show such content differently on theirĀ  search results will do more for the adoption of A than any W3C spec or ‘call of duty’ from web of data evangelists.

The difference between RDFa and Microdata (syntactic differences aside) is basically the fact that the proponents of the firstĀ  believe that once everybody naturally starts reusing existing ID schemes and ontologies a densely connected web of semantically reconciled information will come together naturally. The second just want to focus on immediate values and avoid speculating on what’s going to happen next.

This is not different than the debate Exhibit vs. Tabulator: the first is useful (and in use) today and promotes the surfacing of structured data but does little to promote linkage between isolated datasets, the second is much less useful for end users but acted as a catalyst to the concept of “linkable data“, a methodology where identifiers don’t just identify but can also be used, as-is, as web locators.

They both use the same underlying model (and can even read and write the same syntax)… yet they serve completely different purposes and have radically different aspirations and social dynamics around them: I see the same issue for RDFa vs. Microdata.

The RDFa camp see it as a vector to promote the growth of the web of data, while the Microdata camp focuses on solving practical problems of embedding richer machine-processable information in web pages: the model they use is isomorphic (meaning that, in a closed world scenario, you can always translate one into the other), but their aspirations and the social dynamics they expect around them are different.

It’s not a secret I tend to side with pragmatism and paving-cow-paths strategies on these debates and I find it frankly disheartening that purists still believe that the secret to a useful web of data is already there in the guts of the architecture of the web and that by simply turning a URI into a URL will cause enough social pressure to solve the other issues.

Let me show you why I think this is not the case.

Suppose you are given (or you discover yourself) these two data models (avoid looking at the syntax but know that both these fragments can be embedded in an HTML page using RDFa or Microdata, or serialized in any of the various RDF syntaxes):

<document1> -(a:author)-> <stefanom@mit.edu> -(a:name)-> "Stefano Mazzocchi"
<document1> -(a:publisher)-> "MIT Press, Cambridge, MA"

and

<document2> -(b:author)-> <stefano@metaweb.com>
<stefano@metaweb.com> -(b:first_name) -> "Stefano"
<stefano@metaweb.com> -(b:last_name) -> "Mazzocchi"
<document2> -(b:publisher)-> <http://press.mit.edu/>
<http://press.mit.edu/> -(b:name)-> "MIT Press"
<http://press.mit.edu/> -(b:location)-> "Cambridge, MA, USA"

This data says that I wrote both ‘document1′ and ‘document2′ and they were published by the same publisher.

A first issue is model mixability: I can take the two above models and make one out of the two. Always. This is achieved by the graph model that lies behind both RDFa and Microdata (but not behind XML or JSON, for example). Both RDFa and Microdata can do this.

A second issue is that we don’t want spurious identifiers collisions: ‘document1′ is too weak of an identifier globally, as it might be reused in different parts of the web to mean different things… and once two nodes are considered the same in a model, you can’t take them apart anymore. The solution to this is to use bigger identifiers, prefixing them with unique namespaces. This is normally achieved with using URIs and web domain names that are guaranteed to be unique.

But so solve the second issue, we made it very easy for someone to come up with their own identifiers for things, which leads us to the third issue: how do we condense a data model knowing that some of its identifiers really mean the same thing? In the above example “stefanom@mit.edu” and “stefano@metaweb.com” refer to the same entity: me. We can agree on a way to encode identifier equivalence like this

<stefano@metaweb.com> <-(same_as)-> <stefano@metaweb.com>
<a:author> <-(same_as)-> <b:author>

but what do we do about the rest? <a:publisher> links a document with a literal while <b:publisher> links a document with a node… sure they mean the same thing, but can they be treated the same way? would the queries that run on the first model still work if run on the second with the properties changed?

The forth and biggest issue of all and one that most semweb advocates flat out ignore is that the first of the two models above is ‘undersampled’, meaning that there is a lot of information that is still encoded in strings inside the literals. There is no way to reconcile the two models above with symbolic operations alone, a re-sampling process has to happen for these two fragments to be merged both symbolically and semantically.

Having a world-wide web of data where reconciliation is not considered a priority will lead to something that is dramatically more expensive to build and maintain but only marginally more useful than the collection of interlinked HTML pages we already have today, which seems to imply, at least to me, that it won’t happen.

Karger likes RDF better than XML because he thinks he doesn’t have to learn XSLT (or any other way of performing adaptation transformations) in order to perform valuable data integration. I think he is confusing model mixability with semantic reconciliation and that the first can happen without transformations but the second is really what’s useful and can’t happen (in general) without adaptation, transformations and/or model resampling.

If the above is true, a lot of the existing semantic web ‘best practices’, which directly or indirectly influence debates like the RDFa vs. Microdata one, might not look so good under this light and might have to be reconsidered or at least re-evaluated.

Errata Corrige: I’ve used the word ‘idempotent‘ incorrectly when I meant to say ‘isomorphic‘ instead. Kudos go to David Karger for correcting me on that.