Home » Blog

My Second Appearance on IT Conversations

September 26th, 2009

Jon Udell interviews me a second time for his Interviews with Innovators series over at IT Conversations.

Enjoy.

Permalink | Posted in Article
 

Continuing Thoughts on Reconciliation

September 23rd, 2009

My thoughts about data reconciliation (and later updates) seem to have hit a nerve. After taking some time to digest the comments I received publicly and privately about it, I want to address some of the issues that surfaced.

I want to draw some attention to two different blogs posts that appeared recently on the Haystack Blog: one by Prof. David Karger that continues our blog debate on the dynamics around RDF and one by one of his grad students, Edward Benson, about the differences between Microdata and RDFa in the HTML5 embedded data debate.

They seem about different things but I saw a common underlying thread between the two: the concept of ‘RDF without URIs’. Karger considers the RDF model valuable independently on the scope or aspiration of the identifiers used: not only the graph model is both natural and very malleable, it’s also very ‘liquid’ in the sense that it can adapt to its container better than other data models.

In a sense, both Exhibit and the Microdata proposal in HTML5 go in the same direction: let’s focus on technology that can give users immediate benefits and avoid speculating about other aspects.

Benson suggests that the ‘namespace traffic jam’ might be the reason why people failed to use RDF in its full namespaced potential, but I disagree: the problem, IMO, is the lack of clear and globally applicable answer to the question “what’s in for me?” “what does using one identifier scheme vs. my own buy me?”. Google telling the world they will index ontology A and show such content differently on their  search results will do more for the adoption of A than any W3C spec or ‘call of duty’ from web of data evangelists.

The difference between RDFa and Microdata (syntactic differences aside) is basically the fact that the proponents of the first  believe that once everybody naturally starts reusing existing ID schemes and ontologies a densely connected web of semantically reconciled information will come together naturally. The second just want to focus on immediate values and avoid speculating on what’s going to happen next.

This is not different than the debate Exhibit vs. Tabulator: the first is useful (and in use) today and promotes the surfacing of structured data but does little to promote linkage between isolated datasets, the second is much less useful for end users but acted as a catalyst to the concept of “linkable data“, a methodology where identifiers don’t just identify but can also be used, as-is, as web locators.

They both use the same underlying model (and can even read and write the same syntax)… yet they serve completely different purposes and have radically different aspirations and social dynamics around them: I see the same issue for RDFa vs. Microdata.

The RDFa camp see it as a vector to promote the growth of the web of data, while the Microdata camp focuses on solving practical problems of embedding richer machine-processable information in web pages: the model they use is isomorphic (meaning that, in a closed world scenario, you can always translate one into the other), but their aspirations and the social dynamics they expect around them are different.

It’s not a secret I tend to side with pragmatism and paving-cow-paths strategies on these debates and I find it frankly disheartening that purists still believe that the secret to a useful web of data is already there in the guts of the architecture of the web and that by simply turning a URI into a URL will cause enough social pressure to solve the other issues.

Let me show you why I think this is not the case.

Suppose you are given (or you discover yourself) these two data models (avoid looking at the syntax but know that both these fragments can be embedded in an HTML page using RDFa or Microdata, or serialized in any of the various RDF syntaxes):

<document1> -(a:author)-> <stefanom@mit.edu> -(a:name)-> "Stefano Mazzocchi"
<document1> -(a:publisher)-> "MIT Press, Cambridge, MA"

and

<document2> -(b:author)-> <stefano@metaweb.com>
<stefano@metaweb.com> -(b:first_name) -> "Stefano"
<stefano@metaweb.com> -(b:last_name) -> "Mazzocchi"
<document2> -(b:publisher)-> <http://press.mit.edu/>
<http://press.mit.edu/> -(b:name)-> "MIT Press"
<http://press.mit.edu/> -(b:location)-> "Cambridge, MA, USA"

This data says that I wrote both ‘document1′ and ‘document2′ and they were published by the same publisher.

A first issue is model mixability: I can take the two above models and make one out of the two. Always. This is achieved by the graph model that lies behind both RDFa and Microdata (but not behind XML or JSON, for example). Both RDFa and Microdata can do this.

A second issue is that we don’t want spurious identifiers collisions: ‘document1′ is too weak of an identifier globally, as it might be reused in different parts of the web to mean different things… and once two nodes are considered the same in a model, you can’t take them apart anymore. The solution to this is to use bigger identifiers, prefixing them with unique namespaces. This is normally achieved with using URIs and web domain names that are guaranteed to be unique.

But so solve the second issue, we made it very easy for someone to come up with their own identifiers for things, which leads us to the third issue: how do we condense a data model knowing that some of its identifiers really mean the same thing? In the above example “stefanom@mit.edu” and “stefano@metaweb.com” refer to the same entity: me. We can agree on a way to encode identifier equivalence like this

<stefano@metaweb.com> <-(same_as)-> <stefano@metaweb.com>
<a:author> <-(same_as)-> <b:author>

but what do we do about the rest? <a:publisher> links a document with a literal while <b:publisher> links a document with a node… sure they mean the same thing, but can they be treated the same way? would the queries that run on the first model still work if run on the second with the properties changed?

The forth and biggest issue of all and one that most semweb advocates flat out ignore is that the first of the two models above is ‘undersampled’, meaning that there is a lot of information that is still encoded in strings inside the literals. There is no way to reconcile the two models above with symbolic operations alone, a re-sampling process has to happen for these two fragments to be merged both symbolically and semantically.

Having a world-wide web of data where reconciliation is not considered a priority will lead to something that is dramatically more expensive to build and maintain but only marginally more useful than the collection of interlinked HTML pages we already have today, which seems to imply, at least to me, that it won’t happen.

Karger likes RDF better than XML because he thinks he doesn’t have to learn XSLT (or any other way of performing adaptation transformations) in order to perform valuable data integration. I think he is confusing model mixability with semantic reconciliation and that the first can happen without transformations but the second is really what’s useful and can’t happen (in general) without adaptation, transformations and/or model resampling.

If the above is true, a lot of the existing semantic web ‘best practices’, which directly or indirectly influence debates like the RDFa vs. Microdata one, might not look so good under this light and might have to be reconsidered or at least re-evaluated.

Errata Corrige: I’ve used the word ‘idempotent‘ incorrectly when I meant to say ‘isomorphic‘ instead. Kudos go to David Karger for correcting me on that.

Permalink | Posted in Article
 

More Thoughts on Reconciliation

July 25th, 2009

My previous post about the importance of reconciliation and relational density in the web of data managed to attract a lot of interest and stimulating comments and criticism. I’ll try to address it here.

LOD and the Curse of Incoherence

Ed Summers (from the Library of Congress) really liked the notion of relational density as a differentiator (and asked if there is a way to calculate it), but disagrees with my view that Linking Open Data (LOD) and Freebase differ in their model of reconciliation and that the reason why LOD is making a bigger impact than other semantic web efforts is precisely because of the switch from a posteriori to a priori identifier reconciliation.

Ed certainly has a point and it is true that maybe my criticism of a posteriori reconciliation is better suited for something like Sig.ma than for LOD.

At the same time, the part about relational density remains valid: LOD itself claims that 1 out of 33 triples actually ‘interlink’ data between datasets, the others are internal relationships or literal values.

Here is worth mentioning the experience that we had in SIMILE when creating a browsing interface on top of the union of three different high-quality digital libraries: the surprising and painful result was that even with a powerful faceted-browsing user interface, a small number of datasets and a taylored UI, the perception of our users was that merging many high quality datasets resulted in a bigger but lower quality one. The items from the two sets mixed, but didn’t merge (even if equivalences between various identifiers were found and explicitly integrated in the system).

Basically, a-posteriori sameAs identifier equivalences is enough to reconcile if and only if the two items being merged are described using the same exact data model. I see this as another form of ‘abstraction leakage‘: even if the identifiers of items and schemas were mapped and aligned, this operation might not be enough to perform a real reconciliation, as I explained in a previous post on the quality of metadata.

So, in short, while it is fair to say that LOD is indeed nudging people to link to one another’s datasets a priori and make an effort to do so, the result is relationally sparse and ontologically inconsistent… and I don’t think this is because LOD is doing a bad job but because their model resolves around the idea that more data the better, no matter how relationally dense, which is in striking contrast to what Freebase does, focusing more on higher relational density than higher item or domain counts.

Reconciliation Costs

John Giannandrea, Metaweb’s CTO and former Chief Technologist at Netscape, wrote me privately in an email (which I quote here with permission) challenging my assumption that there is no evidence that the reconciliation costs don’t grow with the amount of data and that it might lead to saturation of resources:

Actually I think there is some evidence already that a-priori costs are an inverse function of graph size, assuming that the graph is not getting more sparse.  That is, the more you know about stuff you are trying to match, the better you can do.

I added the emphasis in the quote above because I think that’s the key to the whole process: in order to attract large quantities of information you can decide to punt the problem of reconciliation, but those costs will grow at least linearly (but probably more given that the amount of potential relationships grows quadratically) with the amount of information to reconcile. On the other hand, by focusing on high relational density since the very beginning, the velocity of data acquisition will be small but the reconciliation costs will get smaller over time, creating a positive network effect.

I found it interesting to spot the same concept mentioned in this article in a related but different context (mentioned in Tim O’Reilly’s latest web re-branding campaign):

Consider what happens when there are two records describing two different people as they appear to share the same name. “What happens is a third record shows up in the future that works like glue, which causes them to collapse,” he said. Eventually, “the more data we loaded, the fewer number of people there were.”

Big vs. Small

Prof. David Karger flatters me by writing his first (citing him) ‘blog rebuttal‘ and asking whether or not RDF is any good without a web of linked data:

[..] there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data with no reference to anyone else’s. The first is interaction [...] The second is portability [...]

What I find interesting about his argument is that David uses “RDF” and “structured data” interchangeably and uses Exhibit to mention the value of such interaction value that you get from adding such structure to your data, even if Exhibit itself is not properly using RDF but a data model that is equivalent in its relational structure but lost the notion of globally unique identifiers (and in that regard, is much more similar to Excel than to Tabulator):

It doesn’t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City—as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.

I added the emphasis to what I consider the key here: big or small the dataset, consistency is a conditio sine qua non for achieving benefit out of your data. While Exhibit’s interaction design is drastically different than one of spreadsheets, their data model is very similar and while it’s true that all Exhibits can output their data in valid RDF, their ‘isolated consistency’ model is not nearly as demanding and ambitious as the RDF one (and indeed, because of this, the user experience on Tabulator is vastly inferior, at least with the current state of the web of data, to that of Exhibit).

I think there is a subtle but important difference: namely the idea that exporting RDF might help surfacing more data but does nothing to help bringing the internal consistency to the outside and grow the relational density of aggregations of exhibits.

David (Huynh)’s work on Potluck was specifically targeted at trying to find a way to build an agglomerative strategy from a collection of exhibits, something that would be required to be able to condense a network effect value from something like Citeline.

At the end, I think we both agree that small or large the dataset, subjective or objective its modeling, consistency and relational density are key to more advanced and sophisticated uses of data and hardly something marginal that can be ignored while trying to surface its bits and pieces.