Home » Blog » On Data Integration with Semantic Web Technologies

On Data Integration with Semantic Web Technologies

March 29th, 2007

It took a while but it’s becoming more and more obvious to people that the mixability of RDF makes it a great candidate for integrating data from various independent sources. This puts RDF in direct collision with data warehousing technologies, OLAP, business intelligence tools, multi-dimensional databases and the like even if I’m sure very few people in those markets are actually aware of the fact that RDF is not just another XML markup.

The concept of data integration is even older than computers, it’s as old as the idea of data itself: all datasets that we use today (from a book to a collection of pictures, from a library catalog to census data) had to be collected by different people/entities and integrated.

I’ve been preaching the use of RDF and friends as a way to help with data integration, mostly because of the unified graph model, the global identification space, the syntax independence and the intrinsic mixability of unordered statements. I’ve also praised the use of parts of OWL as ‘glue’ to bring together data that is modeled using different ontologies, or identified by different identifiers.

While such a vision is appealing, it’s also incomplete: the data that I’ve had to deal with in my day job reveals that while equivalences help, satisfactory data integration can hardly be done simply with them.

An example will help me show my point.

Suppose that we are given two digital libraries about paintings.

In one, the data is modeled with three types “person”, “work” and “image”, while the other has only “person” and “image”.

In the first, a certain person is an author of certain works and a certain work is depicted by certain images. This seems artificially complex but allows one to model things like ‘a picture of the back of a painting’. It also allows one to give information about the image without giving it to the work itself (for example, the resolution is a property of the image, not of the work the image depicts… while the material is a property of the work, not of the digital picture.. and so on).

In the second, a simpler modeling approach has been taken and images and works are considered the same thing.

Now, following the established semantic web practices, one would want to have a system that allows to discover or edit equivalences between items so that different identifiers for the same thing could be equated and their data smooshed together.

But in the first dataset we have identifiers for works and identifiers for images, while in the second we have an identifier for both. No matter what equivalence we chose, we would link a type “work” or “image” to an incompatible type “work/image” to the second.

This might seem academic, but in practice is a problem: say Picasso’s Guernica appears in both datasets, if you search for it, you get two separate items, clearly something that users don’t want. So we can draw an equivalence between the “work” Guernica and the “image” Guernica… but in doing so, we have smooshed the ‘resolution’ property of the digital image assigning a ‘resolution’ to the work, which clearly doesn’t have one being a physical object.

The alternative is to link the “images”, but now we might have merged the author of the image (who took the picture of the Guernica painting!) with the author of the painting depicted by the image (that is Pablo Picasso). So John Smith ends up being the co-creator of Guernica alongside Picasso, even if he just took a picture of it!

Let me give you another example.

Again, you have the same two datasets about paintings. In both a property “location” is given, that identifies the official geographical location of the painting. In one you are given the name of the museum such as “Museum of Fine Art, Boston, MA, USA” or “Getty Museum, Los Angeles, CA, USA”, in the other you are given the location as the geographical coordinates such as “+34.39484,-123.34248″.

The two “location” properties have different URIs, as they come from different ontologies, but they indicate the same information… problem is that they indicate it in a different way. If we draw an equivalence between the two properties and try, for example, to use that property to locate paintings on a map, we’ll end up having painting that show up on from one collection, while the others won’t show up.

This is another example where equivalences, by themselves, are not enough.

There is an implicit assumption among semantic web practitioners that once the data is in RDF and it’s using different ontologies, all it’s left to do is to find a way to map the various ontologies together and voila’, data integration at a global scale!

RDF might help simplify certain operations but the problem of integration is not about just the identifiers used by the data models but also by the act of modeling itself!

If you only have one image per painting, there is very little need to model works and images independently and it’s left to the reader to understand that the metadata “resolution” applies to the image and not to the actual painting.

There are modeling mismatches that simply cannot be solved with ontological mappings alone.

This is a form of ‘undermodeling’, similar to the concept of aliasing and artifacts introduced by sampling: a data model is a way of sampling an information space. In audio processing, it’s obvious that mixing two samples with different sampling resolutions would result in total garbage no matter how one decides to align them in time.

We have a similar problem here: given a set of images of paintings where only one image per painting existed, the data model ‘undersamples’ that information space collapsing the two into one concept.

Following the same sampling rationale, we need to ‘resample’ the data model and decouple the ‘works’ from the ‘images’, or convert the Museum location into coordinates. But this is far from trivial and clearly not something that can be done without intimate domain knowledge.

So what’s missing?

This ‘resampling’ facility, in its most abstract shape, is nothing but a transformation stage: RDF comes in, RDF comes out, possibly different and more aligned with the rest we have to integrate.

Also while some of such transformation operations can be done unsupervised, in general human intervention (by an individual or either a voting group) will be required.

Also the ability to turn supervision into a ’scripted’ set of transformation operations will be required. For XML one would use XSLT, but for RDF there is no such thing (there are rule languages on the horizon, but they hardly seem to fit these needs… even if XSLT wasn’t properly designed for XML transformations either).

I honestly don’t know how these tools will work , what shape they’ll have and how much automation they will be able to provide to human users, but one thing is for sure, while declarative equivalences help data integration with RDF, they are far from being enough and we can no longer afford to ignore this problem.