It’s All About Graphs
February 11th, 2004
For ages, the battle of abstract data modeling has been between trees and tables. The people who gave us relational databases and their wonderfully elegant query algebras worked hard to allow people to store massive quantities of information and query them in a reasonable time and with a reasonable flexibility. Today, RDBMS are so widespread that managers believe that if the data is not in the database it doesn’t even exist!
Somebody once said that 80% of the world’s business logic is in Excel spreadsheets. In the content management world, everybody and their dog knows that if you can’t talk to Word it was good to meet you and bye bye.
Data needed to be structured so that that relational algebra could work on it and databases could access it in reasonable time, especially 20 years ago when storing a few more digits was a problem. Today, only a few of the biggest relational databases in the world won’t fit in my iPod! How is that for a paradigm shift?
All the realms where data could be happily forced to be structured are already saturated. Ask Oracle how that went.
But the great majority of the data out there is not structured and no way in the world you can force people to structure it (remember when you were teaching your mom how wonderful styles are in Word and how much easier it would be to restyle everything if the data was more structured? remember the look on her face? yeah, exactly)
Yesterday W3C issued RDF and OWL as recommendations.
For years, I thought that XML was the king and RDF was its knight. Well, I got it all wrong: it’s the other way around, it’s just that it’s very hard to realize it.
People in all sort of communities realize how important semi-structured data is and how much it will be in the future. These people tend to think that XML will solve the problem for them and once we have a serious XML query language, we’ll be set forever.
Well, wrong. Relations are tables, XML documents are trees, and, guess what, RDF models are graphs. Yep, you got it, RDF can describe both.
The more I discover RDF and RDF query languages, the more it seems to me that it was all about that stupid RDF/XML syntax that prevented people from getting what RDF really was. RDF is a model for describing labelled directed pseudo-graphs. That’s it. You can add typing (RDFSchema) or inference (OWL), but the real deal is that you now have a way to describe the most complex types of graphs.
Many people tend to think at RDF as a uselessly complex way to write markup more formally. Some people tend to think that if XML is data, RDF is metadata. Wrong and wrong. An RDF model is a graph. Period. And since all trees and tables are graphs, you can have an RDF representation of any kind of data you want.
My favorite thought of the day, still resonating on Jon’s confessions, is the fact that an XML document can be written as an RDF document, here is how. Take the following XML document:
<body> <p>This is a paragraph</p> <div>This is a block</p> </body>
now think about its tree representation
<body> +--- <p> +---- [this is a paragraph] +--- <div> +---- [this is a block]
now think of it as a graph where the elements are the nodes and the parent/child relationships are the arcs
(body) ----(includes)---> p ---(includes)---> [this is a paragraph] | +-------(includes)---> div ---(includes)---> [this is a block]
and now, we can write this as RDF (using N3 which is much easier to understand than RDF/XML syntax):
@prefix : <http://www.w3.org/1999/xhtml#> . @prefix tree: <http://whatever/tree#> . :body has tree:child :p :body has tree:child :div . :p has tree:child "this is a paragraph" . :div has tree:child "this is a block" .
Now, if I know my readers well enough you are now thinking “oh, c’mon, that’s pure academic talk, how is that supposed to help me?”, well, let us suppose, for example, that on top of your RDF-ized version of your XML document, you have an RDF Schema for your XHTML RDF representation that says:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix typesetting: <http://whatever/typesetting#> . typesetting:block rdf:type rdfs:Class . :p rdf:type typesetting:block . :div rdf:type typesetting:block .
this allows us to declare of class of semantic equivalence where both P and DIV belong. This allows one magic separation of concern: if tomorrow we add BLOCKQUOTE and we consider this a block, neither the query nor the data model has to change, but only the aggregating schema.
Are you starting to see it? but wait, there is more: suppose that I was the author of the text included in one of those blocks, you could write
@prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix asf: <http://apache.org/people#> . @prefix vcard: <http://www.w3.org/2001/vcard-rdf/3.0#> . :p has tree:attribute [ dc:author asf:stefano ] . asf:stefano has vcard:N [ vcard:Family "Mazzocchi" ] , [ vcard:Given"Stefano" ] .
but again, you’d feel that this is a very complex way to do what a simple attribute with my name and last name would have done. First of all, this approach guarantees uniqueness, while the attribute approach doesn’t (unless you use a URI as an attribute but then you loose theability to do textual query for my name). But most important, the ability to do basic inferencing:
@prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix mit: <http://web.mit.edu/people#> . mit:stefanom owl:sameAs asf:stefano .
This means that if you search for all blocks of text whose author is myself, not only it doesn’t matter what tag you used to identify that block of text but you can search for my name and get text authored by being indicated with all URIs that indicate myself as a person.
It’s all about graphs.