Home » Blog » A No-nonsense Guide to Semantic Web Specs for XML People [Part II]

A No-nonsense Guide to Semantic Web Specs for XML People [Part II]

November 5th, 2004

In Part I, I introduced you to the wonders of RDF, RDF Schema and OWL, hoping to give you as less nonsense as possible, but now that we have scratched the surface, let’s dive a little deeper into the subject.

Sparql

I was going to talk about RDQL in this article, but on Oct. 12, the Data Access Working Group (DAWG, pronounced so that it rhymes with ‘dog’) released the first working draft of Sparql, the query language for RDF.

First of all, why a new query language if there are so many out there? what about SQL? or what about XQuery?

The nature of RDF makes it very hard (if not impossible) to adapt those query languages to the semantic web needs. First of all, the highly distributed nature of the RDF models requires identifiers to be URIs rather than table names. Sure, one could use table name as URIs, but then SQL lacks the namespace-like ability to prefix names.

Moreover, SQL has no notion of statements (the ‘subject -(predicate)-> object’ that RDF is based on) and relations are just pointer references and are unnamed.

The same thing can be said about XML: the nesting of elements has a very specific semantic value, but that value is implicit for the schema used and it’s not accessible to the algorithms processing the data.

Just like RDF wants to be portable across systems and as much self-explanatory as possible, its query language wants to match the level of semantic portability. XQuery follows the implicit semantic model of XML (don’t worry if you don’t get why now, I will describe this more in depth below) and therefore every query is strongly tied to the particular schema that your data exhibits.

But an example is worth a thousand words and you’ll see what I mean:

PREFIX  dc:  <http://purl.org/dc/elements/1.1/>
PREFIX  ns:  <http://example.org/ns#>

SELECT  ?title ?price
WHERE   ( ?x dc:title ?title ), ( ?x ns:price ?price ) AND ?price < 30

First of all, note the SQL-like nature of the syntax: it was a conscious decision (and IMO, a very good one) to stop the “golden hammer” antipattern of wanting to use the XML syntax for everything, including those things where it doesn’t make any sense at all, like a query or a transformation language (yes, XSLT, I’m looking at you!).

Also, note how the namespace-like prefixing ability is introduced: this allow you to have fully-qualified URIs as types, but without impacting the verbosity of the query language.

Second, note the “?blah” notation: this means that ‘blah’ is going to be considered a “variable”, you can think of it as a named wildcard that will contain the value of the field that matches the query constraints. Here, we looking for the title and the price or particular items that satisfy the constraints.

The constrains are introduced by specifying particular statements that have one or more variables. Here look at how the ?x variable is used only inside the constraint area, to link the two statements.

So, as a result, the query will return the title and the price of items where the price is less than 30.

Big deal, I hear you saying. I can do that today in SQL.

True, you can. If your data is local and you control it. But what if you want a software agent to do the queries for you? How are you going to find out across different databases how to adapt your query to their own internal logic, to their tables and to the way thay modeled the information in their relational model?

On the other hand, I hear you saying, well, that’s why XQuery was created, so that semantics were more explicit. Well, it is true that the namespaced XML model allows for better and more identification of elements, but the problem with XQuery, that it inherits from the XML model, is that semantics are encoded (implicitly!) in the XPaths and there is no guarantee that the same query will work transparently across different databases, since the position of those elements might be changed, for example, by harvesting and aggregating several XML sources into one.

But let me give you another example.

Let’s start with an RDF model (this time I’m going to use the RDF/XML syntax so that you get exposed to different syntaxes and you understand that, unlike XML, the value of RDF is in the model, not in the syntax):

<rdf:RDF
   xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#"
   xmlns:foaf="http://xmlns.com/foaf/1.0/"
   xmlns:ex="http://work.example/ns/people#"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>

  <ex:Person rdf:about="http://work1.example/people/2334234">
   <foaf:name>Alice</foaf:name>
   <foaf:mbox rdf:resource="mailto:alice@work.example"/>
   <vcard:N>
     <vcard:Family>Hacker</vcard:Family>
     <vcard:Given>Alice</vcard:Given>
   </vcard:N>
  </ex:Person>

  <ex:Person rdf:about="http://work2.example/people/34234">
   <foaf:name>Bob</foaf:name>
   <foaf:mbox rdf:resource="mailto:bob@work2.example"/>
  </ex:Person>

  <rdf:Description rdf:about="http://work3.example/people/2334234">
   <foaf:name>Eve</foaf:name>
   <vcard:N>
     <vcard:Given>Eve</vcard:Given>
     <vcard:Family>Hacker</vcard:Family>
   </vcard:N>
  </rdf:Description>

</rdf:RDF>

and if you ignore the rdf: namespaced things, it looks like a regular namespaced XML file. Now, let’s run a query on top of it:

SELECT ?foafName ?mbox ?fname ?gname
PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX vcard:   <http://www.w3.org/2001/vcard-rdf/3.0#>
WHERE  ( ?x foaf:name ?foafname )
  [ ( ?x foaf:mbox ?mbox ) ]
  [ ( ?x  vcard:N  ?vc )
     [ ( ?vc vcard:Family ?fname ) ( ?vc vcard:Given  ?gname ) ]
  ]

and this gives the following result:

+----------+-----------------------------+----------+---------+
| foafName |             mbox            |  fname   |  gname  |
+----------+-----------------------------+----------+---------+
| "Alice"  | <mailto:alice@work.example> | "Hacker" | "Alice" |
| "Bob"    | <mailto:bob@work2.example>  |          |         |
| "Eve"    |                             | "Hacker" | "Eve"   |
+----------+-----------------------------+----------+---------+

A few important things to note:

  1. Sparql is designed to work effectively with real life data models, where data is often aggregated from different sources, using different vocabularies, normally incomplete and inconsistent. Following the RDF basic principles, the consistency level was moved from the data model to the vocabularies definition, usage and mapping.
  2. The query is against the RDF model, not against the particular syntax used to serialize it. It is entirely possible to run an XQuery against the above data RDF model by considering its RDF/XML representation and using the underlying XML model, but obtaining the same resultset will require a lot more complexity, especially with dealing with the graceful handling of missing data.
  3. The output of the query is a table. The DAWG hasn’t yet decided what syntax will be used to represent this result set.

XML vs. RDF

It took a while but here we come, the real question: if I already know XML, why should I bother with RDF? In short, what is this hype about RDF solving the problems of XML? what problems BTW?

First of all, allow me to take a step back and to use the above RDF model (the one about “Alice”, “Bob” and “Eve”) as a starting point. A good XML programmer, if asked to encode that information in XML, could have come up with something like this:

<addressbook
  xmlns="http://work.example/ns/people/1.0"
  xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#"
  xmlns:foaf="http://xmlns.com/foaf/1.0/"
>

 <person id="1">
   <foaf:name>Alice</foaf:name>
   <foaf:mbox>mailto:alice@work.example</foaf:mbox>
   <vcard:Family>Hacker</vcard:Family>
   <vcard:Given>Alice</vcard:Given>
 </person>

 <person id="2">
   <foaf:name>Bob</foaf:name>
   <foaf:mbox>mailto:bob@work2.example</foaf:mbox>
 </person>

 <person id="3">
   <foaf:name>Eve</foaf:name>
   <vcard:Given>Eve</vcard:Given>
   <vcard:Family>Hacker</vcard:Family>
 </person>

</addressbook>

This is a reasonable and well conceived XML representation of the above data where the author chose to prefer elements against attributes. Now let us confront this with the above RDF model:

  1. the XML model needs a root element and since this looks like an addressbook, the person doing the schema decided to use such an element for inclusion. It could have been anything really, but the point here is that if we take those <person> elements and move them in another context, we need to rewrite the XQuery or the XPaths that lead to them, unless, of course, we started our XPaths with //.
  2. the XML model is able to identify a particular element inside the document space, but that ID is not guaranteed to be unique across documents (the impact of this could be reduced if the practice of using URIs for ids was more widespread but it’s really not the case, also because very few XML people care about absolute identification of elements, considering XPaths a much more flexible way to address parts of a document).
  3. the XML model does not, on its own, have a native distinction between URIs and Literals. This means that “Bob” and “mailto:bob@work2.example” are treated equivalently by the XML parser, unlike in RDF.
  4. last, but not least, the XML model does not make the relationships between elements explicit and uniquely addressable.

This last point is important enough to require it’s own section.

Explicit vs. Implicit Semantics

Let us consider really simple example that everybody is familiar with:

<head>
  <title>Hello World!</title>
  <meta name="dc:author" value="Stefano Mazzocchi"/>
</head>

now, let me remove the syntax and get to the model

[head]
   +---[title]
   |      +---- "Hello World!"
   +---[meta]
          +---- "dc:author"@name
          +---- "Stefano Mazzocchi"@value

don’t worry if you are not familiar with this notation, I just made it up :-) the point is to show that those angled brakets are just a way to encode the above XML fragment and others are equivalently possible, for example, an XPath-ish representation:

/head/title/text()="Hello World!"
/head/meta/@name="dc:author"
/head/meta/@value="Stefano Mazzocchi"

or a SAX-ish one:

start_element(head)
start_element(title)
text("Hello World!")
end_element(title)
start_element(meta)
attribute(name,"dc:author")
attribute(value,"Stefano Mazzocchi")
end_element(meta)
end_element(head)

these are all equivalent, and now that I made you forget those angle brackets for a second, let us analyze the implicit semantics that the above XHTML fragment carries.

The first thing this model tells me is about the fact that this page has a title and this title is “Hello World!”. You would think this is really not a big deal to understand and not even a big deal to code. Now, a general algorithm could be written that says that every time an element is included into another element, it enriches information of the parent element with the enclosures of the child element.

That seems pretty straightforward, doesn’t it?

Well, look at the second element: if interpreted with the above reasoning, the meta element would not add any information to the head element, since it’s empty! of course, you say, the information is encoded in attributes this time. Correct. So we can think about extending the above rule to identify the fact that every time an attribute is found, that is treated as an element.

If that is the case, head ends up having a title, a name and a value. No, wait, the name and the value are associated to the meta tag. Ok, so we need to change our interpreting rule: every time you find two attributes, one name and one value, they represent a tuple and this gets associated with the parent of the element.

That seems to be working, right? Cool, now let’s take a look at a bunch of different nested elements and indicate what they are meant to signify:

head/title ---> this page has this title
p/img ---> this image is enclosed in this paragraph
xsl:choose/xsl:when ---> this conditional contains this test
xs:complexType/xs:sequence --> this complex type is composed by this sequence
...

As you can see, no rule can be general enough to describe all possible semantic meaning of element nesting and this is exactly why RDF makes the predicates explicit.

That’s it!

That’s all there is as a difference, but it’s a huge one nevertheless: if this implicit semantic information is somehow made available, it is entirely possible to transform XML into RDF, for example, thru the use of an XSLT stylesheet. As I showed above, this stylesheet cannot be a general one-size-fits-all one, but must be tuned for the specific schema and/or for the specific requirements that the data consumer might have (for example, what data should be given a literal and what should be given a URI). Here is a paper that part of our group wrote to describe what we have done in the migration of XML data into RDF.

So, in short: should you care about RDF? For now, you are safe if you care about keeping your own data valid and coherent. The semantic web is trying hard to unlock the chicken-egg problem of “no killer app until data, no data until killer app” and automatic trasnformation of existing data into RDF is what I think is going to unlock it. Also, the fact that we are building tools that you can now use to operate on your RDF data, for example to browse and search it, will show you what you can gain by making those relationships explicit.

Enough for today, next part will deal with the unresolved issues: identification, provenance and trust.