Home » Blog » A No-Nonsense Guide to Semantic Web Specs for XML People [Part I]

A No-Nonsense Guide to Semantic Web Specs for XML People [Part I]

April 14th, 2004

The Semantic Web has a serious problem: the XML people don’t understand it.

They think it’s an utterly complex way to write metadata that you can do with simple namespaces. The two worlds (despite being both hosted inside W3C) don’t talk very much. Many (if not all) W3C folks are all in the RDF camp (and have been there for a while) and they see XML as a half-baked attempt to solve issues that RDF already solves. Unfortunately, not having been in the XML camp at all, they have no way to communicate with the other side.

The XML camp, on the other hand, thinks that they know how to build things that work, while the RDF people are all sitting in their ivory towers telling them that what they are doing is wrong, but without understanding their real-world needs.

As it normally happens in a debate, both are right and both are wrong.

I find myself sitting in a very lucky position: right in the middle of both camps and I talk to both of them.

So, this is a RDF guide for XML people. A much needed one, IMO.

RDF

RDF (Resource Description Framework) is a model for describing directed labelled pseudo-graphs. This means:

  • directed -> every arc has a direction
  • labelled -> every arc has a label
  • pseudo-graph -> there can be more then one arc between the same two nodes

So, for example:

stefano ---(has designed)---> this blog
stefano ---(writes)---> this blog

An RDF model is an unordered collection of statements. A statement is also known as a triple because it’s composed of three things: a subject, a predicate and an object. Subjects and objects are nodes, while a predicate is an arc. In the example above, “stefano” is the subject, “this blog” is the object for both statements, while “has designed” and “writes” are predicates.

RDF is designed with a strongly decentralized system in mind and for this reason, there must be a way to identify those parts of the statements so that they can be reused, either in the same model or in other models. For identification, RDF uses URIs. So, let’s rewrite the above example using URIs:

urn:stefano:betaversion.org -(dc:creator)-> http://www.betaversion.org/~stefano/linotype
urn:stefano:betaversion.org -(dc:author)--> http://www.betaversion.org/~stefano/linotype

where “dc:” is the shorthand for the dublin core namespace. The concept is exactly the same, but we have used URIs to reduce the collision of concepts (we have also made it much harder to find overlap, but that’s another story and it’s much more scalable this way). For readability, I’ll keep using strings instead of URIs here but you should be aware of the fact that URIs are vital for the large scale use of these concepts.

RDF has a few other very important concepts: literals and reification.

  • literals are objects that are not URIs but actual content. Literals can be typed. And, hear hear, there is an “xml content” datatype. This is important because while RDF is powerful enough to describe XML content, lack of order makes it very expensive (since for every RDF statement that represent an XML elements, another statement indicating its original order must be written and later, processed). Note how XML attributes would be much easier to encode in RDF since they are naturally unordered. Here is an example to describe things (note the use of [] to indicate the type):
this blog ---(contains)---> this news
this news ---(has title)---> "Semantic Web 101"
this news ---(was written on)---> "20040404"[date]
this news ---(has content)---> "<html><body>...</body></html>"[XML]
  • reification is the action of using a statement as the subject for another statement. Here is an example of such use:
this news ---(has category)---> "semantic web"
[this news ---(has category)---> "semantic web"] ---(added by)---> stefano

The number one criticism for the semantic web is the idea that in order to work, it has to refer to a gigantic hardcoded and centralized vocabulary and therefore people have the tendency to believe that such a centralized notion is intrinsically bad and against the very principle of the web.

As you see from the description so far, there is nothing in RDF that indicates such a centralized taxonomical requirement and reification provides a way to create layers of meta-metadata (data about ‘data about data’) which can be recursively applied to as many layers as required to indicate, for example, disagreement or annotations. The problem, at that point, will be computational feasibility, but that’s another story.

So, RDF is a graph model, but how do I transfer this model to somebody else? how can I serialize it? There are several options for this, so, let’s go thru the most important ones.

RDF/XML

The “official” serialization of an XML model is thru the RDF/XML format, which, as the name suggests, is a way to encode an RDF model into an XML tree. Now, the RDF/XML language is very weird for an XML person. The XML model encodes nodes in elements and leaves the properties between those nodes implicit. The RDF model has those properties explicit (which it’s its main benefit over XML) but it needs a way to encode this into an XML model (which is intrinsically poorer). This generates lots of confusion, also because XML schemas best practices use lowercase names for elements and attributes, while the RDF/XML best practices use a java-like practice for naming: so “dc:title” will be a property while “rdf:Description” (note the capital “D”) will be a class.

Let’s show an example. Let us take the previous model and serialize it in RDF/XML:

1 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
2  <rdf:Description rdf:about="http://www.betaversion.org/~stefano/linotype/">
3    <dc:title>Stefano's Linotype</dc:title>
4    <dc:creator rdf:resource="urn:betaversion.org:stefano_mazzocchi"/>
5  </rdf:Description>
6 </rdf:RDF>

So, let’s go thru this line by line:

  1. this is an RDF model and this is its namespace (note the # at the end, this is the first thing that confuses XML people, it will become clear why later on, the other is the use of dates for URIs, that will, again, be clear later)
  2. This is where it starts to get confusing: rdf:Description doesn’t really mean anything and does not have an equivalent in the RDF model. It is basically there to say “hey, there is an RDF statement coming up”, but XML people get confused because they see an element and they expect this to be the serialization of some data object. The subject of the RDF statement is the content of the rdf:about attribute!
  3. Here is even more confusing: an XML person would see dc:title and think of a node that contains a litteral. Wrong! dc:title encodes the property “has title”and “Stefano’s Linotype” becomes the object of the predicate (which is a litteral, so goes in the element payload).
  4. Same thing here, but now the object is a URI, so we need to indicate that with rdf:resource.

So, let’s write it in another way:

<http://www.betaversion.org/~stefano/linotype/>
 +--(dc:title)--> "Stefano's Linotype"
 +--(dc:creator)--> <urn:betaversion.org:stefano_mazzocchi>

Much more readable, isn’t it? Wouldn’t it be better if there was a way to write and read RDF in such a simple way? Follow me into TimBL own little basement of magic RDF tricks.

N3

N3 is the Semantic Web’s community best kept secret. TimBL himself, one day, decided he had enough of writing verbose XML and started to write statements very similar to the one above. So, here it is, the same model in N3:

@prefix dc: <http://purl.org/dc/elements/1.1/> .

<http://www.betaversion.org/~stefano/linotype>
dc:title      "Stefano's Linotype";
dc:creator    <urn:betaversion.org:stefano_mazzocchi> .

In my day job I end up writing tons of RDF, some by hand, some via scripting and some by transforming XML. I find RDF/XML pretty nice to use as the output of XSLT when I have to transform XML datasets into an equivalent RDF representation, while I find it horrible when I have to write stuff by hand or when I have to generate RDF via scripting.

Tools like Jena (java) or CWM (python) can do the transformations for you from one syntax to another, since they are totally idempotent.

RDFSchema

We have seen how literals can be typed (using the XMLSchema datatypes), but what about statements? Take a look at this fragment of the Dublin Core RDFSchema translated for your convenience to N3.

@prefix rdf:       <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:      <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dc:        <http://purl.org/dc/elements/1.1/> .
@prefix dct:       <http://purl.org/dc/terms/> .
@prefix dcp:       <http://dublincore.org/usage/documents/principles#> .
@prefix dch:       <http://dublincore.org/usage/terms/history#> . 

dc:creator
  rdf:type       rdf:Property ;
  rdfs:label     "Creator"@en-US ;
  rdfs:comment   "An entity primarily responsible for making the content of the resource."@en-US ;
  dc:description "Examples of a Creator include a person, an organisation, or a service. "@en-US ;
  rdfs:isDefinedBy  <http://purl.org/dc/elements/1.1/> ;
  dct:issued     "1999-07-02" ;
  dct:modified   "2002-10-04" ;
  dc:type        dcp:element ;
  dct:hasVersion dch:creator-004 .

...

RDFSchema is basically a set of RDF statements that define classes and properties. You can think of RDFSchema as metadata for your statements. The kind of things you can say with a mix of RDF and RDFSchema vocabularies are:

  • this URI should be considered (rdf:type) a class (rdfs:Class) or a property (rdf:Property)
  • indicate a human readable label (rdfs:label) or comment (rdfs:comment). these are very useful for visualizing RDF in more presentation-friendly ways.
  • this URI is defined by (rdfs:isDefinedBy)
  • this class is a subclass of this other (rdfs:subClassOf)
  • this property is subproperty of this other (rdfs:subPropetyOf)
  • this property connects this class of subjects (rdfs:domain) with this class of objects (rdfs:range)

For more information, refer to the SchemaWeb classes/properties description page of RDFSchema (which also shows an example of how you can make use of metadata for metadata schemas).

OWL

RDFSchema is helpful to create classifications and groupings of concept and for typing statements, but it’s far from being enough for some real-world needs, so they went on, leveraging the work on autonomous agents done by the DAML (USA-driven) and OIL (Europe-driven) efforts and created OWL (pronounced “owl” like the bird, not as an acronym).

I understood what XML was all about when XSL came around and showed what you could do with it. I felt very much the same with OWL and RDF.

The problem with RDF (and XML to start with) it’s its generality: they are so abstract that feel like shiny empty boxes and leave too much to you to be able to help you. XML and RDF give you a syntax and a model, this allows general purpose parsers and memory models to be developped, but the entire “semantics” resides somewhere else and since that is normally the hard problem, people don’t see much value besides avoiding to write a parser.

Just like XSLT allows you to actually do something with this XML in minimal effort (in this case, transformation and manipulation of XML into something else), OWL allows you to “do something” with your RDF. In case of OWL, as we’ll see, it’s extending the graph.

An ontology, in our context, is a synonym of “RDF vocabulary”. Basically, it’s a collection of classes of nodes and classes of properties. With this definition, RDFSchema is itself an ontology and also OWL is an ontology itself! More formally, an ontology is defined as:

An ontology is a specification of a conceptualization.

That’s useful, uh? Right, not really, so let me start with an example of what kind of statements you can describe with the OWL vocabulary:

  • this property is transitive. (owl:TransitiveProperty) [Example of transitive properties are "belongs to the same organization", while "is friend of" is not]
  • this property is symmetric. (owl:SymmetricProperty) [Example of a symmetric property is "works with", while "being father of" is not.]
  • this property is the inverse of this other one (owl:inverseOf) [Example of two inverse properties are "being parent of" and "being child of"]
  • this property is equivalent to that one (owl:equivalentProperty)
  • this node is the same as that one (owl:sameAs)
  • this relationship can appear only these many times (owl:cardinality) [Example, one can have only one biological father]

I think you start to understand what OWL is for: add even more metadata to the nodes, classes and properties so that you can “reason” upon them.

Suppose you are given this RDF model:

<http://www.betaversion.org/~stefano/> -(is author of)-> <http://www.betaversion.org/~stefano/linotype/>
<http://www.apache.org/~stefano/> -(is author of)-> <http://www.apache.org/~stefano/agora/>
<http://web.mit.edu/people/stefanom/> -(is author of)-> <http://simile.mit.edu/gadget/>

These are normally statements that will be find in different locations and aggregated by a statement harvester (say, a crawler/spider that looks for RDF statements embedded in web pages). Problem is these statements refer to three of the things I created but since they talk about different environments, they are completely unrelated.

We need a way to “map” these URIs together and here is where OWL starts to come handy. Knowing all my online personalities, I will publish another RDF model that indicates how to map together my different personalities:

<http://www.apache.org/~stefano/> -(owl:sameAs)-> <http://www.betaversion.org/~stefano/>
<http://web.mit.edu/people/stefanom/> -(owl:sameAs)-> <http://www.betaversion.org/~stefano/>

At this point, aggregating the two models together and running an reasoner on top of it, would generate two new statements

<http://www.betaversion.org/~stefano/> -(is author of)-> <http://www.apache.org/~stefano/agora/>
<http://www.betaversion.org/~stefano/> -(is author of)-> <http://simile.mit.edu/gadget/>

which would allow people to ask the question “tell me everything that <http://www.betaversion.org/~stefano/> authored” and obtain a much more helpful answer.

I’ve shown how the RDF starts to make sense after you layer enough functional vocabularies on top of it, but is OWL enough? or too much? or useless?

So far, the only thing I’ve used of OWL is equivalences between properties, classes and nodes. Equivalences between properties and classes are incredibly useful, especially to start to map vocabularies coming from different communities. Example of useful mappings are between the vCard ontology and the FOAF ontology, which are both describing people, but with different goals (one is PIM info, the other is social networks).

OWL is divided into three layers: Light, DL (stands for Description Logic) and Full. There is now a group that wants to add a new Tiny layer (even smaller than Light) that is exactly what I’m using (OWL Light is even too much for me).

But why would I want a smaller version of OWL? well, the problem with OWL (or any kind of reasoning engine on a graph) is the computational requirements. In one of our demo models, we have some 500k statements (and that’s just a fraction of the data we have!) and the Jena OWL reasoner runs out of memory with a few Gb of RAM!

Anyway, enough for part I. Enjoy.