Home » Blog » Folksologies: de-idealizing ontologies

Folksologies: de-idealizing ontologies

April 5th, 2005

Came across Clay Shirky‘s talk at the O’Reilly Emerging Technology conference entitled “Ontology is Overrated: Links, Tags and Post-hoc Metadata”. It’s worth listening to.

Just like me, Shirky is a lakoff-ian (excuse the neologism): categories are embodied, expression of humanity, not abstract metaphysical entities (Plato’s ideas) that we aim to obtain. I wrote about this already.

Still, Shirky misses one important point: ontologies are not overrated, they are just contracts, a (more or less explicit) agreement between different parties. Language is a contract as well. So are categories. So is metadata. So are APIs, protocols, plug shapes and their voltage, meters…. you name it! Many make the mistake of associating an ‘ontology’ with Plato’s metaphysical ideas, I think Shirky is one of them.

The semantic web is a bad name for an attempt to make data interoperability scale at a web level. Ontology are a bad name to describe relationships between symbols. That’s all there is, really.

Now, you use tags to categorize things for yourself, but instead of using a ‘controlled vocabulary’, taxonomy or ontology (depending on what field you come from, you will like to call them differently… which also is a metaproof of the point, but let’s move on), you invent your own.

People have been doing this forever. I mention Borge’s essays about this in another post.

Now, the real breakthru of folksonomical-based systems like del.icio.us or flickr is not the lack of structure or committee-based design in the ontological space, but is the idea that if two people use the same term, it’s more probable than they meant the same thing than they meant different things.

That’s the secret sauce: it’s unlikely that a farmer would use del.icio.us to bookmark a page on how to grow apples, so “apple” in that sociological context means Apple Computers, nor fruits. What happens if it’s not? who cares!

This is the point where librarians exit the room screaming and I’m left there staring at the wall, thinking on how to enable ontologies to emerge out of the power law foam, but without librarians to puke on it and without people telling me to stop thinking like a librarian!

The problem is rather simple, really: words are not unique identifiers for concepts. Everybody knows this very well: synonyms exist in every language. So, all you need to start is to create unique identifiers for your tags, but if you don’t do it well enough, it doesn’t scale globally.

Let’s start with a tag that I use a lot in my bookmarks: semweb. In my use, this string (contraction of “semantic web”) refers to things that are related to the “semantic web”. So, in order to promote the exchange of this tag, I create an identifier (URI) for it:

urn:tag:3f7d0330e767ddab5b2826371e2d21ff/c2Vtd2ViCg==

Now, let me decode the above:

urn:tag:[MD5 hash of my email address]/[base64 encoding of the tag]

There: this is a unique identifier that is connected to my email address, therefore reasonably unique because domain names are kept unique by registration authorities and mail protocols don’t allow two distinct accounts to share the same name. Also, the email address is hashed, to avoid abuse (by spammers, for example): only its unicity property is required, the rest of the information can be discarded. The base64 encoding of the tag is to ‘obfuscate’ its originating string, yet base64 (unlike hashing) is a loss-less algorithm, which means that even if we lose information about the label connected to that identifier we can reconstruct it later. This is redundant, but enables better digital preservation in the long term.

But why numeric obfuscation of the tag? Well, humans can’t avoid parsing textual information even in URIs. This is bad because it might produce unwanted side effects. For example, two different ethical groups in conflict might not like to use a URI that was created based on the textual representation of that concept in the rival language. This will unlikely promote the reuse of that URI. Sure, we could have used an incremental counter for those tags and forget about reuse (we will see why in a moment), but that requires different systems that you might use to be kept synchronized. It’s way easier to think that you will unlikely use the same string to use different meanings in the same context.

So, now that we have an identifier, we can start making statements about it:

<urn:tag:3f7d0330e767ddab5b2826371e2d21ff/c2Vtd2ViCg==>
    a tags:Tag ;
    rdfs:label "semweb"@en .

Great, now we know this “thing” is a tag and has a label “semweb” in the english language.

Now, let’s say that a friend of mine uses “semweb” as well, buthaving a different email address, the identifier of his tag will be different, even if he uses the same string. Well, nobody ever said that inferencing on RDF statements should always follow description logics: if we have two statements that share the same literal, then we can say they are folksonomically “colliding”. So, we now have a model that says, after the inferencing:

<urn:tag:3f7d0330e767ddab5b2826371e2d21ff/c2Vtd2ViCg==>
    a tags:Tag ;
    rdfs:label "semweb"@en .

<urn:tag:8f5d07638061eb9e3b60172c59b107b9/c2Vtd2ViCg==>
    a tags:Tag ;
    rdfs:label "semweb"@en .

<urn:tag:3f7d0330e767ddab5b2826371e2d21ff/c2Vtd2ViCg==>
    tags:collidesWith <urn:tag:8f5d07638061eb9e3b60172c59b107b9/c2Vtd2ViCg==>

Note how a syntactic collision does not automatically imply a semantic one! It’s easy to identify that two tokens are the same or not syntactically, but it’s a lot harder to understand if whether or not they refer to the same ‘concept’, or even if such a thing is even remotely possible, given how subjective semantic meaning deeply is.

But as folksonomical systems do, we can assume that, for linguistic efficiency reasons, otherwise noted, two colliding unique tags will mean to reference the same semantic notion.

At this point, we just cloned a folksonomy with the semantic web, but we have just increased (a lot!) the complexity. Where is the gain, I hear you asking?

Well, let’s go back to the ‘apple’ example. Say I use “apple” to mean “apple computers” and my friend, Beatles fan, means “apple records”. So, we have

<urn:tag:3f7d0330e767ddab5b2826371e2d21ff/YXBwbGUK>
    a tags:Tag ;
    rdfs:label "apple"@en .

<urn:tag:8f5d07638061eb9e3b60172c59b107b9/YXBwbGUK>
    a tags:Tag ;
    rdfs:label "apple"@en .

<urn:tag:3f7d0330e767ddab5b2826371e2d21ff/YXBwbGUK>
    tags:collidesWith <urn:tag:8f5d07638061eb9e3b60172c59b107b9/YXBwbGUK>

Just like the above. But then my friend, who also merges my tags with his, notes that my use of the “apple” tag is semantically different than his, so he “disambiguates”, by adding the following statement into his model (don’t worry, not by hand, a UI will guide him):

<urn:tag:8f5d07638061eb9e3b60172c59b107b9/YXBwbGUK>
    owl:differentFrom <urn:tag:3f7d0330e767ddab5b2826371e2d21ff/YXBwbGUK>

With this statement in place, whenever the system re-inferences over the statements, it can understand how, for my friend!, my “apple” tag and his “apple” tag mean different things, so his system will not clusterize my data tagged with that tag in the same category as his.

Now librarians can breath again :-)

But there are other benefits: say my friend also used “semantic_web” along with “semweb”, because he’s a messier tagger or simply because he never realized, I can realize that for him and produce the following statement:

<urn:tag:8f5d07638061eb9e3b60172c59b107b9/c2Vtd2ViCg==>
    owl:sameAs <urn:tag:8f5d07638061eb9e3b60172c59b107b9/c2VtYW50aWNfd2ViCg==>

So, ironically, using an ontology, and without reducing functionality, we solved the two biggest problems that current folksonomies have:

  1. syntactic collisions can be differentiated
  2. syntactic differences can be equated

But there is a lot more!

Now that all tags are uniquely identified and can be discriminated, we can also make statements about their own relationships! So, for example, if I have a tag “RDF” and “semweb” I might want to link them as

"RDF" -(technology of)-> "semweb"

and we could use the same URI creation process not only for the tags (the nodes) but also for the link (the arc between the nodes):

<urn:tag:3f7d0330e767ddab5b2826371e2d21ff/dGVjaG5vbG9neSBvZgo=>
    a tag:Link ;
    rdfs:label "technology of"@en .

<urn:tag:3f7d0330e767ddab5b2826371e2d21ff/UkRGCg==>
    <urn:tag:3f7d0330e767ddab5b2826371e2d21ff/dGVjaG5vbG9neSBvZgo=>
        <urn:tag:3f7d0330e767ddab5b2826371e2d21ff/c2Vtd2ViCg==> .

Note that the lack of readability of that statement is a feature, not a bug: these instructions are meant for machines to be processed, not for humans.

Now, imagine a system where the disambiguation of two tags yields a return on your metadata investment that you consider worth it (if you own an iPod you know what I mean by return on your metadata investment!), the ability to share this information across systems (say between flickr and del.icio.us, or even between your own blog, or even your email client or calendar system!) will very likely revolutionize the way we do things and allowing to pick statement between the people, organizations,entities that you like, will allow you to disagree, to avoid feeling locked into a platonic semantic cage.

There is nothing in semantic web technologies that states that ontologies cannot be created by individuals for their own benefits and shared and mapped according to their individual or group tastes. There is nothing that states that the only way to make data interoperate is thru uber conceptual models (CIDOC CRM) or thru common denominator sets (Dublin Core).

And, to prove the point, we have built a system on this :-)

Stay tuned.

Many thanks to David Huynh and Prof. David Karger for the invaluable support and help in the discussion of the folksology concept as layed out in this note.