Data Smoke and Mirrors
November 7th, 2009
I wrote previously about the fact that without a high relational density, having a dataset in RDF doesn’t give any practical advantage over having it in its original format.
Yet, from a marketing/political point of view, the simple act of “triplifying” a dataset and make it available on the web as linked dataseems to make it appear all more powerful, all more useful and it’s being used a lot as a way to promote the idea that the web of data is finally getting traction.
Today, I stumbled upon this page which contains all the datasets made available by data.gov triplified as RDF. The result yields an eyebrowse-raising 5 billion triples, more than the entire LOD effort today.
Having tried to import some of that data in Freebase myself, I looked a little deeper to see if I could build on their effort and make my import a little easier… what I found didn’t please me.
Turns out that the data is now a lot more granular and a lot of dereferenceable URIs were minted in the process, but let’s follow the trail: say I give you this URI
http://data-gov.tw.rpi.edu/raw/347/data-347-03051.rdf#entry9053451
which you can dereference as a URL (meaning: you can click on it) and obtain some more machine-readable information for it
<rdf:Description rdf:about="#entry9053451"> <value>2.7</value> <period>M01</period> <year>1995</year> <series_id>SMU55225408000000001</series_id> <rdf:type rdf:resource="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry"/> </rdf:Description>
Now you know that the URI I gave you is a “data entry” that has a value, a period a year and a series_id.
Great, so? How is this useful?
I’m a human and I can’t figure out what this is from or for, what it means in practice nor have a way to figure out what M01 stands for, or what series_id SMU55225408000000001 means or what dimension that numerical value is. Do you seriously think a machine would do better after it managed (successfully) to parse this stuff?
By grinding all those rectangular datasets into triples, they’ve actually managed to make it *less* useful than in its original form. In the original form at least I had a little context of what this data was for and from, which is lost here. A surprising achievement, but I bet you won’t read about it at semantic web conferences any time soon.
Now, will this gigantic hairball of triples enter the LOD map of Middle Earth and double it in size overnight with a big “data.gov” stamp of self-validation?
In practice, this might help promoting the web of data in the short term but it’s doing an incredible disservice in the longer run: if you manage to lure serious data practitioners in this game, they will not run against a few bumps they will have to adjust to (which would be natural and something to expect), they will run at full speed into mile-high mountains. Ultimately, they will fail to deliver in practice what less technical people hired them to do convinced it was possible in theory by all the hype generated by such big numbers.
This places everybody trying to make the web of data happening in danger of oper-promising and under-delivering, which is not a recipe for success: it’s a recipe for abuse, mistrust and anger… which ultimately leads to disaster and failure.
The interests of semantic web advocates and people pushing for more government transparency align perfectly with efforts like these: government pressures agencies to get data out (whatever data they have in whatever form) and semweb advocates triplify it (doesn’t matter how, just get it out in valid RDF and follow LOD rules). Result: more machine-readable data out there, thumbs up all around.
Too bad the people who are supposed to benefit from this effort are left with a pile of cryptic, disconnected and artificially inflated information which serves the interests of those involved in the process a lot more than the interests of those being served.
Weren’t government transparency and the semantic web both supposed to be a way to achieve the exact opposite?