GRDDL and Its Supposed Language Neutrality
March 1st, 2007
Danny pokes me about GRDDL and I fail to resist, so I feel I need to elaborate a little more after his blog post.
GRDDL was designed as another way to help solve the semweb chicken-egg problem (no data with no killer app, no killer app with no data) by describing programmatic transformations that can be used by the data consumer to obtain the data from regular HTML pages without requiring substantial changes in the web site itself.
This is a different scheme than metamarkups such as microformats, RDF/A or eRDF that instead want to embed more structured information directly inside the HTML pages and thus require more changes in the web publishing itself.
But for it to work at all, all three these conditions are necessary:
- the transformation instructions need to be accessible by the data consumer
- they need to be executable portably and reproducibly
- they need to output data in portable and reproducible way
By default, GRDDL uses XSLT to describe transformation instructions. This solves all three conditions because:
- as long as there is a URL pointing to the XSLT file, the data consumer can access it (provided it has the right permissions, of course)
- XSLT is a turing complete language, but is also a very well defined platform, with no APIs other than the one that it comes with and no access to external data rather than the one fed for input (I’m ignoring extensions here because, at least for version 1.0, they are not portable anyway and very few rely on them)
- the input and output streams are implicit in XSLT workflow and GRDDL transformations are mandated to output RDF/XML data
If this is all GRDDL described, there would be nothing substantial to criticize. Unfortunately, as they realize that not everybody likes XSLT to describe programmatic transformations, they claim that it is just as possible to define GRDDL transformation instructions in any other language.
As I mentioned in my email, I strongly disagree: it is no generally possible to implement GRDDL transformation instructions in any other turing complete language and to satisfy the above three operational conditions without more help from the spec.
GRDDL bases its theory of operation on a very constrained language (XSLT) and simply assumes that all other programming language exhibit the same portability and reproducibility of both execution, security and data workflow. This is simply false.
Without getting into exotic configurations, just picking Javascript shows how impossible it is to implement a GRDDL-capable data consuming client without more help from the spec.
As Javascript is an interpreted language (just like XSLT), we can assume that by placing the .js file on the web and giving it a URL, we can satisfy the first condition (the implementor now has to have a javascript engine instead of an XSLT one, which is complex but fine as the spec cannot help more there).
The second condition is much harder: unlike XSLT that has no APIs, Javascript does (think of “document” or “window”.. but even “java” or “Packages” or “XmlHttpRequest”). The spec needs to describe, at the very least, the list of objects, if any that the GRDDL javascript program can expect to find in its execution environment. If not, one GRDDL javascript transformation might work on my GRDDL client and not yours… while both of us can state that we are are fully compliant to the spec!
And last but not least, the third condition is hard to resolve without help from the spec because both the input (the dom) and the output (the RDF data generated by the script) cannot be operated upon if undefined and just left for the implementor to guess, as Javascript has no notion of standard input and output streams.
So, it is impossible to define GRDDL in Javascript? No, it’s not, but it requires work: every different language/platform needs to have its own GRDDL “flavor” that defines constraints that are language specific. For XSLT, the constraint might be no extensions and valid RDF/XML as output DOM. For Javascript, they might be a list of objects available to the scraper (say “document”, “window”, “XmlHttpRequest” and “data”), and an object that encapsulates a way for the script to send its output (“data” can have methods such as .addStatement(s,p,o,c) and such).
We have shown this is possible because this is how we do it Piggy Bank.
Would it really be portable? well, it depends on many factors, most importantly on how ‘equal’ those Javascript objects behave across platforms, but this is something that the spec cannot fix, but web browsers show that convergence is indeed possible.
So, my suggestions for the GRDDL WG are the following:
- stop saying that GRDDL transformation can be defined in any language; it might be true in theory, but in practice is hardly useful if those transformations cannot respect the above three conditions;
- decide what language you want to support and create profiles for them. If you have no time for that and you just want to do XSLT it’s fine but if you decouple the profiles from the GRDDL spec you can follow independent editing paths (like XSLT did with XPath and XSL:FO for example) and divide the work.
I hope that my criticism is not misinterpreted for lack of support: I fully believe GRDDL to be a useful and important step to unlock the semweb potential (not the only one, though, mind you), but I also think that specifications should be designed with implementation and practical constraints in mind and not just distilled out of theory and hope.