Home » Blog » Unreasonable Hypocrisy

Unreasonable Hypocrisy

March 31st, 2009

I recently came across this paper entitled “The Unreasonable Effectiveness of Data” by Alon Halevy, Peter Norvig and Fernando Pereira published on the IEEE Intelligent System journal. The paper outlines many of the ways that structure can be inferred statistically from very large quantities of data. They explicitly mention this approach to be antithetic to the semantic web’s, where it is believed that more explicit structure needs to be added to data as a way to improve the ability for machines to emerge information from it.

The paper left me with a bitter taste but I couldn’t put my finger on why until this morning.

As I wrote before, Google built its empire on the <a> tag. Not on statistical methods but on fully deterministic topological analysis of the graph of hyperlinks. They did so while everybody else in the field tried all they could to emerge rank out of better understanding of the content of pages using statistical methods and while everybody else thought that the search engine field was a done deal (because the field of text mining and machine learning was already old and very established)

Unpredictably, HIST (or google’s own flavor, pagerank) became the de-facto standard in state-of-the-art rank emergence in hyperlinked corpora and it’s now considered a milestone in that field.

What upset me about that paper is not how they say “oh sure, structure is great, but look overhere: there is a goldmine in all the sand” (which is something I fully resonate with) but they phrased it as a fight, deterministic vs. statistical, trying to convince people that adding structure it not the way to go, it’s basically a global waste of research resources.

And yet, without the <a> tag (that is: machine-readable imposed structure), they wouldn’t be where they are, not they would be able to speak from such a tall soapbox.

Sure, Google uses all sort of techniques, statistical and not and they are very good at mixing them together, but that’s not what you get from the paper. What you get is a undertone of criticism for those who believe that what’s needed is a lot more explicit structure.

What’s weird is that I fully agree with them there: the web of <a> tags and URLs is a very tiny increase of structure compared to the full-on ontological utopia that most semantic web advocates dream of, yet such a small change in the data ecosystem provides sufficient latent information to obtain staggering new insights on the corpus as a whole.

The same thing could be said for n-gram analysis and character encodings: without a minimal world-wide agreement on how to turn characters into streams of bits, there would be no way for you to parse a human word into machine processable n-grams, therefore no way for you to build corpus spectra and no way for you to work with them to gain insights.

A good title should have been “The Surprising Payoffs of Small Distributed Increases in Data Structure” and it would outline how the introduction of UTF-8 massively simplified n-gram analysis or how the introduction of the <a> tag in HTML allowed the creation of pagerank or how the massive adoption of TCP/IP or HTTP made it possible for Google to harvest the entire web as it was in their basement (and now practically is) or how the widespread support of RSS/Atom massively simplifies the analysis of ‘data in faster motion’ or even how their own ‘sitemap‘ format (a very highly structured XML document) is now advocated as a favorite SEO tool for you to advertise what content you want harvested more than other.

There is absolutely nothing statistical about all those things, yet each and every one of them is now vital for Google to function.

There is no such thing as a clear-cut distinction between data and metadata and there is no such clear-cut distinction between structured and un-structured data. Unstructured data is by definition ‘noise’ because structure is precisely the information we’re seeking to work with and make use of.

It is indeed true that the return on the global emerging of tiny local increases in data structure is surprisingly large. And larger than many people believe even while researchering in this space (and clearly much larger than their funders or even PIs believe).

It is spot-on that researchers should spend more time thinking on how to aggregate and emerge such latent derivatives of structure than to dream of obtaining standardized knowledge representations, distributed symbolic reasoning or computational knowledge.

But this confrontational undertone is coming across at best as hypocrite and at worst as toxic, especially when coming from the research heads of an entity that so much benefited from non-statistical amplification of minor distributed increases in data structure.