Home » Blog

Meet me in SF at Freebase HackDay?

June 26th, 2009

Did you ever wonder about what I saw in Freebase that made me join Metaweb but you still can’t see?

Or you think Freebase is cool and all, but you have no idea how you could make use of it or why you should care in practice?

Or simply you’re curious about what I’ve been working on for the last 9 months that involves something so tricky that a standard java virtual machine wasn’t enough and we had to patch it and run our own modified version in production? (yeah, I know how crazy that sounds.. believe me, I know)

So, if you’re reading this and find yourself close to downtown San Francisco Saturday July 11 2009, consider showing up at the free Freebase HackDay event that we are organizing to show and promote all the developers’-related activities we’ve been working on to make Freebase a useful and interesting platform to integrate with your web applications.

This day is also very important for me because we’ll be releasing Acre 1.0, our server-side javascript-powered hosted web application platform, that my team and I been working on since I joined Metaweb.

The event is free and Metaweb will be offering food, drinks and plenty of whiteboards and wireless connectivity, but if you plan on coming, please RSVP.

See you there!

 

Theory vs. Practice

June 11th, 2009

It’s a little bit of a truism really, but what’s good in theory, even in complicated and brilliant theories, not always works in practice.

The latest of such failures sits right here on my desk and has the shape of three large pieces of paper, with three various colors, that the Repubblica Italiana (aka the government of my country of citizenship) delivered to my house. They are ballots. I’m asked to make a decision on a ‘abrogative referendum‘, which is a complicated way to say that they want me to say yes or no to a patch that the people of Italy want to apply to the law of the Republic.

The Italian constitution provides these powers: the Italian people are allowed to prepare a patch for the law, collect a number of signatures (don’t remember the actual number but it has to be 250k or so) and then ask the rest of the population how they feel about it.

Only two limitations: the patch can only be removing (which is why it’s called ‘abrogative’) and it can’t touch taxes.

On paper, the idea is great: if congress goes too far and comes up with a law that goes against the people, the people can bypass congress and remove it themselves. It was meant as a last resort, a safegard… and after an empire, 1200 years of invasions, a pope-powered state built right inside the capital, one Mussolini and a civil war, it is very much understandable that the post-WWII constitutional engineers built a pretty considerable set of checks and balances (and sometimes even went too far, but that’s another story).

So here I am, with these three ballots, each with two big boxes, yes and no and an even bigger box, a *huge* box, that contains, I’m not kidding, probably 4000 words that read like this: remove word ‘in’ at paragraph 3, comma 23 of Legislative Decree #533 of December 23, 1993… remove ‘of coalition’ paragraph 2, comma 12….

You get the idea.

This means that, basically, I’m asking to evaluate the effects of this patch. The title of the patch is “removing the possibility of linking electoral lists and the attribution of a majority bonus for an electoral coalition”…. which, if I understand it right, is supposed to avoid small parties from linking up together to form a bigger party and get a ‘majority bonus’, which later turns out to be toxic because as soon as they get their seats in congress, these coalitions fragment in a bunch of shards (in the best case) or they hold the majority hostage of their will (as it happens regularly).

Ideally, one would think that voting yes (remember, yes here means ‘go head, apply the patch and remove’) would imply that less electoral coalitions get formed, which hopefully would mean that small political parties get less representation in congress, which would lead to a less unstable political system (even if less representative of minorities).

So, in theory, we have this awesome constitutional power to ’stick it to the man’ and we have this set of patches that, in theory, would enable a more stable political system (which is something Italy seriously needs).

Yet, in practice, this means reading a title and hoping that what the patch actually does is in line with what the title of the patch says. In another country, you might take this for granted, in Italy not so much: knee-jerk distrust for everything governmental goes so deep that even when I’m asked to stick it to the man, I’m wondering if there isn’t one of those men using me as a tool to stick it to some other men. Actually, no, you can count that’s the case.

So, ideally, one should ignore the patch titles and just read what they say… but this is a patch and it looks like a patch… it only has deltas and differences, it doesn’t tell me how the law works, it doesn’t show me where to find the law (best I can do is to get here… then what?) … and I’m no lawyer and I’m no jurist and I’m nowhere near capable of understanding the far-reaching dynamic implications of patching anything of any law.

What in theory is a ’stick it to the man’ power, in practice turns me into a political sock puppet. What in theory was designed as a tool to empower ends up increasing distrust and amplifying fear of action.

I have a few days to vote (Italians that live abroad vote by mail earlier) but I have no idea what I’ll vote… and not because I don’t know where I stand on the issue (I do: I want more stable Italian governments even if this means less minority representation), but because I don’t know what I’m voting on is actually going to do what I’m being told it’s going to do.

Oh, and if you think this ‘referendum’ thing is weird and not that significant, it is worth mentioning that the Italian people change their form of government from monarchy to a republic with it in 1946, decided to allow women to divorce in 1974 and to abort in 1981 and to stop the use of nuclear power in 1987 (if you’re curious, check out the full list of Italian referendums).

Anyway, this might look like democracy and walk like democracy, but ultimately it doesn’t feel like one to me at all.

Permalink | Posted in Commentary
 

Unreasonable Hypocrisy

March 31st, 2009

I recently came across this paper entitled “The Unreasonable Effectiveness of Data” by Alon Halevy, Peter Norvig and Fernando Pereira published on the IEEE Intelligent System journal. The paper outlines many of the ways that structure can be inferred statistically from very large quantities of data. They explicitly mention this approach to be antithetic to the semantic web’s, where it is believed that more explicit structure needs to be added to data as a way to improve the ability for machines to emerge information from it.

The paper left me with a bitter taste but I couldn’t put my finger on why until this morning.

As I wrote before, Google built its empire on the <a> tag. Not on statistical methods but on fully deterministic topological analysis of the graph of hyperlinks. They did so while everybody else in the field tried all they could to emerge rank out of better understanding of the content of pages using statistical methods and while everybody else thought that the search engine field was a done deal (because the field of text mining and machine learning was already old and very established)

Unpredictably, HIST (or google’s own flavor, pagerank) became the de-facto standard in state-of-the-art rank emergence in hyperlinked corpora and it’s now considered a milestone in that field.

What upset me about that paper is not how they say “oh sure, structure is great, but look overhere: there is a goldmine in all the sand” (which is something I fully resonate with) but they phrased it as a fight, deterministic vs. statistical, trying to convince people that adding structure it not the way to go, it’s basically a global waste of research resources.

And yet, without the <a> tag (that is: machine-readable imposed structure), they wouldn’t be where they are, not they would be able to speak from such a tall soapbox.

Sure, Google uses all sort of techniques, statistical and not and they are very good at mixing them together, but that’s not what you get from the paper. What you get is a undertone of criticism for those who believe that what’s needed is a lot more explicit structure.

What’s weird is that I fully agree with them there: the web of <a> tags and URLs is a very tiny increase of structure compared to the full-on ontological utopia that most semantic web advocates dream of, yet such a small change in the data ecosystem provides sufficient latent information to obtain staggering new insights on the corpus as a whole.

The same thing could be said for n-gram analysis and character encodings: without a minimal world-wide agreement on how to turn characters into streams of bits, there would be no way for you to parse a human word into machine processable n-grams, therefore no way for you to build corpus spectra and no way for you to work with them to gain insights.

A good title should have been “The Surprising Payoffs of Small Distributed Increases in Data Structure” and it would outline how the introduction of UTF-8 massively simplified n-gram analysis or how the introduction of the <a> tag in HTML allowed the creation of pagerank or how the massive adoption of TCP/IP or HTTP made it possible for Google to harvest the entire web as it was in their basement (and now practically is) or how the widespread support of RSS/Atom massively simplifies the analysis of ‘data in faster motion’ or even how their own ‘sitemap‘ format (a very highly structured XML document) is now advocated as a favorite SEO tool for you to advertise what content you want harvested more than other.

There is absolutely nothing statistical about all those things, yet each and every one of them is now vital for Google to function.

There is no such thing as a clear-cut distinction between data and metadata and there is no such clear-cut distinction between structured and un-structured data. Unstructured data is by definition ‘noise’ because structure is precisely the information we’re seeking to work with and make use of.

It is indeed true that the return on the global emerging of tiny local increases in data structure is surprisingly large. And larger than many people believe even while researchering in this space (and clearly much larger than their funders or even PIs believe).

It is spot-on that researchers should spend more time thinking on how to aggregate and emerge such latent derivatives of structure than to dream of obtaining standardized knowledge representations, distributed symbolic reasoning or computational knowledge.

But this confrontational undertone is coming across at best as hypocrite and at worst as toxic, especially when coming from the research heads of an entity that so much benefited from non-statistical amplification of minor distributed increases in data structure.

Permalink | Posted in Commentary