Home » Blog » Archives

Archive for the ‘Commentary’ Category

Unreasonable Hypocrisy

March 31st, 2009

I recently came across this paper entitled “The Unreasonable Effectiveness of Data” by Alon Halevy, Peter Norvig and Fernando Pereira published on the IEEE Intelligent System journal. The paper outlines many of the ways that structure can be inferred statistically from very large quantities of data. They explicitly mention this approach to be antithetic to the semantic web’s, where it is believed that more explicit structure needs to be added to data as a way to improve the ability for machines to emerge information from it.

The paper left me with a bitter taste but I couldn’t put my finger on why until this morning.

As I wrote before, Google built its empire on the <a> tag. Not on statistical methods but on fully deterministic topological analysis of the graph of hyperlinks. They did so while everybody else in the field tried all they could to emerge rank out of better understanding of the content of pages using statistical methods and while everybody else thought that the search engine field was a done deal (because the field of text mining and machine learning was already old and very established)

Unpredictably, HIST (or google’s own flavor, pagerank) became the de-facto standard in state-of-the-art rank emergence in hyperlinked corpora and it’s now considered a milestone in that field.

What upset me about that paper is not how they say “oh sure, structure is great, but look overhere: there is a goldmine in all the sand” (which is something I fully resonate with) but they phrased it as a fight, deterministic vs. statistical, trying to convince people that adding structure it not the way to go, it’s basically a global waste of research resources.

And yet, without the <a> tag (that is: machine-readable imposed structure), they wouldn’t be where they are, not they would be able to speak from such a tall soapbox.

Sure, Google uses all sort of techniques, statistical and not and they are very good at mixing them together, but that’s not what you get from the paper. What you get is a undertone of criticism for those who believe that what’s needed is a lot more explicit structure.

What’s weird is that I fully agree with them there: the web of <a> tags and URLs is a very tiny increase of structure compared to the full-on ontological utopia that most semantic web advocates dream of, yet such a small change in the data ecosystem provides sufficient latent information to obtain staggering new insights on the corpus as a whole.

The same thing could be said for n-gram analysis and character encodings: without a minimal world-wide agreement on how to turn characters into streams of bits, there would be no way for you to parse a human word into machine processable n-grams, therefore no way for you to build corpus spectra and no way for you to work with them to gain insights.

A good title should have been “The Surprising Payoffs of Small Distributed Increases in Data Structure” and it would outline how the introduction of UTF-8 massively simplified n-gram analysis or how the introduction of the <a> tag in HTML allowed the creation of pagerank or how the massive adoption of TCP/IP or HTTP made it possible for Google to harvest the entire web as it was in their basement (and now practically is) or how the widespread support of RSS/Atom massively simplifies the analysis of ‘data in faster motion’ or even how their own ‘sitemap‘ format (a very highly structured XML document) is now advocated as a favorite SEO tool for you to advertise what content you want harvested more than other.

There is absolutely nothing statistical about all those things, yet each and every one of them is now vital for Google to function.

There is no such thing as a clear-cut distinction between data and metadata and there is no such clear-cut distinction between structured and un-structured data. Unstructured data is by definition ‘noise’ because structure is precisely the information we’re seeking to work with and make use of.

It is indeed true that the return on the global emerging of tiny local increases in data structure is surprisingly large. And larger than many people believe even while researchering in this space (and clearly much larger than their funders or even PIs believe).

It is spot-on that researchers should spend more time thinking on how to aggregate and emerge such latent derivatives of structure than to dream of obtaining standardized knowledge representations, distributed symbolic reasoning or computational knowledge.

But this confrontational undertone is coming across at best as hypocrite and at worst as toxic, especially when coming from the research heads of an entity that so much benefited from non-statistical amplification of minor distributed increases in data structure.

Permalink | Posted in Commentary
 

The arXiv blog joins MIT TechReview

March 13th, 2009

I’m not sure how many people understand how huge this is, so let me explain how I see it.

arXiv is a an e-print archive built and run by the Cornell University Library. Basically, a big repository of scientific articles (more than 500k right now), all in digital form (PDFs), all with permanent identifiers and all freely available.

arXiv has existed for 17 years and the number of article contributions per month has been increasing linearly over all 17 years(!).

500k items in a collection is a lot but it’s relatively small compared to, say, the web, scientific publishers or even any decent size university library catalog.

What’s fascinating about arXiv (and citeseer and other similar repositories) is that submissions are not peer reviewed: storing and distributing 10k articles or 500k has a very minor difference in cost, which is why the need for up-front filtering drops dramatically.

The knee-jerk reaction to this from old-school scientists and librarians is normally something between horror and disgust: if anybody is allowed to publish, the result is that quantity will increase and quality will drop.

This is a reaction to a model that sees books on shelves and articles in journals in stacks located by searches in an (electronic these days, cards when I was born) index catalog by metadata and subject term. In that world, yes, lower filter on quality yields substantially lower quality in precision and recall for any search and in vastly diluted cataloging and curating efforts.

But arXiv is a highly automated system and runs on full text analysis and self-submitted article metadata and subject classification. It gives every article a permanent identifier and it also links to “citebase” which tracks references to it from other papers.

There is no a priori peer review, subject analysis, librarian curation, all it’s done a posteriori, by analyzing the content of the article and the behavior of people around it.

Yes, there is tons of crap in there but there are also incredible gems (such as this one, one of my favorite articles).

We all agree that focusing on distinguishing between the good and the bad is valuable, but the reshaping of social landscapes and economies of scale make the old tools feel inadequate and suboptimal, including printed publishing and their politics and affiliation-driven peer review system that keeps them alive.

One of such new tools has been the arXiv blog: a human curator watches over the stream of entering articles in arXiv (mostly to spot and flag abuses of the system, I would guess) and decided to blog about the gems that were found in the process.

The blog was always very interesting and witty, and one of my favorite ways to discover interesting new scientific discoveries, coming from all sort of places and without the need for a big university affiliation to make it into the establishment.

The news of today is that from now the arXiv blog will be exclusively hosted inside the MIT TechReview web site. This is huge not only because it will bring exposure to arXiv and to more peripheral scientific research, but because it sets a small but substantial milestone in the acceptance of a commit-then-review scientific publishing world (opposed to the de-facto standard review-then-commit model that has been in place for centuries).

Don’t get me wrong: this is an important step but it’s still a pretty small one. What makes me happy about this is that at least we seem to be walking in the right direction, a direction that by giving a chance to anybody to publish their thoughts without having to convince others of their value a priori will hopefully spark more variety of thought, more diversity in research and will focus more on publishing to improve scientific merit rather than publishing to improve your position in the network of academic influence.

Permalink | Posted in Commentary
 

Post-Mortem of a Dissonant Keynote

March 5th, 2009

Last week, I was given the honor of being the keynote speaker at the Code4Lib 2009 conference in Providence, RI.

This is a conference self-organized by a community of alpha-techy librarians trying really hard to drag the library world (kicking and screaming) into a more modern alignment of both technological infrastructure and vision. I was never a librarian, but I worked 5 years for the MIT Libraries (trying to do the above myself) and now work for a company that can very well become part of that future technological infrastructure that could be highly beneficial for a modern version of the libraries.

Both SIMILE and Metaweb are considered valuable things in this space and watched with great interest, and I know many of them read this blog (or at least have read some of my ramblings in passing). Which is why I was invited.

So it’s particularly painful to admit that my keynote sucked.

Plagued by network problems that slowed my live demos to a crawl and made the whole experience downright pathetic, my timing got off and I was forced off the stage by the bell (the conference ran on a surprisingly rigorous schedule) without delivering the entire ending of the presentation.

This wouldn’t have been that bad if the ending was just a wrap up… but that was, in fact, the part that I wanted most to tell the audience about (and everything else was a way to lead to it).

My keynote wanted to highlight all the things that libraries and librarians are really good at and that will NOT die even if books become obsolete media of information transfer…. but I didn’t have time for that, so I just told them all the things that might displace them, and even demoed a few live.

And left it at that.

It was supposed to be a “fear then hope” speech, but time forced me to cut the ‘hope’ out of it.

Yeah, right.

So, you can imagine how eager I was to go thru the IRC conference backlog during my keynote that Erik was so prompt to send me. Now, to be frank, conference backlogs are never for the faint of heart, but I’ve done this before, I have a pretty thick skin and I knew already it was bad so I dove into it, hoping to find more value than just “booo”.

The first part of the keynote was about showing how the variations of marginal costs have driven all the innovations in information transfer, that following that curve would seem to predict that as soon as “ipods for books” arrive that have reasonably similar user experience to books (no, we’re so not there yet, kindle is barely scratching that surface), most information would become digital and libraries would split between ‘museums of books’ and ‘something else’. At which, the backchannel replied

[09:27:26]          bess |  umm… he knows this is CODE 4lib, right? This is reminding me of mid-1990s “the book is dead” hysteria

True. The people at code4lib already sense there is some value in there (and they all work on digital technologies already). Although when (a little earlier) I pointed out that books will probably feel one day like vinyl LPs feel today, there was a slew of

[09:26:12]           rjw |  vinyl LPs++
[09:26:17]           gsf |  fanatics++
[09:26:22]          kat3 |  bibliophiles++

which indicates, as I knew already, that there is a love/hate relationship going on in the libraries about information technology as a displacer. Many want the benefits of the new tools, without the disruption that these end up causing. This turns out to be a very common pattern across many industries that have to deal with information distribution.

Then I go on observing the fact that if there is no limitation of storage (or its costs curves are radically altered from today’s library’s shelves), what is the justification of filtering?

[09:29:11]      scolford |  I would like to filter faulty or dangerous medical or financial information. Wouldn’t you?

Actually, no, I wouldn’t.

In order to do this, you must know at the time of filtering how to evaluate properties like ‘faulty’ or ‘dangerous’ and assume they are both generally applicable and don’t change over time. Faced with a problem of shelf-space limitation, it’s easy to evaluate that storing faulty or dangerous information at the expense of valuable and useful information isn’t a benefit, no matter how important this filtered out information turns out to be 100 years in the future. But I don’t find it to be a valid argument today.

[note, less input filtering does not imply less output filtering: I too care about usefulness of information, I just disagree that pre-emptive filtering is the best way to achieve it, as it was forced to be done in the past by shelf-space constraints]

Technology might displace the very reason why filtering was done in the libraries, but the mindset will be much harder to displace. This inertia is way more dangerous than it might seem at first, if only because it opens the doors to other institutions that have much less problems in accepting all information and decide on its value a posteriori, but that don’t necessarily share the libraries’ core values.

The next step was, obviously, to deal with the other side: less reasons to filter input imply more reasons for better output filters and metadata has historically being the solution for this problem for libraries. But if the entire text was available, would we still need metadata? (at which many in the audience yelled “yeah!”)

[09:30:57]           gsf |  attack books all you want, but leave metadata alone

and also

[09:29:27]      rosy1280 |  metadata doesn’t exist
[09:29:31]      rosy1280 |  its just data

and to sum it up

[09:30:15] *     jtgorman is guessing the keynote is striking dissonance tones

which was precisely the point and then continues

[09:32:48] *     jtgorman thinks non-library people as keynotes is always a gamble

which is a very diplomatic way of saying that I was not pleasing

[09:32:52]          bess |  hi, you must be new here. Yes, in fact, people have thought about some of this stuff before. Even in libraries.

precisely, but very few ever focus on what libraries would still be useful for (or good at) after all this electronic dust settles (but neither did I, since I ran out of time, boo).

[09:32:58]         dchud |  ok maybe now he’s getting to metaweb.

then

[09:33:49]    timmcgeary |  demoing that books are dead?

Not quite. I’m not the one killing books nor I have an incentive to do so. But my work for years has all been focusing on showing how to emerge information out of other information and for that you need the ability to re-purpose and integrate data from various sources. And you can’t (easily) do that with books. I demoed all the stuff that we’re working on at Metaweb that shows the power of that approach.

[09:34:21]      akorphan |  So… the semantic web is good for Jeopardy answers?

At a superficial level, yes, it really feels like all it’s good for: like one of those nerdy kids that know everything about every subject but fail miserably to distill/emerge valuable knowledge from that sea of facts.

It’s not surprising really: many societies tend to equate knowing lots of things to being very smart, while most teachers (or individuals in general) know better.

The current state of affairs is that all efforts on the web of data are, at least at this stage, in the ‘notionistic’ phase. I’m the first to admit that and I’m the first to want to improve on it… unfortunately, emerging latent properties from datasets require a considerable volume of information and dense networks of relationships; these are surprisingly rare and hard to build.

The backlog gems continued:

[09:35:17]      akorphan |  I think the assertion that the degranularization from book to data leaves out the notion that *narrative* is a necessity for all kinds of reading, research, etc.

This is very smart criticism and I wish I had Q&A time to discuss this live: even if web of data turns out to be all its proponents want it to be, narrative won’t still be part of it, but it will be something to put on top.

There is no question that transforming research, random thoughts and ideas, structured queries and infoviz charts into a coherent and understandable narrative is absolutely necessary for all this web of data infrastructure to be worth anything.

But we really don’t know what this ‘narrative’ turn out to be or if this is substantially different than today’s. For example, we don’t know if the availability of interactivity will change the way people write papers or books or decide to visualize and present their findings or their arguments.

There were also nice comments during the demos (despite the network slowness)

[09:35:52]   anarchivist |  freebasing++

I also gave the audience very hard questions about factual information and asked them where they would look for them first at which

[09:36:09]  mib_nch3tkl0 |  omg…I thought of freebase when he asked!

would certainly make some people at Metaweb really happy but also

[09:37:49]         dchud |  for every one of these questions, i know multiple librarians who would know hte answers off the top of their heads
[09:38:25]      jbrinley |  dchud: can I have copies of those librarians?

This next one got me laughing and scared me at the same time

[09:38:32]       mbklein |  Parallax is awesome, but all I really use it for is scaring the technophobic librarians.

we clearly have a lot of work to do to make all this useful. Then a bunch of funny comments about the network being slow

[09:40:37]          epoz |  interesting. digital heckling burns bandwidth causing presenter demo grief
[09:40:58]      akorphan |  Quick, everyone start playing youtube clips!
[09:41:11]          bess |  should i not be bittorrenting right now?
[09:42:03]   anarchivist |  *crickets*
[09:42:15]         MrDys |  looks like my plan to do a live demo was not a great one…
[09:42:16]    BillDueber |  Note to self: be prepared to talk over slow network.
[09:42:31]          BigD |  freebase ate the tubes
[09:43:02]      harmless |  anyone have a fast data plan on their cell?
[09:43:19]    paulalbert |  I wonder if it’s so slow because I’m downloading the entire second season of Hannah Montana in HD.

and the last drop

[09:43:30]       rsinger |  you know, a librarian wouldn’t have these problems

which really hurt.

But there are also useful things hidden in there:

[09:45:36]       mikeybe |  can you make private freebase data sets?

(no, you can’t… not at the moment at least) or

[09:43:46]          edsu |  MikeTaylor: how do you build a distributed database on the web?
[09:44:47]    MikeTaylor |  edsu: something fuzzier.  Any solution that begins with “Hey, let’s just get everyone to input everything with rigour!” is not a solution.

and a very interesting conversation  (which is indeed the kind of thing that I wanted to spark):

[09:45:41]  mib_nch3tkl0 |  also, librariny q…since freebase queries things like wikipedia, how do we verify info?
[09:46:11]      akorphan |  What if you get two sources that provide different dates for King Lear?
[09:46:17]      akorphan |  how do you discriminate?
[09:48:14]      jbrinley |  akorphan: you look at the sources used for those sources, same as you would with books/encyclopedias
[09:48:23]      jbrinley |  akorphan: ultimately, it comes down to trust
[09:49:10]      akorphan |  Sure, but the predominant convention for trust on the web at this time is amount of linking, which is a bit suspect.
[09:50:29]      jbrinley |  akorphan: you’re welcome to use your own methods to determine who you trust on the web
[09:50:56]    MikeTaylor |  Any solution that begins “you are welcome to use your own methods to …” is not a solution.
[09:51:21]      harmless |  books have that problem too.  a lot of questionable stuff gets printed.  a lot of journalists write articles from too few research papers with too small samples or too preliminary, etc.
[09:51:24]      harmless |  do we have any standard metric for information quality?
[09:51:26]      jbrinley |  MikeTaylor: but it’s the same solution that we already have for print
[09:51:57]      akorphan |  jbrinley: But the notion here is that you’re trying to make some kind of automated knowledge aggregator
[09:52:05]    MikeTaylor |  jbrinley, I thought we were trying to IMPROVE on what we already have.
[09:53:28]      jbrinley |  MikeTaylor: yes, improve it. But don’t say that the Internet is worse than print just because you don’t have better solutions for certain problems

Then I showed the audience what we’re doing to improve on the rate of contribution to Freebase, things like Typewriter, Genderizer and Geographer. These are ‘games with a purpose‘ build and powered by Acre (Freebase’s application platform) that want to make it easy for people to contribute data to Freebase. Here another set of dissonance tones were struck

[09:53:24]         dchud |  i’m not convinced that crowdsourcing is necessarily different from Gale or whoever paying staff to edit reference sources
[09:54:17]    BillDueber |  Great. Votes by people who don’t know what the hell they’re talking about. I can get that now on IRC.
[09:55:14]    paulalbert |  wisdom of crowds only works when there’s no echo chamber effect
[09:57:42]       skoczko |   no one is gonnna spend time writing down relations and dependencies that seem obvious to him, unfortunately those relations are not obvious for the machine
[09:58:18]      Baroquem |  Not boredom. It’s fun to bring order to chaos.
[09:58:35] *        JodiS has a great desire to bring info together

and last but not least

[09:58:19]      akorphan |  I just can’t get 100% behind the “factual by majority assertion” model of authority.
[09:58:55]         JodiS |  akorphan: yup, that’s a big deal. “The majority is always wrong” (or whatever Ibsen said)

which deserves a separate blog post.

So, partially to set the record straight and partially to apologize, I made available my slides, all three parts, including the last one that I couldn’t deliver.

Enjoy.

Permalink | Posted in Commentary