Home » Blog » Archives

Archive for 2006

The MIT Lectures Search Engine

September 15th, 2006

One of the drawbacks of working for MIT is that your “wow factor” gets spoiled, meaning that it’s really hard to make me go “wow” when I see something new around computers and information technologies since I’ve been immersed in so much cool stuff and cool people.

A few months ago, David pointed me to an MIT internal prototype of a web-based search engine for video lectures that he had been collaborating with for the UI part and that made my “wow meter” go off scale. He had to pressure me to avoid blogging about it back then because they were still working on the system and they didn’t want to deal with the side effects of early exposure, but today I received the ‘ok, go’ and I’m hitting the web press.

So, here it is: the MIT video lectures search engine.

NOTE: To enjoy the full experience, you need to have Real Player installed. I know, it’s a bummer, but believe me, it’s worth it.

I remember seeing something like this (a video search engine for the german parliament) at the Fraunhofer Institute in Germany a few years ago, but they were, ehm, ‘cheating’ by correcting the speech recognizers with human-generated trascripts and they needed a special client to access the database. This one is completely automated. No human intervention. And, best of all, works with your browser, has a simple google-like textbox, shows you the contexts around what you searched for in a clickable timeline view and does a karaoke-like word highlighting when the part of the video you selected is playing.

So, in other words, you feed hundreds of hours of video to a computer and, hours of crunching later, you get this.

But what’s really important to understand is not just how much more useful and more exposed the hundreds of MIT video lectures now is with a service like this (and how much professors will want to appear there too if they aren’t there already!), but also, as a non-native english speaker, I can only begin to imagine how useful this is as a spoken-english training platform for millions of students around the world! MIT is widely known for projects such as OpenCourseWare and One Laptop per Child, but this lecture search engine goes right up there for usefulness for humanity and not only as a kick-ass demo of decades of speech recognition research.

This paper explains part of the system (they are still in the process of submitting papers for this system)

I don’t have to be reminded to feel proud about working for MIT, but there is something deeply touching my engineering soul in seeing decades of research finally condensing in something that truly delivers the promise in a easy to use, easy to understand (and delightfully addictive, I might add) way.

To Jim Glass and his group: “chapeau”.

Permalink | Posted in Commentary
 

Toolmaking in SoCal

September 6th, 2006

I last blogged 7 months ago, around the time I moved to Los Angeles. It cannot be a coincidence and maybe the saying “I don’t have a blog, I have a life” has some real value… or maybe I need to sit down and fix all those bugs in this linotype of mine… or maybe I’ve just being busy working on tools instead of talking.

While deep inside I’m a hardware guy (with a degree in opto-electronics), since I lacked the millions of dollars that are required to do any serious research in those fields (or the patience to climb the social ladder to obtain the access to properly equipped labs), I turned to software where all you need is a text editor, a compiler/interpreter, time and patience. Sprinkle a little bit of OCD and ADD over it, early access to the internet and to like-minded people and you’re almost there.

Then you need a purpose.

McLuhan once said: “if it works, it’s obsolete”. It’s my motto, basically, or my curse, depending on what day you ask me. So I always decide to work on something that is new and fresh… not that crowded… where you can make an impact… a bleeding edge.

The problem with the ‘bleeding edge’ is not only that is bleeding but that you have no tools. You can decide to use the existing tools in different ways and “make it work” or you can write new tools yourself. I consider myself lazy. So lazy, in fact, that instead of doing a boring action twice, I would spend a lot of time and energy to build a tool to do it for me. Pretty much every software that I’ve created or helped writing was done to save me from doing boring and mindless tasks.

I spend pretty much all my time building tools now. I wrote and released two recently and both try to exercise the power of ‘data emergence’: the sum is way more than the parts, especially when you can get your hands on a lot of data

Gadget

The first one is Gadget, which I released a few months ago. I’m particularely proud of this tool even if not many people have the problem that it tries to solve. It’s an XML inspector that helps you understand the structure of one or more XML documents, no matter how big. When I say “no matter how big”, I’m not kidding. I’ve used it with several gigabytes of XML.

Gadget was written because when you are given a few database dumps in XML, each a few Gb big, you can’t really open it in your text editor and look at it. Sometimes you are given a schema or a DTD, but most of the times the dump won’t even validate against the schema. Moreover, if you are tasked to write a program that transforms this XML dump into something else (for example RDF/XML), the schema is not enough to understand, for example, if a particular value is unique in the dataset or not.

So I wrote Gadget, which basically is an XML parser on BerkeleyDB steroids. It fragments the XML and saves the XPath projections of it. Then you can reconstruct the “skeleton” of the XML, which is the list of all the XPaths that were ever encountered by the parser.

So, you can take all the XML you want, and throw it at Gadget. It will digest it, producing indexes that later a web application will use to present you with that data. And you can search, for particular values, see the value distributions charts, browse the XML skeleton tree and, last but not least, apply clustering functions to the values found in the same XPath to evaluate things like spelling mistakes. The clustering function I use is very simple and very efficient (linear in algorithmical complexity with the number of values being clustered) but it is surprisingly effective, at least for the datasets I’ve used it on.

So it’s a discovery tool and it helps the data analyst create a mental model of the XML dataset being analyzed, but also allows for data quality control and error management.

Of course, the next step is to build the equivalent of Gadget for RDF, but the graph datamodel makes it a more complicated since you can’t (easily) have paths. I’ll blog more about this in the future.

Gadget has saved me endless hours of frustration and I even received a book as a gift from somebody that loved it so much. I’ve demoed it at the Getty Museum in Los Angeles and at ARTstor in New York and got everybody very excited.

In Gadget, I also like the fact that I decided to build the presentation software as a web application, even if most of the time you are the only one looking at the data that Gadget generates. Also, it’s written in Java but I didn’t use any framework. I just wrote a simple servlet, following the same pattern of the cocoon sitemap but orders of magnitude simpler and used Jetty6 maven plugin that autorestarts the context when you make a change. So, maven + jetty + eclipse + a simpler servlet+ velocity + some javascript on the client and you don’t need much else. The power of a serious IDE, the solidity of the java web stack, the fast round/trip of interpreted languages, the set of libraries that java has to offer and the simplicity of downloading them thru maven.

For a guy that spent 7 years working on an XML publishing framework, it was indeed refreshing to find out how fast and simple it was to do just a very specialized web application.

Referee

The other tool was released today and it’s called Referee.

Referee is a command line application that reads your web server logs and automatically finds out what other web pages have to say about your own. Unlike trackbacks, Referee is a completely automated tool: all you need is your server logs and a network connection, Referee will do the rest. And will work not only for blogs, but for any URL of your web site, not matter what program generated it.

Referee was born out of my curiosity to know who links to my stuff and what they have to say about it. I don’t have comments on my blog, because I think that if you really care to say something about this, you’ll blog on your own (or you’ll write me an email) [it's a social filter, so to speak]. So I needed a way to harvest all that content and using bloglines or technorati or google news search is ineffective because they keep feeding me stuff that they have found already just because it’s in a new URL or because the content of the page has changed.

Referee takes care of the “generation” part of the data, which is then saved as RDF/N3 files for you to consume the way you prefer.

And I hear you ask in frustration: “Why RDF? Atom would have been way easier!”.

Atom is a step forward from general XML because it allows you to split the data model into many tree fragments, each with a unique identifier. But there are two issues with using Atom as a general data modelling language for many separated items: it lacks the ability to model relationships between such items.

Referee is unique because it treats ‘comments’ and ‘pages’ differently. A comment is a piece of text that surrounds the <a> tag that links to your page and it’s identified by its SHA-1 value, while a page is the URL that containst that comment. There are cases where the same content is contained in different pages (different as they have different URLs) or cases where the same page contains different comments. There is no way to model that as a tree, you need a graph.

So, while atom might have allow me to model the single items (pages, comments and feeds), I would have had to extend it with my own markup to model the relationships between these items. Ending up reinventing the RDF wheel anyway and in a way that would be incompatible with RDF tools and ignored by Atom tools.

Note that Referee deals only with the production side of the data cycle, the consumption is left to the user. You can either browse it with RDF tools (such as Longwell or Piggy Bank) or you can feed it into a triple store and run Sparql queries on top of it (for example to generate an Atom feed of the new comments about a specific URL).

What I find fascinating about Referee is that, just like my spam filter, it’s a great example of a software agent that I use that is smarter than me (if only in doing repetitive jobs without getting numb or bored or making mistakes doing it because of that, at least). The best example of this is a list of comments in character sets that I’m not even able to interpret!

I continue to be amazed by the power of the right combination of tool, itch to scratch and good architectural model.

Ah, forgot to mention: both tools are open sourced using a BSD license.

Enjoy.

Update: The first tool I wrote and released as open source (8 years ago!) is Apache JMeter, which I had written to test the thread pool code that I had written for Apache JServ. Today I find out that Google uses JMeter as their profiling tool of choice [via Steve Loughran]. A good ego massage.

 

Manhattan Project 2.0?

February 20th, 2006

My friends and I are often debating about the impact on society that technology makes (or could make), especially since some of us are driven by the idea (idealism?) that technology (more specifically information technology) can be an enabler, a way to give opportunities to those who don’t have them.

Rekha sends me this short movie by Chris Oakley called “the catalogue” and it touches me deeply because it depicts a scenario that is both appealing and devastating at the same time: a future of networked information where privacy is traded for convenience and data is aggregated, collected, bought, mined and exchanged, until it becomes a sort of currency on its own.

Information is power. But gathering the information is not enough, you have to be able to do something with it. That’s why Google’s bought 30 acres of land next to a huge hydroelectric power plant in Oregon: 150MW could power a lot of computers.

I work every day on software that deals with large quantities of semi-structured data and tries to emerge relationships that were not obvious to make before. Books and academic papers for now, but tomorrow will be the web, then images, then email, then then DNA sequences… we are still very far away from a scenario like the one depicted in the movie, but what’s scary is that somebody would be willing to invest billions of dollars to make that happen. I won’t cry wolf adding myself to the list of those who fear that Orwell was right: what concerns me is that I’m part of this, I’m actively helping out.

It’s not that we are blind about it. Privacy, identity, society, education, fair use, access control, rights management: these are words that that get mentioned in every meeting I go to and I know people that are actively trying to make progress in information exchange but without sacrificing the individual on the altar of the greater society good.

But I wonder: do we really understand what we are playing with? “The House in the Middle” is a movie made by the US government in 1954 that uses the fear of atomic heat to convince people to clean up their back yards and renovate their houses. It looks sort of a dry and sad humor now to see a tidy house resist atomic heat, just to have its inhabitants die of nuclear fallout in a few days. I wonder if we aren’t saying/thinking equivalently off-beat when we debate about the dangers of the evil use of what we are creating.

Which makes me wonder about what people will feel when they look back at us, chasers of Vannevar Bush, from 2054.

Or maybe I’ve just been working too hard lately.

Update: Karl Dubost points me to this, a mash-up that uses amazon wishlists in order to spot people based on ‘subversive’ book choices. What amazes me is that I find it both scary and lame: lame in the sense that we can do so much better than that, scary is that I wouldn’t want to but others might!

Permalink | Posted in Article