Home » Blog » Archives

Archive for the ‘Announcement’ Category

Leaving MIT

March 11th, 2008

When I joined MIT, in January 2004, very few people knew about the SIMILE Project, even if a few suspected that my presence would change that.

Now, a little more than 4 years later, SIMILE is known inside and outside academia, has produced software that is used by thousands of people, in many different environments and, most of all, has pioneered many innovations in data integration, data visualization and the relationship between open development practices and academic environments.

I’m extremely proud of the work we have done. I’ve had the unique and wonderful opportunity to work with amazing people, bringing all sort of different skills, experience and points of view to the table. Both a humbling and empowering experience, and I consider myself incredibly fortunate of having had the opportunity of belonging to this group and to this truly unique and world-wide recognized institution.

But Phase 2 of SIMILE is coming to an end and it’s time for me to re-evaluate my position and my aspirations.

While the academic environment is a wonderful opportunity for vast and deep research and for mental stimulation, it is not the place where it’s easy to make the rubber meet the road and get real traction. I’ve tried to change that, merging my decade-long Apache experience with academic dynamics and its funding; I’ve had the fortune of having wise funders and wise bosses, but I sense that I’ve personally peaked and that it’s time to re-evaluate where to invest my energy.

My contract with MIT expires Jan 2009 and there is still plenty of job to do around SIMILE and friends so I’m in no hurry and also my visa status won’t change before the end of the summer so I can’t even change jobs if I wanted to before that happens… but what some of you already know it’s now public information: I’m officially on the market for a new job.

So what would I want to do next?

First of all, I bet that the future of IT is in data not in software, so I won’t be following Ted in a job that tries to port a particular language on a particular platform, no thanks, not interested.

Second, we just bought a house in Los Angeles and quite simply, I’m not going to move. I’ve worked remotely for MIT for the last 2 years and 6 years with Apache before that, so I’m used to work remotely. I’m not afraid of traveling (flying makes me even more productive at times), but relocation is not an option.

Third, I’m way more productive if I’m passionate about my job, so if you think you can lure me into a job that I won’t feel passionate about with a higher salary, don’t: you won’t get what you’d pay for and we’ll both be unhappy.

That said, it doesn’t mean that I will work for peanuts if I liked the job, that’s part of the reason to leave academia: the job needs to be something that I feel passionate about, something that I want my name publicly associated with and something that I feel I can invest time and energy to make a strong impact and feel proud about. And I need to be properly compensated for it. In this order, but all pieces are important.

I already have several offers on the table a few of which very promising, but no done deal yet, so if you think you have a job for me and it meets that above requirements, send me an email and we’ll take it from there.

And if you worry about the future of SIMILE and its software, don’t worry: it’s not going away anytime soon, although there are already discussions on how to move forward and ’spin off’ some of the most used software in more neutral locations. But that’s another story and this is not even the right place to discuss it.

 

Toolmaking in SoCal

September 6th, 2006

I last blogged 7 months ago, around the time I moved to Los Angeles. It cannot be a coincidence and maybe the saying “I don’t have a blog, I have a life” has some real value… or maybe I need to sit down and fix all those bugs in this linotype of mine… or maybe I’ve just being busy working on tools instead of talking.

While deep inside I’m a hardware guy (with a degree in opto-electronics), since I lacked the millions of dollars that are required to do any serious research in those fields (or the patience to climb the social ladder to obtain the access to properly equipped labs), I turned to software where all you need is a text editor, a compiler/interpreter, time and patience. Sprinkle a little bit of OCD and ADD over it, early access to the internet and to like-minded people and you’re almost there.

Then you need a purpose.

McLuhan once said: “if it works, it’s obsolete”. It’s my motto, basically, or my curse, depending on what day you ask me. So I always decide to work on something that is new and fresh… not that crowded… where you can make an impact… a bleeding edge.

The problem with the ‘bleeding edge’ is not only that is bleeding but that you have no tools. You can decide to use the existing tools in different ways and “make it work” or you can write new tools yourself. I consider myself lazy. So lazy, in fact, that instead of doing a boring action twice, I would spend a lot of time and energy to build a tool to do it for me. Pretty much every software that I’ve created or helped writing was done to save me from doing boring and mindless tasks.

I spend pretty much all my time building tools now. I wrote and released two recently and both try to exercise the power of ‘data emergence’: the sum is way more than the parts, especially when you can get your hands on a lot of data

Gadget

The first one is Gadget, which I released a few months ago. I’m particularely proud of this tool even if not many people have the problem that it tries to solve. It’s an XML inspector that helps you understand the structure of one or more XML documents, no matter how big. When I say “no matter how big”, I’m not kidding. I’ve used it with several gigabytes of XML.

Gadget was written because when you are given a few database dumps in XML, each a few Gb big, you can’t really open it in your text editor and look at it. Sometimes you are given a schema or a DTD, but most of the times the dump won’t even validate against the schema. Moreover, if you are tasked to write a program that transforms this XML dump into something else (for example RDF/XML), the schema is not enough to understand, for example, if a particular value is unique in the dataset or not.

So I wrote Gadget, which basically is an XML parser on BerkeleyDB steroids. It fragments the XML and saves the XPath projections of it. Then you can reconstruct the “skeleton” of the XML, which is the list of all the XPaths that were ever encountered by the parser.

So, you can take all the XML you want, and throw it at Gadget. It will digest it, producing indexes that later a web application will use to present you with that data. And you can search, for particular values, see the value distributions charts, browse the XML skeleton tree and, last but not least, apply clustering functions to the values found in the same XPath to evaluate things like spelling mistakes. The clustering function I use is very simple and very efficient (linear in algorithmical complexity with the number of values being clustered) but it is surprisingly effective, at least for the datasets I’ve used it on.

So it’s a discovery tool and it helps the data analyst create a mental model of the XML dataset being analyzed, but also allows for data quality control and error management.

Of course, the next step is to build the equivalent of Gadget for RDF, but the graph datamodel makes it a more complicated since you can’t (easily) have paths. I’ll blog more about this in the future.

Gadget has saved me endless hours of frustration and I even received a book as a gift from somebody that loved it so much. I’ve demoed it at the Getty Museum in Los Angeles and at ARTstor in New York and got everybody very excited.

In Gadget, I also like the fact that I decided to build the presentation software as a web application, even if most of the time you are the only one looking at the data that Gadget generates. Also, it’s written in Java but I didn’t use any framework. I just wrote a simple servlet, following the same pattern of the cocoon sitemap but orders of magnitude simpler and used Jetty6 maven plugin that autorestarts the context when you make a change. So, maven + jetty + eclipse + a simpler servlet+ velocity + some javascript on the client and you don’t need much else. The power of a serious IDE, the solidity of the java web stack, the fast round/trip of interpreted languages, the set of libraries that java has to offer and the simplicity of downloading them thru maven.

For a guy that spent 7 years working on an XML publishing framework, it was indeed refreshing to find out how fast and simple it was to do just a very specialized web application.

Referee

The other tool was released today and it’s called Referee.

Referee is a command line application that reads your web server logs and automatically finds out what other web pages have to say about your own. Unlike trackbacks, Referee is a completely automated tool: all you need is your server logs and a network connection, Referee will do the rest. And will work not only for blogs, but for any URL of your web site, not matter what program generated it.

Referee was born out of my curiosity to know who links to my stuff and what they have to say about it. I don’t have comments on my blog, because I think that if you really care to say something about this, you’ll blog on your own (or you’ll write me an email) [it’s a social filter, so to speak]. So I needed a way to harvest all that content and using bloglines or technorati or google news search is ineffective because they keep feeding me stuff that they have found already just because it’s in a new URL or because the content of the page has changed.

Referee takes care of the “generation” part of the data, which is then saved as RDF/N3 files for you to consume the way you prefer.

And I hear you ask in frustration: “Why RDF? Atom would have been way easier!”.

Atom is a step forward from general XML because it allows you to split the data model into many tree fragments, each with a unique identifier. But there are two issues with using Atom as a general data modelling language for many separated items: it lacks the ability to model relationships between such items.

Referee is unique because it treats ‘comments’ and ‘pages’ differently. A comment is a piece of text that surrounds the <a> tag that links to your page and it’s identified by its SHA-1 value, while a page is the URL that containst that comment. There are cases where the same content is contained in different pages (different as they have different URLs) or cases where the same page contains different comments. There is no way to model that as a tree, you need a graph.

So, while atom might have allow me to model the single items (pages, comments and feeds), I would have had to extend it with my own markup to model the relationships between these items. Ending up reinventing the RDF wheel anyway and in a way that would be incompatible with RDF tools and ignored by Atom tools.

Note that Referee deals only with the production side of the data cycle, the consumption is left to the user. You can either browse it with RDF tools (such as Longwell or Piggy Bank) or you can feed it into a triple store and run Sparql queries on top of it (for example to generate an Atom feed of the new comments about a specific URL).

What I find fascinating about Referee is that, just like my spam filter, it’s a great example of a software agent that I use that is smarter than me (if only in doing repetitive jobs without getting numb or bored or making mistakes doing it because of that, at least). The best example of this is a list of comments in character sets that I’m not even able to interpret!

I continue to be amazed by the power of the right combination of tool, itch to scratch and good architectural model.

Ah, forgot to mention: both tools are open sourced using a BSD license.

Enjoy.

Update: The first tool I wrote and released as open source (8 years ago!) is Apache JMeter, which I had written to test the thread pool code that I had written for Apache JServ. Today I find out that Google uses JMeter as their profiling tool of choice [via Steve Loughran]. A good ego massage.

 

Piggy Bank, Cocoon and the Future of the Web

October 2nd, 2005

Today, after many months of work and a ton of source code inspected, traced and evaluated, we released Piggy Bank 2.1.0.

I’m very proud of this work and very happy to be part of it: it is, to me, the first significant step into a bright future and it got me closer to the mozilla architecture, which, I have to admit, is a pleasure to work with (especially now that we found a way to write XPCOM components in java and therefore we have a ton of existing libraries to use pretty much for free).

Just before last weekend, during my final Piggy Bank wrap-up’s, I sent an email to the Cocoon development mailing list airing my concerns: the web is slowly but surely changing. Some call it the Web 2.0, some call it Ajax, some call it “told you!” and some call it “so what?”, but the truth of the matter is that web services are coming and their impact has very little to do with what protocols or architectural decisions you make, but the amount of people you manage to catalyze.

Sylvain was the only one that explicitly uncloaked my intent: Cocoon is clearly not obsolete and it won’t be for a while, but it’s fat and sleepy, kinda watching TV (if you allow me) instead of going out exercising. Before I move on, I wanted to trigger a wake up call.

At the heart of Piggy Bank, there is a web server running inside your web browser. It’s running a servlet, a minimal RESTful framework that David wrote call Flair (modeled after what Mark did for Longwell 1). It’s so simple it’s actually (to my web framework architect’s eyes) embarrassing, yet it does the job: Piggy Bank’s webapp (actually Longwell 2) is fully RESTful, no session, no state, no continuations, everything is passed back and forth urlencoded and urldecoded (yes, this creates issues, but that’s another story).

Since we know that Piggy Bank runs only on Firefox, we can go crazy with DHTML and know that it will work. Ajax is used as client side include, and you can even do templating and XML pipelining directly on the client. With Firefox 1.5, even the need for graphics on the server side is gone, SVG and canvas are embedded, scriptable and fully merged with the browser, no need for the amazing SVG->PNG functionality cocoon offers… also because, guess what, David cloned it with a little servlet called Picto that we now use to have our own color-coded Google Maps placeholders.

All of this, in a fraction of the space and complexity: the entire Piggy Bank, web server + database + full text indexer + webapp framework + template system + RDFizing framework + firefox extension + icons is 4.5Mb… and it’s not even stripped down (if I really wanted to, I could get it down to 1.5mb by using ProGuard but I really don’t see the reason for it).

One thing that I miss with cocoon is the sitemap, but the cost (in terms of megabytes and complexity) of that is way too high, besides, since we use Jetty’s own APIs and not the Servlet web.xml (yes, I know, you think I completely lost my mind at this point, being one of the people that designed that Servlet API web.xml in the first place) we were able to reuse a lot of Jetty’s internals as a web server, reducing the need for what we have to handle.

So, in short, all REST, state is never temporary saved but always transferred until persisted, AJAX pretty much everywhere, a minimal servlet that translates a request into a different action handler doing the urlencoding and decoding (the controllers, one per command, in java), RDF as the model and velocity templates as the view. No pipeline, No multimodality, no XML awareness, no continuations, no sessions. Piggy Bank has, on the server side, the architectural appeal of a CGI-BIN and yes, 7 years spent designing web application frameworks, I know that to be an insult.

But the overall result is incredible: light and simple on the server, light and simple on the client. Very easy to learn, very easy to adjust incrementally (once you polish up all the memory leaks and fine tune the database indexes for performance, as we now did).

Like many on the (long) thread that started on dev@cocoon.apache.org mentioned, pretty much nobody has the luxury today of owning both the client and the server in a web application. But with Piggy Bank, we do! and, let me tell you, not only it feels great and refreshing, but it makes you rethink about the entire web, a web where the ability to influence the clients is not locked in some vault in Redmond.

Yeah yeah, sure, IE is long to be gone, and so is Microsoft, but it’s not just the Mozilla Foundation and radical web standard activists pushing for Firefox anymore, it’s already in the radar of too many web sites, forcing IE to compete. And competition is good, especially when there are players like Yahoo, Google, Amazon and eBay that will not just stay at the window on the battle for the new web-service-powered ‘man in the middle’.

Netscape’s plan of transforming the browser into a desktop, making the operating system basically an obsolete concept, was called “constellation“. Many believe this plan was what made Microsoft kill Netscape with all possible means. When the mozilla developers decided to rewrite Netscape Navigator from scratch, many thought they were crazy in building their own browser inside the HTML rendering environment itself. The ultimate flexibility syndrome.

But today, years and thousands of bugs later, mozilla is a platform capable of delivering a client side tool used by millions. I can’t think of another client side application framework that was able to achieve such tremendous success if not MFC. Java JFC (aka Swing) failed long ago to deliver that promise on the client (its look and feel is awkward, unreasonably slow and massively overdesigned) and Eclipse RCP needs to take the Mac seriously before anybody will start using it for real (and yeah, the JVM needs to go to the gym with cocoon too!). Cocoa is great, but it’s another lock-in and the only reasonable answer to who gives kids sigarettes for free today is “No, thanks”, especially since Apple will very likely kill the need for your little cool app in the next release of the OS anyway, and, no, you don’t get a piece of that money.

So, what does it mean for the future of the web?

I honestly don’t know, that thread was just a way to shake people up. All I know is that I’m proud of what I did for the web, first on JServ for Java and Servlets, then on Cocoon for XML and web frameworks that could deliver the web promise and scale to the number of people involved, and now for SIMILE and the semantic web, of which, if you ask me, this Web 2.0 buzz is just the very beginning.

When people ask me what I do for a living, I say that I research what the web of the future could be. At that point, they ask me to give them an example of what that would mean for them. My usual reply is “if we are successful, the only difference you’ll perceive is that you won’t feel as constantly lost as you feel today”. At that point they smile, happy to meet a technologist who thinks it’s his fault, not theirs, if they can’t do something with his software.

No matter what technology or platform we build the future of the web upon, we need to learn how to write the software that delivers those smiles: anything short of that will be a failure.