I last blogged 7 months ago, around the time I moved to Los Angeles. It cannot be a coincidence and maybe the saying “I don’t have a blog, I have a life” has some real value… or maybe I need to sit down and fix all those bugs in this linotype of mine… or maybe I’ve just being busy working on tools instead of talking.
While deep inside I’m a hardware guy (with a degree in opto-electronics), since I lacked the millions of dollars that are required to do any serious research in those fields (or the patience to climb the social ladder to obtain the access to properly equipped labs), I turned to software where all you need is a text editor, a compiler/interpreter, time and patience. Sprinkle a little bit of OCD and ADD over it, early access to the internet and to like-minded people and you’re almost there.
Then you need a purpose.
McLuhan once said: “if it works, it’s obsolete”. It’s my motto, basically, or my curse, depending on what day you ask me. So I always decide to work on something that is new and fresh… not that crowded… where you can make an impact… a bleeding edge.
The problem with the ‘bleeding edge’ is not only that is bleeding but that you have no tools. You can decide to use the existing tools in different ways and “make it work” or you can write new tools yourself. I consider myself lazy. So lazy, in fact, that instead of doing a boring action twice, I would spend a lot of time and energy to build a tool to do it for me. Pretty much every software that I’ve created or helped writing was done to save me from doing boring and mindless tasks.
I spend pretty much all my time building tools now. I wrote and released two recently and both try to exercise the power of ‘data emergence’: the sum is way more than the parts, especially when you can get your hands on a lot of data
Gadget
The first one is Gadget, which I released a few months ago. I’m particularely proud of this tool even if not many people have the problem that it tries to solve. It’s an XML inspector that helps you understand the structure of one or more XML documents, no matter how big. When I say “no matter how big”, I’m not kidding. I’ve used it with several gigabytes of XML.
Gadget was written because when you are given a few database dumps in XML, each a few Gb big, you can’t really open it in your text editor and look at it. Sometimes you are given a schema or a DTD, but most of the times the dump won’t even validate against the schema. Moreover, if you are tasked to write a program that transforms this XML dump into something else (for example RDF/XML), the schema is not enough to understand, for example, if a particular value is unique in the dataset or not.
So I wrote Gadget, which basically is an XML parser on BerkeleyDB steroids. It fragments the XML and saves the XPath projections of it. Then you can reconstruct the “skeleton” of the XML, which is the list of all the XPaths that were ever encountered by the parser.
So, you can take all the XML you want, and throw it at Gadget. It will digest it, producing indexes that later a web application will use to present you with that data. And you can search, for particular values, see the value distributions charts, browse the XML skeleton tree and, last but not least, apply clustering functions to the values found in the same XPath to evaluate things like spelling mistakes. The clustering function I use is very simple and very efficient (linear in algorithmical complexity with the number of values being clustered) but it is surprisingly effective, at least for the datasets I’ve used it on.
So it’s a discovery tool and it helps the data analyst create a mental model of the XML dataset being analyzed, but also allows for data quality control and error management.
Of course, the next step is to build the equivalent of Gadget for RDF, but the graph datamodel makes it a more complicated since you can’t (easily) have paths. I’ll blog more about this in the future.
Gadget has saved me endless hours of frustration and I even received a book as a gift from somebody that loved it so much. I’ve demoed it at the Getty Museum in Los Angeles and at ARTstor in New York and got everybody very excited.
In Gadget, I also like the fact that I decided to build the presentation software as a web application, even if most of the time you are the only one looking at the data that Gadget generates. Also, it’s written in Java but I didn’t use any framework. I just wrote a simple servlet, following the same pattern of the cocoon sitemap but orders of magnitude simpler and used Jetty6 maven plugin that autorestarts the context when you make a change. So, maven + jetty + eclipse + a simpler servlet+ velocity + some javascript on the client and you don’t need much else. The power of a serious IDE, the solidity of the java web stack, the fast round/trip of interpreted languages, the set of libraries that java has to offer and the simplicity of downloading them thru maven.
For a guy that spent 7 years working on an XML publishing framework, it was indeed refreshing to find out how fast and simple it was to do just a very specialized web application.
Referee
The other tool was released today and it’s called Referee.
Referee is a command line application that reads your web server logs and automatically finds out what other web pages have to say about your own. Unlike trackbacks, Referee is a completely automated tool: all you need is your server logs and a network connection, Referee will do the rest. And will work not only for blogs, but for any URL of your web site, not matter what program generated it.
Referee was born out of my curiosity to know who links to my stuff and what they have to say about it. I don’t have comments on my blog, because I think that if you really care to say something about this, you’ll blog on your own (or you’ll write me an email) [it’s a social filter, so to speak]. So I needed a way to harvest all that content and using bloglines or technorati or google news search is ineffective because they keep feeding me stuff that they have found already just because it’s in a new URL or because the content of the page has changed.
Referee takes care of the “generation” part of the data, which is then saved as RDF/N3 files for you to consume the way you prefer.
And I hear you ask in frustration: “Why RDF? Atom would have been way easier!”.
Atom is a step forward from general XML because it allows you to split the data model into many tree fragments, each with a unique identifier. But there are two issues with using Atom as a general data modelling language for many separated items: it lacks the ability to model relationships between such items.
Referee is unique because it treats ‘comments’ and ‘pages’ differently. A comment is a piece of text that surrounds the <a> tag that links to your page and it’s identified by its SHA-1 value, while a page is the URL that containst that comment. There are cases where the same content is contained in different pages (different as they have different URLs) or cases where the same page contains different comments. There is no way to model that as a tree, you need a graph.
So, while atom might have allow me to model the single items (pages, comments and feeds), I would have had to extend it with my own markup to model the relationships between these items. Ending up reinventing the RDF wheel anyway and in a way that would be incompatible with RDF tools and ignored by Atom tools.
Note that Referee deals only with the production side of the data cycle, the consumption is left to the user. You can either browse it with RDF tools (such as Longwell or Piggy Bank) or you can feed it into a triple store and run Sparql queries on top of it (for example to generate an Atom feed of the new comments about a specific URL).
What I find fascinating about Referee is that, just like my spam filter, it’s a great example of a software agent that I use that is smarter than me (if only in doing repetitive jobs without getting numb or bored or making mistakes doing it because of that, at least). The best example of this is a list of comments in character sets that I’m not even able to interpret!
I continue to be amazed by the power of the right combination of tool, itch to scratch and good architectural model.
Ah, forgot to mention: both tools are open sourced using a BSD license.
Enjoy.
Update: The first tool I wrote and released as open source (8 years ago!) is Apache JMeter, which I had written to test the thread pool code that I had written for Apache JServ. Today I find out that Google uses JMeter as their profiling tool of choice [via Steve Loughran]. A good ego massage.