Home » Blog » Freebase Gridworks, Data-Journalism and Open Data Network Effects

Freebase Gridworks, Data-Journalism and Open Data Network Effects

May 24th, 2010

Earlier this year, David pinged me over IRC and prodded me to look at a new software prototype he had just created. Just like it happened many times before, I was blown away: what I had in front was a game changer. Not only it was a wonderfully executed prototype of obvious usefulnes (a rare thing on its own), but it was solid yet flexible in design and it allowed me to plug many of my own ideas and code prototypes into it that had been laying around disconnected is various random projects over the years.

The months after became a wonderful and exciting development collaboration between the two of us (much like in the good old days of SIMILE while we were both at MIT) to take the outstanding ideas and foundation he had built, sprinkle a few of mine and bake it together into a software product that we could be proud of and that we could try to use to bootstrap a network effect around the problem of enticing substantial data contributions to Freebase.

Several months later, that early prototype became Freebase Gridworks.

We knew we were onto something valuable here because while developing we started using the tool itself for daily situations that had nothing to do with our development effort. We were writing software we wanted to use ourselves, for daily tasks, and that’s the best (and rarest!) kind of software.

What we didn’t expect is how much people resonated with it.

In Praise of Gridworks

Jon Udell wrote a post entitled “PowerPivot + Gridworks = Wow!” where he marveled at the possibilities of mixing the latest data powertool from Microsoft with Gridworks and in another post wrote (emphasis is mine)

[...] Freebase Gridworks will make you weep with joy.

As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it. As a new wave of visualization tools arrives, there will be more eyeballs on more data, and that’s a great thing. But we’ll also need to be able to lay hands on the data and clean up the messes we can begin to see. As we do, we’ll want to be using tools that do the kinds of things shown in the Gridworks screencasts.

The News Applications team at the Chicago Tribune wrote on their blog

The genius of Gridworks is that it is generic enough to work for a wide variety of datasets without the need to write any code at all [...] We really can’t say enough about what a great application Gridworks is and about its myriad uses for hacker journalists and data-nerds of all stripe.

Chris Amico of PBS NewsHour tweeted “Gridworks is like crack for data junkies”;  Scott Klein of ProPublica tweeted “I think @thejefflarson is going to name a dog after Gridworks.” speaking of his colleague Jeff Larson; Rich Vázquez of ImpactNews tweeted “I just got to know old data all over again using Freebase Gridworks” and many others we have collected from the Gridworks twitter stream.

Data-Journalism

From The Guardian Simon Rogers asks this question:

is data journalism? If you need to ask yourself the question then you are about to miss out on an information bonanza.

Unfortunately, as I have written before, people will soon realize (as Jon also warned above) that the ‘information bonanza’ that Simon is talking about looks a lot more like somebody’s gigantic basement than the well ordered shelves of a library or the heavily curated archives of a museum. Most importantly, they won’t find any mention of this coming from governments or open data advocates since they have all the intentions (and the incentives) to make you believe otherwise.

At the same time, a new breed of electronic investigative journalism is emerging and it feeds on the perception that there must be golden stories buried in the giant pile of digital ore that open data advocates have helped surfacing. The problem now is shifting: before, it was getting your hands on the data, now is surviving information overload or sieving thru all that noise and find the golden digital nuggets worth of a story.

It’s also important to realize that gold is not only a metaphor here: ProPublica, for example, won no less than the Pulitzer Prize this year (first time ever for an independent, non-profit newsroom that produces investigative journalism in the public interest) for their investigative journalism in partnership with the New York Times. The advantages and rewards for digital journalists are real and tangible, especially in an era when anybody can publish something to the world with a click of a button and individual bloggers can’t afford a “news application development team”.

But there are tons of data manipulation tools out there including free ones: why are people (and especially data journalists) so excited about Gridworks? What is so special about it?

We don’t know for sure (it’s hard to reconstruct resonance, even after it happens), but this is my personal take:

  1. Unlike most data tools that assume that data inconsistencies are mistakes and that for that reason they are assumed rare, Gridworks was designed around the concept that data inconsistencies are a fundamental property of any dataset; things like alignment, consistency and curation are first order tasks that need to be done every time a dataset is used for an application different from its original one. Quality is not an absolute property of a dataset, therefore it’s misleading to assume so. Gridworks makes complex data manipulation and curation operations natural and uniform, while in other tools, even the most popular ones like Excel, these is a huge gap between trivial search-replace operations and fully programmable scripting solutions. Most curation operations require complexity that lies in that nowhere land and in most tools are available only with complex scripting and programming; but not in Gridworks.
  2. Gridworks follows the principle that data first designs are more aligned to natural human cognitive abilities and are also easier to bootstrap because the return on the invested effort is easier to predict (and forecast) each step of the way. Coupled with the previous point, this means that it should be easy and natural for the user to re-structure data to follow whatever mental model fits them best and feels most empowering, rewarding and liberating in contrast with other tools that know better and tell them what to do and for that reason feel rigid and taxing.
  3. Unlike most data mining tools out there that focus on creating summarized executive reports or spotting numerical trends or correlations, Gridworks focuses less on numbers and more on relations. Numbers and dates are not the focus of the data model but they are decorations of a relational model between more abstract data points. This makes Gridworks fit a special (an in our opinion extremely fertile but mostly unexplored) functional space between a spreadsheet and a relational database, retaining the data-first incremental familiarity of the first and the querying and filtering capacity of the second.

Open Data Network Effects

Resonance and traction are great rewarding properties of a successful product launch but they usually only paint the picture of individual interest, at least at first.

As we have seen with the accent on the social aspect of the web in recent years, even the simplest and most trivial of services (say, microblogging services like Twitter) can assume a completely different scope of impact and importance once sustainable network effects come into play. So how does Gridworks fair in the realm of Open Data network effects and what are the obstacles on its path?

First of all, it’s worth noting that all successful and sustainable network effects share a unique property: the system needs to be beneficial for the individual independently on how many others use it. If this is not the case, a ‘chicken/egg’ problem surfaces where the system is beneficial for them only if many people use it but nobody wants to use it until it’s beneficial for them.

The regular web, the blogosphere and microblogging all share this fundamental property: people find expressing themselves rewarding, independently of how many others read what they write. But these systems naturally create self-sustaining network effects: once other people read what you wrote, they often want to write something too; if its easy/cheap enough for them to do so, this starts a chain reaction that sustains the network effect.

Because David and I have been working on untangling chicken/egg problems of the web of data for years (more than 7 now that I think about it) and gained a lot of experience with previous tools (timeline, exhibit and timeplot) that data lovers really liked, we knew that first and foremost a data tool should feel immediately powerful and rewarding even just for individual use and that was our major focus for the first phase of Gridworks.

At the same time, no network effect will emerge unless Gridworks becomes even more useful when others use it too.

It is in that spirit that this tweet today from @delineator that made me stop and ponder (emphasis is mine):

@symroe I’m making a lot of use of gridworks too – are you uploading your data back into freebase? not sure if I want to give them the scoop

This is something that was in the back of my mind but I had not put in such clear terms before: the people digging for open data gold might be keen to praise and support all efforts that make more free data and free tools available (as they feel it makes it easier for them to find their digital gold), but while they have clear and established incentives to reveal their findings (what is the story and where they found it, which is the foundation of their credibility as journalists), they do not (yet) have incentives to reveal how they got to it or to share the result of the data curation effort with others. This is because they worry that it might only make it easier for others to find other stories from that pile of already cleaned data and thus, de-facto, ‘steal’ it from them.

This is not much different, for example, to what happened with the human genome project when public and private institutions started to race to compile the entire map of the human DNA: only when the costs of DNA sequencing became so low as to make their proprietary advantage in data hoarding marginal, private institutions started to share their data with public efforts.

The principal network effect attractor for Gridworks is the notion that internal consistency, external reconciliation and data integration between heterogeneous datasets are surprisingly expensive even for the most trivial and well covered data domain (this is something Metaweb learned the hard way while building Freebase).

This fact makes “curated open data hoarding” an unstable equilibrium: all it takes is one person to be a little less selfish and share their partially curated datasets in an open shared space in order to share the curation cost with others to disrupt the proprietary advantage of hoarding. This is very similar to the idea of creating a vendor branch of an open source project and make money off of the proprietary fork: it works only if the vendor branch is as effective as the open community to keep up with innovation and evolution of the ecosystem (and history of open software shows this is hardly ever a sustainable business model if the underlying community is healthy and vibrant).

Unfortunately, another chicken/egg problem surfaces here: Metaweb and the people relying on Freebase data for their applications won’t be keen on letting people enter badly or partially curated data into the main shared data pool, to avoid diluting the perception of data quality for all other users. On the other hand, curated data hoarding dynamics will remain stable unless Gridworks provides simple and effective ways for people to collaborate on the curation of datasets in an incremental and immediately rewarding way (just like proprietary software development models were perfectly stable before the internet and open development processes lowered coordination costs enough for sharing network effects to become sustainable).

Unlocking this conundrum and lowering the coordination costs enough to make open data curation sharing processes sustainable is what the Gridworks team (and the user-facing side of Metaweb) is going to focus next.

In the meanwhile, we can’t wait to see what kind of digital gold people will be able to extract from the open data piles using Gridworks and how they will decorate it and augment it with the data coming from Freebase.