Home » Blog

On The Impact of Damage non-locality in Incentive Economies around Data Sharing

June 17th, 2010

For centuries, it was common for scientists to exchange ideas with epistular discussions. These days, remotely located scientists collaborate via email, or exchange digital documents when they don’t meet face to face. These are way faster and easier to exchange than hand-written letters sent via postal services. Unfortunately, they still retain that ‘after the fact’ property that they are often revealed only when some scholar decides later they were important enough to dig out and organize.

With that in mind, I find myself excited every time I get the chance to participate in in ‘blog rebuttals’ like the ones that David Karger and myself have been having lately about requirements, motives and incentives for people to share structured data on the web. Both of us care a great deal about this problem and we still cross paths and cross-pollinate ideas even after I left MIT. We also have very different backgrounds but they overlap enough so that we can understand each other’s language even when we try to explain our own (sometimes still foggy) thinking.

It is a rare situation when people from different backgrounds cross paths and earn each other’s respect. It is even rarer when their discussions are aired publicly as they are  happening; this creates a very healthy and stimulating environment not only for those participating but also for eventual readers.

In any case, the point of contention in the current discussion is the reasons why people would want to share structured data and what can facilitate it.

It seems to me that the basic (and implicit) assumption of David’s thinking is that because a web of hyperlinked web pages came to exist, it would be enough to understand why it did, replicate the technological substrate (and its social lubrification properties) and the same growth property would apply to different kind of content.

I question that assumption and I’m frankly surprised that questioning whether the nature of the content can influence the growth dynamics of a sharing ecosystem makes him dismiss it as being related to a particular class of people (programmers) or to a particular class of business models (my employer’s).

It might well be that David is right and the same exact principles apply… but it seems a rather risky thing to take for granted. People post pictures on public sites, write public tweets, contribute to wikipedia, write public blogs, or create personal web sites, all this is shared and all this is public. These are facts. They don’t publish nearly as much structured data and this is another fact. But believing that people would do the same with structured data if only there was technology that made it easier or made is transparent, is as assumption, not a fact. It implicitly assumes that the nature of the content being contributed has no impact on the incentive economies around it.

And it seems to me a rather strong assumption considering, for example, that it doesn’t hold true for open sharing of software code.

Is it because software programmers are more capricious about sharing? Is it because what’s being shared is considered more valuable? Or is it because the incentive economies around sharing change dramatically when collaboration becomes a necessary condition to sustainability?

Could it be that sharing for independent and dispersed consumption (say, a picture, a tweet, a blog post) is governed by economies of incentives that are different from sharing for collaborative and reciprocal consumption? (say, software source code, wikipedia, designs for lego mindstorm robots or electronic circuitry)

I am the first to admit that it is reasonable to dismiss my questioning for being philosophical or academic, or too ephemeral to provide valuable practical benefits, but recent insights that crystalized collectively inside Metaweb (my employer) make me think otherwise. The trivial, yet far-reaching insight is this:

the impact of mistakes in hypertext are localized,
while the impact of mistakes in structured data or software are not

If somebody writes something false, misleading or spammy on a web page, that action impacts the perceived value of that page but it doesn’t impact any other. Pages have different relevance depending on their location or rank so the negative impact of that action changes depending on the page importance. But the ‘locality of negative impact’ property remains the same: no other page is directly influenced by that action.

This is not true for data or software: a change in one line of code, or one structured assertion, could potentially trigger a cascading effect of damage.

This explains very clearly, for example, why there are no successful software projects that use a wikipedia model for collaboration and allow anybody that shows up to be able to modify the central code repository.

Is that prospect equally unstable for collaborative development over structured data? or is there something in between, some hybrid collaboration models that take the best practices between the wiki models (which shines in lowering the barrier to entry) and the open software development models (which manages to distill quality in an organic way)?

I understand these questions don’t necessarely apply to the economy of incentives of individuals wanting to publish their structured datasets without the need for collaboration, but I present them here as a cautionary tale about taking the applicability of models for granted.

More than programmers vs. professors, I think the tension between David and myself is about the nature of our work: he’s focusing on facilitating the sharing of results from individual entities (including groups), I’m focusing on fostering collaboration and catalyzing network effects between such entities.

Still, I believe that understanding the motives and the incentive economies around sharing, even for purely individualistic reasons, is the only way to provide solutions that meet people’s real needs. Taking them for granted is a very risky thing to do.

Permalink | Posted in Commentary
 

Drivers vs. Enablers

June 5th, 2010

I’ve heard many times people saying that the web exists because of “view source”.

“view source”, if you don’t know what I mean, is the ability that web browsers have to show you the source HTML content of the web page you are currently browsing. If you ask around, pretty much everybody that worked on the web early on will tell you that they learned HTML by example, by viewing the source or other people’s pages. Tricks and techniques were found by somebody, applied, and spread quickly.

There is wide and general consensus that ‘view source’ was a very instrumental tool to easily propagate knowledge and simplify adopting the web as a platform, yet its role is often confused.

“view source” was an enabler, a catalyst; something that makes it easier for a reaction or a process to take place and thus increases rate, effectiveness, adoption, or whatever metric you want to use.

But it is misleading to confuse “view source” for a driver: something that makes it beneficial and sustainable for the process to take place. The principal driver for the web was the ability for people to publish something to the entire world with dramatically reduced startup costs and virtually zero marginal costs. “view source” made it easier and reduced such startup costs, but had nothing to do with lowering marginal costs and certainly had very little to do with the intrinsic world-wide publishing features of the web.

You might think that the current HTML5 vs. Flash diatribe is what’s sparking these considerations, but it’s not: it’s something that Prof. David Karger wrote about my previous post (we deeply enjoy these blog-based conversations). He’s suggesting that while my approach of looking for sustainable models for open data contributions is good and worthwhile, he believes that a more effective strategy can be the one of convincing the tool builders to basically add a “view source” for data and that once that is in place, we wouldn’t have to care as the data would be revealed simply by people using the tools.

It’s easy to see the appeal for such a strategy: the coordination costs are greatly reduced as you have to talk and convince a much smaller population and all composed of people that already care about surfacing data and see potential benefits for further adoption of their toolsets.

On the other hand, if feels to me that it’s confusing enablers for drivers.

The order I pose questions in my mind when engineering adoption strategies is normally “why” then “how”: taking for granted that because you have drivers then everybody else must share it or have a similar one can easily lead you astray . The question of motive, of “what’s in for me?”, might feel materialistic, un-intellectual and limiting, but an understandable and predictable reward is the basis for behavioral sustainability.

David is basing his thoughts around Exhibit and I assume he considers the driver to be the tool itself and its usefulness: it can taking your data and presents it neatly and interactively without you having to do much work or bother your IT administrators to setup and maintain server-side software. That’s appealing, that’s valuable and that’s easy to explain.

The enabler for the network effect is that “cut/paste data” icon that people can click and obtain the underlying data representation of the model…. and do whatever they want with it.

But here is where things start to get interesting when you consider drivers and enablers separately: ‘view source’ was a great enabler for the web because it was useful for other people’s adoption but didn’t impact your own adoption drivers. The fact that others had access to the html code of your pages didn’t hurt you in any way…. mostly because the complexity of the system was locked on your end in your servers and your domain name is something you control and they can’t replicate. What you had access to was a thin surface of a much more complicated system running on somebody else’s servers. It was convenient to you and your developers to have that view-source and the fact that others benefited from it posed no threats to you.

This is dramatically different in the Exhibit situation (or in many other open data scenarios): not only you can take the data with you, but you can take the entire exhibit. Some people are not bothered by this fact, but you can assume that normal people get a weird feeling when they think that others can just take their entire work and run with it.

This need of ‘preventing people from benefitting from your work without you benefitting from theirs’ is precisely the leverage used by reciprocal copyright licenses (the GPL first, the CC-share-alike later) to promote themselves, but there is nothing in the Exhibit adoption model that addresses this issue explicitly.

If your business is to tell or synthesize stories emerged from piles of data (journalists, historians, researchers, politicians, teachers, curators, analysts, etc), we need to think about a contribution ecosystem where sharing your data benefits you and in a way that it’s obvious for you to understand (and to explain to your boss!).  Or, as David suggests, a ‘view source’-style model where the individualistic driver is clear and obvious and the collaborative enabler is transparent, meaning that it doesn’t require them to do work and is not perceived as a threat to their individualistic driver.

The thing is: with Exhibit, or with any other system that makes the entire data available (this includes Freebase), the immediate perception that people have is that making their entire dataset available to others is clearly benefiting others and doesn’t seem to offer clear benefits for them (which was the central issue of my previous post).

Sure, you can try to guilt-trip them into releasing their data (cultural pressure) or use reciprocal licensing models (legal pressure), but really, the driver that works best is when people want to collaborate with one another  (or are not bothered by others doing it on their own work) because they immediately perceive value in doing so.

Both Exhibit and Gridworks were designed with the explicit goal to be at first drivers for individual adoption (so that you have a social platform to work with) and potential enablers for collaborative action later (so that you can experiment with trying to build these network effects); but a critical condition for the collaborative enabler is that it must not reduce the benefit of individual adoption or otherwise it will reduce its ability to drive network effects.

Think for a second about a web where a ‘view source’ command in a browser pulled the entire codebase out of a website you’re visiting: do you really think it would have survived this long? remember how heated the debate was when the GPLv3 wanted to contain reciprocal constraints even for software that was just executed and not redistributed (which would have impacted all web sites and web services which are now exempt)?

It is incredibly valuable to be inspired by systems and strategies that worked in the past and by the dynamics that made them sustainable… but we must do so by appreciating both the similarities and the differences if we want to be successful in replicating their impact.

Counterintuitively, what might be required to bootstrap a more sustainable open data ecosystem is not more being more open but less, building tools that focus first on protecting individual investments, and then in fostering selective disclosure and collaboration over such disclosed part.

We sure can (and did) engineer systems that act as trojan horses for openness (Exhibit is one obvious example), but they have failed so far to create sustainable network effects because, I think, we have not yet identified the dynamics that entice stable and sustainable collaborative models around data sharing.

 

Freebase Gridworks, Data-Journalism and Open Data Network Effects

May 24th, 2010

Earlier this year, David pinged me over IRC and prodded me to look at a new software prototype he had just created. Just like it happened many times before, I was blown away: what I had in front was a game changer. Not only it was a wonderfully executed prototype of obvious usefulnes (a rare thing on its own), but it was solid yet flexible in design and it allowed me to plug many of my own ideas and code prototypes into it that had been laying around disconnected is various random projects over the years.

The months after became a wonderful and exciting development collaboration between the two of us (much like in the good old days of SIMILE while we were both at MIT) to take the outstanding ideas and foundation he had built, sprinkle a few of mine and bake it together into a software product that we could be proud of and that we could try to use to bootstrap a network effect around the problem of enticing substantial data contributions to Freebase.

Several months later, that early prototype became Freebase Gridworks.

We knew we were onto something valuable here because while developing we started using the tool itself for daily situations that had nothing to do with our development effort. We were writing software we wanted to use ourselves, for daily tasks, and that’s the best (and rarest!) kind of software.

What we didn’t expect is how much people resonated with it.

In Praise of Gridworks

Jon Udell wrote a post entitled “PowerPivot + Gridworks = Wow!” where he marveled at the possibilities of mixing the latest data powertool from Microsoft with Gridworks and in another post wrote (emphasis is mine)

[...] Freebase Gridworks will make you weep with joy.

As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it. As a new wave of visualization tools arrives, there will be more eyeballs on more data, and that’s a great thing. But we’ll also need to be able to lay hands on the data and clean up the messes we can begin to see. As we do, we’ll want to be using tools that do the kinds of things shown in the Gridworks screencasts.

The News Applications team at the Chicago Tribune wrote on their blog

The genius of Gridworks is that it is generic enough to work for a wide variety of datasets without the need to write any code at all [...] We really can’t say enough about what a great application Gridworks is and about its myriad uses for hacker journalists and data-nerds of all stripe.

Chris Amico of PBS NewsHour tweeted “Gridworks is like crack for data junkies”;  Scott Klein of ProPublica tweeted “I think @thejefflarson is going to name a dog after Gridworks.” speaking of his colleague Jeff Larson; Rich Vázquez of ImpactNews tweeted “I just got to know old data all over again using Freebase Gridworks” and many others we have collected from the Gridworks twitter stream.

Data-Journalism

From The Guardian Simon Rogers asks this question:

is data journalism? If you need to ask yourself the question then you are about to miss out on an information bonanza.

Unfortunately, as I have written before, people will soon realize (as Jon also warned above) that the ‘information bonanza’ that Simon is talking about looks a lot more like somebody’s gigantic basement than the well ordered shelves of a library or the heavily curated archives of a museum. Most importantly, they won’t find any mention of this coming from governments or open data advocates since they have all the intentions (and the incentives) to make you believe otherwise.

At the same time, a new breed of electronic investigative journalism is emerging and it feeds on the perception that there must be golden stories buried in the giant pile of digital ore that open data advocates have helped surfacing. The problem now is shifting: before, it was getting your hands on the data, now is surviving information overload or sieving thru all that noise and find the golden digital nuggets worth of a story.

It’s also important to realize that gold is not only a metaphor here: ProPublica, for example, won no less than the Pulitzer Prize this year (first time ever for an independent, non-profit newsroom that produces investigative journalism in the public interest) for their investigative journalism in partnership with the New York Times. The advantages and rewards for digital journalists are real and tangible, especially in an era when anybody can publish something to the world with a click of a button and individual bloggers can’t afford a “news application development team”.

But there are tons of data manipulation tools out there including free ones: why are people (and especially data journalists) so excited about Gridworks? What is so special about it?

We don’t know for sure (it’s hard to reconstruct resonance, even after it happens), but this is my personal take:

  1. Unlike most data tools that assume that data inconsistencies are mistakes and that for that reason they are assumed rare, Gridworks was designed around the concept that data inconsistencies are a fundamental property of any dataset; things like alignment, consistency and curation are first order tasks that need to be done every time a dataset is used for an application different from its original one. Quality is not an absolute property of a dataset, therefore it’s misleading to assume so. Gridworks makes complex data manipulation and curation operations natural and uniform, while in other tools, even the most popular ones like Excel, these is a huge gap between trivial search-replace operations and fully programmable scripting solutions. Most curation operations require complexity that lies in that nowhere land and in most tools are available only with complex scripting and programming; but not in Gridworks.
  2. Gridworks follows the principle that data first designs are more aligned to natural human cognitive abilities and are also easier to bootstrap because the return on the invested effort is easier to predict (and forecast) each step of the way. Coupled with the previous point, this means that it should be easy and natural for the user to re-structure data to follow whatever mental model fits them best and feels most empowering, rewarding and liberating in contrast with other tools that know better and tell them what to do and for that reason feel rigid and taxing.
  3. Unlike most data mining tools out there that focus on creating summarized executive reports or spotting numerical trends or correlations, Gridworks focuses less on numbers and more on relations. Numbers and dates are not the focus of the data model but they are decorations of a relational model between more abstract data points. This makes Gridworks fit a special (an in our opinion extremely fertile but mostly unexplored) functional space between a spreadsheet and a relational database, retaining the data-first incremental familiarity of the first and the querying and filtering capacity of the second.

Open Data Network Effects

Resonance and traction are great rewarding properties of a successful product launch but they usually only paint the picture of individual interest, at least at first.

As we have seen with the accent on the social aspect of the web in recent years, even the simplest and most trivial of services (say, microblogging services like Twitter) can assume a completely different scope of impact and importance once sustainable network effects come into play. So how does Gridworks fair in the realm of Open Data network effects and what are the obstacles on its path?

First of all, it’s worth noting that all successful and sustainable network effects share a unique property: the system needs to be beneficial for the individual independently on how many others use it. If this is not the case, a ‘chicken/egg’ problem surfaces where the system is beneficial for them only if many people use it but nobody wants to use it until it’s beneficial for them.

The regular web, the blogosphere and microblogging all share this fundamental property: people find expressing themselves rewarding, independently of how many others read what they write. But these systems naturally create self-sustaining network effects: once other people read what you wrote, they often want to write something too; if its easy/cheap enough for them to do so, this starts a chain reaction that sustains the network effect.

Because David and I have been working on untangling chicken/egg problems of the web of data for years (more than 7 now that I think about it) and gained a lot of experience with previous tools (timeline, exhibit and timeplot) that data lovers really liked, we knew that first and foremost a data tool should feel immediately powerful and rewarding even just for individual use and that was our major focus for the first phase of Gridworks.

At the same time, no network effect will emerge unless Gridworks becomes even more useful when others use it too.

It is in that spirit that this tweet today from @delineator that made me stop and ponder (emphasis is mine):

@symroe I’m making a lot of use of gridworks too – are you uploading your data back into freebase? not sure if I want to give them the scoop

This is something that was in the back of my mind but I had not put in such clear terms before: the people digging for open data gold might be keen to praise and support all efforts that make more free data and free tools available (as they feel it makes it easier for them to find their digital gold), but while they have clear and established incentives to reveal their findings (what is the story and where they found it, which is the foundation of their credibility as journalists), they do not (yet) have incentives to reveal how they got to it or to share the result of the data curation effort with others. This is because they worry that it might only make it easier for others to find other stories from that pile of already cleaned data and thus, de-facto, ’steal’ it from them.

This is not much different, for example, to what happened with the human genome project when public and private institutions started to race to compile the entire map of the human DNA: only when the costs of DNA sequencing became so low as to make their proprietary advantage in data hoarding marginal, private institutions started to share their data with public efforts.

The principal network effect attractor for Gridworks is the notion that internal consistency, external reconciliation and data integration between heterogeneous datasets are surprisingly expensive even for the most trivial and well covered data domain (this is something Metaweb learned the hard way while building Freebase).

This fact makes “curated open data hoarding” an unstable equilibrium: all it takes is one person to be a little less selfish and share their partially curated datasets in an open shared space in order to share the curation cost with others to disrupt the proprietary advantage of hoarding. This is very similar to the idea of creating a vendor branch of an open source project and make money off of the proprietary fork: it works only if the vendor branch is as effective as the open community to keep up with innovation and evolution of the ecosystem (and history of open software shows this is hardly ever a sustainable business model if the underlying community is healthy and vibrant).

Unfortunately, another chicken/egg problem surfaces here: Metaweb and the people relying on Freebase data for their applications won’t be keen on letting people enter badly or partially curated data into the main shared data pool, to avoid diluting the perception of data quality for all other users. On the other hand, curated data hoarding dynamics will remain stable unless Gridworks provides simple and effective ways for people to collaborate on the curation of datasets in an incremental and immediately rewarding way (just like proprietary software development models were perfectly stable before the internet and open development processes lowered coordination costs enough for sharing network effects to become sustainable).

Unlocking this conundrum and lowering the coordination costs enough to make open data curation sharing processes sustainable is what the Gridworks team (and the user-facing side of Metaweb) is going to focus next.

In the meanwhile, we can’t wait to see what kind of digital gold people will be able to extract from the open data piles using Gridworks and how they will decorate it and augment it with the data coming from Freebase.

Permalink | Posted in Article