The arXiv blog joins MIT TechReview
March 13th, 2009
I’m not sure how many people understand how huge this is, so let me explain how I see it.
arXiv is a an e-print archive built and run by the Cornell University Library. Basically, a big repository of scientific articles (more than 500k right now), all in digital form (PDFs), all with permanent identifiers and all freely available.
arXiv has existed for 17 years and the number of article contributions per month has been increasing linearly over all 17 years(!).
500k items in a collection is a lot but it’s relatively small compared to, say, the web, scientific publishers or even any decent size university library catalog.
What’s fascinating about arXiv (and citeseer and other similar repositories) is that submissions are not peer reviewed: storing and distributing 10k articles or 500k has a very minor difference in cost, which is why the need for up-front filtering drops dramatically.
The knee-jerk reaction to this from old-school scientists and librarians is normally something between horror and disgust: if anybody is allowed to publish, the result is that quantity will increase and quality will drop.
This is a reaction to a model that sees books on shelves and articles in journals in stacks located by searches in an (electronic these days, cards when I was born) index catalog by metadata and subject term. In that world, yes, lower filter on quality yields substantially lower quality in precision and recall for any search and in vastly diluted cataloging and curating efforts.
But arXiv is a highly automated system and runs on full text analysis and self-submitted article metadata and subject classification. It gives every article a permanent identifier and it also links to “citebase” which tracks references to it from other papers.
There is no a priori peer review, subject analysis, librarian curation, all it’s done a posteriori, by analyzing the content of the article and the behavior of people around it.
Yes, there is tons of crap in there but there are also incredible gems (such as this one, one of my favorite articles).
We all agree that focusing on distinguishing between the good and the bad is valuable, but the reshaping of social landscapes and economies of scale make the old tools feel inadequate and suboptimal, including printed publishing and their politics and affiliation-driven peer review system that keeps them alive.
One of such new tools has been the arXiv blog: a human curator watches over the stream of entering articles in arXiv (mostly to spot and flag abuses of the system, I would guess) and decided to blog about the gems that were found in the process.
The blog was always very interesting and witty, and one of my favorite ways to discover interesting new scientific discoveries, coming from all sort of places and without the need for a big university affiliation to make it into the establishment.
The news of today is that from now the arXiv blog will be exclusively hosted inside the MIT TechReview web site. This is huge not only because it will bring exposure to arXiv and to more peripheral scientific research, but because it sets a small but substantial milestone in the acceptance of a commit-then-review scientific publishing world (opposed to the de-facto standard review-then-commit model that has been in place for centuries).
Don’t get me wrong: this is an important step but it’s still a pretty small one. What makes me happy about this is that at least we seem to be walking in the right direction, a direction that by giving a chance to anybody to publish their thoughts without having to convince others of their value a priori will hopefully spark more variety of thought, more diversity in research and will focus more on publishing to improve scientific merit rather than publishing to improve your position in the network of academic influence.