Open Library – Jean-Etienne's blog

Aaron Swartz, a 24-year old hacker, was recently indicted on data theft charges for downloading over 4 million documents from JSTOR, a US-based online system for archiving academic journals. Mainstream media (Reuters, Guardian, NYT, Time, …) reported this with a mix of facts and fiction. I guess that the recent attacks of hacking groups on well-known websites and the release of data they stole on the internet gave to this story some spice.

First, I really appreciate what Aaron Swartz did and is currently doing. From The Open Library, web.py, RSS, to the Guerilla Open Access Manifesto and Demand Progress, he brought a lot to the computer world and the awareness of knowledge distribution.

Other blogs around the world are already talking about that and sometimes standing up for him. I especially liked The Economics of JSTOR (John Levin), The difference between Google and Aaron Swartz (Kevin Webb) and Careless language and poor analogies (Kevin Smith). I also encourage you to show your support for Aaron as I think he’s only the scapegoat for a bigger process …

I also think Aaron Swartz went too fast. If you do the maths (see appendix below), the download speed was approximately 49Mb per second. Even in a crowded network as the MIT one, this continuous amount of traffic coming from a single computer (or a few if you forge your addresses) is easily spotted. I understand he might have been in a hurry given that his access was not fully legal (although I think it initially was). It was the best thing to do if he wanted to collect a maximum amount of files in the shortest period of time.

This lead me to wonder what was the goal behind this act.

People stated it was his second attempt at downloading large amounts of data (which is not exactly true), depicting him like a serial perpetrator. Others stated that his motives were purely academic (text-mining research, JSTOR Data For Research being somewhat limited). One can also think of an act similar to Anonymous or LulzSec that were in the press recently. Or money, maybe (4*10^6 articles at an average of $15 per article makes $60 million), although this seems highly unlikely. The simple application of his Guerrilla Open Access Manifesto?

What is also puzzling me is the goal of JSTOR. It constantly repeats that it is supporting scholarly work and access to knowledge around the world. From its news statement, it says it was not its fault to prosecute Aaron Swartz but US Attorney’s Office’s. But at the same time, they assure they secured “the content” and made sure it will not be distributed. And the indictment doesn’t contain anything related to intellectual property theft. The only portion related to the content is a fraudulent access to “things of value”.

I think one of the issue JSTOR has is that it doesn’t actually own the material it sells to scientists. The actual publishers are dictating what JSTOR can digitize and what it can’t. And unfortunately, they only see these papers as “things of monetary value”.

However these things are actual scientific knowledge, usually from a distant past and usually without any copyright anymore. Except the cost of digitizing and building the search engine database (which are both provided by Google Books and Google Scholar for free, or the Gutenberg project in another area), all the costs related to the dissemination of these papers are already covered, usually since a long time. The irony is that some of the papers behind the JSTOR paywall are sometimes even freely available elsewhere (at institutions’ and societies’ repositories, e.g.).

It wouldn’t have cost much to put all these articles under an Open Access license while transferring them to JSTOR. JSTOR would then charge for the actual digitizing work but wouldn’t have to “secure the content” in case of redistribution since it would then be allowed. The not–for–profit service provided by JSTOR would then benefit to the knowledge instead of being one additional roadblock to it.

JSTOR, don’t become the RIAA or the MPAA of old scholar content!

Appendix. The maths

In “retaliation”, Gregory Maxwell posted 32Gb of data containing 18,592 JSTOR articles on the internet. This is an average of 1.762Mb per JSTOR article. Aaron Swartz downloaded 4*10^6 articles from JSTOR that represents approximately 6.723Tb of data. That took him 4 days (September 25th, 26th and October 8th and 9th, 2010) at an average of 1,721.17Gb per day. If we assume the computer was working 10 hours per day (he has to plug and unplug the computer during working hours), the average download speed id 172Gb per hour or 2.869Gb per minute or 48.958Mb per second.

Photo credit: Boston Wiki Meetup by Sage Ross on Flickr (CC-by-sa).