Recently at work we've been making some minor changes to the handling of "auxiliary files" - movies, additional information, or data sets provided by the authors that go beyond the normal article text, figures and tables (all XML or PDF format) that we usually publish. The issue of archiving datasets in particular has been on my mind. One motivation is my own past experiences wondering what to do with large collections of (in my case computer-simulated) data generated in the process of doing research. I probably still have some of it, what I thought most significant, stored somewhere on the laptop I'm writing from now. Though I'm not sure what I would do with it after 20 or more years of neglect. Would it even be worth anything to myself or anybody else, to make it available? Recently advocates of scientific openness, for example Michael Nielsen's Physics World article, have made a strong case for sharing with the world. But there seems to be something still missing before I or I suspect most active researchers would feel the necessary motivation to put all their data out there.
These questions and many other issues for scientific datasets are impressively tackled in a paper posted last fall: To share or not to share: Publication and quality assurance of research data outputs., by Alma Swan and Sheridan Brown, commissioned by the UK Research Information Network. They have a large number of findings and recommendations, of which I'd like to highlight a few:
Lots of good stuff in there.
Along similar lines, but considerably narrower in scope, was a recent paper from the OECD: "We Need Publishing Standards for Datasets and Data Tables". This presents some detailed examples of issues with citation and use of OECD tables, and has specific recommendations for improving the situation, but they seem to me to be a little too tightly coupled to the sort of large-scale economic data OECD collects. Nevertheless, it's an interesting detailed specification of two components of the dataset problem - metadata, and citation standards - applicable at least to their context. One of the issues is dynamic vs. static datasets - if a dataset is constantly updated, how do you indicate which version you used in your research? And perhaps more importantly, how do you ensure retrievability of a copy of that version of the dataset into the future?
OECD has been assigning Digital Object Identifiers (DOI's) to their datasets which point not to the actual data but to a "homepage" for the dataset, that contains the metadata and other descriptive information, with links to the actual data in various file formats from there. For dynamic datasets there would be only one DOI, and their citation recommendation includes adding a "(accessed on Date)" line to indicate which version was used. There's obviously a bit of a dilemma here - it makes a lot of sense to want to point people to a generic page for a given data collection, but for reproducibility it seems like there ought to be a unique pointer of some sort to an immutable "digital object" associated with each version of the dataset, and we're given no promise of that here.
It was with much excitement that we heard in early 2008 about Google's Palimpsest Project (later Google Research Datasets), to house large scientific datasets on the Google servers and provide worldwide access to them. But by the end of the year, the project was being shut down, apparently a victim of financial belt-tightening. Amazon has a similar public datasets project, not limited to scientific data (currently hosting an interesting variety, from street maps to wikipedia dumps to genomics). Amazon is taking advantage of their experiments in "cloud computing" here, and in fact access to the datasets is only available through an account within their cloud (so they are not providing them for free - but a real revenue stream is not a bad thing for this sort of project). Both the Google and Amazon projects were focused on very large datasets - tens of gigabytes to terabytes or more, while much useful research data comes in considerably smaller packages.
So none of these things seem yet suited to the research data of the individual or small-group scientist that might be inclined to share their results in a machine-readable format with the world. There is definitely a need for standards, and something more cross-disciplinary than the current mish-mash of field-dependent approaches would be nice. Centralized storage on some sort of cloud computing system seems like the logical way to go as well - preferably a distributed inter-operable collection, with redundancy in data storage to guarantee against failures (something like the LOCKSS idea). A way to guarantee some form of immutability once published would be an important step as well (with processes for additions and errata). The Swan-Brown study provides a good overview of the challenges here. Doing something about dataset publishing really seems an essential step forward.