Whither Scientific Datasets?

Recently at work we've been making some minor changes to the handling of "auxiliary files" - movies, additional information, or data sets provided by the authors that go beyond the normal article text, figures and tables (all XML or PDF format) that we usually publish. The issue of archiving datasets in particular has been on my mind. One motivation is my own past experiences wondering what to do with large collections of (in my case computer-simulated) data generated in the process of doing research. I probably still have some of it, what I thought most significant, stored somewhere on the laptop I'm writing from now. Though I'm not sure what I would do with it after 20 or more years of neglect. Would it even be worth anything to myself or anybody else, to make it available? Recently advocates of scientific openness, for example Michael Nielsen's Physics World article, have made a strong case for sharing with the world. But there seems to be something still missing before I or I suspect most active researchers would feel the necessary motivation to put all their data out there.

These questions and many other issues for scientific datasets are impressively tackled in a paper posted last fall: To share or not to share: Publication and quality assurance of research data outputs., by Alma Swan and Sheridan Brown, commissioned by the UK Research Information Network. They have a large number of findings and recommendations, of which I'd like to highlight a few:

  • Research datasets come in three main types with different values and constraints: observational, experimental, and the output of computer simulations
  • Any given type has an original "raw" state; analysis results in "derived" or "reduced" data
  • Many datasets are stored by researchers themselves in a more or less haphazard manner
  • Centralized data centres have significant advantages, but are so far not able to handle all the data generated in research. Distributed data storage may be more robust, but standards, expertise and resources are a problem.
  • Researchers don't generally receive much recognition or reward for publishing datasets, as opposed to more standard peer-reviewed publications
  • Some research data is subject to legal or ethical constraints on sharing; additionally researchers may have reason to reserve exclusive access to their own data for a time
  • Accessibility and usability are important: citation standards are needed and permanent locations for accessibility; PDF format is not easily re-usable (machine-readable formats are needed)
  • Metadata or explanatory text regarding the data is essential, but may be difficult to create
  • Quality and trust are significant issues - how should peer review work for datasets?
  • There is a need for long-term funding for dataset preservation, where that data is considered of potential long term value
  • Specific recommendations for funding agencies:
    • promoting more actively through the use of case studies the benefits and the value to
      researchers of data publishing
    • providing visible top level support, and offering career-related rewards, to researchers who
      publish high-quality data
    • providing expert support to enable researchers to produce sound data management plans, and
      closely reviewing the quality of those plans when they assess grant applications
    • making clear to applicants for grants and to reviewers that including a budget to cover data
      management – including the provision of a dedicated data manager where appropriate - will
      not adversely affect a grant application
    • providing better information about and access to sources of expert advice on how most
      effectively to publish and to re-use data.
    • developing strategies to address the current skills gaps in data management
    • promoting and providing better information about the mechanisms available to data creators to
      control access to and use of their data (e.g. embargoes, restricted access, licence conditions)
    • promoting improved access to research data through better discovery tools and metadata
    • identifying and documenting by subject area the barriers to effective re-use of data, and
      promoting guidance on good practice
    • promoting the “freeze and build” approach to dynamic datasets, where original data may be
      amended, added to, or replaced by newer data at a later date.

Lots of good stuff in there.

Along similar lines, but considerably narrower in scope, was a recent paper from the OECD: "We Need Publishing Standards for Datasets and Data Tables". This presents some detailed examples of issues with citation and use of OECD tables, and has specific recommendations for improving the situation, but they seem to me to be a little too tightly coupled to the sort of large-scale economic data OECD collects. Nevertheless, it's an interesting detailed specification of two components of the dataset problem - metadata, and citation standards - applicable at least to their context. One of the issues is dynamic vs. static datasets - if a dataset is constantly updated, how do you indicate which version you used in your research? And perhaps more importantly, how do you ensure retrievability of a copy of that version of the dataset into the future?

OECD has been assigning Digital Object Identifiers (DOI's) to their datasets which point not to the actual data but to a "homepage" for the dataset, that contains the metadata and other descriptive information, with links to the actual data in various file formats from there. For dynamic datasets there would be only one DOI, and their citation recommendation includes adding a "(accessed on Date)" line to indicate which version was used. There's obviously a bit of a dilemma here - it makes a lot of sense to want to point people to a generic page for a given data collection, but for reproducibility it seems like there ought to be a unique pointer of some sort to an immutable "digital object" associated with each version of the dataset, and we're given no promise of that here.

It was with much excitement that we heard in early 2008 about Google's Palimpsest Project (later Google Research Datasets), to house large scientific datasets on the Google servers and provide worldwide access to them. But by the end of the year, the project was being shut down, apparently a victim of financial belt-tightening. Amazon has a similar public datasets project, not limited to scientific data (currently hosting an interesting variety, from street maps to wikipedia dumps to genomics). Amazon is taking advantage of their experiments in "cloud computing" here, and in fact access to the datasets is only available through an account within their cloud (so they are not providing them for free - but a real revenue stream is not a bad thing for this sort of project). Both the Google and Amazon projects were focused on very large datasets - tens of gigabytes to terabytes or more, while much useful research data comes in considerably smaller packages.

So none of these things seem yet suited to the research data of the individual or small-group scientist that might be inclined to share their results in a machine-readable format with the world. There is definitely a need for standards, and something more cross-disciplinary than the current mish-mash of field-dependent approaches would be nice. Centralized storage on some sort of cloud computing system seems like the logical way to go as well - preferably a distributed inter-operable collection, with redundancy in data storage to guarantee against failures (something like the LOCKSS idea). A way to guarantee some form of immutability once published would be an important step as well (with processes for additions and errata). The Swan-Brown study provides a good overview of the challenges here. Doing something about dataset publishing really seems an essential step forward.