There is a giant and rapidly growing wild-west-like expanse in scholarly communications. It has few boundaries, few rules, and appears as expansive as the Big Sky country of Montana. I’m speaking of the world of research data, which has exploded in both size and scope since the turn of the millennium. An often-quoted report by IDC in 2008 (http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf) concluded that the pace of data creation had exceeded the capacity to store that information and with the rapid implementation of sensors and data creation tools of every type, this trend is unlikely to abate. Diverse and complex problems exist in managing all this data.
External factors are also driving this growth in data availability and distribution. In 2007, President Bush signed the America COMPETES Act (PL 110–69) into law, which among many other things requires civilian federal agencies that conduct scientific research to “develop and issue an overarching set of principles to ensure the communication of open exchange of data and results to other agencies, policymakers, and the public.” This led various organizations, both within and outside the federal government, to review their policies on data management. In October, the National Science Foundation amended its grant proposal submission guidelines to require the inclusion of a detailed Data Management Plan. This change is in support of NSF’s new NSF Data Sharing Policy (http://www.nsf.gov/bfa/dias/policy/dmp.jsp), which states that recipients of grants are “expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.” They are not the only grant funding organization to expect awardees to facilitate and participate in data sharing. The National Institutes of Health has been a leader in promoting data sharing (http://grants.nih.gov/grants/policy/data_sharing/) since 2003. Other non-government sponsors of research such as the Wellcome Trust (http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/wtx035043.htm) — a global charitable foundation that sponsors research in bioscience, medicine, and the environment — and the Australian Research Council (http://ands.org.au/guides/code-awareness. html) have implemented policies on data sharing. These are only a few examples among many throughout the world.
While the number of organizations demanding that scholars share their data is increasing, there is not yet clear understanding of how to accomplish all the sharing that is being mandated. The political, legal, technical, curatorial, and publication aspects of data sharing are problems our community will be addressing for a considerable time to come. Several organizations have begun addressing aspects of the complexity, including CODATA (http://www.codata.org/taskgroups/), ICSTI (http://www.icsti.org/documents/Numeric_Data_FINAL_ report.pdf), Science Commons (http://neurocommons.org/report/data-publication.pdf), the Dataverse Network Project (http://thedata.org/citation/standard), NISO (http://www.niso.org/workrooms/supplemental), and the UK’s Digital Curation Center (http://www.dcc.ac.uk/resources/policy-and-legal/policytools-and-guidance). As the problems have grown in complexity, the number and scope of organizations investing time and energy in this space is increasing rapidly.
This growth in interest by organizations around the world makes the issue of coordination increasingly important. A favorite joke regarding standards is particularly relevant to the current situation regarding data distribution. Connie Morella, former congresswoman and ambassador to the Organization of Economic Cooperation and Development, said during an ANSI’s World Standards Day gala, “Standards are like toothbrushes. Everybody wants one, but nobody wants to use anybody else’s.” This is especially true in the area of research data, which spans such a broad swath of the research community. What is taking place on one end of the earth in a particular discipline is often at odds with another project halfway around the globe or even next-door in a different discipline. While some of the challenges are domain-specific, many of the problems span all fields. CODATA is one organization that is stepping up to the coordination question and some of the thornier questions of citation.
CODATA is an interdisciplinary Scientific Committee of the International Council for Science (ICSU) that works to improve the quality, reliability, management, and accessibility of data of importance to all fields of science and technology. Last October during their biannual conference in South Africa, a Task Group on Data Citation was launched. This international group, organized jointly by several CODATA committees and the International Council for Scientific and Technical Information (ICSTI), will explore the technical, scientific, socio-cultural, institutional, legal, and sustainability questions regarding data use and citation, including references to portions or subsets of data. They are also quite aware that citing a dataset has further implications regarding the ability to reliably identify, locate, access, interpret, and verify the version, integrity, and provenance of the dataset. The goal is to help coordinate activities in this area internationally and promote common practices and standards in the scientific community. The group hopes to organize a summit next fall to build awareness and to promote better cooperation among the various leading organizations at work on these topics.
The joint NISO-NFAIS project on Supplemental Journal Article Materials is another project that touches on this space. In scope, however, it is both larger and more tightly focused than the CODATA effort. It is larger from the perspective that it covers any type of supplemental material — not only research data, but also digital notebooks, textual supplementary data, software applications, audio, video, or any of the other supporting content that authors submit along with their articles for publication in scholarly publications. From the perspective of data, however, it is much more tightly focused on the publication-related questions, avoiding the more complex questions of provenance, copyright, security, data integration, packaging, and sharing. The project has begun with defining terms such as what content is supplemental, ancillary, and core to understanding. It is also looking at metadata questions and how to effectively link journal content and supplementary component elements. By working with the publishing community, the Supplemental Journal Article Materials project can help to codify and promote recognition of and use of these materials in the publication stream, as well as to ensure that libraries and researchers can effectively access and use them.
The Science Commons group, a sister organization aligned with the Creative Commons, is another organization with work underway. Their project, led by John Wilbanks, is looking at the legal structures necessary to share data among researchers. As is usually the case, copyright and legal protections regarding intellectual property are often among the most challenging issues for distribution of content. While U.S. Copyright Law doesn’t protect factual items, there are protections for the organization and representation of data forms. Where the lines are drawn in scholarly data has not yet been determined by case law or regulation and will likely not be easily decided. In addition, different laws or regulations apply outside the U.S., where, in some cases, copyright in data can be asserted. If data is shared across international boundaries, a case can be made that the data that is returned will retain the more stringent legal strictures. Science Commons hopes to promote an open license solution to data sharing based on a similar structure to the Creative Commons licenses for publications and other creative works.
Existing work conducted by the Open Archives Initiative on Object Reuse and Exchange (ORE) (http://www.openarchives. org/ore/) could play a significant role in the packaging and distribution of datasets. The OAI-ORE specification presents a model for describing how elements within a compound digital object are identified, described, and related to one another. Although originally developed to deal with aggregations of Web resources, such as Web pages or whole Websites, the specification has potential to be applied to scientific datasets. ORE has seen implementation in a few testing environments such as the Chronicling America Historic American Newspapers project (http://groups.google.com/group/oai-ore/browse_thread/thread/4a71d09b6b5a6feb?pli=1) and the oreChem project (http://www.openarchives.org/oreChem/). While ORE provide a semantic and logistical framework for packaging and distributing datasets, significant work remains before it can provide the needed tools for the scientific community.
One of the most critical success factors for the rapid adoption of the standards that are developed is making changes within the social and political environment. In the early- to mid-20th century, the publication of scholarly journal articles took off as tenure systems were developed that required the publication of research results for promotion consideration (the “publish or perish” mantra). The new government and non-government requirements for sharing of data, mentioned earlier in this article, are having a similar impact. However, these sharing mandates are only the beginning of what is needed to support a long-term infrastructure for data management. Along with legislation and policies, where the funding will come from for all of this data management is a major concern. The biggest inhibitor of adoption of data sharing is of course social, not technical or political. Some researchers are reluctant to share data and some of their organizations have created restrictions on sharing or developed incentives (like the promotion and tenure system) that could result in a mind-set of hoarding one’s data. Both these organizational and individual tendencies to limit sharing will need to be overcome to succeed in large-scale data projects.
Each of these elements: legislation, organizational policy, individual behaviors, intellectual property, funding, technical infrastructure, technology, and information management standards will need to be addressed for the data sharing vision to be realized. These issues are large and interwoven and cannot be solved without significant collaboration between the affected parties and the many organizations that represent them. But the recognition of the value of research data seems to have become pervasive enough that now is the right time to facilitate this collaboration. The new government and non-government requirements for sharing of research data may just be the “tipping point” that is needed to ensure that standards are developed and adopted for the identification, citation, curation, and provenance of datasets.