There’s little question that research data is big today. Last year an advisory committee to the director of the National Institutes of Health laid the foundations for a project—the Brain Research Through Advancing Innovative Neurotechnologies (BRAIN) Initiative—which promised to “break new ground in the scale and speed of data collection in neurosciences requiring tools to handle data in the magnitude of yottabytes (1024).” A single yottabyte would “fill the states of Delaware and Rhode Island with a million data centers,” it’s been estimated. Now, that’s Big Data.
Given the scale and pervasiveness of scientific work today—and the impact of technological advances—projects like this will only continue and grow in the future. A critical need has arisen to be able to track this research, find some type of bibliographic control to the massive stores of data being created, and potentially share this research data with other researchers. With Open Access mandates growing and data repositories being established at colleges and universities across the world, information professionals have new issues to solve and new challenges to face.
As far back as 1979, Sue Dodd suggested a format for bibliographic references for numeric social science data files in JASIS, noting that, “the development of information technology and the ability to produce data have progressed much more rapidly than our capacity to organize, classify, and reference its availability.” In 1995, psychologist Joan Sieber and statistician Bruce Trumbo argued that proper citation of datasets was “no less important than proper citation of books and articles, and for the same reasons. It helps to provide open access to the work of others, it acknowledges the actual creators of the data, and it documents their scientific legacy in a permanent, public record.”
Now, some 35 years later, we are seeing the realization of efforts to not only bring datasets into the bibliographic fold but archiving, accessing, and even re-using, revising and expanding datasets in the same way that ideas, theories, treatments, and observations from articles can be accessed and then folded into new research to advance science and progress.
In a 1998 letter in Nature, researcher John Helly suggested that, “recurring and fundamental issues limited ‘full and open access’ to data… [due to] the cultural mind-set that fosters, permits and even necessitates the withholding of research data.” Today these issues are front and center in every scholarly association and research institution across the globe.
The Need for Data Citation
In their 2007 article for D-Lib Magazine, Micah Altman and Gary King saw standardized citation for data as responding to “issues of confidentiality, verification, authentication, access, technology changes, existing subfield-specific practices, and possible future extensions, among others.” Just as “the simple idea of scholarly citation of printed works on scientific progress has been extraordinary…science is not merely about behaving scientifically, but also requires a community of scholars competing and cooperating to pursue common goals, scholarly citation of printed matter can be viewed as an instantiation of a central feature of the whole enterprise.”
In her introduction to the 2013 National Academies publication For Attribution—Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop, Christine Borgman noted that today’s emphasis on data citation and attribution is based on “the growth in data volume relative to storage and analytic capacities…another factor is advances in the technical infrastructure for generating, managing, analyzing and distributing data.” If one considers data as an equal product of research to publication, “data deserve attribution similar to that of publications,” she believes.
At the same time, Borgman notes that, “data are very different entities than publications. They take many more forms, both physical and digital, are far more malleable than publications, and practices vary immensely by individual, by research team, and by research area. Institutional practices to assure stewardship of data are far less mature than are practices to sustain access to publication. All of these factors contribute to the complexity of data citation and attribution.”
Today, led by the work of a strong and growing collaboration of research institutions, governmental bodies, associations and private companies, we are seeing the development of a strong foundation to support and define the development of not only standards for citing data, but systems for guaranteeing that these key research products will continue to exist, and be findable and usable by current and future researchers.
Data Citation Principles
Earlier this year a “community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing,” called Force 11, completed their Joint Declaration of Data Citation Principles, which begins with the affirmative statement that: “Sound, reproducible scholarship rests upon a foundation of robust, accessible data. For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record. In other words, data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice and is part of the scholarly ecosystem supporting data reuse.”
The statement goes on to discuss the eight core principles of this manifesto:
- Importance: Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publication.
- Credit and Attribution: Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.
- Evidence: In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.
- Unique Identification: A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
- Access: Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.
- Persistence: Unique identifiers, and metadata describing the data, and its disposition, should persist—even beyond the lifespan of the data they describe.
- Specificity and Verifiability: Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to be able to verify that the specific timeslice, version, and/or granular portion of data retrieved subsequently is the same as was originally cited.
- Interoperability and Flexibility: Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities
Organizations and individuals around the world have signed on to supporting these principles, which provide a single theoretical and workable set of guidelines that can be applied across communities and technologies. Although actual citations may vary in style, Force11 is seeking broad support to fundamental agreements that data must be included in the full reference list along with citations to other types of works and that citations within text should provide enough information to identify cited data reference in the reference list.
Citing Data Sets
With all of this activity, you’d think it might be easy to see how the different style manuals—from Chicago Manual of Style to individual disciplinary standards—allow for citing data sets. However, there are still significant differences in how these are handled. This guide from Michigan State provides some examples. Scholarly databases are further behind with few even allowing for marking data sets for moving into citation managers easily. Even Web of Science doesn’t allow for marking cited data sets for easy referencing—although it does provide citation referencing information. Today we have no universally accepted format for citing data, let alone clear access paths for future researchers.
Key elements in any useful scheme would include the following:
- Authorship/Creation: who created the dataset or holds the institutional responsibility for that creation. Even if a data set generated from larger sets should be able to be clearly identified; however, ultimate source may be important.
- Name of dataset: the title should be apparent, if not the name of the study for which the data was created could be used.
- Dates & Versions: There should be evidence of the date of collection/creation and a history of versions/additions/changes to that dataset.
- Creator: Who is the creator of the data set? This can be an individual, a group of individuals, or an organization.
- Keeper/Editor of the data: It can be the name of an institution, department, or individual with initial or ongoing responsibility for the maintenance of the data.
- Distributor/Archive/Publisher and Location: Who are the responsible parties for access and their contact information (physical location, URLs, etc.)
- Format of the data: Is this a computer file, a CD-ROM, or some other format.
- Unique identification and/or location: Is there a Digital Object Identifier (DOI) or other type of persistent identifier for this data? DOI is preferred, but both can be useful to future researchers.
Beyond these basic data elements, general principles are available for various schemes that currently allow for unique citation information—just as DOIs provide for authors—for data sets. Today there are two major efforts aimed at providing data citation information. The first current model “favors the citation of a ‘date paper’ or ‘data publication’ describing the data set,” as described by Christopher Belter in a recent paper in PLoS ONE. “In this model, the metadata necessary for using a dataset and a link to the dataset is presented in a paper published either in a traditional journal or in a specialized data journal. Data papers differ from more traditional publications in that no analyses or findings resulting from the data set are required. Researchers wishing to cite the data set would then cite the data paper, rather than the data set.”
The second model was first developed for nucleotide sequence datasets for GenBank, and later used in other applications and focuses on a direct link or citation to the dataset’s repository. This method is used by the California Digital Library, DataONE, Dataverse Network, Dryad, NOAA’s National Climatic Data Center, ICPSR, and Pangaea. “One of the fundamental components of this model is the creation and citation of an identifier that uniquely identifies the data set being cited, such as DataCite.”
Clearly, having a unique dataset identifier is just the start. “In order for a dataset to be cited,” Belter continues, “it must first have been deposited in a repository, preserved in an interoperable format, adequately described by a formal set of metadata attached to the data set and made available to other researchers for reuse.” No small set of tasks.
“With members and affiliates in 18 countries, DataCite represents a global, nonprofit effort to help researchers find, access, and reuse data,” explains Michael Witt, Head of the Distributed Data Curation Center at Purdue University, a key U.S. DataCite partner. “This includes operating a DOI infrastructure to assign unique, persistent identifiers for datasets as well as working towards standards and tools to support data citation, enabling new services such as linking documents and datasets, and raising awareness and doing outreach. For example, the dataset that supports the measurement of the Higgs boson was published with a DataCite DOI; and by the end of the year, DataCite will have issued more than 4 million DOIs.”
DataCite is an international, non-profit organization aiming to:
- Establish easier access to research data on the Internet
- Increase acceptance of research data as legitimate, citable contributions to the scholarly record
- Support data archiving that will permit results to be verified and re-purposed for future study
DataCite’s member institutions offer services and advice directly where they are needed by researchers. The German National Library of Science and Technology acts as the Managing Agent for DataCite, coordinating communication, activities, and citation processes. Member organizations in countries across the planet (full members and Associate Members) coordinate the efforts in their geographic areas. A Board provides oversight and governance. Through collaborations with ORCID, the ICSU World Data System, companies like Thomson Reuters and Elsevier, and major research institutions across the globe, DataCite is becoming a major player in the data citation arena and an important component in the evolving global research infrastructure.
The international collaborative Force11—along with many others—is also working to make data more accessible. “ICPSR has been linking publications to the underlying data since 2000,” Elizabeth Moss, librarian for the Inter-university Consortium for Political and Social Research explains. “In the past few years, data citation has become more discussed, no doubt because of the increased interest in open data and data sharing. Recently, the various groups who have been advocating better data standards, across the academic disciplines, have tried to synthesize their efforts via Force11. It formed a Data Citation Synthesis Working Group, which came out with a Joint Declaration of Data Citation Principles last year. Many stakeholders from publishers to archives to tool developers have endorsed it, including ICPSR. Force 11 has now added both a dissemination and an implementation group to come up with ways to promote the principles of data citation and help change practice. Force 11, ICPSR, and even major private sector groups are supporting these efforts.
“Microsoft Research sees great potential in improving scientific reproducibility and accelerating scientific discovery through public access to research data, and has long promoted Jim Gray’s vision of seeing all research data available online,” Alex Wade, Director of Scholarly Communications at Microsoft Research explains to ATG readers. “Data citation is still in the early stages and is not yet widely adopted by authors or publishers. Incentives, in the form of funding agency mandates and author credit for data sharing, will likely be necessary components of a cultural shift to broad research data sharing,” Wade continues. “Thompson Reuters’ Data Citation Index is one such effort to quantify the impact and reach of cited instances of research data. Improved discovery mechanisms, and integration into researcher’s workflows will also facilitate buy-in and adaption by the research community.”
Data Citation Index
In 2012, Thomson Reuters released its Data Citation Index as a companion product to its various citation products across the disciplines. This database is sold integrated into the other citation products and generates data citations that can be linked to the repositories or other sites that hold the data. A paper presented this month at the STI Conference in Leiden analyzed the content of the database finding that it included “a large number of data repositories…[however] most of its records have no citation related with them, showing a high rate of uncitedness (88%).” At this early stage, indexing data sets is still a major issue in subsequent access. The authors concluded that “data sharing practices are not common to all areas of scientific knowledge and only certain fields have developed an infrastructure that allows one to use and share data.”
Thomson Reuters has worked with major repositories https://www.force11.org/datacitation/ endorsements — such as ICPSR — to establish a strong foundation for significant datasets. The data studies provide descriptions of experiments or studies held in depositories and the associated data when subsequently used by others. As an emerging area and until we have a truly open system of research data, data citation and use of existing datasets will lag. More recently, Elsevier has initiated an effort to link papers in ScienceDirect to datasets used or deposited in repositories. However both of these schemes rely on clear, unique identifiers and links to the datasets.
DataCite’s efforts to standardize data citation represents a strong foundation for this growing area. However, achieving an accepted, workable system for citing data is only a first step. Citation only works as a retrieval system if the cited object is actually findable through search systems and accessible through some established means. One good index to repositories is the independent website Databib, which includes nearly 1,000 repositories in its indexes and allows for searching by subject, titles, description or other indicators. Working with Data Citation Index, these resources are providing important, initial access to publicly available data sets.
Other Key Issues
In addition to standardized citation and clear retrieval options, data needs to be maintained and potentially available for future researchers hoping to replicate or build on existing research that has been done. One key issue that remains understudied is how to get users and researchers alike to have enough trust in the integrity of repositories that they can confidently rely on them for archiving and managing their research materials. “While repositories’ efforts to build trustworthy digital repositories (TDRs) led to the establishment of ISO standards,” noted Ayoung Yoon in a recent analysis, “much less research has been done regarding the user’s side, despite calls for an understanding of users’ trust in TDRs.” Trust is just one issue.
One interesting development is the use of provenance techniques—and even crowdsourcing—as a process to gauge or insure that scientific data, through all its transformations, analyses, interpretations and format changes making sure that the basic integrity of the data is maintained, analyzed for any defects, etc. In one recent example, data citation graphs were crowdsourced as a form of data validation.
Other issues of equal importance are:
- The requirement to make significant changes to the workflow of researchers in order to accommodate data citation, access, and reuse.
- Providing incentives and recognition to researchers to share their research materials in an open environment.
- Providing secure, state-of-the-art repositories with clear policies, technological capability and oversight to guarantee the ongoing archiving and access to information.
- The development of clear standards and tools to make datasets and related materials truly useful, allowing for linkages to revisions, new datasets based on existing research, etc., just as we can follow links to articles and ideas through traditional citation indexes.
- Clear professional recognition and standards for this new system of scholarly publication.
- Systems that provide a sound measures to protect intellectual property rights, subject privacy, and other ethical issues that may arise.
How are libraries and research centers responding to these issues and needs? Part Two will look at how various players are working together to create a wholly new research structure for the 21st century.
(Click here to view the second part of this series: ATG Original: A Fourth Paradigm: Creating a 21st Century Research Infrastructure (Part 2 of a 2 part series) )
Nancy K. Herther is librarian for American Studies, Anthropology, Asian American Studies and Sociology at the University of Minnesota Libraries, Twin Cities campus.