By Nancy K. Herther
“Science has entered a ‘fourth paradigm,’” notes Carol Tenopir and fellow authors earlier this year, one “that is more collaborative, more computational, and more data intensive than the previous experimental, theoretical, and computational paradigms.” Their research found that academic research libraries were increasingly positioning themselves to be a central player in this access/repository endeavor. “A small, but growing, number of libraries are becoming more involved with research data, from helping with data management plans to preparing and preserving research data for deposit in data repositories.”
Elsevier’s Article of the Future Project is intended to help “define an optimal way for the dissemination of science in the digital age.” In a recent article, Elsevier researchers presented information on such “key dimensions” as interlinking scientific article and research data stored with domain-specific data repositories, 3-D visualization tools embedded in online article an ”automatic enrichment of journal articles through text-mining or other methods.” “We make datasets more discoverable through deep-linking to them from our articles,” Elsevier SVP for Journal and Data Solutions, IJsbrand Jan Aalbersberg tells ATG. “We currently link to datasets in 45 different databases (with 10 more in the immediate pipeline), through banner links, entity links, or application links. In many way we see that an infrastructure for research data is emerging, and that many stakeholders are coming together to develop and manage the bits and pieces that are necessary. That includes technical work on development and standardization, but also more social aspects like policies, incentives, and developing best practices. Many of these aspects are relevant to data citation as well; there’s the technical side of long-term data storage and assigning persistent identifiers, there’s a standardization aspect in making sure that all the players recognize and support the same identifiers, and there’s a policy aspect in defining best-practices and incentives for data sharing and data citation.”
Unidata Program Center’s Ben Domenico is working on methods that would allow “authors to create online publications that enable readers to access, analyze, and display the data discussed in the publication” as they read it. These articles would include data that is fully dynamic—linked to the holding data repository for access—as well as including other key tools and enhancements to make these living documents, research systems that allow readers a jumping point to begin their own analysis and work. Just as technology is changing other forms of publication, research is also seeing radical change due to opportunities brought by technological advances.
In May 2013, President Obama signed an Executive Order “Making Open and Machine Readable the New Default for Government Information.” In this action, Obama ordered that “government information resources shall be open and machine readable. Government information shall be managed as an asset throughout its life cycle to promote interoperability and openness, and, wherever possible and legally permissible, to ensure that data are released to the public in ways that make the data easy to find, accessible, and usable. In making this the new default state, executive departments and agencies (agencies) shall ensure that they safeguard individual privacy, confidentiality, and national security.”
As early as 1955, the International Council for Science established a series of World Data Centers in order to protect and preserve key data. In 2003, the European Union’s Parliament passed a directive for all EU member states to remove any barriers that would inhibit re-use of public sector information resulting in the establishment of the European Union Open Data Portal as a physical entity to link or house such data. The Organization for Economic Co-operation and Development member organizations signed a 2004 statement (followed in 2007 by written guidelines) that publicly funded data would be archived and made publicly available.
In the U.S., mandates for data plans that include open formats and access have come from such major research funders as: The National Science Foundation, U.S. Department of Agriculture, USDA Forest Service, National Institutes of Health, National Oceanic & Atmospheric Administration (NOAA), U.S. Geological Survey, National Endowment for the Humanities, Department of Defense, and Centers for Disease Control & Prevention (CDC)—and many others. Many journals now require data-sharing mandates today and research institutions are also focusing on this key initiative. The ROARMAP, Registry of Open Access Repositories Mandatory Archiving Policies, keeps tab on the numbers of institutions with these policies and links to the policies themselves.
At the University of Minnesota, Twin Cities campus, work is underway to determine the extent of researcher knowledge about Open Access and their needs. Alicia Hofelich Mohr of the College of Liberal Arts’ Research Support Services explains that, “one of the things the U is doing to help researchers is to provide a high level of support for data management across service areas. In addition to the support provided by the Libraries for data sharing and archival, our office in the College of the Liberal Arts provides data management support integrated with and in addition to primary research support for data planning, storage, collection, and analysis. With expertise in social science data and analysis, as well as strong ties to IT and to the libraries, we can help CLA researchers navigate the resources available to them and help them implement best data management practices throughout the research lifecycle.”
“Among repositories, I believe there has been considerable movement in the support for data citation. Many repositories provide separate Digital Object Identifiers (DOIs) for deposited datasets, and encourage users to cite the data separately from (and in addition to) the published articles that use the data. Additionally, professional organizations, such as the International Association for Social Science Information Services & Technology (IASSIST) are promoting best practices and guidelines for data citation,” Mohr continues.
“However, in practice, the adoption of data citation is slow,” Mohr cautions. “While federal agencies like NSF and NIH require data management or sharing plans with grant applications, it has yet to be seen how these plans will be enforced. Few would disagree that open data and materials are important for the integrity of research, but putting this into practice is challenging. The extent to which data sharing and citation are currently embedded in academic culture varies widely by discipline, and even sub-fields within a discipline. Many disciplines do not recognize data as stand-alone intellectual contributions to their field, and open data practices currently do not carry much, if any, weight in academic promotions or the academic job market. This is further complicated by the fact many datasets in the social and health sciences have privacy or confidentiality restrictions that can make sharing difficult. Researchers need to be incentivized to put the effort into data sharing, which can be time-consuming (for example, making sure their data are well documented, useable, and de-identified), and to incorporate existing data in their work if data citation is to be widely practiced. This not only requires the support of repositories and granting agencies, but also a shift in academic culture.”
Creating Repositories for Permanent Storage and Access
Carol Tenopir led a recent study, published in PLoS ONE, of scientists’ use of repositories for their research materials. “We not surprisingly found that lack of time was the major barrier for sharing or depositing data sets. Most respondents were
willing to share at least some of their data, but time to get it ready, create metadata, and so on, are barriers. Therefore, the role of institutions is crucial and research libraries are in a good position to take a leadership role in this. Libraries have the long-term picture in mind, plus serve all parts and disciplines of the university, so in collaboration with research offices and cyberinfrastructure are key. Subject repositories may come from a professional society or collaboration and are often also based in universities. As we found in our study of libraries, the library role can vary from helping researchers find an appropriate place to put their data, to leading them to the data management planning tool from UC, all the way to creating an institutional repository and helping with metadata.”
“Citation standards are a collaborative effort,” Tenopir continues. “Data repositories need to provide citations when data is downloaded and the citation standards need to be easily found. DataONE.org has information on citation standards for data, as do other sites. We have just finished follow up assessments of scientists, libraries and librarians and are working on analysis right now. One other issue to put into the mix, is if the data to be preserved and shared involves human subjects, IRBs need to be open to the new paradigm of data sharing. Researchers may need to put language in informed consents that the cleaned anonymous data set will be deposited into an open repository.”
Zoologist Tim Vines from the University of British Columbia led a research effort to study the importance of citing datasets and the rapid decline in availability of research data over time published in Current Biology which they concluded that “in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.”
Vines believes that moving to Open Access will begin with journals, which “are in the unique position where authors come to them with a readily defined dataset (the one underlying their article), and the journal can make public release of the dataset a condition of publication. Even more powerful is making acceptance contingent on high quality data sharing, as nobody wants to have their otherwise acceptable paper rejected on the grounds that the data were not being shared. After all, it’s self-evident that papers that have accessible data are more reproducible and more useful for future research than papers that don’t.”
Funding agencies may not have the same authority to mandate change, Vines believes, because they ”do not have such a clearly defined leverage point, as researchers will collect many datasets across a three or five year grant, and not all of them can be attributed to a particular funding source. However, now that agencies have indicated that sharing data is something they encourage, applicants will presumably have to make lots of their data public to get funded,” Vines notes. “Universities seem to have a lot less influence here—they can provide infrastructure, such as a university data repository, but it would be very time consuming for them to try to enforce data sharing by their researchers. It’s possible that tenure committees could include sharing efforts in their faculty evaluations, but that’s not something I’ve heard of departments doing.”
“With respect to individuals in the community, a greater proportion of people now care about data sharing and make their data public than five years ago. They’re a minority, though, and the incentives to share (possible future citations, possible collaborative publications) are still outweighed by the benefits to keeping your data to yourself (no hassle with curation prior to release, no risk of being scooped, control over future projects). I don’t see that the balance of incentives is going to change in future, even if datasets are always cited, so journals and funding agencies need to impose the ‘public good’ over the selfish needs of individual researchers. It’s possible that attitudes will change in time, so that not sharing data is a black mark against the paper, and this might happen more quickly if journals base editorial decisions on the authors’ data sharing efforts.”
University of Sheffield professor Stephen Pinfield recently co-authored a report on research data management in U.K. universities, finding that “libraries are currently involved in developing new institutional research data management policies and services, and see this as an important part of their future role.” ATG asked Pinfield about the role of libraries in this process. “Researchers themselves need to adopt new approaches but they can only do so if adequately supported by institutions, funders, and subject communities. The precise roles of each are still unclear and may ultimately vary across different countries or subject groups. However, there is good work going on in a number of different contexts to try to identify the way forward. This requires collaboration between different players.”
“In my view,” Pinfield continues, “greater consideration needs to be given to the incentives for researchers to share and cite data. At the moment, the incentives are often not clear, apart from in some disciplinary communities where data sharing is becoming accepted practice. There is still a great deal to do in order to get new approaches into the mainstream. However, a great deal has been achieved already in terms of developing standards and requirements. Once again, coordination is essential.”
Today, re-using data relies on clear and findable data citation, which relies on data citation standards and the availability of that data in usable formats; which relies upon institutions and staff that have the ability and resources to store and shepherd these materials; which, of course, depends on researchers depositing their data in the first place. Like a game of dominos, the reality of data citation as a research tool and data as a re-usable resource for the future relies on a structure of data management that we are still seeing evolve. “In order for a dataset to be cited,” Belter says, “it must first have been deposited in a repository, preserved in an interoperable format, adequately described by a formal set of metadata attached to the data set and made available to other researchers for reuse.” The task ahead is truly daunting, but essential.
(Click here to view the first part of this series: Data Citation, Open Access & Reusability—Key Issues for Research Institutions (Part 1 of a 2 part series))