Text mining—the automated process of extracting information from some type of textual materials, through using specialized software to analyze and find patterns within some structured digital text files—has arisen with both the rise of high-speed computer processing capabilities using sophisticated computer programs and the ever increasing stores of information on virtually every possible topic. Whether analyzing ancient texts or working through some of the nearly two million research reports and articles created each year, text mining is allowing for not only speedy information retrieval but also completely new types of analysis and study that have never been possible before. As JISC’s Torsten Reimer noted in 2012: “For the last ten years alone, the UK PubMed Central database lists 3,112,308 citations with the word ‘cancer’ in the title—browsing them at the leisurely pace of 85 per day will take you about ten years. And by that time, ten years’ worth of new article on cancer will have appeared.”
However, getting access to the textual metadata in research documents—articles, books, dissertations, and other sources—isn’t an easy task today. At North Carolina State University Libraries (NCSU), Darby Orcutt, Assistant Head of Collection Management worked with Unlimited Priorities to set up special agreements with Accessible Archives to allow for text and data mining for their clients. “This fits into a larger effort here that I initiated more than a year ago,” Orcutt explained, “to examine ways to make our collections available for new forms of research, especially computationally-intensive, non-traditional types of research. Text and data mining is one of these areas. A major part of this has been rolling up our sleeves to work out the business terms and agreements with our vendors. One of the reasons that I started working with commercial historical archives was because of faculty interest and knowing that we needed to build our resources and capabilities in this area for the future. We’ve noticed increased interest, especially by vendors of historical research sets, which offer libraries collections of high quality. Libraries need to build a strong corpora of high quality data from archives to meet the needs of future researchers.”
“Working with some publishers is actually harder today than working with libraries.” – Iris Hanney
Building Capacity Vendor by Vendor
Rather than going it alone, NCSU partnered with the Cape Coral, Florida consultancy in order to expedite the process. “This experience is a good opportunity for NCSU to build experience,” Unlimited Priorities’ president Iris Hanney points out, “however there are so many variations in the existing contracts with publishers, aggregators, and vendors. This is still a ‘cutting edge’ technology application—not all academic libraries today have the bandwidth, the facilities, or services to take this on. But there are a number that are and that number is growing.” For libraries, this involves technical issues inherent in the file structures of each product, supporting the needed software for analysis, and working with campus partners to provide the computing capabilities needed. All of these issues are still in the early stages of assessment and adoption—and need to be balanced against other critical responsibilities in this era or austerity.
“Publishers have a learning curve as well,” Hanney continues. “Working with some publishers is actually harder today than working with libraries.” In reality, if they understand the model, publishers really don’t have much to lose. Mining really just gives them an additional potential marketing edge. The mindset and fear level of the individual publisher is key; this is really just another technology transition. Today, publishers use different formats, standards, and modifications in developing their products which complicates the process. Text mining isn’t really another revenue stream but another service vendors can offer their clients. If handled correctly, this isn’t going to injure any revenue streams if they follow some simple rules. There are lessons to be learned, and changes that will need to be made in how they prepare and manage their content. I don’t think that copyright is really a major issue because everyone who can mine data already has a license agreement with the publisher which outlines rights and obligations. This is really just another way to use research data.” In addition to the existing contracts to access content from databases or archival collections, however, accessing and using text and metadata for data mining requires additional permissions.
“This is our second agreement for this type of text mining,” Orcutt notes. “The first was with Gale and we were the first to conclude a negotiation with them. They have since made this same basic agreement available to other client libraries. The major issue in doing this is that too many people are confused by or are confusing the issues involved with text mining. Copyright is a red herring because, in reality, people are less likely to cite significant text with mining than with traditional close reading research that is more common today. With text mining, you are doing a type of distance reading looking for trends and patterns in the documents as-a-whole.”
“It’s important for libraries to focus on the basics,” Orcutt believes. “We want our researchers to have access to large sets of information in order to use them for research. We need to help vendors and publishers see that mining is really just another tool to help value their content products in academic libraries. Mining is really an add-on service or capability that should just require adding a few more terms to sales agreements. Most of the real issues are logistical. There shouldn’t be any requirement to report back to the vendor on this type of use. Although vendors may be concerned about the types of sharing that researchers do, this is really a learning curve issue for vendors.”
We don’t need the traditionally published scholarly materials to demonstrate the usefulness of TDM. – Eric Lease Morgan
TDM Development Doesn’t Require “For-Profit Publisher’s Fire”
University of Notre Dame Digital Initiatives Librarian Eric Lease Morgan believes that “the problem to solve does not surround license agreements. Yes, license agreements are impediments to taking full advantage of TDM against scholarly materials, but you don’t need the scholarly material to improve and demonstrate the usefulness of TDM. What I would do instead is:
- Systematically cache non-licensed materials such as things from the HathiTrust, Internet Archive, the ‘Net in general, and open access repositories.
- Organize the cached materials into local library collections and curate them by applying the traditional library processes which includes description (cataloging), preservation (archiving), and re-distribution (public service).
- Enhance the public services by providing TDM against the collections, thus demonstrating increasingly useful services.
- Go to Step #1.
We don’t need the traditionally published scholarly materials to demonstrate the usefulness of TDM. By going after the traditional scholarly materials, we are only feeding for-profit publisher’s fire for increasing prices; we are only continuing to demonstrate how we are at their beck and call. The problem to solve is not getting access to traditional scholarly materials. The problem to solve is providing useful services against materials needed to do research. All research does not necessarily come from scholarly output. I provide research services against non-scholarly, non-peer reviewed content all the time. Needing the license agreements is simply an excuse for not doing the really hard work—creating the services.”
Still, we have many red flags in meeting researcher expectations.
Removing the Red Flags
Academics aren’t the only ones concerned about the mounting piles of key research data that beg for better forms of identification and sophisticated search tools. At a 2012 British meeting on the mounting issues related to the massive increase in research publication and communication, GlaxoSmithKline’s Philip Ditchfield noted that, “there are about 7,000 diseases out there and we can cure about 1% as an industry at the moment. We’re all patients at the end of the day and we need to discover medicines. That’s the priority,” he commented. “We’re a very compliant industry and we want to work with publishers, not undermine their intellectual property. Publishers often say you can mine our content—you just have to ask us. That’s very easy to say and very hard to achieve. It is like in the early days of motor cars when you were allowed to drive down the road but you had to have a man with a red flag running in front of you.”
Last year researchers at the University of Colorado, Denver, wrote about their efforts to support text mining from their Health Sciences Library by seeking to “leverage the library’s existing journal licenses to obtain a large collection of full-text journal articles in XML format, the right to text mine the collection, and the right to use the collection and data mined from it for grant-funded research to develop biomedical natural language processing tools.” Their challenges and frustrations led them to conclude that, “text-mining rights of full-text articles in XML format should routinely be included in the negotiation of the library’s licenses.” Vendors report that request for text mining permissions is still rare—but growing.
For librarians the issue of access is as critical and complicated as for their researchers themselves. The University of Colorado, Denver librarians explaining the “natural opportunities for collaboration including negotiating rights to content more efficiently through expanded licensing arrangements and facilitating the secure transfer and storage of data to protect researchers and publishers.”
The Publishers Perspective
Publishers see many issues from their perspective as well. A study commissioned by the British Publishing Research Consortium in 2011, titled Journal Article Mining: A Research study into Practices, Policies, Plans … and Promises, found that although requests for content were “widespread but still at low levels of frequency,” publishers needed to establish standards for cross-publisher text formats to allow for “one shared content mining platform to commonly agreed access terms for mining and standardization of mining-friendly content formats….The overall aim of these cross-publisher solutions should be to facilitate and stimulate the fascinating possibilities of content mining of the ever-growing digital corpus of scholarly journal articles for researchers and publishers alike.”
In 2011, what has been called the Hargreaves Report, which supported making changes to the British copyright law to exempt TDM from copyright protection, was referred to as “removing this red flag.” The exception was approved and took force in June 2014. The European Union has been examining TDM as well. In its report, the EU noted that “Traditional publishers distinguish between ‘access’ and ‘mining’, arguing that they are two different activities that require their own license and may bring with them different terms and conditions. Providing researchers with ongoing, reliable access to high quality content for text and data mining is said to involve a significant institutional investment in validation, correction, and refinements to content, plus investment in systems to hold that content in a secure manner. At the same time, there is some acceptance among scientific publishers that the present arrangements are inefficient and costly and would not scale if demand for TDM were to grow as predicted.” In the U.S., research institutions have been working more towards defining TDM as an application of our country’s copyright doctrine of fair use.
“The UK exception, having been enacted, is worth examining,” explains Copyright Clearance Center’s Roy Kaufman. “It applies to already purchased and licensed content, so there is no requirement that works or data be made available to non-customers. Likewise, data unavailable for sale or license is not subject to the exception. The exception is limited to non-commercial text and data mining initiatives, so corporations and corporate-university partnerships are not covered under the exception.”
The University of Northampton’s Charles Oppenheim isn’t sure that the British exception will have the intended effect: “There is no question that the new UK non-commercial TDM exception has aroused a lot of interest—and controversy. One project in particular, Cambridge University-based ContentMine, has taken an aggressive approach, both carrying out extensive TDM and offering its services to others to run such exercises, all based on the new exception. A lot of other researchers are interested, but cautious. The caution is understandable because publishers have been making warning noises. There is a bit in the wording of the exception which allows publishers to restrict access if large scale TDM seriously degrades their system performance, and they are relying on that to try the line that researchers only download small amounts of data (in terms of number of items, and/or the extent of copying of each item) and/or try to insist that researchers use a publisher-created API to do the downloading rather than the researchers’ own ones. On the other hand, the new exception says no contract may over-ride it.”
“So the new exception includes a contradiction,” Oppenheim continues, “which publishers are trying to exploit. They are also urging at an EU level that the EU does not adopt a TDM exception in a Directive, saying publishers’ licenses will solve the problem of researchers wishing to do TDM. Julia Reda, a Member of European Parliament, recently produced an important report recommending that the EU adopts, amongst other things, an EU-wide exception for TDM, but publishers have lobbied MEPs hard to get this recommendation overturned. So overall, it is a political hot potato right now. My own guess is that the publishers will be successful in stopping an EU-wide exception, but ironically that will benefit UK researchers because they can enter into joint TDM projects with EU partners, knowing that they can lead on the TDM angle—so UK TDM practitioners will be popular with researchers in continental Europe anxious to see what TDM can do in their subject field.”
Major Publishers Begin to Respond
Gemma Hersh, Elsevier’s Policy Director explains that although Text and Data Mining (TDM) is still “a niche activity, but the rhetoric around TDM is both heated and vociferous. Unfortunately, TDM is often used as the Trojan horse with which to argue for copyright reform—and by that I mean weakening of the copyright framework. We want to support the development of this new research tool based on researcher feedback, and our policy was developed following a TDM pilot with the research community. We also continue to solicit feedback and evolve our policy in response to this. We want to make sure we have tools and services at researchers’ disposal, as and when they want to use them.” However, LIBER, SPARC, and others see major trepidation on the part of publishers—especially in an era of growing frustration with commercial journal publishing and widespread criticism of journals costs.
In an open letter to Elsevier last July, signed by an impressive group of eighteen European research and library organizations, these advocates advised that, “Europe is falling behind in the exploitation of TDM because the lack of clarity in the current European copyright framework is disincentivising the uptake of TDM by researchers. In the UK, an exception for TDM has been introduced into legislation. What this means is that TDM will no longer be an activity that is subject to license in the UK; any researcher will be free to mine content to which their institution has legal access. We see no reason that researchers across Europe and beyond should not have equal rights to mine content to which they have legal access. Restrictive licenses provided by publishers for access to content for the purpose of TDM have the potential to further disadvantage the research community by enforcing strict parameters around how content can be mined and under what conditions the results may be made available.”
Springer announced a partnership with the Copyright Clearance Center earlier this year to make TDM easier for corporate and biomedical clients. “This partnership with CCC targets the corporate market. For non-commercial research, Springer grants text- and data-mining rights to subscribed content to researchers via their institutions.”
Orcutt is sympathetic to the issues that this shift causes publishers. “My basic idea is that we start by securing blanket rights and capacity for high-end computational research. Very pragmatically, it doesn’t make sense to me that we hold off on securing content for our advanced researchers until we can get some sort of perfect package that will address the (presumed) desires of the many as well. On the flip side, we can’t expect vendors (especially commercial vendors) to proactively develop TDM infrastructure without reasonable expectations for compensation. My fear here is that if we only demand Cadillacs, that is exactly what we’ll get—along with the price tags that come with them. If we’re going to see any ‘standardization,’ then it’s got to be along the lines of a minimal level of acceptability, recognizing the reality that some vendors will never be able to provide more than that, that many vendors are a long way from being able to provide more than that, and that libraries would not be able to bear the costs of always driving more car than will generally be needed (to perhaps belabor the metaphor).”
CrossRef’s Efforts to Publish Publisher License Standard Contracts
Last May, CrossRef, a not-for-profit organization of nearly 2,000 global publishers, formed CrossRef Text and Data Mining services, which allows “publishers to provide information that will simplify access arrangements for researchers who desire to mine and analyze scholarly publisher sites.” The service provides access through “appropriate full text URLs for the full text content so such activity does not inadvertently impact the performance of its main site for more typical readers.”
“On the license issues,” CrossRef’s Carol Anne Meyer says, “we are providing an easy way for researcher to check the publisher license by requiring participating publishers to deposit a URL for their licenses. It could be a URL to a Creative Commons license or to a proprietary license. Even if it is a link to a proprietary license, that standard license may very well include permission for text and data mining….CrossRef isn’t saying what the license terms should and should not be, but we are reducing the time spent for researchers and publishers negotiating individual access arrangements….What CrossRef Text and Data Mining services does NOT include is a standard set of license terms that publishers need to adopt. We stay out of the business model decisions of our members.” These agreements, being made between vendors and their clients are probably monitored by each side—by libraries making sure access is maintained and by vendors in guaranteeing that their intellectual property is secure.
“CrossRef’s compilation of text mining agreements is useful,” Hanney believes, “but I don’t see that we will ever have a truly standard contract for text mining—or for many other items. Negotiations between individual libraries and vendors are critical to share expectations and needs and foster communication. Researchers generally mine data for long-term projects, so these are normally negotiated for multi-year or permanent access. This is fair to everyone. Different vendors are different. Some don’t have the ability to license the content beyond a certain term. Different vendors operate in different ways with different types of content as well.”
Removing More Red Flags
“Once we agreed in principle,” Orcutt notes, “I turned it over to the licensing team at NCSU and it was quickly handled by them—the agreement was done before the purchase agreement was signed. Text mining isn’t driven by libraries or vendors, but by researchers themselves. So the ability for text mining Accessible Archives‘ databases at NCSU is available as soon as any researcher wants to use this.” “What you learn in all of this,” Hanney believes, “is who is a good partner and who isn’t. For NCSU the whole thing was finished in about a week, in other cases that might take a few months. This is all dependent on comfort levels, knowledge, familiarity with license agreements and goals, and what other restrictions any party may have.”
Removing those last “red flags” will enable a whole new era of research, mining the huge amounts of collected data—from mobile devices, the web, the huge stores of the world’s research, political or popular literature, doing sentiment and deception analysis, humanities analysis or making better strategic assessments. However, this nascent research area will take data, tools, and time to develop. However, without the data we can never establish techniques and methodologies needed that will lead us into this new age of research and discovery.
George Mason University historian Tom Scheinfeldt noted recently that “like 18th century natural philosophers confronted with a deluge of strange new tools like microscopes, air pumps, and electrical machines, maybe we need time to articulate our digital apparatus, to produce new phenomena that we can neither anticipate nor explain immediately.”
Researchers are approaching this with enthusiasm that bridges the humanities and sciences. TDM promises to be a key research methodology for the 21st century.
Nancy K. Herther is Librarian for American Studies, Anthropology, Asian American Studies & Sociology at the University of Minnesota, Twin Cities campus. [email protected]