This workshop had four panelists, one of whom, Maryann Martone, spoke via Skype. Jennifer Kemp introduced the topic by discussing what we are missing when we don’t have good metadata (an open document for recording comments on this session is available by clicking here). Without good metadata, what is the cost to society, and what discoveries are we missing? Jennifer noted that publishers reuse metadata.
Metadata 2020 is a collaboration that advocates richer, connected, and reusable metadata for all participants because richer metadata fuels discovery and innovation; connected metadata bridges the gaps between systems and communities; and reusable metadata eliminates duplication of effort. There is a good summary of insights on the Metadata 2020 blog. Here is the goal of the Metadata collaboration; it comes from community feedback:
Publishers and libraries are involved. Community working groups are involved and include
- Funders, platforms, and tools,
- Data publishers and repositories,
- Librarians, and
Phase 1 of the collaboration is ongoing. It involves gathering stories, good and bad, and continuing workshops to work on creating resources. In Phase 2, the stories will be used to build business cases, and develop a Metadata Maturity Model by which content creators will be able to measure themselves and improve. The focus is on richer, connected, and reusable metadata.
Metadata: The current state of affairs
Maryann Martone (speaking via Skype), Professor Emerita, University of California, San Diego, who also works with hypothes.is, discussed the researcher perspective on metadata: What-Who-When-How? Nobody ever says they have too much metadata!
The FAIR principle, shown below, was formulated through Force11.
Many of its principles are compatible with Metadata 2020. Rich metadata is required for reuse. Here are the characteristics of interoperability and reusability of metadata.
FAIR is a multilayered problem involving core descriptors, community vocabularies, domain-specific vocabularies, specialized vocabularies and information models. We are still reliant on paper laboratory notebooks, unfortunately. Labs are dirty messy places and don’t lend themselves to electronic technology. The tools do not work together across the workflow.
See http://www.scientificpaperofthefuture.org/gpf/what-is-a-gpf for a representation of the scientific paper of the future. Metadata is a core piece of it. Scientists are being asked for more and more. We must get rid of stupid such as the necessity of re-entering information! Platforms are coming that will capture structured metadata, but it is still a cumbersome and manual process.
What do authors need to provide? In over 50% of published papers, authors do not supply enough metadata to be identified. To help solve this problem, a resource identification initiative part the Force11 project, has developed a unique identifier for research entities, the RID (Research Resource Identifier). The number of papers with RIDs has been steadily growing, as shown here:
We must be selective in what we ask for. Metadata matters:
Better and richer metadata requires better tools and interoperable workflows.
- We need to take the burden of entering simple metadata away from authors so they can devote more energy and time to what matters for the science.
- This should be a human-machine partnership.
Jennifer Kemp, Outreach Manager, CrossRef, discussed the role of infrastructure in the metadata ecosystem and asked if we can or should discuss the infrastructure separately from the metadata. It must be built for the most complex situation, but the simplest is the most common. Standards are a given; independent infrastructure organizations must collaborate or at least be interoperable for maintenance, let alone innovation.
The common ground between library metadata and the ORCID metadata is the Open URL. Metadata is messy; don’t try to make it minimal. Are we ready for a new “minimum” and start talking about non-bibliographic information? ORCID IDs might bridge this gap; they could be the new bibliographic. An ORCID connect pilot is institutional and can update records for authors. It also works with vendor systems.
Lettie Conrad, a consultant, presented the publisher perspective on metadata. Metadata and standards require expertise, resourcing, funding; fit for purpose; applications and use cases; conformance; and maintenance. Publishers are thinking about metadata but also need to think about formats and standards. We need to be investing in in-house resources so that publisher and metadata strategies align. Metadata strategy must be clear on what systems are involved; it needs to match the applications publishers intend to serve. Publishers need to ensure that the metadata they distribute in compatible with the existing standards.
The act of publication is a leap of faith that risks changes, being fit for purpose, obsolescence, and influences speed and accuracy. Some things are beyond our immediate control; what is published will continue to change. Publishers are often surprised at the varying uses to which metadata is put, and they must be aware of them. They have limited control over making changes. There are a myriad of journeys for metadata as shown here, and there can be dozens of hosts for it.
These applications can help publishers determine how their metadata is being used.
Metadata and AI Technology
Michelle Brewer, Librarian and Market Intelligence Planner at Wolters Kluwer, concluded the session with a review of perspectives on metadata and AI technology. She noted that there is no single universally accepted definition of AI, but one that is widely used is:
Artificial intelligence is the theory and development of computer systems able to perform tasks that normally require human intelligence.
The field of AI was founded in 1956. After languishing for several years, it has recently resurged because of improved hardware, the emergence of cloud services and open source technology, cheap storage for large data sets, and improved machine learning algorithms. This chart shows some of the many sub-disciplines or applications of AI.
AI has been used to generate metadata; for example, the Royal Kew Gardens used it for compiling botanical terms, Amazon has used it to generate book recommendations, and some database producers (such as EMBASE and INSPEC) use it in their indexing processes. Google Translate is now an AI-based system.
In a fascinating application, One researcher used AI to develop a voice interface between papers in the ArXiv system and Amazon’s Echo and then provide a daily update of the most recent papers added to ArXiv. The system reads the title of each article and gives the user the option of hearing the abstract as well. Click here to read about this groundbreaking research and see a video clip of the system in operation.
For those who wish to embark on an AI-based project, Brewer listed the following suggestions. Following them will provide an avenue to learning how AI can be used in one’s organization.
- Don’t reinvent the wheel.
- Understand that AI involves a steep learning curve.
- Be suspicious of “one size fits all” algorithms.
- Metadata is best done at its point of origin; subject specialists know best.
- Algorithms can make systems smarter, but common sense must be added as well.
- Plan the use of metadata well.
- Have patience and experiment; many projects do not work well at the start but improve over time.
Don Hawkins blogs about conferences for Information Today and Against The Grain. He also maintains the Conference Calendar on the Information Today website and is the Editor of Personal Archiving: Preserving Our Digital Heritage, published by Information Today in 2013, and Co-Editor of Public Knowledge: Access and Benefits, published by Information Today in 2016. He received his Ph.D. degree from the University of California, Berkeley, and has worked in the information industry for over 45 years.