Blog

The anatomy of metadata matching

Crossref logo icon https://doi-org.pluma.sjfc.edu/10.13003/zie7reeg

In our previous blog post about metadata matching, we discussed what it is and why we need it (tl;dr: to discover more relationships within the scholarly record). Here, we will describe some basic matching-related terminology and the components of a matching process. We will also pose some typical product questions to consider when developing or integrating matching solutions.

Basic terminology

Metadata matching is a high-level concept, with many different problems falling into this category. Indeed, no matter how much we like to focus on the similarities between different forms of matching, matching affiliation strings to ROR IDs or matching preprints to journal papers are still different in several important ways. At Crossref and ROR, we call these problems matching tasks.

Drawing on the Research Nexus with Policy documents: Overton’s use of Crossref API

Update 2024-07-01: This post is based on an interview with Euan Adie, founder and director of Overton._

What is Overton?

Overton is a big database of government policy documents, also including sources like intergovernmental organizations, think tanks, and big NGOs and in general anyone who’s trying to influence a government policy maker. What we’re interested in is basically, taking all the good parts of the scholarly record and applying some of that to the policy world. By this we mean finding all the documents, finding what’s out there, collecting metadata for them consistently, fitting to our schema, extracting references from all the policy documents we find, adding links between them, and then we also do citation analysis.

Rebalancing our REST API traffic

Since we first launched our REST API around 2013 as a Labs project, it has evolved well beyond a prototype into arguably Crossref’s most visible and valuable service. It is the result of 20,000 organisations around the world that have worked for many years to curate and share metadata about their various resources, from research grants to research articles and other component inputs and outputs of research.

The REST API is relied on by a large part of the research information community and beyond, seeing around 1.8 billion requests each month. Just five years ago, that average monthly number was 600 million. Our members are the heaviest users, using it for all kinds of information about their own records or picking up connections like citations and other relationships. Databases, discovery tools, libraries, and governments all use the API. Research groups use it for all sorts of things such as analysing trends in science or recording retractions and corrections.

Metadata matching 101: what is it and why do we need it?

Crossref logo icon https://doi-org.pluma.sjfc.edu/10.13003/aewi1cai

At Crossref and ROR, we develop and run processes that match metadata at scale, creating relationships between millions of entities in the scholarly record. Over the last few years, we’ve spent a lot of time diving into details about metadata matching strategies, evaluation, and integration. It is quite possibly our favourite thing to talk and write about! But sometimes it is good to step back and look at the problem from a wider perspective. In this blog, the first one in a series about metadata matching, we will cover the very basics of matching: what it is, how we do it, and why we devote so much effort to this problem.

2024 public data file now available, featuring new experimental formats

This year’s public data file is now available, featuring over 156 million metadata records deposited with Crossref through the end of April 2024 from over 19,000 members. A full breakdown of Crossref metadata statistics is available here.

Like last year, you can download all of these records in one go via Academic Torrents or directly from Amazon S3 via the “requester pays” method.

Download the file: The torrent download can be initiated here. Instructions for downloading via the “requester pays” method, along with other tips for using these files, can be found on the “Tips for working with Crossref public data files and Plus snapshots” page.

Integrity of the Scholarly Record (ISR): what do research institutions think?

Earlier this year, we reported on the roundtable discussion event that we had organised in Frankfurt on the heels of the Frankfurt Book Fair 2023. This event was the second in the series of roundtable events that we are holding with our community to hear from you how we can all work together to preserve the integrity of the scholarly record - you can read more about insights from these events and about ISR in this series of blogs.

Seeking consultancy: understanding joining obstacles for non-member journals

Crossref is undertaking a large program, dubbed 'RCFS' (Resourcing Crossref for Future Sustainability) that will initially tackle five specific issues with our fees. We haven’t increased any of our fees in nearly two decades, and while we’re still okay financially and do not have a revenue growth goal, we do have inclusion and simplification goals. This report from Research Consulting helped to narrow down the five priority projects for 2024-2025 around these three core goals:

This year’s call for expressions of interest to join our board

The Crossref Nominating Committee is inviting expressions of interest to join the Board of Directors of Crossref for the term starting in January 2025. The committee will gather responses from those interested and create the slate of candidates that our membership will vote on in an election in September.

Expressions of interest will be due Monday, May 27th, 2024

This is an exciting time to join the board, as we have a number of active projects underway: We are considering resourcing Crossref for a sustainable future and board members will be part of deciding any changes to our fees scheme and overseeing its implementation. We’re focusing on how our community and metadata can contribute to ensuring the integrity of the scholarly record. We’re broadening our metadata record to capture richer funding and institutional affiliations. We’re working towards a future where the scholarly record prioritizes relationships between research outputs to build a holistic research nexus. The board helps guide this work.

Common views and questions about metadata across Africa

This past year has been a captivating journey of immersion within the Crossref community, a mix of online interactions and meaningful in-person experiences. From the engaging Sustainability Research and Innovation Conference in Port Elizabeth, South Africa, to the impactful webinars conducted globally, this has been more than just a professional endeavour; it has been a personal exploration of collaboration, insights, and a shared commitment to pushing the boundaries of scholarly communication.

Testing times

Martin Eve

Martin Eve – 2024 April 03

In R&D

One of the challenges that we face in Labs and Research at Crossref is that, as we prototype various tools, we need the community to be able to test them. Often, this involves asking for deposit to a different endpoint or changing the way that a platform works to incorporate a prototype.

The problem is that our community is hugely varied in its technical capacity and level of ability when it comes to modifying their platform. Some mega-publishers, for instance, outsource their platforms and so are dependent on third party developers/organizations when they want to make a change. Many smaller publishers, by contrast, use systems such as OJS, which come with Crossref plugins that make life very easy… but that require hard code changes to accommodate prototypes. Such changes are way beyond the technical capacity of most journal editors.