This blog post was co-authored by Paloma Marín Arraiza and Gabriela Mejias.
Last Friday, ORCID turned eight, and we are about to reach another important milestone: 10 million ORCID iDs! As we do every year, we are celebrating our anniversary and Open Access Week by releasing our Public Data File.
The 2020 Public Data File contains a snapshot of all public record data in the ORCID Registry, is published under a CC0 waiver, and is free for everyone to use. Openness is one of our foundational values, and as part of our commitment to remove barriers to access, we release the file to ensure that all stakeholders have broad access to a vital part of the scholarly communication infrastructure. At the time of writing, the 2019 Public Data File was downloaded more than 35,000 times.
The file has been used in different projects as a data source for the analysis of relationships and individual trajectories within the research community, scientific migrations, collaboration networks, and the adoption of ORCID across disciplines and locations.
How is the community using the file?
We would like to present three examples of Public Data File uses to help enrich scholarly metadata/records and visualize connections.
dblp – Computer Science Bibliography
dblp provides open bibliographic information on major computer science journals and proceedings. In 2017, they started displaying ORCID iDs in bibliographies and individual publications. The metadata enrichment is done by harvesting data directly from publishers and combining it with the data obtained from the public data file. Currently, 12% of their entries have an ORCID iD. The coverage goes up to 18% for 2020 publications. For the journal IEEE Control Systems Letters, it reaches 75%. It is also important to highlight here the outreach work carried out by the German ORCID consortium to promote the use of ORCID in this bibliography.
Digital Humanities Lab – Leibniz Institute for European History
To visualize the connections between authors of the DHd 2020 (i.e., the conference of digital humanities in the German-speaking space), the Digital Humanities Lab used the names of the authors extracted from the Book of Abstracts, the ORCID Reconcile tool of OpenRefine, and the affiliations of the ORCID iDs according to the Public Data File. After data processing and cleaning (the full description is available in German on this blog), they reached the following person-affiliate network:
Graph representation of the person-affiliation-network based on the Book of Abstracts 2020 and ORCID iDs. 204 nodes (person: 110, red / institution: 94, blue) and 183 edges (“affiliated with”).
Source: https://github.com/ieg-dhr/orcidgraph/blob/master/Orcidgraph.png and https://dhlab.hypotheses.org/1467.
The source code of the script can be found in GitHub.
OpenAIRE
The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices. Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.
For the past ten years, OpenAIRE has been working to assemble the OpenAIRE Research Graph collection of metadata and links between scientific products such as articles, datasets, software, and other research products; entities like organizations, funders, funding streams, projects, communities, and data sources. As of today, this massive collection aggregates around 450Mi metadata records with links collecting from 10,000+ data sources trusted by scientists. After cleaning, fine-grained classification processes, deduplication, and enrichment via full-text mining (~13Mi full-texts), today the Graph counts ~110Mi publications, ~14Mi datasets, ~200K software research products, 8Mi other products linked together with ~1Bi semantic relations.
ORCID data is used by OpenAIRE to enrich the research product records of the graph. OpenAire is using our public data file and lambda file—generated daily, this file contains a list of all ORCID iDs and their last date of modification. It then uses our Member API to call records that have been modified to import new and updated metadata from those records.
This integration consists of: (i) adding ORCID iDs to Crossref records that are part of the graph, (ii) importing metadata records from ORCID that do not have a DOI, (iii) propagating iDs from products to products when semantic relationships between products justify the action (e.g. if article metadata record with an ORCID iD is linked to a dataset metadata record via a DataCite semantic relationship “supplementedBy/isSupplementTo”). OpenAIRE is capable of brokering to all data sources contributing metadata to the graph (e.g., repositories, publishers, data repositories) the ORCID iDs associated with the related records.
Openaire has been an ORCID institutional member since early 2020 and is planning to establish a bi-directional data exchange by completing an ORCID Search & Link wizard (currently in development).
Interested in using the Public Data File?
If you are interested in using the file, you can download it from the ORCID repository. This year’s file is available in XML format and is further divided into separate files for easier management. One file contains the full record summary for each record. The rest of the data is divided into 11 files which contain the activities for each record including full work data. If you prefer JSON, you can use our ORCID Conversion Library available in our Github repository. The converter is a Java application and enables the generation of JSON from XML in the default version ORCID message schema format (v2.0 and v2.1).
We release the public data file under a CC0 1.0 Public Domain Dedication, and use of the public data is in accordance with our Privacy Policy. We have also created recommended community norms to use the file.
If you are already using the file, or are planning to and have questions, please let us know about your use case. We’d love to hear from you!