OpenCitations (http://opencitations.net) employs Semantic Web technologies to publish scholarly citation data in an open repository. The main service developed by OpenCitations is the OpenCitations Corpus (OCC), in which RDF-based scholarly citation data (i.e. bibliographic references) are stored and made freely available under a Creative Commons public domain dedication so that others may use and build upon them. The other resources of OpenCitations – namely the ontologies describing the data and the software developed to build the OCC – are also available with open licenses. OpenCitations follows the FAIR principles for data-intensive science, namely that the data should be findable, accessible, interoperable and re-usable, and employs the three basic criteria promoted by the Initiative for Open Citations, namely that the citation data must be structured, separable and open.
In this presentation, we will explain how OpenCitations harvests citations and associated metadata from several external APIs (those of Europe PubMed Central, Crossref and ORCID), describes them using the SPAR Ontologies, the Provenance Ontology and FOAF according to the OCC metadata model, and stores all the data in JSON-LD. The stored citations are made accessible via the OCC SPARQL interface, by means of the HTTP requests (using content negotiation, e.g. https://w3id.org/oc/corpus/br/1), and as monthly data dumps. As of 5th August 2017, the OCC had ingested the references from 211,196 citing bibliographic resources and contained information about 8,986,030 citation links to 5,106,542 cited resources.
We will exemplify how OpenCitations can benefit the scholarly community at large, being used by international initiatives and projects including Wikidata/WikiCite, OpenAIRE, LOC-DB and eLife, and will describe future plans, made possible by funding from the Sloan Foundation, to increase the rate of citation ingest thirty-fold to 15 million citations per month, and to develop novel user interfaces to access and visualize the citation data.
A citation is a conceptual directional link from a citing entity to a cited entity, instantiated by a bibliographic reference + an in-text reference pointer denoting such reference
A citation index is
a kind of bibliographic index, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents – from Wikipedia
Some are freely accessible but not downloadable
Others can be accessed only by paying significant access fees
Semantic Publishing concerns the use of Web and Semantic Web technologies and standards for enhancing a scholarly work semantically (by means of plain RDF statements, nano-publications, etc.) so as to improve its discoverability, interactivity, openness and (re-)usability...
... for both humans and machines
David Shotton (2009). Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing, 22 (2): 85–94.
Back to 2010: Semantic Publishing and Referencing (SPAR) Ontologies form a suite of orthogonal and complementary OWL 2 DL ontology modules for the creation of comprehensive machine-readable RDF metadata for every aspect of semantic publishing and referencing
The OpenCitations Corpus (OCC, http://opencitations.net) is a Linked Open Data (LOD) repository of CC0 citation data described with SPAR Ontologies (BiRO, C4O, CiTO, DataCite Ontology, FaBiO, PRO), FOAF, PROV-O, DCAT, VOID
It provides >11M citation links from ~260,000 citing articles to ~6M cited resources + provenance information
OCC data are FAIR:
We developed several scripts for implementing the ingestion workflow that populates the OpenCitations Corpus
These scripts implement a live and iterative process
Live: it's working while I'm speaking. It doesn't sleep, never. It's like a sentient, relentless, fast zombie – watch out!
Iterative: the ingestion workflow continuously calls several external APIs to obtain new reference lists and clean metadata of the citing and cited papers
At present, all the reference lists are taken by processing the XML sources of the papers in the PubMed Central Open Access subset
Europe PubMed Central API for retrieving the XML sources
We ask for the most recent papers first
Citing papers includes articles published in 2016 and 2017
There are 1.75M OA articles available in PubMed, according to their API. We have harvested 15% so far...
|Entity type||What it describes||Count in the OCC|
|Bibliographic resource (br)||Conference papers, book chapters, journal articles, academic proceedings, books, journals, etc.||~7.3M|
|Resource embodiment (re)||Digital vs. print, first and ending pages, etc.||~4.3M|
|Bibliographic entry (be)||Textual content of a reference in a reference list||~10.5M|
|Resposible agent (ra)||Given name, family name and ORCID of the agent involved||~22.5M|
|Agent role (ar)||Author, publisher, etc.||~28.6M|
|Identifier (id)||DOI, PubMed ID, PubMed Central ID, ORCID, ISSN, etc.||~15.1M|
|Data statements||Provenance statements|
|Entity type||Kinds of entity||Entities with identifiers|
Academic Proceedings: 13,804
Book Chapter: 68,798
Book Series: 15
Data File: 4,155
Expression Collection: 10,321
Journal Article: 4,481,224
Journal Issue: 1,029,369
Journal Volume: 365,286
Proceedings Paper: 28,125
Reference Book: 66
Reference Entry: 7,917
Report Document: 1,946
Specification Document: 600
Generic kind: 1,281,627
Direct access to bibliographic resources by means of their HTTP URIs (via content negotiation, e.g. https://w3id.org/oc/corpus/br/1)
SPARQL endpoint (https://w3id.org/oc/sparql)
Dump views and downloads
Use by country: United States (33.7%), France (31.3%), Bulgaria (14.3%)
OpenCitations Blog on Wordpress at https://opencitations.wordpress.com
OpenCitations Twitter account at
It's a virtual machine - it doesn't exist!
Geoffrey Bilder, Crossref
Jonathan Dugan, Berkeley Institute for Data Science
Catriona MacCallum, Hindawi
Johanna McEntyre, Europe PubMed Central
Cassidy Sugimoto, Indiana University
Dario Taraborelli, Wikimedia Foundation
one year's salary for a postdoc (Ivan Heibi, who will start in November) to develop new user interfaces
new hardware to enhance the OCC performance
⇒ from 8M citations per year to 240M
DELL PowerEdge R730xd V4, 512GB RAM, 22TB HD + 30 Raspberry Pi 3