Works are scholarly documents like journal articles, books, datasets, and theses.
OpenAlex indexes about 239M works, with about 50,000 added daily. The canonical PID for works is DOI; about half of works have one.
We collect new works from many sources, including Crossref, PubMed, institutional and discipline-specific repositories (eg, arXiv). Many older works come from the now-defunct Microsoft Academic Graph.
The same work can be hosted in multiple venues, often with slight differences. So, we cluster works together, using an algorithm that does fuzzy matching based on each work’s publication date, title, and author list. For example: https://doi.org/10.1364/PRJ.433188 and https://arxiv.org/abs/2102.11388 are two versions of the same paper, so they appear in OpenAlex as one Work, https://openalex.org/W3184470535.
There are three component objects that are only used as part of a
Occasionally, a work has more than one DOI--for example, there might be one DOI for a preprint version hosted on bioRxiv, and another DOI for the published version. However, this field always has just one DOI, the DOI for the published work. If you want DOIs for other versions, you can find them in the
String: The title of this work.
title: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",
display_name: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",
Integer: The year this work was published.
Where different publication dates exist, we select the earliest available date of electronic publication.
Object: All the external identifiers that we know about for this work. IDs are expressed as URIs whenever possible. Possible ID types:
host_venueis where you can find the best (closest to the version of record) copy of this work. For a peer-reviewed journal article, the best
host_venuewould be a full text published version, hosted by the publisher at the article's DOI URL.
// this top stuff is the same as a dehydrated Venue object
// this stuff is extra, and relates to this work at this venue
String: The type or genre of the work.
Where possible, we just pass along Crossref's
typevalue for each work. When that's impossible (eg the work isn't in Crossref), we do our best to figure out the
typeourselves. Unfortunately the accuracy of Crossref's data for this isn't great, and ours isn't much better. We're working to develop better type classification.
// first authorship object:
display_name: "Heather A. Piwowar",
// more authorship objects go here, omited for space.
Integer: The number of citations to this work. These are the times that other works have cited this work: Other works ➞ This work.
Object: Old-timey bibliographic info for this work. This is mostly useful only in citation/reference contexts. These are all strings because sometimes you'll get fun values like "Spring" and "Inside cover."
Boolean: True if we know this work has been retracted.
This field has high precision but low recall. In other words, if
true, the article is definitely retracted. But if
False, it still might be retracted, but we just don't know. This is because unfortunately, the open sources for retraction data aren't currently very comprehensive, and the more comprehensive ones aren't sufficiently open for us to use here.
In our context, paratext is stuff that's in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples:
- yep it's paratext: front cover, back cover, table of contents, editorial board listing, issue information, masthead.
- no, not paratext: research paper, dataset, letters to the editor, figures
Turns out there is a lot of paratext in registries like Crossref. That's not a bad thing... but we've found that it's good to have a way to filter it out.
is_paratextalgorithmically using title heuristics.
Conceptobject in the list also has one additional property:
score(Float): The strength of the connection between the work and this concept (higher is stronger). This number is produced by AWS Sagemaker, in the last layer of the machine learning model that assigns concepts.
display_name: "Citation impact",
descriptor_name: "Peer Review, Research",
descriptor_name: "Peer Review, Research",
display_name: "Europe PMC",
display_name: "Simon Fraser University - Summit",
// others omitted for brevity.
Newer works are more likely to have an abstract inverted index. For example, over 60% of works in 2022 have abstract data, compared to 45% for works older than 2000. Full chart is below:
Any citations older than ten years old aren't included. Years with zero citations have been removed so you will need to add those in if you need them.
The Authorship object represents a single author and her institutional affiliations in the context of a given work. It is only found as part of a
String: A summarized description of this author's position in the work's author list. Possible values are
It's not strictly necessary, because author order is already implicitly recorded by the list order of
Authorshipobjects; however it's useful in some contexts to have this as a categorical value.
display_name: "Juan Pablo Alperin",
List: The institutional affiliations this author claimed in the context of this work, as dehydrated
display_name: "Simon Fraser University",
display_name: "Public Knowledge Project",
String: This author's affiliation as it originally came to us (on a webpage or in an API), as a raw unformatted string. Multiple affiliations are separated by a semicolon.
raw_affiliation_string: "Canadian Institute for Studies in Publishing, Simon Fraser University, Vancouver, BC, Canada."
The HostVenue object describes a given work hosted on a given venue (you can think of it as a WorkVenue bridging table). It's only found as part of the
Workobject. It's got two parts:
- 1.a dehydrated Venue object, and
- 2.some extra stuff about the work.
The extra stuff is important because a given work can be hosted in different ways and in different forms, depending on where it's living.
String: The URL where you can access this work.
Boolean: Set to
trueif the work hosted here can be read for free, without registration.
publishedVersion: The document’s version of record. This is the most authoritative version.
acceptedVersion: The document after having completed peer review and being officially accepted for publication. It will lack publisher formatting, but the content should be interchangeable with the that of the
submittedVersion: the document as submitted to the publisher by the authors, but before peer-review. Its content may differ significantly from that of the accepted article.
String: The license applied to this work at this host. Most toll-access works don't have an explicit license (they're under "all rights reserved" copyright), so this field generally has content only if
OpenAccessobject describes access options for a given work. It's only found as part of the
Trueif this work is Open Access (OA).
There are many ways to define OA. OpenAlex uses a broad definition: having a URL where you can read the fulltext of this work without needing to pay money or log in. You can use the
oa_statusfields to narrow your results further, accommodating any definition of OA you like.
String: The Open Access (OA) status of this work. Possible values are:
bronze: Free to read on the publisher landing page, but without any identifiable license.
closed: All other articles.
String: The best Open Access (OA) URL for this work.
Although there are many ways to define OA, in this context an OA URL is one where you can read the fulltext of this work without needing to pay money or log in. The "best" such URL is the one closest to the version of record.
This URL might be a direct link to a PDF, or it might be to a landing page that links to the free PDF
Ngram objects are only available via the API.
String: Group of words (or numbers, letters, etc) that exist together in the work. This can be a five-gram, four-gram, trigram, bigram, or unigram.
ngram: "energy formula into a functional"
Integer: How many tokens are in the ngram.
Integer: How many times this ngram occurred in the work.
Float: How often the ngram occurred in the work.