Newer works are more likely to have an abstract inverted index. For example, over 60% of works in 2022 have abstract data, compared to 45% for works older than 2000. Full chart is below:
display_name: "Europe PMC",
display_name: "Simon Fraser University - Summit",
// others omitted for brevity.
// first authorship object:
display_name: "Heather A. Piwowar",
// more authorship objects go here, omited for space.
Object: Old-timey bibliographic info for this work. This is mostly useful only in citation/reference contexts. These are all strings because sometimes you'll get fun values like "Spring" and "Inside cover."
Integer: The number of citations to this work. These are the times that other works have cited this work: Other works ➞ This work.
Conceptobject in the list also has one additional property:
score(Float): The strength of the connection between the work and this concept (higher is stronger). This number is produced by AWS Sagemaker, in the last layer of the machine learning model that assigns concepts.
Concepts with a score of at least 0.3 are assigned to the work. However, ancestors of an assigned concept are also added to the work, even if the ancestor scores are below 0.3.
display_name: "Citation impact",
Any citations older than ten years old aren't included. Years with zero citations have been removed so you will need to add those in if you need them.
display_name: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",
Occasionally, a work has more than one DOI--for example, there might be one DOI for a preprint version hosted on bioRxiv, and another DOI for the published version. However, this field always has just one DOI, the DOI for the published work. If you want DOIs for other versions, you can find them in the
host_venueis where you can find the best (closest to the version of record) copy of this work. For a peer-reviewed journal article, the best
host_venuewould be a full text published version, hosted by the publisher at the article's DOI URL.
// this top stuff is the same as a dehydrated Venue object
// this stuff is extra, and relates to this work at this venue
Object: All the external identifiers that we know about for this work. IDs are expressed as URIs whenever possible. Possible ID types:
In our context, paratext is stuff that's in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples:
- yep it's paratext: front cover, back cover, table of contents, editorial board listing, issue information, masthead.
- no, not paratext: research paper, dataset, letters to the editor, figures
Turns out there is a lot of paratext in registries like Crossref. That's not a bad thing... but we've found that it's good to have a way to filter it out.
is_paratextalgorithmically using title heuristics.
Boolean: True if we know this work has been retracted.
This field has high precision but low recall. In other words, if
true, the article is definitely retracted. But if
False, it still might be retracted, but we just don't know. This is because unfortunately, the open sources for retraction data aren't currently very comprehensive, and the more comprehensive ones aren't sufficiently open for us to use here.
descriptor_name: "Peer Review, Research",
descriptor_name: "Peer Review, Research",
Where different publication dates exist, we select the earliest available date of electronic publication.
Integer: The year this work was published.
String: The title of this work.
title: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",
String: The type or genre of the work.
Where possible, we just pass along Crossref's
typevalue for each work. When that's impossible (eg the work isn't in Crossref), we do our best to figure out the
typeourselves. Unfortunately the accuracy of Crossref's data for this isn't great, and ours isn't much better. We're working to develop better type classification.
The Authorship object represents a single author and her institutional affiliations in the context of a given work. It is only found as part of a
display_name: "Juan Pablo Alperin",
String: A summarized description of this author's position in the work's author list. Possible values are
It's not strictly necessary, because author order is already implicitly recorded by the list order of
Authorshipobjects; however it's useful in some contexts to have this as a categorical value.
List: The institutional affiliations this author claimed in the context of this work, as dehydrated
display_name: "Simon Fraser University",
display_name: "Public Knowledge Project",
String: This author's affiliation as it originally came to us (on a webpage or in an API), as a raw unformatted string. Multiple affiliations are separated by a semicolon.
raw_affiliation_string: "Canadian Institute for Studies in Publishing, Simon Fraser University, Vancouver, BC, Canada."
The HostVenue object describes a given work hosted on a given venue (you can think of it as a WorkVenue bridging table). It's only found as part of the
Workobject. It's got two parts:
- 1.a dehydrated Venue object, and
- 2.some extra stuff about the work.
The extra stuff is important because a given work can be hosted in different ways and in different forms, depending on where it's living.
Boolean: Set to
trueif the work hosted here can be read for free, without registration.
String: The license applied to this work at this host. Most toll-access works don't have an explicit license (they're under "all rights reserved" copyright), so this field generally has content only if
String: The URL where you can access this work.
publishedVersion: The document’s version of record. This is the most authoritative version.
acceptedVersion: The document after having completed peer review and being officially accepted for publication. It will lack publisher formatting, but the content should be interchangeable with the that of the
submittedVersion: the document as submitted to the publisher by the authors, but before peer-review. Its content may differ significantly from that of the accepted article.
String: Group of words (or numbers, letters, etc) that exist together in the work. This can be a five-gram, four-gram, trigram, bigram, or unigram.
ngram: "energy formula into a functional"
Integer: How many times this ngram occurred in the work.
Integer: How many tokens are in the ngram.
Float: How often the ngram occurred in the work.
OpenAccessobject describes access options for a given work. It's only found as part of the
Trueif this work is Open Access (OA).
There are many ways to define OA. OpenAlex uses a broad definition: having a URL where you can read the fulltext of this work without needing to pay money or log in. You can use the
oa_statusfields to narrow your results further, accommodating any definition of OA you like.
String: The Open Access (OA) status of this work. Possible values are:
bronze: Free to read on the publisher landing page, but without any identifiable license.
closed: All other articles.
String: The best Open Access (OA) URL for this work.
Although there are many ways to define OA, in this context an OA URL is one where you can read the fulltext of this work without needing to pay money or log in. The "best" such URL is the one closest to the version of record.
This URL might be a direct link to a PDF, or it might be to a landing page that links to the free PDF