Work object
There's a lot of useful data inside a work. When you use the API to get a single work or lists of works, this is what's returned.
Object: The abstract of the work, as an inverted index, which encodes information about the abstract's words and their positions within the text. Like Microsoft Academic Graph, OpenAlex doesn't include plaintext abstracts due to legal constraints.
abstract_inverted_index: {
Despite: [
0
],
growing: [
1
],
interest: [
2
],
in: [
3,
57,
73,
110,
122
],
Open: [
4,
201
],
Access: [
5
],
...
}
Newer works are more likely to have an abstract inverted index. For example, over 60% of works in 2022 have abstract data, compared to 45% for works older than 2000. Full chart is below:
List: List of
HostVenue
objects describing places this work lives. They're called "alternate" because the list doesn't include the work's canonical location; that's in host_venue
. alternate_host_venues: [
{
id: null,
display_name: "Europe PMC",
type: "repository",
url: "http://europepmc.org/articles/pmc5815332?pdf=render",
is_oa: true,
version: "publishedVersion",
license: "cc-by"
},
{
id: null,
display_name: "Simon Fraser University - Summit",
type: "repository",
url: "https://summit.sfu.ca/item/17691",
is_oa: true,
version: "submittedVersion",
license: "cc-by"
},
// others omitted for brevity.
]
List: List of
Authorship
objects, each representing an author and their institution. Limited to the first 100 authors to maintain API performance.authorships: [
// first authorship object:
{
author_position: "first",
author: {
id: "https://openalex.org/A1969205032",
display_name: "Heather A. Piwowar",
orcid: "https://orcid.org/0000-0003-1613-5981"
},
institutions: [
{
id: "https://openalex.org/I4200000001",
display_name: "OurResearch",
ror: "https://ror.org/02nr0ka47",
country_code: "US",
type: "nonprofit"
}
]
},
// more authorship objects go here, omited for space.
]
Object: Old-timey bibliographic info for this work. This is mostly useful only in citation/reference contexts. These are all strings because sometimes you'll get fun values like "Spring" and "Inside cover."
volume
(String)issue
(String)first_page
(String)last_page
(String)
biblio: {
volume: "495",
issue: "7442",
first_page: "437",
last_page: "440"
}
String: A URL that uses the
cites
filter to display a list of works that cite this work. This is a way to expand cited_by_count
into an actual list of works.Integer: The number of citations to this work. These are the times that other works have cited this work: Other works ➞ This work.
cited_by_count: 382
Each
Concept
object in the list also has one additional property:score
(Float): The strength of the connection between the work and this concept (higher is stronger). This number is produced by AWS Sagemaker, in the last layer of the machine learning model that assigns concepts.
Concepts with a score of at least 0.3 are assigned to the work. However, ancestors of an assigned concept are also added to the work, even if the ancestor scores are below 0.3.
concepts: [
{
id: "https://openalex.org/C2778793908",
wikidata: "https://www.wikidata.org/wiki/Q5122404",
display_name: "Citation impact",
level: 3,
score: 0.459309
},
{
id: "https://openalex.org/C2778805511",
wikidata: "https://www.wikidata.org/wiki/Q1713",
display_name: "Citation",
level: 2,
score: 0.447306
}
]
List:
Works.cited_by_count
for each of the last ten years, binned by year. To put it another way: each year, you can see how many times this work was cited. Any citations older than ten years old aren't included. Years with zero citations have been removed so you will need to add those in if you need them.
counts_by_year: [
{
year: 2022,
cited_by_count: 8
},
{
year: 2021,
cited_by_count: 252
},
...
{
year: 2012,
cited_by_count: 79
}
]
String: The date this
Work
object was created in the OpenAlex dataset, expressed as an ISO 8601 date string. created_date: "2017-08-08"
String: Exactly the same as
Work.title
. It's useful for Work
s to include a display_name
property, since all the other entities have one.display_name: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",
Occasionally, a work has more than one DOI--for example, there might be one DOI for a preprint version hosted on bioRxiv, and another DOI for the published version. However, this field always has just one DOI, the DOI for the published work. If you want DOIs for other versions, you can find them in the
Work.alternate_host_venues
list. doi: "https://doi.org/10.7717/peerj.4375"
The
host_venue
is where you can find the best (closest to the version of record) copy of this work. For a peer-reviewed journal article, the best host_venue
would be a full text published version, hosted by the publisher at the article's DOI URL.Some records don't have a
host_venue
, because they were inherited from MAG, which implemented a less complete provenance chain. We're gradually filling in these missing host venues.host_venue: {
// this top stuff is the same as a dehydrated Venue object
id: "https://openalex.org/V1983995261",
issn_l: "2167-8359",
issn: [
"2167-8359"
],
display_name: "PeerJ",
publisher: "PeerJ",
type: "journal",
// this stuff is extra, and relates to this work at this venue
url: "https://doi.org/10.7717/peerj.4375",
is_oa: null,
version: null,
license: null
}
id: "https://openalex.org/W2741809807"
Object: All the external identifiers that we know about for this work. IDs are expressed as URIs whenever possible. Possible ID types:
Most works are missing one or more ID types (either because we don't know the ID, or because it was never assigned). Keys for
null
IDs are not displayed.ids: {
openalex: "https://openalex.org/W2741809807",
doi: "https://doi.org/10.7717/peerj.4375",
mag: 2741809807,
pmid: "https://pubmed.ncbi.nlm.nih.gov/29456894"
}
In our context, paratext is stuff that's in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples:
- yep it's paratext: front cover, back cover, table of contents, editorial board listing, issue information, masthead.
- no, not paratext: research paper, dataset, letters to the editor, figures
Turns out there is a lot of paratext in registries like Crossref. That's not a bad thing... but we've found that it's good to have a way to filter it out.
We determine
is_paratext
algorithmically using title heuristics. is_paratext: false
Boolean: True if we know this work has been retracted.
This field has high precision but low recall. In other words, if
is_retracted
is true
, the article is definitely retracted. But if is_retracted
is False
, it still might be retracted, but we just don't know. This is because unfortunately, the open sources for retraction data aren't currently very comprehensive, and the more comprehensive ones aren't sufficiently open for us to use here.is_retracted: false
mesh: [
{
descriptor_ui: "D017712",
descriptor_name: "Peer Review, Research",
qualifier_ui: "Q000379",
qualifier_name: "methods",
is_major_topic: false
},
{
descriptor_ui: "D017712",
descriptor_name: "Peer Review, Research",
qualifier_ui: "Q000592",
qualifier_name: "standards",
is_major_topic: true
}
]
String: It lists groups of words and phrases (n-grams) that make up a work, as obtained from the Internet Archive. See The Ngram object and Get N-grams for background on n-grams, how we use them, and what this API call returns.
ngrams_url: "https://api.openalex.org/works/W2023271753/ngrams"
open_access: {
is_oa: true,
oa_status: "gold",
oa_url: "https://peerj.com/articles/4375.pdf"
},
Where different publication dates exist, we select the earliest available date of electronic publication.
This date applies to the version found at
Work.url
. The other versions, found in Work.alternate_host_venues
, may have been published at different (earlier) dates. publication_date: "2018-02-13"
Integer: The year this work was published.
This year applies to the version found at
Work.url
. The other versions, found in Work.alternate_host_venues
, may have been published in different (earlier) years. publication_year: 2018
List: OpenAlex IDs for works that this work cites. These are citations that go from this work out to another work: This work ➞ Other works.
referenced_works: [
"https://openalex.org/W2753353163",
"https://openalex.org/W2785823074",
"https://openalex.org/W2511661767",
"https://openalex.org/W2115339903",
"https://openalex.org/W2031754690"
]
List: OpenAlex IDs for works related to this work. Related works are computed algorithmically; the algorithm finds recent papers with the most concepts in common with the current paper.
related_works: [
"https://openalex.org/W2753353163",
"https://openalex.org/W2785823074",
"https://openalex.org/W2511661767",
"https://openalex.org/W2115339903",
"https://openalex.org/W2031754690",
]
String: The title of this work.
title: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",
String: The type or genre of the work.
This field uses Crossref's "type" controlled vocabulary; you can see all possible values via the Crossref api here: https://api.crossref.org/types.
Where possible, we just pass along Crossref's
type
value for each work. When that's impossible (eg the work isn't in Crossref), we do our best to figure out the type
ourselves. Unfortunately the accuracy of Crossref's data for this isn't great, and ours isn't much better. We're working to develop better type classification.type: "journal-article"
String: The last time anything in this
Work
object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.updated_date: "2022-01-02T00:22:35.180390"
The Authorship object represents a single author and her institutional affiliations in the context of a given work. It is only found as part of a
Work
object.author: {
id: "https://openalex.org/A2790141563",
display_name: "Juan Pablo Alperin",
orcid: "https://orcid.org/0000-0002-9344-7439"
}
String: A summarized description of this author's position in the work's author list. Possible values are
first
, middle
, and last
. It's not strictly necessary, because author order is already implicitly recorded by the list order of
Authorship
objects; however it's useful in some contexts to have this as a categorical value.author_position: "first"
List: The institutional affiliations this author claimed in the context of this work, as dehydrated
Institution
objects.institutions: [
{
id: "https://openalex.org/I18014758",
display_name: "Simon Fraser University",
ror: "https://ror.org/0213rcc28",
country_code: "CA",
type: "education"
},
{
id: "https://openalex.org/I209863525",
display_name: "Public Knowledge Project",
ror: null,
country_code: null,
type: null
}
]
String: This author's affiliation as it originally came to us (on a webpage or in an API), as a raw unformatted string. Multiple affiliations are separated by a semicolon.
raw_affiliation_string: "Canadian Institute for Studies in Publishing, Simon Fraser University, Vancouver, BC, Canada."
The HostVenue object describes a given work hosted on a given venue (you can think of it as a WorkVenue bridging table). It's only found as part of the
Work
object. It's got two parts:- 1.a dehydrated Venue object, and
- 2.some extra stuff about the work.
The extra stuff is important because a given work can be hosted in different ways and in different forms, depending on where it's living.
To learn more about the dehydrated Venue object part, see the DehydratedVenue docs. To learn more about the other stuff, read below:
Boolean: Set to
true
if the work hosted here can be read for free, without registration.is_oa: true
String: The license applied to this work at this host. Most toll-access works don't have an explicit license (they're under "all rights reserved" copyright), so this field generally has content only if
is_oa
is true
.license: "cc-by"
String: The URL where you can access this work.
id: "https://doi.org/10.7717/peerj.4375"
String: The version of the work, based on the DRIVER Guidelines versioning scheme. Possible values are:.
publishedVersion
: The document’s version of record. This is the most authoritative version.acceptedVersion
: The document after having completed peer review and being officially accepted for publication. It will lack publisher formatting, but the content should be interchangeable with the that of thepublishedVersion
.submittedVersion
: the document as submitted to the publisher by the authors, but before peer-review. Its content may differ significantly from that of the accepted article.
version: "publishedVersion"
String: Group of words (or numbers, letters, etc) that exist together in the work. This can be a five-gram, four-gram, trigram, bigram, or unigram.
ngram: "energy formula into a functional"
Integer: How many times this ngram occurred in the work.
ngram_count: 1
Integer: How many tokens are in the ngram.
ngram_tokens: 5
Float: How often the ngram occurred in the work.
Caution: This data was taken directly from the General Index and we've not tested
term_frequency
against actual articles. You can read about their data extraction process on the Internet Archive website. If you compare term_frequency
against articles we would like to hear how it went!term_frequency: 0.0005452562704471102
The
OpenAccess
object describes access options for a given work. It's only found as part of the Work
object.Boolean:
True
if this work is Open Access (OA). There are many ways to define OA. OpenAlex uses a broad definition: having a URL where you can read the fulltext of this work without needing to pay money or log in. You can use the
alternate_host_venues
and oa_status
fields to narrow your results further, accommodating any definition of OA you like.is_oa: true
String: The Open Access (OA) status of this work. Possible values are:
bronze
: Free to read on the publisher landing page, but without any identifiable license.closed
: All other articles.
oa_status: "gold"
String: The best Open Access (OA) URL for this work.
Although there are many ways to define OA, in this context an OA URL is one where you can read the fulltext of this work without needing to pay money or log in. The "best" such URL is the one closest to the version of record.
This URL might be a direct link to a PDF, or it might be to a landing page that links to the free PDF
oa_url: "https://peerj.com/articles/4375.pdf"
Last modified 1d ago