📄

Work

Works are scholarly documents like journal articles, books, datasets, and theses.
OpenAlex indexes about 239M works, with about 50,000 added daily. The canonical PID for works is DOI; about half of works have one.
We collect new works from many sources, including Crossref, PubMed, institutional and discipline-specific repositories (eg, arXiv). Many older works come from the now-defunct Microsoft Academic Graph.
The same work can be hosted in multiple venues, often with slight differences. So, we cluster works together, using an algorithm that does fuzzy matching based on each work’s publication date, title, and author list. For example: https://doi.org/10.1364/PRJ.433188 and https://arxiv.org/abs/2102.11388 are two versions of the same paper, so they appear in OpenAlex as one Work, https://openalex.org/W3184470535.
Works are linked to other works via the referenced_works (outgoing citations), cited_by_api_url (incoming citations), and related_works properties.
There are three component objects that are only used as part of a Work:
Most of the examples below are drawn from a single work. You can view this work in its entirety via the website or API.

The Work object

id

String: The OpenAlex ID for this work.
id: "https://openalex.org/W2741809807"

doi

String: The DOI for the work. This is the Canonical External ID for works.
Occasionally, a work has more than one DOI--for example, there might be one DOI for a preprint version hosted on bioRxiv, and another DOI for the published version. However, this field always has just one DOI, the DOI for the published work. If you want DOIs for other versions, you can find them in the Work.alternate_host_venues list.
doi: "https://doi.org/10.7717/peerj.4375"

title

String: The title of this work.
title: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",

display_name

String: Exactly the same as Work.title. It's useful for Works to include a display_name property, since all the other entities have one.
display_name: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",

publication_year

Integer: The year this work was published.
This year applies to the version found at Work.url. The other versions, found in Work.alternate_host_venues, may have been published in different (earlier) years.
publication_year: 2018

publication_date

String: The day when this work was published, formatted as an ISO 8601 date.
Where different publication dates exist, we select the earliest available date of electronic publication.
This date applies to the version found at Work.url. The other versions, found in Work.alternate_host_venues, may have been published at different (earlier) dates.
publication_date: "2018-02-13"

ids

Object: All the external identifiers that we know about for this work. IDs are expressed as URIs whenever possible. Possible ID types:
Most works are missing one or more ID types (either because we don't know the ID, or because it was never assigned). Keys for null IDs are not displayed.
ids: {
openalex: "https://openalex.org/W2741809807",
doi: "https://doi.org/10.7717/peerj.4375",
mag: 2741809807,
pmid: "https://pubmed.ncbi.nlm.nih.gov/29456894"
}

host_venue

Object: A HostVenue object describing how and where this work is being hosted online.
The host_venue is where you can find the best (closest to the version of record) copy of this work. For a peer-reviewed journal article, the best host_venue would be a full text published version, hosted by the publisher at the article's DOI URL.
Some records don't have a host_venue, because they were inherited from MAG, which implemented a less complete provenance chain. We're gradually filling in these missing host venues.
host_venue: {
// this top stuff is the same as a dehydrated Venue object
id: "https://openalex.org/V1983995261",
issn_l: "2167-8359",
issn: [
"2167-8359"
],
display_name: "PeerJ",
publisher: "PeerJ",
type: "journal",
// this stuff is extra, and relates to this work at this venue
url: "https://doi.org/10.7717/peerj.4375",
is_oa: null,
version: null,
license: null
}

type

String: The type or genre of the work.
This field uses Crossref's "type" controlled vocabulary; you can see all possible values via the Crossref api here: https://api.crossref.org/types.
Where possible, we just pass along Crossref's type value for each work. When that's impossible (eg the work isn't in Crossref), we do our best to figure out the type ourselves. Unfortunately the accuracy of Crossref's data for this isn't great, and ours isn't much better. We're working to develop better type classification.
type: "journal-article"

open_access

Object: Information about the access status of this work, as an OpenAccess object.
open_access: {
is_oa: true,
oa_status: "gold",
oa_url: "https://peerj.com/articles/4375.pdf"
},

authorships

List: List of Authorship objects, each representing an author and their institution.
authorships: [
// first authorship object:
{
author_position: "first",
author: {
id: "https://openalex.org/A1969205032",
display_name: "Heather A. Piwowar",
orcid: "https://orcid.org/0000-0003-1613-5981"
},
institutions: [
{
id: "https://openalex.org/I4200000001",
display_name: "OurResearch",
ror: "https://ror.org/02nr0ka47",
country_code: "US",
type: "nonprofit"
}
]
},
// more authorship objects go here, omited for space.
]

cited_by_count

Integer: The number of citations to this work. These are the times that other works have cited this work: Other works ➞ This work.
cited_by_count: 382

biblio

Object: Old-timey bibliographic info for this work. This is mostly useful only in citation/reference contexts. These are all strings because sometimes you'll get fun values like "Spring" and "Inside cover."
  • volume (String)
  • issue (String)
  • first_page (String)
  • last_page (String)
biblio: {
volume: "495",
issue: "7442",
first_page: "437",
last_page: "440"
}

is_retracted

Boolean: True if we know this work has been retracted.
This field has high precision but low recall. In other words, if is_retracted is true, the article is definitely retracted. But if is_retracted is False, it still might be retracted, but we just don't know. This is because unfortunately, the open sources for retraction data aren't currently very comprehensive, and the more comprehensive ones aren't sufficiently open for us to use here.
is_retracted: false

is_paratext

Boolean: True if we think this work is paratext.
In our context, paratext is stuff that's in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples:
  • yep it's paratext: front cover, back cover, table of contents, editorial board listing, issue information, masthead.
  • no, not paratext: research paper, dataset, letters to the editor, figures
Turns out there is a lot of paratext in registries like Crossref. That's not a bad thing... but we've found that it's good to have a way to filter it out.
We determine is_paratext algorithmically using title heuristics.
is_paratext: false

concepts

List: List of dehydrated Concept objects.
Each Concept object in the list also has one additional property:
  • score (Float): The strength of the connection between the work and this concept (higher is stronger). This number is produced by AWS Sagemaker, in the last layer of the machine learning model that assigns concepts.
concepts: [
{
id: "https://openalex.org/C2778793908",
wikidata: "https://www.wikidata.org/wiki/Q5122404",
display_name: "Citation impact",
level: 3,
score: 0.459309
},
{
id: "https://openalex.org/C2778805511",
wikidata: "https://www.wikidata.org/wiki/Q1713",
display_name: "Citation",
level: 2,
score: 0.447306
}
]

mesh

List: List of MeSH tag objects. Only works found in PubMed have MeSH tags; for all other works, this is an empty list.
mesh: [
{
descriptor_ui: "D017712",
descriptor_name: "Peer Review, Research",
qualifier_ui: "Q000379",
qualifier_name: "methods",
is_major_topic: false
},
{
descriptor_ui: "D017712",
descriptor_name: "Peer Review, Research",
qualifier_ui: "Q000592",
qualifier_name: "standards",
is_major_topic: true
}
]

alternate_host_venues

List: List of HostVenue objects describing places this work lives. They're called "alternate" because the list doesn't include the work's canonical location; that's in host_venue.
alternate_host_venues: [
{
id: null,
display_name: "Europe PMC",
type: "repository",
url: "http://europepmc.org/articles/pmc5815332?pdf=render",
is_oa: true,
version: "publishedVersion",
license: "cc-by"
},
{
id: null,
display_name: "Simon Fraser University - Summit",
type: "repository",
url: "https://summit.sfu.ca/item/17691",
is_oa: true,
version: "submittedVersion",
license: "cc-by"
},
// others omitted for brevity.
]

referenced_works

List: OpenAlex IDs for works that this work cites. These are citations that go from this work out to another work: This work ➞ Other works.
referenced_works: [
"https://openalex.org/W2753353163",
"https://openalex.org/W2785823074",
"https://openalex.org/W2511661767",
"https://openalex.org/W2115339903",
"https://openalex.org/W2031754690"
]
List: OpenAlex IDs for works related to this work. Related works are computed algorithmically; the algorithm finds recent papers with the most concepts in common with the current paper.
related_works: [
"https://openalex.org/W2753353163",
"https://openalex.org/W2785823074",
"https://openalex.org/W2511661767",
"https://openalex.org/W2115339903",
"https://openalex.org/W2031754690",
]

ngrams_url

String: This field is only available in the API. It lists groups of words and phrases (n-grams) that make up a work, as obtained from the General Index. See The Ngram object and Get N-grams for background on n-grams, how we use them, and what this API call returns.
ngrams_url: "https://api.openalex.org/works/W2023271753/ngrams"

abstract_inverted_index

Object: The abstract of the work, as an inverted index, which encodes information about the abstract's words and their positions within the text. Like Microsoft Academic Graph, OpenAlex doesn't include plaintext abstracts due to legal constraints.
abstract_inverted_index: {
Despite: [
0
],
growing: [
1
],
interest: [
2
],
in: [
3,
57,
73,
110,
122
],
Open: [
4,
201
],
Access: [
5
],
...
}

Abstract inverted index coverage

Newer works are more likely to have an abstract inverted index. For example, over 60% of works in 2022 have abstract data, compared to 45% for works older than 2000. Full chart is below:

cited_by_api_url

String: A URL that uses the cites filter to display a list of works that cite this work. This is a way to expand cited_by_count into an actual list of works.

counts_by_year

List: Works.cited_by_count for each of the last ten years, binned by year. To put it another way: each year, you can see how many times this work was cited.
Any citations older than ten years old aren't included. Years with zero citations have been removed so you will need to add those in if you need them.
counts_by_year: [
{
year: 2022,
cited_by_count: 8
},
{
year: 2021,
cited_by_count: 252
},
...
{
year: 2012,
cited_by_count: 79
}
]

updated_date

String: The last time anything in this Work object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.
updated_date: "2022-01-02T00:22:35.180390"

created_date

String: The date this Work object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.
created_date: "2017-08-08"

The Authorship object

The Authorship object represents a single author and her institutional affiliations in the context of a given work. It is only found as part of a Work object.

author_position

String: A summarized description of this author's position in the work's author list. Possible values are first, middle, and last.
It's not strictly necessary, because author order is already implicitly recorded by the list order of Authorship objects; however it's useful in some contexts to have this as a categorical value.
author_position: "first"

author

String: An author of this work, as a dehydrated Author object.
author: {
id: "https://openalex.org/A2790141563",
display_name: "Juan Pablo Alperin",
orcid: "https://orcid.org/0000-0002-9344-7439"
}

institutions

List: The institutional affiliations this author claimed in the context of this work, as dehydrated Institution objects.
institutions: [
{
id: "https://openalex.org/I18014758",
display_name: "Simon Fraser University",
ror: "https://ror.org/0213rcc28",
country_code: "CA",
type: "education"
},
{
id: "https://openalex.org/I209863525",
display_name: "Public Knowledge Project",
ror: null,
country_code: null,
type: null
}
]

raw_affiliation_string

String: This author's affiliation as it originally came to us (on a webpage or in an API), as a raw unformatted string. Multiple affiliations are separated by a semicolon.
raw_affiliation_string: "Canadian Institute for Studies in Publishing, Simon Fraser University, Vancouver, BC, Canada."

The HostVenue object

The HostVenue object describes a given work hosted on a given venue (you can think of it as a WorkVenue bridging table). It's only found as part of the Work object. It's got two parts:
  1. 1.
    a dehydrated Venue object, and
  2. 2.
    some extra stuff about the work.
The extra stuff is important because a given work can be hosted in different ways and in different forms, depending on where it's living.
To learn more about the dehydrated Venue object part, see the DehydratedVenue docs. To learn more about the other stuff, read below:

url

String: The URL where you can access this work.
id: "https://doi.org/10.7717/peerj.4375"

is_oa

Boolean: Set to true if the work hosted here can be read for free, without registration.
is_oa: true

version

String: The version of the work, based on the DRIVER Guidelines versioning scheme. Possible values are:.
  • publishedVersion: The document’s version of record. This is the most authoritative version.
  • acceptedVersion: The document after having completed peer review and being officially accepted for publication. It will lack publisher formatting, but the content should be interchangeable with the that of the publishedVersion.
  • submittedVersion: the document as submitted to the publisher by the authors, but before peer-review. Its content may differ significantly from that of the accepted article.
version: "publishedVersion"

license

String: The license applied to this work at this host. Most toll-access works don't have an explicit license (they're under "all rights reserved" copyright), so this field generally has content only if is_oa is true.
license: "cc-by"

The OpenAccess object

The OpenAccess object describes access options for a given work. It's only found as part of the Work object.

is_oa

Boolean: True if this work is Open Access (OA).
There are many ways to define OA. OpenAlex uses a broad definition: having a URL where you can read the fulltext of this work without needing to pay money or log in. You can use the alternate_host_venues and oa_status fields to narrow your results further, accommodating any definition of OA you like.
is_oa: true

oa_status

String: The Open Access (OA) status of this work. Possible values are:
  • gold: Published in an OA journal that is indexed by the DOAJ.
  • green: Toll-access on the publisher landing page, but there is a free copy in an OA repository.
  • hybrid: Free under an open license in a toll-access journal.
  • bronze: Free to read on the publisher landing page, but without any identifiable license.
  • closed: All other articles.
oa_status: "gold"

oa_url

String: The best Open Access (OA) URL for this work.
Although there are many ways to define OA, in this context an OA URL is one where you can read the fulltext of this work without needing to pay money or log in. The "best" such URL is the one closest to the version of record.
This URL might be a direct link to a PDF, or it might be to a landing page that links to the free PDF
oa_url: "https://peerj.com/articles/4375.pdf"

The Ngram object

Ngram objects are only available via the API.

ngram

String: Group of words (or numbers, letters, etc) that exist together in the work. This can be a five-gram, four-gram, trigram, bigram, or unigram.
ngram: "energy formula into a functional"

ngram_tokens

Integer: How many tokens are in the ngram.
ngram_tokens: 5

ngram_count

Integer: How many times this ngram occurred in the work.
ngram_count: 1

token_frequency

Float: How often the ngram occurred in the work.
Caution: This data was taken directly from the General Index and we've not tested term_frequency against actual articles. You can read about their data extraction process on the General Index website. If you compare term_frequency against articles we would like to hear how it went!
term_frequency: 0.0005452562704471102