πŸ“„
Work
Works are scholarly documents like journal articles, books, datasets, and theses.
OpenAlex indexes about 209M works, with about 50,000 added daily. The canonical PID for works is DOI; about half of works have one.
We collect new works from many sources, including Crossref, PubMed, institutional and discipline-specific repositories (eg, arXiv). Many older works come from the now-defunct Microsoft Academic Graph.
The same work can be hosted in multiple venues, often with slight differences. So, we cluster works together, using an algorithm that does fuzzy matching based on each work’s publication date, title, and author list. For example: https://doi.org/10.1364/PRJ.433188 and https://arxiv.org/abs/2102.11388 are two versions of the same paper, so they appear in OpenAlex as one Work, https://openalex.org/W3184470535.
Works are linked to other works via the referenced_works (outgoing citations), cited_by_api_url (incoming citations), and related_works properties.
There are three component objects that are only used as part of a Work:
Most of the examples below are drawn from a single work. You can view this work in its entirety via the website or API.
​

The Work object

id

String: The OpenAlex ID for this work.
1
id: "https://openalex.org/W2741809807"
Copied!

doi

String: The DOI for the work. This is the Canonical External ID for works.
Occasionally, a work has more than one DOI--for example, there might be one DOI for a preprint version hosted on bioRxiv, and another DOI for the published version. However, this field always has just one DOI, the DOI for the published work. If you want DOIs for other versions, you can find them in the Work.alternate_host_venues list.
1
doi: "https://doi.org/10.7717/peerj.4375"
Copied!

title

String: The title of this work.
1
title: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",
Copied!

display_name

String: Exactly the same as Work.title. It's useful for Works to include a display_name property, since all the other entities have one.
1
display_name: "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles",
Copied!

publication_year

Integer: The year this work was published.
This year applies to the version found at Work.url. The other versions, found in Work.alternate_host_venues, may have been published in different (earlier) years.
1
publication_year: 2018
Copied!

publication_date

String: The day when this work was published, formatted as an ISO 8601 date.
Where different publication dates exist, we select the earliest available date of electronic publication.
This date applies to the version found at Work.url. The other versions, found in Work.alternate_host_venues, may have been published at different (earlier) dates.
1
publication_date: "2018-02-13"
Copied!

ids

Object: All the external identifiers that we know about for this work. IDs are expressed as URIs whenever possible. Possible ID types:
Most works are missing one or more ID types (either because we don't know the ID, or because it was never assigned). Keys for null IDs are not displayed.
1
ids: {
2
openalex: "https://openalex.org/W2741809807",
3
doi: "https://doi.org/10.7717/peerj.4375",
4
mag: 2741809807,
5
pmid: "https://pubmed.ncbi.nlm.nih.gov/29456894"
6
}
Copied!
​

host_venue

Object: A HostVenue object describing how and where this work is being hosted online.
The host_venue is important because it describes where you can find the "best" (closest to the version of record) copy of this work.
However, some records don't have a host_venue, because they were inherited from MAG, which implemented a less complete provenance chain. We're gradually filling in these missing host venues.
1
host_venue: {
2
// this top stuff is the same as a dehydrated Venue object
3
id: "https://openalex.org/V1983995261",
4
issn_l: "2167-8359",
5
issn: [
6
"2167-8359"
7
],
8
display_name: "PeerJ",
9
publisher: "PeerJ",
10
type: "journal",
11
12
// this stuff is extra, and relates to this work at this venue
13
url: "https://doi.org/10.7717/peerj.4375",
14
is_oa: null,
15
version: null,
16
license: null
17
}
Copied!
​

type

String: The type or genre of the work.
This field uses Crossref's "type" controlled vocabulary; you can see all possible values via the Crossref api here: https://api.crossref.org/types.
Where possible, we just pass along Crossref's type value for each work. When that's impossible (eg the work isn't in Crossref), we do our best to figure out the type ourselves. Unfortunately the accuracy of Crossref's data for this isn't great, and ours isn't much better. We're working to develop better type classification.
1
type: "journal-article"
Copied!

open_access

Object: Information about the access status of this work, as an OpenAccess object.
1
open_access: {
2
is_oa: true,
3
oa_status: "gold",
4
oa_url: "https://peerj.com/articles/4375.pdf"
5
},
Copied!

authorships

List: List of Authorship objects, each representing an author and their institution.
1
authorships: [
2
// first authorship object:
3
{
4
author_position: "first",
5
author: {
6
id: "https://openalex.org/A1969205032",
7
display_name: "Heather A. Piwowar",
8
orcid: "https://orcid.org/0000-0003-1613-5981"
9
},
10
institutions: [
11
{
12
id: "https://openalex.org/I4200000001",
13
display_name: "OurResearch",
14
ror: "https://ror.org/02nr0ka47",
15
country_code: "US",
16
type: "nonprofit"
17
}
18
]
19
},
20
21
// more authorship objects go here, omited for space.
22
]
Copied!
​

cited_by_count

Integer: The number of citations to this work. These are the times that other works have cited this work: Other works ➞ This work.
1
cited_by_count: 382
Copied!

biblio

Object: Old-timey bibliographic info for this work. This is mostly useful only in citation/reference contexts. These are all strings because sometimes you'll get fun values like "Spring" and "Inside cover."
  • volume (String)
  • issue (String)
  • first_page (String)
  • last_page (String)
1
biblio: {
2
volume: "495",
3
issue: "7442",
4
first_page: "437",
5
last_page: "440"
6
}
Copied!

is_retracted

Boolean: True if we know this work has been retracted.
This field has high precision but low recall. In other words, if is_retracted is true, the article is definitely retracted. But if is_retracted is False, it still might be retracted, but we just don't know. This is because unfortunately, the open sources for retraction data aren't currently very comprehensive, and the more comprehensive ones aren't sufficiently open for us to use here.
1
is_retracted: false
Copied!
​

is_paratext

Boolean: True if we think this work is paratext.
In our context, paratext is stuff that's in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples:
  • yep it's paratext: front cover, back cover, table of contents, editorial board listing, issue information, masthead.
  • no, not paratext: research paper, dataset, letters to the editor, figures
Turns out there is a lot of paratext in registries like Crossref. That's not a bad thing... but we've found that it's good to have a way to filter it out.
We determine is_paratext algorithmically using title heuristics.
1
is_paratext: false
Copied!

concepts

List: List of dehydrated Concept objects.
Each Concept object in the list also has one additional property:
  • score (Float): The strength of the connection between the work and this concept (higher is stronger).
1
concepts: [
2
{
3
id: "https://openalex.org/C2778793908",
4
wikidata: "https://www.wikidata.org/wiki/Q5122404",
5
display_name: "Citation impact",
6
level: 3,
7
score: 0.459309
8
},
9
{
10
id: "https://openalex.org/C2778805511",
11
wikidata: "https://www.wikidata.org/wiki/Q1713",
12
display_name: "Citation",
13
level: 2,
14
score: 0.447306
15
}
16
]
Copied!
​

mesh

List: List of MeSH tag objects. Only works found in PubMed have MeSH tags; for all other works, this is an empty list.
1
mesh: [
2
{
3
descriptor_ui: "D017712",
4
descriptor_name: "Peer Review, Research",
5
qualifier_ui: "Q000379",
6
qualifier_name: "methods",
7
is_major_topic: false
8
},
9
{
10
descriptor_ui: "D017712",
11
descriptor_name: "Peer Review, Research",
12
qualifier_ui: "Q000592",
13
qualifier_name: "standards",
14
is_major_topic: true
15
}
16
]
Copied!
​

alternate_host_venues

List: List of HostVenue objects describing places this work lives. This work's primary hosting venue isn't in this list; it's at host_venue.
Known Issue: Some venues in this list are missing the id field! This should be fixed by February 2022.
1
alternate_host_venues: [
2
{
3
id: null,
4
display_name: "Europe PMC",
5
type: "repository",
6
url: "http://europepmc.org/articles/pmc5815332?pdf=render",
7
is_oa: true,
8
version: "publishedVersion",
9
license: "cc-by"
10
},
11
{
12
id: null,
13
display_name: "Simon Fraser University - Summit",
14
type: "repository",
15
url: "https://summit.sfu.ca/item/17691",
16
is_oa: true,
17
version: "submittedVersion",
18
license: "cc-by"
19
},
20
// others omitted for brevity.
21
​
22
]
Copied!

referenced_works

List: OpenAlex IDs for works that this work cites. These are citations that go from this work out to another work: This work ➞ Other works.
1
referenced_works: [
2
"https://openalex.org/W2753353163",
3
"https://openalex.org/W2785823074",
4
"https://openalex.org/W2511661767",
5
"https://openalex.org/W2115339903",
6
"https://openalex.org/W2031754690"
7
]
Copied!
List: OpenAlex IDs for works related to this work. Related works are computed algorithmically; the algorithm finds recent papers with the most concepts in common with the current paper.
1
related_works: [
2
"https://openalex.org/W2753353163",
3
"https://openalex.org/W2785823074",
4
"https://openalex.org/W2511661767",
5
"https://openalex.org/W2115339903",
6
"https://openalex.org/W2031754690",
7
]
Copied!

abstract_inverted_index

Object: The abstract of the work, as an inverted index, which encodes information about the abstract's words and their positions within the text. Like Microsoft Academic Graph, OpenAlex doesn't include plaintext abstracts due to legal constraints.
1
abstract_inverted_index: {
2
Despite: [
3
0
4
],
5
growing: [
6
1
7
],
8
interest: [
9
2
10
],
11
in: [
12
3,
13
57,
14
73,
15
110,
16
122
17
],
18
Open: [
19
4,
20
201
21
],
22
Access: [
23
5
24
],
25
...
26
}
Copied!

cited_by_api_url

TODO: documentation coming soon!

counts_by_year

TODO: documentation coming soon!

updated_date

String: The last time anything in this Work object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.
1
updated_date: "2022-01-02T00:22:35.180390"
Copied!

created_date

String: The date this Work object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.
1
created_date: "2017-08-08"
Copied!

The Authorship object

The Authorship object represents a single author and her institutional affiliations in the context of a given work. It is only found as part of a Work object.

author_position

String: A summarized description of this author's position in the work's author list. Possible values are first, middle, and last.
It's not strictly necessary, because author order is already implicitly recorded by the list order of Authorship objects; however it's useful in some contexts to have this as a categorical value.
1
author_position: "first"
Copied!

author

String: An author of this work, as a dehydrated Author object.
1
author: {
2
id: "https://openalex.org/A2790141563",
3
display_name: "Juan Pablo Alperin",
4
orcid: "https://orcid.org/0000-0002-9344-7439"
5
}
Copied!
​

institutions

List: The institutional affiliations this author claimed in the context of this work, as dehydrated Institution objects.
1
institutions: [
2
{
3
id: "https://openalex.org/I18014758",
4
display_name: "Simon Fraser University",
5
ror: "https://ror.org/0213rcc28",
6
country_code: "CA",
7
type: "education"
8
},
9
{
10
id: "https://openalex.org/I209863525",
11
display_name: "Public Knowledge Project",
12
ror: null,
13
country_code: null,
14
type: null
15
}
16
]
Copied!

raw_affiliation_string

String: This author's affiliation as it originally came to us (on a webpage or in an API), as a raw unformatted string.
1
raw_affiliation_string: "Canadian Institute for Studies in Publishing, Simon Fraser University, Vancouver, BC, Canada."
Copied!
​

The HostVenue object

The HostVenue object describes a given work hosted on a given venue (you can think of it as a WorkVenue bridging table). It's only found as part of the Work object. It's got two parts:
  1. 1.
    a dehydrated Venue object, and
  2. 2.
    some extra stuff about the work.
The extra stuff is important because a given work can be hosted in different ways and in different forms, depending on where it's living.
To learn more about the dehydrated Venue object part, see the DehydratedVenue docs. To learn more about the other stuff, read below:

url

String: The URL where you can access this work.
1
id: "https://doi.org/10.7717/peerj.4375"
Copied!

is_oa

Boolean: Set to true if the work hosted here can be read for free, without registration.
1
is_oa: true
Copied!

version

String: The version of the work, based on the DRIVER Guidelines versioning scheme. Possible values are:.
  • publishedVersion: The document’s version of record. This is the most authoritative version.
  • acceptedVersion: The document after having completed peer review and being officially accepted for publication. It will lack publisher formatting, but the content should be interchangeable with the that of the publishedVersion.
  • submittedVersion: the document as submitted to the publisher by the authors, but before peer-review. Its content may differ significantly from that of the accepted article.
1
version: "publishedVersion"
Copied!

license

String: The license applied to this work at this host. Most toll-access works don't have an explicit license (they're under "all rights reserved" copyright), so this field generally has content only if is_oa is true.
1
license: "cc-by"
Copied!
​
​

The OpenAccess object

The OpenAccess object describes access options for a given work. It's only found as part of the Work object.

is_oa

Boolean: True if this work is Open Access (OA).
There are many ways to define OA. OpenAlex uses a broad definition: having a URL where you can read the fulltext of this work without needing to pay money or log in. You can use the alternate_host_venues and oa_status fields to narrow your results further, accommodating any definition of OA you like.
1
is_oa: true
Copied!

oa_status

String: The Open Access (OA) status of this work. Possible values are:
  • gold: Published in an OA journal that is indexed by the DOAJ.
  • green: Toll-access on the publisher landing page, but there is a free copy in an OA repository.
  • hybrid: Free under an open license in a toll-access journal.
  • bronze: Free to read on the publisher landing page, but without any identifiable license.
  • closed: All other articles.
1
oa_status: "gold"
Copied!

oa_url

String: The best Open Access (OA) URL for this work.
Although there are many ways to define OA, in this context an OA URL is one where you can read the fulltext of this work without needing to pay money or log in. The "best" such URL is the one closest to the version of record.
This URL might be a direct link to a PDF, or it might be to a landing page that links to the free PDF
1
oa_url: "https://peerj.com/articles/4375.pdf"
Copied!

​

​