Snapshot data format
Last updated
Last updated
Here are the details on where the OpenAlex data lives and how it's structured.
The data files are gzip-compressed JSON Lines, one row per entity.
Records are partitioned by updated_date. Within each entity type prefix, each object (file) is further prefixed by this date. For example, if an Author
has an updated_date of 2021-12-30 it will be prefixed/data/authors/updated_date=2021-12-30/
.
If you're initializing a fresh snapshot, the updated_date
partitions aren't important yet. You need all the entities, so for Authors
you would get /data/authors
/*/*.gz
There are multiple objects under each updated_date
partition. Each is under 2GB.
The manifest file is JSON (in redshift manifest format) and lists all the data files for each object type - /data/works/manifest
lists all the works.
The gzip-compressed snapshot takes up about 330 GB and decompresses to about 1.6 TB.
The structure of each entity type is documented here: Work, Author, Source, Institution, Concept, and Publisher.
We have recently added folders for new entities topics
, fields
, subfields
, and domains
, and we will be adding others soon. This documentation will soon be updated to reflect these changes.
This is a screenshot showing the "leaf" nodes of one entity type, updated date folder. You can also click around the browser links above to get a sense of the snapshot's structure.
Once you have a copy of the snapshot, you'll probably want to keep it up to date. The updated_date
partitions make this easy, but the way they work may be unfamiliar. Unlike a set of dated snapshots that each contain the full dataset as of a certain date, each partition contains the records that last changed on that date.
If we imagine launching OpenAlex on 2021-12-30 with 1000 Authors
, each being newly created on that date, /data/authors/
looks like this:
If, on 2022-01-04, we made changes to 50 of those Authors
, they would come out of one of the files in /data/authors/updated_date=2021-12-30
and go into one in /data/authors/updated_date=2022-01-04:
If we also discovered 50 new Authors, they would go in that same partition, so the totals would look like this:
So if you made your copy of the snapshot on 2021-12-30, you would only need to download /data/authors/updated_date=2022-01-04
to get everything that was changed or added since then.
To update a snapshot copy that you created or updated on date X
, insert or update the records in objects where updated_date
> X
.
You never need to go back for a partition you've already downloaded. Anything that changed isn't there anymore, it's in a new partition.
At the time of writing, these are the Author
partitions and the number of records in each (in the actual dataset):
updated_date=2021-12-30/
- 62,573,099
updated_date=2022-12-31/
- 97,559,192
updated_date=2022-01-01/
- 46,766,699
updated_date=2022-01-02/
- 1,352,773
This reflects the creation of the dataset on 2021-12-30 and 145,678,664 combined updates and inserts since then - 1,352,773 of which were on 2022-01-02. Over time, the number of partitions will grow. If we make a change that affects all records, the partitions before the date of the change will disappear.
See Merged Entities for an explanation of what Entity merging is and why we do it.
Alongside the folders for the six Entity types - work, author, source, institution, concept, and publisher - you'll find a seventh folder: merged_ids. Within this folder you'll find the IDs of Entities that have been merged away, along with the Entity IDs they were merged into.
Keep in mind that merging an Entity ID is a way of deleting the Entity while persisting its ID in OpenAlex. In practice, you can just delete the Entity it belongs to. It's not necessary to keep track of the date or which entity it was merged into.
Merge operations are separated into files by date. Each file lists the IDs of Entities that were merged on that date, and names the Entities they were merged into.
For example, data/merged_ids/authors/2022-06-07.csv.gz
begins:
When processing this file, all you need to do is delete A2257618939. The effects of merging these authors, like crediting A2208157607 with their Works, are already reflected in the affected Entities.
Like the Entities' updated_date partitions, you only ever need to download merged_ids files that are new to you. Any later merges will appear in new files with later dates.
manifest
fileWhen we start writing a new updated_date
partition for an entity, we'll delete that entity's manifest
file. When we finish writing the partition, we'll recreate the manifest, including the newly-created objects. So if manifest
is there, all the entities are there too.
The file is in redshift manifest format. To use it as part of the update process for an Entity type (we'll keep using Authors as an example):
Download s3://openalex/data/authors/manifest
.
Get the file list from the url
property of each item in the entries
list.
Download any objects with an updated_date
you haven't seen before.
Download s3://openalex/data/authors/manifest
again. If it hasn't changed since (1), no records moved around and any date partitions you downloaded are valid.
Decompress the files you downloaded and parse one JSON Author
per line. Insert or update into your database of choice, using each entity's ID as a primary key.
If you’ve worked with dataset like this before and have a toolchain picked out, this may be all you need to know. If you want more detailed steps, proceed to download the data.