OpenAlex snapshot

For most use cases, the REST API is your best option. However, you can also download (instructions here) and install a complete copy of the OpenAlex database on your own server, using the database snapshot. The snapshot consists of seven files (split into smaller files for convenience), with one file for each of our seven entity types. The files are in the JSON Lines format; each line is a JSON object, exactly the same as you'd get from our API. The properties of these JSON objects are documented in each entity's object section (for example, the Work object).

The snapshot is updated about once per month; you can read release notes for each new update here.

If you've worked with a dataset like this before, the snapshot data format may be all you need to get going. If not, read on.

The rest of this guide will tell you how to (a) download the snapshot and (b) upload it to your own database. We’ll cover two general approaches:

Load the intact OpenAlex records to a data warehouse (we’ll use BigQuery as an example) and use native JSON functions to query the Work, Author, Source, Institution, Concept, and Publisher objects directly.
Flatten the records into a normalized schema in a relational database (we’ll use PostgreSQL) while preserving the relationships between objects.

We'll assume you're initializing a fresh snapshot. To keep it up to date, you'll have to take the information from Downloading updated Entities and generalize from the steps in the guide.

This is hard. Working with such a big and complicated dataset hardly ever goes according to plan. If it gets scary, try the REST API. In fact, try the REST API first. It can answer most of your questions and has a much lower barrier to entry.

There’s more than one way to do everything. We’ve tried to pick one reasonable default way to do each step, so if something doesn’t work in your environment or with the tools you have available, let us know.

Up next: the snapshot data format, downloading the data and getting it into your database.

PreviousRate limits and authentication NextSnapshot data format

Last updated 1 year ago