Load to a relational database
Compared to using a data warehouse, loading the dataset into a relational database takes more work up front but lets you write simpler queries and run them on less powerful machines. One important caveat is that this is a lot of data, and exploration will be very slow in most relational databases.
By using a relational database, you trade flexibility for efficiency in certain selected operations. The tables, columns, and indexes we have chosen in this guide represent only one of many ways the entity objects could be stored. It may not be the best way to store them given the queries you want to run. Some queries will be fast, others will be painfully slow.
We’re going to use PostgreSQL as an example and skip the database server setup itself. We’ll assume you have a working postgres 13+ installation on which you can create schemas and tables and run queries. With that as a starting point, we'll take you through these steps:
Define the tables the data will be stored in and some key relationships between them (the "schema").
Convert the JSON Lines files you downloaded to CSV files that can be read by the database application. We'll flatten them to fit a hierarchical database model.
Load the CSV data into to the tables you created.
Run some queries on the data you loaded.
Step 1: Create the schema
Running this SQL on your database (in the psql client, for example) will initialize a schema for you.
Run it and you'll be set up to follow the next steps. To show you what it's doing, we'll explain some excerpts here, using the concept entity as an example.
SQL in this section isn't anything additional you need to run. It's part of the schema we already defined in the file above.
The key thing we're doing is "flattening" the nested JSON data. Some parts of this are easy. Concept.id is just a string, so it goes in a text column called "id":
But Concept.related_concepts isn't so simple. You could store the JSON array intact in a postgres JSON or JSONB column, but you would lose much of the benefit of a relational database. It would be hard to answer questions about related concepts with more than one degree of separation, for example. So we make a separate table to hold these relationships:
We can preserve score
in this relationship table and look up any other attributes of the dehydrated related concepts in the main table concepts
. Creating indexes on concept_id
and related_concept_id
lets us look up concepts on both sides of the relationship quickly.
Step 2: Convert the JSON Lines files to CSV
This python script will turn the JSON Lines files you downloaded into CSV files that can be copied to the the tables you created in step 1.
This script assumes your downloaded snapshot is in openalex-snapshot
and you've made a directory csv-files
to hold the CSV files.
Edit SNAPSHOT_DIR
and CSV_DIR
at the top of the script to read or write the files somewhere else.
This script has only been tested using python 3.9.5.
Copy the script to the directory above your snapshot (if the snapshot is in /home/yourname/openalex/openalex-snapshot/
, name it something like /home/yourname/openalex/flatten-openalex-jsonl.py)
run it like this:
This script is slow. Exactly how slow depends on the machine you run it on, but think hours, not minutes.
If you're familiar with python, there are two big improvements you can make:
Run
flatten_authors
andflatten_works
at the same time, either by using threading in python or just running two copies of the script with the appropriate lines commented out.Flatten multiple
.gz
files within each entity type at the same time. This means parallelizing thefor jsonl_file_name ... loop
in eachflatten_
function and writing multiple CSV files per entity type.
You should now have a directory full of nice, flat CSV files:
Step 3: Load the CSV files to the database
Now we run one postgres copy command to load each CSV file to its corresponding table. Each command looks like this:
This script will run all the copy commands in the right order. Here's how to run it:
Copy it to the same place as the python script from step 2, right above the folder with your CSV files.
Set the environment variable OPENALEX_SNAPSHOT_DB to the connection URI for your database.
If your CSV files aren't in
csv-files
, replace each occurence of 'csv-files/' in the script with the correct path.Run it like this (from your shell prompt)
or like this (from psql)
There are a bunch of ways you can do this - just run the copy commands from the script above in the right order in whatever client you're familiar with.
Step 4: Run your queries!
Now you have all the OpenAlex data in your database and can run queries in your favorite client.
Here’s a simple one, getting the OpenAlex ID and OA status for each work:
You'll get results like this (truncated, the actual result will be millions of rows):
id | oa_status |
---|---|
closed | |
gold | |
bronze |
Here’s an example of a more complex query - finding the author with the most open access works of all time:
We get the one row we asked for:
author_id | num_oa_works |
---|---|
https://openalex.org/A2798520857 | 3297 |
Checking out https://api.openalex.org/authors/A2798520857, we see that this is Ashok Kumar at Manipal University Jaipur. We could also have found this directly in the query, through openalex.authors
.
Last updated