Comment on page
Load to a relational database
Compared to using a data warehouse, loading the dataset into a relational database takes more work up front but lets you write simpler queries and run them on less powerful machines. One important caveat is that this is a lot of data, and exploration will be very slow in most relational databases.
By using a relational database, you trade flexibility for efficiency in certain selected operations. The tables, columns, and indexes we have chosen in this guide represent only one of many ways the entity objects could be stored. It may not be the best way to store them given the queries you want to run. Some queries will be fast, others will be painfully slow.
We’re going to use PostgreSQL as an example and skip the database server setup itself. We’ll assume you have a working postgres 13+ installation on which you can create schemas and tables and run queries. With that as a starting point, we'll take you through these steps:
- 1.Define the tables the data will be stored in and some key relationships between them (the "schema").
- 3.Load the CSV data into to the tables you created.
- 4.Run some queries on the data you loaded.
SQL in this section isn't anything additional you need to run. It's part of the schema we already defined in the file above.
CREATE TABLE openalex.concepts (
id text NOT NULL,
-- plus some other columns ...
But Concept.related_concepts isn't so simple. You could store the JSON array intact in a postgres JSON or JSONB column, but you would lose much of the benefit of a relational database. It would be hard to answer questions about related concepts with more than one degree of separation, for example. So we make a separate table to hold these relationships:
CREATE TABLE openalex.concepts_related_concepts (
We can preserve
scorein this relationship table and look up any other attributes of the dehydrated related concepts in the main table
concepts. Creating indexes on
related_concept_idlets us look up concepts on both sides of the relationship quickly.
This script assumes your downloaded snapshot is in
openalex-snapshotand you've made a directory
csv-filesto hold the CSV files.
CSV_DIRat the top of the script to read or write the files somewhere else.
This script has only been tested using python 3.9.5.
Copy the script to the directory above your snapshot (if the snapshot is in
/home/yourname/openalex/openalex-snapshot/, name it something like
run it like this:
mkdir -p csv-files
This script is slow. Exactly how slow depends on the machine you run it on, but think hours, not minutes.
If you're familiar with python, there are two big improvements you can make:
- Flatten multiple
.gzfiles within each entity type at the same time. This means parallelizing the
for jsonl_file_name ... loopin each
flatten_function and writing multiple CSV files per entity type.
You should now have a directory full of nice, flat CSV files:
$ tree csv-files/
$ cat csv-files/concepts_related_concepts.csv
Now we run one postgres copy command to load each CSV file to its corresponding table. Each command looks like this:
\copy openalex.concepts_ancestors (concept_id, ancestor_id) from csv-files/concepts_ancestors.csv csv header
- 1.Copy it to the same place as the python script from step 2, right above the folder with your CSV files.
- 3.If your CSV files aren't in
csv-files, replace each occurence of 'csv-files/' in the script with the correct path.
- 4.Run it like this (from your shell prompt)
psql $OPENALEX_SNAPSHOT_DB < copy-openalex-csv.sql
or like this (from psql)
There are a bunch of ways you can do this - just run the copy commands from the script above in the right order in whatever client you're familiar with.
Now you have all the OpenAlex data in your database and can run queries in your favorite client.
Here’s a simple one, getting the OpenAlex ID and OA status for each work:
select w.id, oa.oa_status
from openalex.works w
join openalex.works_open_access oa
on w.id = oa.work_id;
You'll get results like this (truncated, the actual result will be millions of rows):
Here’s an example of a more complex query - finding the author with the most open access works of all time:
count(distinct work_id) as num_oa_works
a.id as author_id,
w.id as work_id,
join openalex.works_authorships wa on a.id = wa.author_id
join openalex.works w on wa.work_id = w.id
join openalex.works_open_access oa on w.id = oa.work_id
group by 1
order by 2 desc
We get the one row we asked for: