Load to a data warehouse
In many data warehouse and document store applications, you can load the OpenAlex entities as-is and query them directly. We’ll use BigQuery as an example here. (Elasticsearch docs coming soon). To follow along you’ll need the Google Cloud SDK. You’ll also need a Google account that can make BigQuery tables that are, well… big. Which means it probably won’t be free.
We'll show you how to do this in 4 steps:
Create a BigQuery Project and Dataset to hold your tables
Create the tables that will hold your entity JSON records
Copy the data files to the tables you created
Run some queries on the data you loaded
This guide will have you load each entity to a single text column, then use BigQuery's JSON functions to parse them when you run your queries. This is convenient but inefficient since each object has to be parsed every time you run a query.
This project, kindly shared by @DShvadron, takes a more efficient approach: https://github.com/DrorSh/openalex_to_gbq
Separating the Entity data into multiple columns takes more work up front but lets you write queries that are faster, simpler, and often cheaper.
Snowflake users can connect to a ready-to-query data set on the marketplace, helpfully maintained by Util - https://app.snowflake.com/marketplace/listing/GZT0ZOMX4O7
Step 1: Create a BigQuery Project and Dataset
In BigQuery, you need a Project and Dataset to hold your tables. We’ll call the project “openalex-demo” and the dataset “openalex”. Follow the linked instructions to create the Project, then create the dataset inside it:
Dataset 'openalex-demo:openalex' successfully created
Step 2: Create tables for each entity type
Now, we’ll create tables inside the dataset. There will be 5 tables, one for each entity type. Since we’re using JSON, each table will have just one text column named after the table.
Table 'openalex-demo:openalex.works' successfully created.
Table 'openalex-demo:openalex.authors' successfully created
and so on for sources
, institutions
, concepts,
and publishers
.
Step 3: Load the data files
We’ll load each table’s data from the JSON Lines files we downloaded earlier. For works
, the files were:
openalex-snapshot/data/works/updated_date=2021-12-28/0000_part_00.gz
openalex-snapshot/data/works/updated_date=2021-12-28/0001_part_00.gz
Here’s a command to load one works
file (don’t run it yet):
See the full documentation for the bq load
command here: https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_load
This part of the command may need some explanation:
--source_format=CSV -F '\t' --schema 'work:string'
Bigquery is expecting multiple columns with predefined datatypes (a “schema”). We’re tricking it into accepting a single text column (--schema 'work:string'
) by specifying CSV format (--source_format=CSV
) with a column delimiter that isn’t present in the file (-F '\t')
(\t means “tab”).
bq load
can only handle one file at a time, so you must run this command once per file. But remember that the real dataset will have many more files than this example does, so it's impractical to copy, edit, and rerun the command each time. It's easier to handle all the files in a loop, like this:
This step is slow. How slow depends on your upload speed, but for Author
and Work
we're talking hours, not minutes.
You can speed this up by using parallel
or other tools to run multiple upload commands at once. If you do, watch out for errors caused by hitting BigQuery quota limits.
Do this once per entity type, substituting each entity name for work
/works
as needed. When you’re finished, you’ll have five tables that look like this:
Step 4: Run your queries!
Now you have the all the OpenAlex data in a place where you can do anything you want with it using BigQuery JSON functions through bq query or the BigQuery console.
Here’s a simple one, extracting the OpenAlex ID and OA status for each work:
It will give you a list of IDs (this is a truncated sample, the real result will be millions of rows):
TRUE
FALSE
FALSE
You can run queries like this directly in your shell:
But even simple queries are hard to read and edit this way. It’s better to write them in a file than directly on the command line. Here’s an example of a slightly more complex query - finding the author with the most open access works of all time:
We get one result:
https://openalex.org/A2798520857
3297
Checking out https://api.openalex.org/authors/A2798520857, we see that this is Ashok Kumar at Manipal University Jaipur.
Last updated