Download to your machine
First off: anyone can get the data for free. While the files are hosted on S3 and we’ll be using Amazon tools in these instructions, you don’t need an Amazon account.
Many thanks to the AWS Open Data program. They cover the data-transfer fees (about $70 per download!) so users don't have to.
Before you load the snapshot contents to your database, you’ll need to get the files that make it up onto your own computer. There are exceptions, like loading to redshift from s3 or using an ETL product like Xplenty with an S3 connector. If either of these apply to you, see if the snapshot data format is enough to get you started.
The easiest way to get the files is with the Amazon Web Services Command Line Interface (AWS CLI). Sample commands in this documentation will use the AWS CLI. You can find instructions for installing it on your system here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
You can also browse the snapshot files using the AWS console here: https://s3.console.aws.amazon.com/s3/buckets/openalex. You need to be logged into an Amazon account to use the console, but it’s still free. The CLI will work without an account.
This shell command will copy everything in the openalex S3 bucket to a local folder named openalex-snapshot. It'll take up about 350GB of disk space.
1
aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request
Copied!
You should get a file structure like this (edited for length - there are more objects in the actual bucket):
1
openalex-snapshot/
2
├── LICENSE.txt
3
├── RELEASE_NOTES.txt
4
└── data
5
├── authors
6
│ ├── manifest
7
│ └── updated_date=2021-12-28
8
│ ├── 0000_part_00.gz
9
│ └── 0001_part_00.gz
10
├── concepts
11
│ ├── manifest
12
│ └── updated_date=2021-12-28
13
│ ├── 0000_part_00.gz
14
│ └── 0001_part_00.gz
15
├── institutions
16
│ ├── manifest
17
│ └── updated_date=2021-12-28
18
│ ├── 0000_part_00.gz
19
│ └── 0001_part_00.gz
20
├── venues
21
│ ├── manifest
22
│ └── updated_date=2021-12-28
23
│ ├── 0000_part_00.gz
24
│ └── 0001_part_00.gz
25
└── works
26
├── manifest
27
└── updated_date=2021-12-28
28
├── 0000_part_00.gz
29
└── 0001_part_00.gz
Copied!
Copy link