First off: anyone can get the data for free. While the files are hosted on S3 and weβll be using Amazon tools in these instructions, you donβt need an Amazon account.
Many thanks to the AWS Open Data program. They cover the data-transfer fees (about $70 per download!) so users don't have to.
Before you load the snapshot contents to your database, youβll need to get the files that make it up onto your own computer. There are exceptions, like loading to redshift from s3 or using an ETL product like Xplenty with an S3 connector. If either of these apply to you, see if the snapshot data format is enough to get you started.
This shell command will copy everything in the openalex S3 bucket to a local folder named openalex-snapshot. It'll take up roughly 300GB of disk space.
If you download the snapshot into an existing folder, you'll need to use the aws s3 sync--delete flag to remove files from any previous downloads. You can also remove the contents of destination folder manually. If you don't, you will see duplicate Entities that have moved from one file to another between snapshot updates.
The size of the snapshot will change over time. You can check the current size before downloading by looking at the output of: