OpenAlex technical documentation
  • Overview
  • Quickstart tutorial
  • API Entities
    • Entities overview
    • πŸ“„Works
      • Work object
        • Authorship object
        • Location object
      • Get a single work
      • Get lists of works
      • Filter works
      • Search works
      • Group works
      • Get N-grams
    • πŸ‘©Authors
      • Author object
      • Get a single author
      • Get lists of authors
      • Filter authors
      • Search authors
      • Group authors
      • Limitations
      • Author disambiguation
    • πŸ“šSources
      • Source object
      • Get a single source
      • Get lists of sources
      • Filter sources
      • Search sources
      • Group sources
    • 🏫Institutions
      • Institution object
      • Get a single institution
      • Get lists of institutions
      • Filter institutions
      • Search institutions
      • Group institutions
    • πŸ’‘Topics
      • Topic object
      • Get a single topic
      • Get lists of topics
      • Filter topics
      • Search topics
      • Group topics
    • πŸ—οΈKeywords
    • 🏒Publishers
      • Publisher object
      • Get a single publisher
      • Get lists of publishers
      • Filter publishers
      • Search publishers
      • Group publishers
    • πŸ’°Funders
      • Funder object
      • Get a single funder
      • Get lists of funders
      • Filter funders
      • Search funders
      • Group funders
    • 🌎Geo
      • Continents
      • Regions
    • Concepts
      • Concept object
      • Get a single concept
      • Get lists of concepts
      • Filter concepts
      • Search concepts
      • Group concepts
    • Aboutness endpoint (/text)
  • How to use the API
    • API Overview
    • Get single entities
      • Random result
      • Select fields
    • Get lists of entities
      • Paging
      • Filter entity lists
      • Search entities
      • Sort entity lists
      • Select fields
      • Sample entity lists
      • Autocomplete entities
    • Get groups of entities
    • Rate limits and authentication
  • Download all data
    • OpenAlex snapshot
    • Snapshot data format
    • Download to your machine
    • Upload to your database
      • Load to a data warehouse
      • Load to a relational database
        • Postgres schema diagram
  • Additional Help
    • Tutorials
    • Report bugs
    • FAQ
Powered by GitBook
On this page
Export as PDF
  1. Download all data

Download to your machine

PreviousSnapshot data formatNextUpload to your database

Last updated 1 year ago

First off: anyone can get the data for free. While the files are hosted on and we’ll be using Amazon tools in these instructions, you don’t need an Amazon account.

Many thanks to the . They cover the data-transfer fees (about $70 per download!) so users don't have to.

Before you load the snapshot contents to your database, you’ll need to get the files that make it up onto your own computer. There are exceptions, like or using an ETL product like with an S3 connector. If either of these apply to you, see if the is enough to get you started.

The easiest way to get the files is with the Amazon Web Services Command Line Interface (AWS CLI). Sample commands in this documentation will use the AWS CLI. You can find instructions for installing it on your system here:

You can also browse the snapshot files using the AWS console here: . This browser and the CLI will work without an account.

This shell command will copy everything in the openalex S3 bucket to a local folder named openalex-snapshot. It'll take up roughly 300GB of disk space.

aws s3 sync "s3://openalex" "openalex-snapshot" --no-sign-request

If you download the snapshot into an existing folder, you'll need to use the --delete flag to remove files from any previous downloads. You can also remove the contents of destination folder manually. If you don't, you will see duplicate Entities that have moved from one file to another between snapshot updates.

The size of the snapshot will change over time. You can check the current size before downloading by looking at the output of:

aws s3 ls --summarize --human-readable --no-sign-request --recursive "s3://openalex/"

You should get a file structure like this (edited for length - there are more objects in the actual bucket):

openalex-snapshot/
β”œβ”€β”€ LICENSE.txt
β”œβ”€β”€ RELEASE_NOTES.txt
└── data
    β”œβ”€β”€ authors
    β”‚   β”œβ”€β”€ manifest
    β”‚   └── updated_date=2021-12-28
    β”‚       β”œβ”€β”€ 0000_part_00.gz
    β”‚       └── 0001_part_00.gz
    β”œβ”€β”€ concepts
    β”‚   β”œβ”€β”€ manifest
    β”‚   └── updated_date=2021-12-28
    β”‚       β”œβ”€β”€ 0000_part_00.gz
    β”‚       └── 0001_part_00.gz
    β”œβ”€β”€ institutions
    β”‚   β”œβ”€β”€ manifest
    β”‚   └── updated_date=2021-12-28
    β”‚       β”œβ”€β”€ 0000_part_00.gz
    β”‚       └── 0001_part_00.gz
    β”œβ”€β”€ sources
    β”‚   β”œβ”€β”€ manifest
    β”‚   └── updated_date=2021-12-28
    β”‚       β”œβ”€β”€ 0000_part_00.gz
    β”‚       └── 0001_part_00.gz
    └── works
        β”œβ”€β”€ manifest
        └── updated_date=2021-12-28
            β”œβ”€β”€ 0000_part_00.gz
            └── 0001_part_00.gz
S3
AWS Open Data program
loading to redshift from s3
Xplenty
snapshot data format
https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
https://openalex.s3.amazonaws.com/browse.html
aws s3 sync