Full-text PDFs

OpenAlex has cached copies of full-text content for about 60 million works:

  • 60M PDFs β€” Original PDF files

  • 43M TEI XML β€” Machine-readable structured text parsed by Grobidarrow-up-right

This page covers bulk download options. For downloading individual files via the API, see Download PDF content.

Storage details

The full-text archive is approximately 270 TB total:

Format
Files
Size

PDF

60M

~250 TB

TEI XML

43M

~20 TB

Download options

Option 1: API (up to ~10K files)

Use the content API to download files one at a time. Each download costs 100 credits.

With a free API key (100K credits/day), you can download about 1,000 files per day. Good for research projects, building small corpora, or sampling.

Option 2: Script (up to a few million files)

For larger downloads, write a script that iterates through work IDs and downloads each file via the API. Standard credit rates apply (100 credits per download).

Here's a basic Python example:

Add your own error handling, retries, and resume logic. At 100 downloads/second, you can download a few million files in a few days.

Option 3: Complete archive sync

For downloading the complete archive (all 60M files), we provide direct access to the storage bucket. You get time-limited credentials and use standard S3 tools to sync.

Files are stored on Cloudflare R2arrow-up-right, which is fully compatible with the S3 API. You can use the AWS CLI, boto3, or any S3-compatible tool.

One-time download: 30-day R2 read access to sync the complete archive.

Ongoing sync: Persistent R2 read access is included with our enterprise subscription.

See the pricing pagearrow-up-right for details, or contact usenvelope to get started.

How it works:

  1. We generate R2 API credentials with read-only access

  2. You sync using the AWS CLI:

  1. For ongoing sync, run periodically to get new files

At typical network speeds, expect 1-2 weeks to download the full archive.

File naming

Files are named by UUID, not work ID:

To map work IDs to file UUIDs, use the snapshot data. The locations array in each work contains pdf_url fields that include the UUID.

Licensing

The PDFs themselves retain their original copyright. OpenAlex does not grant any additional rights to the content. Many files are Open Access, but not allβ€”check each work's open_access field for licensing information.

Last updated