Full-text PDFs
OpenAlex has cached copies of full-text content for about 60 million works:
60M PDFs β Original PDF files
43M TEI XML β Machine-readable structured text parsed by Grobid
This page covers bulk download options. For downloading individual files via the API, see Download PDF content.
Storage details
The full-text archive is approximately 270 TB total:
60M
~250 TB
TEI XML
43M
~20 TB
Download options
Option 1: API (up to ~10K files)
Use the content API to download files one at a time. Each download costs 100 credits.
With a free API key (100K credits/day), you can download about 1,000 files per day. Good for research projects, building small corpora, or sampling.
Option 2: Script (up to a few million files)
For larger downloads, write a script that iterates through work IDs and downloads each file via the API. Standard credit rates apply (100 credits per download).
Here's a basic Python example:
Add your own error handling, retries, and resume logic. At 100 downloads/second, you can download a few million files in a few days.
Option 3: Complete archive sync
For downloading the complete archive (all 60M files), we provide direct access to the storage bucket. You get time-limited credentials and use standard S3 tools to sync.
Files are stored on Cloudflare R2, which is fully compatible with the S3 API. You can use the AWS CLI, boto3, or any S3-compatible tool.
One-time download: 30-day R2 read access to sync the complete archive.
Ongoing sync: Persistent R2 read access is included with our enterprise subscription.
See the pricing page for details, or contact us to get started.
How it works:
We generate R2 API credentials with read-only access
You sync using the AWS CLI:
For ongoing sync, run periodically to get new files
At typical network speeds, expect 1-2 weeks to download the full archive.
File naming
Files are named by UUID, not work ID:
To map work IDs to file UUIDs, use the snapshot data. The locations array in each work contains pdf_url fields that include the UUID.
Licensing
The PDFs themselves retain their original copyright. OpenAlex does not grant any additional rights to the content. Many files are Open Access, but not allβcheck each work's open_access field for licensing information.
Last updated