Full-text PDFs
OpenAlex has cached copies of full-text content for about 60 million works:
60M PDFs β Original PDF files
43M TEI XML β Machine-readable structured text parsed by Grobid
This page covers bulk download options. For downloading individual files via the API, see Download PDF content.
Storage details
The full-text archive is approximately 270 TB total:
60M
~250 TB
TEI XML
43M
~20 TB
Download options
Option 1: API (up to ~10K files)
Use the content API to download files one at a time. Each download costs $0.01.
With a free API key ($1/day budget), you can download about 100 files per day. Good for research projects, building small corpora, or sampling.
Option 2: OpenAlex CLI (up to a few million files)
For larger downloads, use the OpenAlex CLI. It handles parallel downloads, automatic retries, checkpointing, and resumeβso you don't have to build all that yourself.
Install:
Example: Download PDFs for a specific topic
Example: Download 2026 works with Creative Commons licenses
Example: Download metadata + PDFs + TEI XML
See the OpenAlex CLI page for more examples and full documentation.
Standard rates apply ($0.01 per content file download; metadata is free). At full speed, you can download a few million files in a few days.
Option 3: Complete archive sync
For downloading the complete archive (all 60M files), we provide direct access to the storage bucket. You get time-limited credentials and use standard S3 tools to sync.
Files are stored on Cloudflare R2, which is fully compatible with the S3 API. You can use the AWS CLI, boto3, or any S3-compatible tool.
One-time download: 30-day R2 read access to sync the complete archive.
Ongoing sync: Persistent R2 read access is included with our enterprise subscription.
See the pricing page for details, or contact us to get started.
How it works:
We generate R2 API credentials with read-only access
You sync using the AWS CLI:
For ongoing sync, run periodically to get new files
At typical network speeds, expect 1-2 weeks to download the full archive.
File naming
Files are named by UUID, not work ID:
To map work IDs to file UUIDs, use the snapshot data. The locations array in each work contains pdf_url fields that include the UUID.
Licensing
The PDFs retain their original copyright. OpenAlex does not grant any additional rights to the contentβwe're just providing access to files we've collected.
To check the license for a specific work, use the best_oa_location.license field in the API. This tells you the license associated with the work's best open access location (e.g., cc-by, cc-by-nc, cc0, or null if unknown).
Filter by license:
You can use the API to find works with a specific license:
This returns works that have downloadable PDFs and are licensed under CC BY.
Last updated