Full-text PDFs

OpenAlex has cached copies of full-text content for about 60 million works:

  • 60M PDFs β€” Original PDF files

  • 43M TEI XML β€” Machine-readable structured text parsed by Grobidarrow-up-right

This page covers bulk download options. For downloading individual files via the API, see Download PDF content.

Storage details

The full-text archive is approximately 270 TB total:

Format
Files
Size

PDF

60M

~250 TB

TEI XML

43M

~20 TB

Download options

Option 1: API (up to ~10K files)

Use the content API to download files one at a time. Each download costs $0.01.

With a free API key ($1/day budget), you can download about 100 files per day. Good for research projects, building small corpora, or sampling.

Option 2: OpenAlex CLI (up to a few million files)

For larger downloads, use the OpenAlex CLI. It handles parallel downloads, automatic retries, checkpointing, and resumeβ€”so you don't have to build all that yourself.

Install:

Example: Download PDFs for a specific topic

Example: Download 2026 works with Creative Commons licenses

Example: Download metadata + PDFs + TEI XML

See the OpenAlex CLI page for more examples and full documentation.

Standard rates apply ($0.01 per content file download; metadata is free). At full speed, you can download a few million files in a few days.

Option 3: Complete archive sync

For downloading the complete archive (all 60M files), we provide direct access to the storage bucket. You get time-limited credentials and use standard S3 tools to sync.

Files are stored on Cloudflare R2arrow-up-right, which is fully compatible with the S3 API. You can use the AWS CLI, boto3, or any S3-compatible tool.

One-time download: 30-day R2 read access to sync the complete archive.

Ongoing sync: Persistent R2 read access is included with our enterprise subscription.

See the pricing pagearrow-up-right for details, or contact usenvelope to get started.

How it works:

  1. We generate R2 API credentials with read-only access

  2. You sync using the AWS CLI:

  1. For ongoing sync, run periodically to get new files

At typical network speeds, expect 1-2 weeks to download the full archive.

File naming

Files are named by UUID, not work ID:

To map work IDs to file UUIDs, use the snapshot data. The locations array in each work contains pdf_url fields that include the UUID.

Licensing

The PDFs retain their original copyright. OpenAlex does not grant any additional rights to the contentβ€”we're just providing access to files we've collected.

To check the license for a specific work, use the best_oa_location.license field in the API. This tells you the license associated with the work's best open access location (e.g., cc-by, cc-by-nc, cc0, or null if unknown).

Filter by license:

You can use the API to find works with a specific license:

This returns works that have downloadable PDFs and are licensed under CC BYarrow-up-right.

Last updated