Get content

Download PDFs and machine-readable XML for OpenAlex works

OpenAlex includes links to publisher-hosted and repository-hosted full text for about 60 million Open Access works. But downloading from all those different sources can be inconvenient.

So we've cached copies of these files. We've got:

  • PDF (60M): You can download PDFs directly from us.

  • XML (43M): We've also parsed the PDFs (using Grobidarrow-up-right) into TEI XMLarrow-up-right, a format for representing the sections and semantics of scholarly papers.

  • Markdown: coming soon.

circle-exclamation

Download content

Single work

Let's say you have an OpenAlex work ID and you want to download its full-text PDF. The URL pattern is simple:

https://content.openalex.org/works/{work_id}.pdf?api_key=YOUR_KEY

Replace {work_id} with any OpenAlex work ID (like W2741809807), and you'll download the PDF. Use .grobid-xml instead of .pdf to get the TEI XML version. If you don't specify an extension, it'll default to .pdf.

Examples:

Multiple works

If you have a list of OpenAlex work IDs, you can iterate through them and download each file one at a time using the endpoint above.

For higher volume (thousands to millions of downloads), use our command-line tool:

The tool handles parallel downloads, automatic retries, and checkpointing so you can resume interrupted downloads. See Full-text PDFs for usage examples.

For the complete archive (all 60M files), you can sync directly from our storage bucket. See Full-text PDFs for details.

How it works

When you request content, here's what happens:

  1. We check if we have the requested file. If not, you get a 404.

  2. If we have the file, we verify your API key has enough credits.

  3. We generate a presigned URLarrow-up-rightβ€”a temporary, authenticated link that grants access to the file on Cloudflare R2arrow-up-right where it's stored.

  4. We return a 302 redirect to that presigned URL. Your browser or HTTP client follows the redirect automatically.

  5. Cloudflare verifies the signature and serves the file directly from their global edge network.

This approach is more scalable than streaming files through our servers. Since content is served directly from Cloudflare's edge infrastructure, downloads are fast regardless of where you are.

The presigned URL expires after 5 minutes. If you need to download the same file again, just hit the content endpoint again to get a fresh URL (but it will cost another 100 credits).

Find content to download

It's easy to download content if you already know which works you want. But what if you need to find works that have downloadable content? There are three methods:

The YOLO method

Just plug a work ID into the URL template and see what happens. If we have it, great. If not, you'll get a 404. Not recommended, but it works.

Check the work object

If you already have a work object from the API, look for the content_url field. If it's present, we have content available. Just append .pdf or .grobid-xml and add your API key:

This is convenient when you're already working with work objectsβ€”no need to construct URLs yourself.

Use the has_content filter

This is the most powerful approach. Use the has_content filter to find works with downloadable content, combined with any other filters you want.

For example, find works about frogs that have PDFs:

Or works with CC-BY licenses published since 2024:

Then iterate through the results, grab each content_url, append .pdf, add your API key, and download. You can run 100 requests in parallel without any issues.

circle-info

Downloading more than a few thousand files? Use the command-line toolβ€”it handles parallel downloads, retries, and checkpointing automatically.

Example: Build a corpus for AI synthesis

Say you want to use an LLM to synthesize research on microplastics in drinking water. Here's how to collect the PDFs:

Step 1: Find relevant works with PDFs

This returns ~800 works. Page through using cursor=* to collect all content_url values. We use select=id,title,content_url to minimize response size. We also use per_page=100 to get 100 works per page, which means fewer API calls (faster and cheaper).

Step 2: Download and convert to text

Now you have a text corpus ready for RAG or LLM synthesis. Vibe a query interface and you've got your own real-time semantic search engine with results synthesis.

circle-info

Want to download these faster? Use the command-line tool: openalex-content download --filter "default.search:microplastics drinking water,has_content.pdf:true"

Credit costs

Action
Credits

Get a work by ID

0

List/filter works

1

Download PDF or TEI XML

100

Last updated