Get content

Download PDFs and machine-readable XML for OpenAlex works

OpenAlex includes links to publisher-hosted and repository-hosted full text for about 60 million Open Access works. But downloading from all those different sources can be inconvenient.

So we've cached copies of these files. We've got:

  • PDF (60M): You can download PDFs directly from us.

  • XML (43M): We've also parsed the PDFs (using Grobidarrow-up-right) into TEI XMLarrow-up-right, a format for representing the sections and semantics of scholarly papers.

  • Markdown: coming soon.

circle-exclamation

Getting content

Get content for a single work

The URL pattern is simple:

https://content.openalex.org/works/{work_id}.pdf?api_key=YOUR_KEY

Replace {work_id} with any OpenAlex work ID (like W2741809807), and you'll download the PDF. Use .grobid-xml instead of .pdf to get the TEI XML version. If you don't specify an extension, it'll default to .pdf.

Examples:

How it works

When you request content, here's what happens:

  1. We check if we have the requested file. If not, you get a 404 and are charged just 1 credit.

  2. If we have the file, we verify your API key has enough credits.

  3. We generate a presigned URLarrow-up-rightβ€”a temporary, authenticated link that grants access to the file on Cloudflare R2arrow-up-right where it's stored.

  4. We return a 302 redirect to that presigned URL. Your browser or HTTP client follows the redirect automatically.

  5. Cloudflare verifies the signature and serves the file directly from their global edge network.

This approach is more scalable than streaming files through our servers. Since content is served directly from Cloudflare's edge infrastructure, downloads are fast regardless of where you are.

The presigned URL expires after 5 minutes. If you need to download the same file again, just hit the content endpoint again to get a fresh URL (but it will cost another 100 credits).

Finding works with content

There are three ways to find works that have downloadable content:

The YOLO method

Just plug a work ID into the URL template and see what happens. If we have it, great. If not, you'll get a 404 (and pay 1 credit for the lookup). Not recommended, but it works.

Check the work object

If you already have a work object from the API, look for the content_url field. If it's present, we have content available. Just append .pdf or .grobid-xml and add your API key:

This is convenient when you're already working with work objectsβ€”no need to construct URLs yourself.

Use the has_content filter

This is the most powerful approach. Use the has_content filter to find works with downloadable content, combined with any other filters you want.

For example, find works about frogs that have PDFs:

Or works with CC-BY licenses published since 2024:

Then iterate through the results, grab each content_url, append .pdf, add your API key, and download. You can run 100 requests in parallel without any issues.

Examples

Example: Build a corpus for AI synthesis

Say you want to use an LLM to synthesize research on microplastics in drinking water. Here's how to collect the PDFs:

Step 1: Find relevant works with PDFs

This returns ~800 works. Page through using cursor=* to collect all content_url values. We use select=id,title,content_url to minimize response size. We also use per_page=100 to get 100 works per page, which means fewer API calls (faster and cheaper).

Step 2: Download and convert to text

Now you have a text corpus ready for RAG or LLM synthesis. Vibe a query interface and you've got your own real-time semantic search engine with results synthesis.

Example: Download millions of PDFs

Downloading millions of PDFs requires a lot of credits. You'll need a one-time credit packβ€”contact usenvelope for pricing.

Once you have credits, here's the approach:

Step 1: Page through all works with PDFs

circle-info

Here's where you can limit your downloads to Creative Commons licensed works, certain topics, certain years--anything that our powerful filter syntax allows.

Step 2: Download in parallel

Performance: At 100 downloads/second, you can download all 60M PDFs in about 10 days. We recommend staying under 100 requests/second for reliable performance.

Credit costs

Action
Credits

Download PDF or TEI XML (success)

100

Query for unavailable content (404)

1

List works with content (via /works API)

10 per page

Last updated