Get content
Download PDFs and machine-readable XML for OpenAlex works
OpenAlex includes links to publisher-hosted and repository-hosted full text for about 60 million Open Access works. But downloading from all those different sources can be inconvenient.
So we've cached copies of these files. We've got:
PDF (60M): You can download PDFs directly from us.
Markdown: coming soon.
Content downloads require an API key and cost 100 credits per file. See rate limits for details.
Download content
Single work
Let's say you have an OpenAlex work ID and you want to download its full-text PDF. The URL pattern is simple:
https://content.openalex.org/works/{work_id}.pdf?api_key=YOUR_KEYReplace {work_id} with any OpenAlex work ID (like W2741809807), and you'll download the PDF. Use .grobid-xml instead of .pdf to get the TEI XML version. If you don't specify an extension, it'll default to .pdf.
Examples:
Multiple works
If you have a list of OpenAlex work IDs, you can iterate through them and download each file one at a time using the endpoint above.
For higher volume (thousands to millions of downloads), use our command-line tool:
The tool handles parallel downloads, automatic retries, and checkpointing so you can resume interrupted downloads. See Full-text PDFs for usage examples.
For the complete archive (all 60M files), you can sync directly from our storage bucket. See Full-text PDFs for details.
How it works
When you request content, here's what happens:
We check if we have the requested file. If not, you get a
404.If we have the file, we verify your API key has enough credits.
We generate a presigned URLβa temporary, authenticated link that grants access to the file on Cloudflare R2 where it's stored.
We return a
302 redirectto that presigned URL. Your browser or HTTP client follows the redirect automatically.Cloudflare verifies the signature and serves the file directly from their global edge network.
This approach is more scalable than streaming files through our servers. Since content is served directly from Cloudflare's edge infrastructure, downloads are fast regardless of where you are.
The presigned URL expires after 5 minutes. If you need to download the same file again, just hit the content endpoint again to get a fresh URL (but it will cost another 100 credits).
Find content to download
It's easy to download content if you already know which works you want. But what if you need to find works that have downloadable content? There are three methods:
The YOLO method
Just plug a work ID into the URL template and see what happens. If we have it, great. If not, you'll get a 404. Not recommended, but it works.
Check the work object
If you already have a work object from the API, look for the content_url field. If it's present, we have content available. Just append .pdf or .grobid-xml and add your API key:
This is convenient when you're already working with work objectsβno need to construct URLs yourself.
Use the has_content filter
This is the most powerful approach. Use the has_content filter to find works with downloadable content, combined with any other filters you want.
For example, find works about frogs that have PDFs:
Or works with CC-BY licenses published since 2024:
Then iterate through the results, grab each content_url, append .pdf, add your API key, and download. You can run 100 requests in parallel without any issues.
Downloading more than a few thousand files? Use the command-line toolβit handles parallel downloads, retries, and checkpointing automatically.
Example: Build a corpus for AI synthesis
Say you want to use an LLM to synthesize research on microplastics in drinking water. Here's how to collect the PDFs:
Step 1: Find relevant works with PDFs
This returns ~800 works. Page through using cursor=* to collect all content_url values. We use select=id,title,content_url to minimize response size. We also use per_page=100 to get 100 works per page, which means fewer API calls (faster and cheaper).
Step 2: Download and convert to text
Now you have a text corpus ready for RAG or LLM synthesis. Vibe a query interface and you've got your own real-time semantic search engine with results synthesis.
Want to download these faster? Use the command-line tool: openalex-content download --filter "default.search:microplastics drinking water,has_content.pdf:true"
Credit costs
Get a work by ID
0
List/filter works
1
Download PDF or TEI XML
100
Last updated