Get content
Download PDFs and machine-readable XML for OpenAlex works
OpenAlex includes links to publisher-hosted and repository-hosted full text for about 60 million Open Access works. But downloading from all those different sources can be inconvenient.
So we've cached copies of these files. We've got:
PDF (60M): You can download PDFs directly from us.
Markdown: coming soon.
Content downloads require an API key and cost 100 credits per file. See rate limits for details.
Getting content
Get content for a single work
The URL pattern is simple:
https://content.openalex.org/works/{work_id}.pdf?api_key=YOUR_KEYReplace {work_id} with any OpenAlex work ID (like W2741809807), and you'll download the PDF. Use .grobid-xml instead of .pdf to get the TEI XML version. If you don't specify an extension, it'll default to .pdf.
Examples:
How it works
When you request content, here's what happens:
We check if we have the requested file. If not, you get a
404and are charged just 1 credit.If we have the file, we verify your API key has enough credits.
We generate a presigned URLβa temporary, authenticated link that grants access to the file on Cloudflare R2 where it's stored.
We return a
302 redirectto that presigned URL. Your browser or HTTP client follows the redirect automatically.Cloudflare verifies the signature and serves the file directly from their global edge network.
This approach is more scalable than streaming files through our servers. Since content is served directly from Cloudflare's edge infrastructure, downloads are fast regardless of where you are.
The presigned URL expires after 5 minutes. If you need to download the same file again, just hit the content endpoint again to get a fresh URL (but it will cost another 100 credits).
Finding works with content
There are three ways to find works that have downloadable content:
The YOLO method
Just plug a work ID into the URL template and see what happens. If we have it, great. If not, you'll get a 404 (and pay 1 credit for the lookup). Not recommended, but it works.
Check the work object
If you already have a work object from the API, look for the content_url field. If it's present, we have content available. Just append .pdf or .grobid-xml and add your API key:
This is convenient when you're already working with work objectsβno need to construct URLs yourself.
Use the has_content filter
This is the most powerful approach. Use the has_content filter to find works with downloadable content, combined with any other filters you want.
For example, find works about frogs that have PDFs:
Or works with CC-BY licenses published since 2024:
Then iterate through the results, grab each content_url, append .pdf, add your API key, and download. You can run 100 requests in parallel without any issues.
Examples
Example: Build a corpus for AI synthesis
Say you want to use an LLM to synthesize research on microplastics in drinking water. Here's how to collect the PDFs:
Step 1: Find relevant works with PDFs
This returns ~800 works. Page through using cursor=* to collect all content_url values. We use select=id,title,content_url to minimize response size. We also use per_page=100 to get 100 works per page, which means fewer API calls (faster and cheaper).
Step 2: Download and convert to text
Now you have a text corpus ready for RAG or LLM synthesis. Vibe a query interface and you've got your own real-time semantic search engine with results synthesis.
Example: Download millions of PDFs
Downloading millions of PDFs requires a lot of credits. You'll need a one-time credit packβcontact us for pricing.
Once you have credits, here's the approach:
Step 1: Page through all works with PDFs
Here's where you can limit your downloads to Creative Commons licensed works, certain topics, certain years--anything that our powerful filter syntax allows.
Step 2: Download in parallel
Performance: At 100 downloads/second, you can download all 60M PDFs in about 10 days. We recommend staying under 100 requests/second for reliable performance.
Credit costs
Download PDF or TEI XML (success)
100
Query for unavailable content (404)
1
List works with content (via /works API)
10 per page
Last updated