API Guide for LLMs
Are you an LLM? Start here.
OpenAlex API Guide for LLM Agents and AI Applications
OpenAlex is a fully open catalog of scholarly works, authors, sources, institutions, topics, publishers, and funders. Base URL: https://api.openalex.org Documentation: https://docs.openalex.org No authentication required | 100,000 requests/day limit
CRITICAL GOTCHAS - Read These First!
β DON'T: Create ad-hoc sampling by using random page numbers
WRONG: ?page=5, ?page=17, ?page=42 to get "random" results This is NOT random sampling and will bias your results!
β
DO: Use the ?sample parameter for random sampling
CORRECT: https://api.openalex.org/works?sample=20 For consistent results, add a seed: ?sample=20&seed=123
β DON'T: Try to sample large datasets (10k+) in one request
The sample parameter maxes out at reasonable sizes for a single request.
β
DO: Use multiple samples with different seeds, then deduplicate
For large random samples (10k+ records):
Make multiple sample requests with different seeds
Combine results
Deduplicate by ID Example:
?sample=1000&seed=1
?sample=1000&seed=2
?sample=1000&seed=3 Then deduplicate the combined results by checking work IDs. See: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/sample-entity-lists
β DON'T: Search/filter by entity names directly
WRONG: /works?filter=author_name:Einstein Entity names are ambiguous and this won't work!
β
DO: Use two-step lookup pattern for related entities
CORRECT two-step process:
Find the entity ID: /authors?search=einstein Response shows ID like "A5023888391" or full URI
Use ID to filter: /works?filter=authorships.author.id:A5023888391
Why? Names are ambiguous. "MIT" could be many institutions. IDs are unique. This applies to: authors, institutions, sources, topics, publishers, funders.
β DON'T: Try to group by multiple dimensions in one query
WRONG: You cannot do SQL-style "GROUP BY topic, year" in a single API call.
β
DO: Make multiple queries and combine results client-side
To analyze by topic AND year (or any two dimensions):
Make one query per year: ?filter=publication_year:2020&group_by=topics.id
Repeat for 2021, 2022, etc.
Combine results in your code The API only supports one group_by per request.
β DON'T: Ignore API errors or retry immediately on failure
API errors are common, especially at scale. Immediate retries can make things worse.
β
DO: Implement exponential backoff for retries
When you get errors (429 rate limit, 500 server error, timeouts):
Catch the error
Wait before retrying (1s, 2s, 4s, 8s, etc.)
Include a max retry limit (e.g., 5 attempts)
Log failures for debugging
β DON'T: Use default page sizes for bulk extraction
Default is only 25 results per page. Slow for large extracts!
β
DO: Use maximum page size (200) for bulk data extraction
FAST: ?per-page=200 This reduces the number of API calls needed by 8x compared to default.
β DON'T: Make sequential API calls for lists of known IDs
SLOW: Loop through 100 DOIs making 100 separate API calls.
β
DO: Use the OR filter (pipe |) for batch ID lookups
FAST: Combine up to 50 IDs in one query using pipe separator: /works?filter=doi:https://doi.org/10.1371/journal.pone.0266781|https://doi.org/10.1371/journal.pone.0267149|... You can include up to 50 values per filter. Use per-page=50 to get all results. See: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists#addition-or
β DON'T: Ignore rate limits when using concurrency/threading
Using multiple threads WITHOUT respecting rate limits will get you rate-limited or banned.
β
DO: Respect rate limits even across concurrent requests
Default pool: 1 request/second
Polite pool (with email): 10 requests/second
Daily limit: 100,000 requests When using threading/async:
Implement rate limiting across ALL threads
Track requests per second globally
Add your email to requests for 10x higher limits
Quick Reference
Base URL and Authentication
Get Higher Rate Limits (Polite Pool)
Add your email to ANY parameter:
This increases your rate limit from 1 req/sec β 10 req/sec Always do this for production applications!
Entity Endpoints
Essential Query Parameters
Filter Syntax
Basic Filtering
Comparison Operators
Multiple Values in Same Attribute
You can express AND within a single attribute two ways:
Both mean: "works with author from US AND author from GB"
OR Queries (Pipe Separator)
You can combine up to 50 values with pipes.
Important: OR only works WITHIN a filter, not BETWEEN filters
Common Patterns
Get Random Sample of Works
Search Works by Title/Abstract
Find Works by Author (Two-Step Pattern)
Find Works by Institution (Two-Step Pattern)
Get Highly Cited Recent Papers
Get Open Access Works Only
Filter by Multiple Criteria
Bulk Lookup by DOIs
Get Works from Specific Journal
Aggregate/Group Data
Pagination for Large Result Sets
Select Specific Fields Only (Faster Responses)
Autocomplete for Type-Ahead
Tag Your Own Text (/text endpoint)
Response Structure
List Endpoints
All list endpoints (/works, /authors, etc.) return:
Single Entity Endpoints
Getting a single entity returns the object directly:
Group By Responses
Performance Optimization Tips
1. Use Maximum Page Size
2. Use Batch ID Lookups
3. Select Only Fields You Need
4. Use Concurrent Requests with Rate Limiting
5. Add Email for 10x Speed Boost
Handling Errors
Common HTTP Status Codes
Exponential Backoff Pattern
Entity-Specific Filter Examples
Works Filters (Most Common)
Authors Filters
Sources Filters
Institutions Filters
External ID Support
You can use external IDs directly in the API:
Works
Authors
Institutions
Sources
Advanced Tips
Reproducible Random Samples
Always use a seed for reproducible sampling:
Same seed = same results every time.
Finding Related Works
Filtering by Date Ranges
Complex Boolean Searches
The search parameter supports boolean operators:
Rate Limiting Best Practices
Without Email (Default Pool)
1 request per second
100,000 requests per day
Sequential processing recommended
With Email (Polite Pool)
10 requests per second
100,000 requests per day
Parallel processing viable
Always include your email for production use
Concurrent Requests Strategy
Daily Limit Management
With 100k/day limit:
~4,166 requests per hour average
~69 requests per minute average
Plan accordingly for large jobs
Consider OpenAlex Premium for higher limits
Common Mistakes to Avoid
β Using page numbers for sampling β β Use ?sample=
β Filtering by entity names β β Get IDs first, then filter
β Default page size β β Use per-page=200
β Sequential ID lookups β β Batch with pipe (|) operator
β No error handling β β Implement retry with backoff
β Ignoring rate limits in threads β β Global rate limiting
β Trying to group by multiple fields β β Multiple queries + combine
β Not including email β β Add mailto= for 10x speed
β Fetching all fields β β Use select= for needed fields only
β Assuming instant responses β β Add timeouts (30s recommended)
Need More Info?
Full documentation: https://docs.openalex.org
API Overview: https://docs.openalex.org/how-to-use-the-api/api-overview
Entity schemas: https://docs.openalex.org/api-entities
Help: https://openalex.org/help
User group: https://groups.google.com/g/openalex-users
For Premium Features
If you need:
More than 100k requests/day
Faster than daily snapshot updates
Commercial support
SLA guarantees
See: https://openalex.org/pricing
Last updated: 2025-10-13 Maintained for: LLM agents, AI applications, and automated tools
Last updated