API Guide for LLMs
Are you an LLM? Start here.
OpenAlex API Guide for LLM Agents and AI Applications
OpenAlex is a fully open catalog of scholarly works, authors, sources, institutions, topics, publishers, and funders. Base URL: https://api.openalex.org Documentation: https://docs.openalex.org No authentication required | 100,000 requests/day limit
CRITICAL GOTCHAS - Read These First!
❌ DON'T: Create ad-hoc sampling by using random page numbers
WRONG: ?page=5, ?page=17, ?page=42 to get "random" results This is NOT random sampling and will bias your results!
✅ DO: Use the ?sample parameter for random sampling
CORRECT: https://api.openalex.org/works?sample=20 For consistent results, add a seed: ?sample=20&seed=123
❌ DON'T: Try to sample large datasets (10k+) in one request
The sample parameter maxes out at reasonable sizes for a single request.
✅ DO: Use multiple samples with different seeds, then deduplicate
For large random samples (10k+ records):
Make multiple sample requests with different seeds
Combine results
Deduplicate by ID Example:
?sample=1000&seed=1
?sample=1000&seed=2
?sample=1000&seed=3 Then deduplicate the combined results by checking work IDs. See: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/sample-entity-lists
❌ DON'T: Search/filter by entity names directly
WRONG: /works?filter=author_name:Einstein Entity names are ambiguous and this won't work!
✅ DO: Use two-step lookup pattern for related entities
CORRECT two-step process:
Find the entity ID: /authors?search=einstein Response shows ID like "A5023888391" or full URI
Use ID to filter: /works?filter=authorships.author.id:A5023888391
Why? Names are ambiguous. "MIT" could be many institutions. IDs are unique. This applies to: authors, institutions, sources, topics, publishers, funders.
❌ DON'T: Try to group by multiple dimensions in one query
WRONG: You cannot do SQL-style "GROUP BY topic, year" in a single API call.
✅ DO: Make multiple queries and combine results client-side
To analyze by topic AND year (or any two dimensions):
Make one query per year: ?filter=publication_year:2020&group_by=topics.id
Repeat for 2021, 2022, etc.
Combine results in your code The API only supports one group_by per request.
❌ DON'T: Ignore API errors or retry immediately on failure
API errors are common, especially at scale. Immediate retries can make things worse.
✅ DO: Implement exponential backoff for retries
When you get errors (429 rate limit, 500 server error, timeouts):
Catch the error
Wait before retrying (1s, 2s, 4s, 8s, etc.)
Include a max retry limit (e.g., 5 attempts)
Log failures for debugging
❌ DON'T: Use default page sizes for bulk extraction
Default is only 25 results per page. Slow for large extracts!
✅ DO: Use maximum page size (200) for bulk data extraction
FAST: ?per-page=200 This reduces the number of API calls needed by 8x compared to default.
❌ DON'T: Make sequential API calls for lists of known IDs
SLOW: Loop through 100 DOIs making 100 separate API calls.
✅ DO: Use the OR filter (pipe |) for batch ID lookups
FAST: Combine up to 50 IDs in one query using pipe separator: /works?filter=doi:https://doi.org/10.1371/journal.pone.0266781|https://doi.org/10.1371/journal.pone.0267149|... You can include up to 50 values per filter. Use per-page=50 to get all results. See: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists#addition-or
❌ DON'T: Ignore rate limits when using concurrency/threading
Using multiple threads WITHOUT respecting rate limits will get you rate-limited or banned.
✅ DO: Respect rate limits even across concurrent requests
Default pool: 1 request/second
Polite pool (with email): 10 requests/second
Daily limit: 100,000 requests When using threading/async:
Implement rate limiting across ALL threads
Track requests per second globally
Add your email to requests for 10x higher limits
Quick Reference
Base URL and Authentication
Base: https://api.openalex.org
Auth: None required
Rate: 100k requests/day
Get Higher Rate Limits (Polite Pool)
Add your email to ANY parameter:
https://api.openalex.org/[email protected]
This increases your rate limit from 1 req/sec → 10 req/sec Always do this for production applications!
Entity Endpoints
/works - 240M+ scholarly documents (articles, books, datasets)
/authors - Researcher profiles with disambiguated identities
/sources - Journals, repositories, conferences
/institutions - Universities, research organizations
/topics - Subject classifications (3-level hierarchy)
/publishers - Publishing organizations
/funders - Funding agencies
/text - Tag your own text with OpenAlex topics/keywords (POST)
Essential Query Parameters
filter= - Filter results (see filter syntax below)
search= - Full-text search across title/abstract/fulltext
sort= - Sort results (e.g., cited_by_count:desc)
per-page= - Results per page (default: 25, max: 200)
page= - Page number for pagination
sample= - Get random results (e.g., sample=50)
seed= - Seed for reproducible sampling
select= - Limit returned fields (e.g., select=id,title)
group_by= - Aggregate results by a field
mailto= - Your email (gets you into polite pool)
Filter Syntax
Basic Filtering
Single filter: ?filter=publication_year:2020
Multiple (AND): ?filter=publication_year:2020,is_oa:true
Values (OR): ?filter=type:journal-article|book
Negation: ?filter=type:!journal-article
Comparison Operators
Greater than: ?filter=cited_by_count:>100
Less than: ?filter=publication_year:<2020
Range: ?filter=publication_year:2020-2023
Multiple Values in Same Attribute
You can express AND within a single attribute two ways:
Repeat filter: ?filter=institutions.country_code:us,institutions.country_code:gb
Use + symbol: ?filter=institutions.country_code:us+gb
Both mean: "works with author from US AND author from GB"
OR Queries (Pipe Separator)
Any of these: ?filter=institutions.country_code:us|gb|ca
Batch IDs: ?filter=doi:10.1/abc|10.2/def|10.3/ghi
You can combine up to 50 values with pipes.
Important: OR only works WITHIN a filter, not BETWEEN filters
POSSIBLE: ?filter=type:article|book (article OR book)
NOT POSSIBLE: Cannot do "(year=2020 OR year=2021) AND (type=article)"
WORKAROUND: Make separate queries and combine results
Common Patterns
Get Random Sample of Works
Small sample:
https://api.openalex.org/works?sample=20
Reproducible sample:
https://api.openalex.org/works?sample=20&seed=42
Large sample (10k+):
1. https://api.openalex.org/works?sample=1000&seed=1
2. https://api.openalex.org/works?sample=1000&seed=2
3. https://api.openalex.org/works?sample=1000&seed=3
...then deduplicate by ID
Search Works by Title/Abstract
Simple search:
https://api.openalex.org/works?search=machine+learning
Search specific field:
https://api.openalex.org/works?filter=title.search:CRISPR
Search + filter:
https://api.openalex.org/works?search=climate&filter=publication_year:2023
Find Works by Author (Two-Step Pattern)
Step 1 - Get author ID:
https://api.openalex.org/authors?search=Heather+Piwowar
Response includes: "id": "https://openalex.org/A5023888391"
Step 2 - Get their works:
https://api.openalex.org/works?filter=authorships.author.id:A5023888391
Alternative - Use ORCID directly:
https://api.openalex.org/works?filter=authorships.author.id:https://orcid.org/0000-0003-1613-5981
Find Works by Institution (Two-Step Pattern)
Step 1 - Get institution ID:
https://api.openalex.org/institutions?search=MIT
Response includes: "id": "https://openalex.org/I136199984"
Step 2 - Get their works:
https://api.openalex.org/works?filter=authorships.institutions.id:I136199984
Alternative - Use ROR directly:
https://api.openalex.org/works?filter=authorships.institutions.id:https://ror.org/042nb2s44
Get Highly Cited Recent Papers
https://api.openalex.org/works?filter=publication_year:>2020&sort=cited_by_count:desc&per-page=200
Get Open Access Works Only
All OA:
https://api.openalex.org/works?filter=is_oa:true
Gold OA only:
https://api.openalex.org/works?filter=open_access.oa_status:gold
Published OA version:
https://api.openalex.org/works?filter=has_oa_published_version:true
Filter by Multiple Criteria
Recent OA works about COVID from top institutions:
https://api.openalex.org/works?filter=publication_year:2022,is_oa:true,title.search:covid,authorships.institutions.id:I136199984|I27837315
Breaking down the filters:
- publication_year:2022 (recent)
- is_oa:true (open access)
- title.search:covid (about COVID)
- authorships.institutions.id:I136199984|I27837315 (MIT or Harvard)
Bulk Lookup by DOIs
Get specific works by DOI (efficient batch method):
https://api.openalex.org/works?filter=doi:https://doi.org/10.1371/journal.pone.0266781|https://doi.org/10.1371/journal.pone.0267149|https://doi.org/10.1371/journal.pone.0267890&per-page=50
Up to 50 DOIs per request. Use per-page=50 to ensure you get all results.
Get Works from Specific Journal
Step 1 - Get source ID:
https://api.openalex.org/sources?search=Nature
Response includes: "id": "https://openalex.org/S137773608"
Step 2 - Get works from that source:
https://api.openalex.org/works?filter=primary_location.source.id:S137773608
Aggregate/Group Data
Top topics by work count:
https://api.openalex.org/works?group_by=topics.id
Papers per year:
https://api.openalex.org/works?group_by=publication_year
Most prolific institutions:
https://api.openalex.org/works?group_by=authorships.institutions.id
Group with filters:
https://api.openalex.org/works?filter=publication_year:>2020&group_by=topics.id
Pagination for Large Result Sets
First page:
https://api.openalex.org/works?filter=publication_year:2023&per-page=200
Next pages:
https://api.openalex.org/works?filter=publication_year:2023&per-page=200&page=2
https://api.openalex.org/works?filter=publication_year:2023&per-page=200&page=3
...
The meta.count field tells you total results.
Calculate pages needed: ceil(meta.count / per-page)
Select Specific Fields Only (Faster Responses)
Just IDs and titles:
https://api.openalex.org/works?select=id,title&per-page=200
Multiple fields:
https://api.openalex.org/works?select=id,title,publication_year,cited_by_count
Autocomplete for Type-Ahead
Fast autocomplete endpoint for building search UIs:
Authors:
https://api.openalex.org/autocomplete/authors?q=einst
Institutions:
https://api.openalex.org/autocomplete/institutions?q=stanford
Works:
https://api.openalex.org/autocomplete/works?q=neural+networks
Typically returns in ~200ms
Tag Your Own Text (/text endpoint)
POST or GET to classify your own content:
https://api.openalex.org/text?title=Machine+learning+for+drug+discovery
Returns topics, keywords, and concepts for your text.
Limited to 1 req/sec and 1000 req/day.
Text must be 20-2000 characters.
Response Structure
List Endpoints
All list endpoints (/works, /authors, etc.) return:
{
"meta": {
"count": 240523418,
"db_response_time_ms": 42,
"page": 1,
"per_page": 25
},
"results": [
{ /* entity object */ },
{ /* entity object */ },
...
]
}
Single Entity Endpoints
Getting a single entity returns the object directly:
https://api.openalex.org/works/W2741809807
→ Returns a Work object directly (no meta/results wrapper)
Group By Responses
{
"meta": { "count": 100, ... },
"group_by": [
{
"key": "https://openalex.org/T10001",
"key_display_name": "Artificial Intelligence",
"count": 15234
},
...
]
}
Performance Optimization Tips
1. Use Maximum Page Size
SLOW: Default 25 per page = more API calls
FAST: ?per-page=200 (8x fewer API calls)
2. Use Batch ID Lookups
SLOW: Loop through 50 DOIs, 50 API calls
FAST: One call with pipe-separated DOIs (up to 50)
3. Select Only Fields You Need
SLOW: Full objects with all fields
FAST: ?select=id,title,publication_year
4. Use Concurrent Requests with Rate Limiting
# Pseudo-code
from concurrent.futures import ThreadPoolExecutor
import time
rate_limiter = RateLimiter(10) # 10 req/sec with polite pool
def fetch_page(page_num):
rate_limiter.wait() # Ensure we don't exceed rate limit
return requests.get(f"...&page={page_num}&[email protected]")
with ThreadPoolExecutor(max_workers=10) as executor:
results = executor.map(fetch_page, range(1, 101))
5. Add Email for 10x Speed Boost
WITHOUT: 1 request/second max
WITH: 10 requests/second max
Just add: &[email protected]
Handling Errors
Common HTTP Status Codes
200 OK - Success
400 Bad Request - Invalid parameter (check your filter syntax)
403 Forbidden - Rate limit exceeded (slow down, implement backoff)
404 Not Found - Entity doesn't exist
500 Server Error - Temporary issue (retry with backoff)
Exponential Backoff Pattern
def fetch_with_retry(url, max_retries=5):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=30)
if response.status_code == 200:
return response.json()
elif response.status_code == 403:
# Rate limited
wait_time = 2 ** attempt # 1s, 2s, 4s, 8s, 16s
time.sleep(wait_time)
elif response.status_code >= 500:
# Server error
wait_time = 2 ** attempt
time.sleep(wait_time)
else:
# Other error, don't retry
response.raise_for_status()
except requests.exceptions.Timeout:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
else:
raise
raise Exception(f"Failed after {max_retries} retries")
Entity-Specific Filter Examples
Works Filters (Most Common)
authorships.author.id - Author's OpenAlex ID
authorships.institutions.id - Institution's OpenAlex ID
cited_by_count - Number of citations
is_oa - Is open access (true/false)
publication_year - Year published
primary_location.source.id - Source (journal) ID
topics.id - Topic ID
type - article, book, dataset, etc.
has_doi - Has a DOI (true/false)
has_fulltext - Has fulltext available (true/false)
Authors Filters
last_known_institution.id - Current/last institution
works_count - Number of works authored
cited_by_count - Total citations
orcid - ORCID identifier
Sources Filters
host_organization - Publisher/host
type - journal, repository, etc.
is_oa - Is open access
Institutions Filters
type - education, healthcare, company, etc.
country_code - Two-letter country code
continent - africa, asia, europe, etc.
External ID Support
You can use external IDs directly in the API:
Works
DOI: /works/https://doi.org/10.7717/peerj.4375
PMID: /works/pmid:29844763
Authors
ORCID: /authors/https://orcid.org/0000-0003-1613-5981
Institutions
ROR: /institutions/https://ror.org/02y3ad647
Sources
ISSN: /sources/issn:0028-0836
Advanced Tips
Reproducible Random Samples
Always use a seed for reproducible sampling:
https://api.openalex.org/works?sample=100&seed=42
Same seed = same results every time.
Finding Related Works
Get cited works:
1. Get work: /works/W2741809807
2. Response includes: "referenced_works": [...]
3. Fetch those: /works?filter=openalex_id:W123|W456|W789
Get citing works:
1. Get work: /works/W2741809807
2. Response includes: "cited_by_api_url"
3. Follow that URL
Filtering by Date Ranges
Exact year: ?filter=publication_year:2020
After: ?filter=publication_year:>2020
Before: ?filter=publication_year:<2020
Range: ?filter=publication_year:2018-2022
Complex Boolean Searches
The search parameter supports boolean operators:
AND: ?search=climate+AND+change
OR: ?search=climate+OR+weather
NOT: ?search=climate+NOT+politics
Rate Limiting Best Practices
Without Email (Default Pool)
1 request per second
100,000 requests per day
Sequential processing recommended
With Email (Polite Pool)
10 requests per second
100,000 requests per day
Parallel processing viable
Always include your email for production use
Concurrent Requests Strategy
1. Track requests per second globally (not per thread)
2. Use a semaphore or rate limiter across threads
3. Add delays between batches if needed
4. Monitor for 403 responses (rate limit exceeded)
5. Back off if you hit limits
Daily Limit Management
With 100k/day limit:
~4,166 requests per hour average
~69 requests per minute average
Plan accordingly for large jobs
Consider OpenAlex Premium for higher limits
Common Mistakes to Avoid
❌ Using page numbers for sampling → ✅ Use ?sample=
❌ Filtering by entity names → ✅ Get IDs first, then filter
❌ Default page size → ✅ Use per-page=200
❌ Sequential ID lookups → ✅ Batch with pipe (|) operator
❌ No error handling → ✅ Implement retry with backoff
❌ Ignoring rate limits in threads → ✅ Global rate limiting
❌ Trying to group by multiple fields → ✅ Multiple queries + combine
❌ Not including email → ✅ Add mailto= for 10x speed
❌ Fetching all fields → ✅ Use select= for needed fields only
❌ Assuming instant responses → ✅ Add timeouts (30s recommended)
Need More Info?
Full documentation: https://docs.openalex.org
API Overview: https://docs.openalex.org/how-to-use-the-api/api-overview
Entity schemas: https://docs.openalex.org/api-entities
Help: https://openalex.org/help
User group: https://groups.google.com/g/openalex-users
For Premium Features
If you need:
More than 100k requests/day
Faster than daily snapshot updates
Commercial support
SLA guarantees
See: https://openalex.org/pricing
Last updated: 2025-10-13 Maintained for: LLM agents, AI applications, and automated tools
Last updated