API Guide for LLMs

Are you an LLM? Start here.

OpenAlex API Guide for LLM Agents and AI Applications

OpenAlex is a fully open catalog of scholarly works, authors, sources, institutions, topics, publishers, and funders. Base URL: https://api.openalex.org Documentation: https://docs.openalex.org No authentication required | 100,000 requests/day limit

CRITICAL GOTCHAS - Read These First!

❌ DON'T: Create ad-hoc sampling by using random page numbers

WRONG: ?page=5, ?page=17, ?page=42 to get "random" results This is NOT random sampling and will bias your results!

βœ… DO: Use the ?sample parameter for random sampling

CORRECT: https://api.openalex.org/works?sample=20 For consistent results, add a seed: ?sample=20&seed=123

❌ DON'T: Try to sample large datasets (10k+) in one request

The sample parameter maxes out at reasonable sizes for a single request.

βœ… DO: Use multiple samples with different seeds, then deduplicate

For large random samples (10k+ records):

  1. Make multiple sample requests with different seeds

  2. Combine results

  3. Deduplicate by ID Example:

  • ?sample=1000&seed=1

  • ?sample=1000&seed=2

  • ?sample=1000&seed=3 Then deduplicate the combined results by checking work IDs. See: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/sample-entity-lists

❌ DON'T: Search/filter by entity names directly

WRONG: /works?filter=author_name:Einstein Entity names are ambiguous and this won't work!

CORRECT two-step process:

  1. Find the entity ID: /authors?search=einstein Response shows ID like "A5023888391" or full URI

  2. Use ID to filter: /works?filter=authorships.author.id:A5023888391

Why? Names are ambiguous. "MIT" could be many institutions. IDs are unique. This applies to: authors, institutions, sources, topics, publishers, funders.

❌ DON'T: Try to group by multiple dimensions in one query

WRONG: You cannot do SQL-style "GROUP BY topic, year" in a single API call.

βœ… DO: Make multiple queries and combine results client-side

To analyze by topic AND year (or any two dimensions):

  1. Make one query per year: ?filter=publication_year:2020&group_by=topics.id

  2. Repeat for 2021, 2022, etc.

  3. Combine results in your code The API only supports one group_by per request.

❌ DON'T: Ignore API errors or retry immediately on failure

API errors are common, especially at scale. Immediate retries can make things worse.

βœ… DO: Implement exponential backoff for retries

When you get errors (429 rate limit, 500 server error, timeouts):

  1. Catch the error

  2. Wait before retrying (1s, 2s, 4s, 8s, etc.)

  3. Include a max retry limit (e.g., 5 attempts)

  4. Log failures for debugging

❌ DON'T: Use default page sizes for bulk extraction

Default is only 25 results per page. Slow for large extracts!

βœ… DO: Use maximum page size (200) for bulk data extraction

FAST: ?per-page=200 This reduces the number of API calls needed by 8x compared to default.

❌ DON'T: Make sequential API calls for lists of known IDs

SLOW: Loop through 100 DOIs making 100 separate API calls.

βœ… DO: Use the OR filter (pipe |) for batch ID lookups

FAST: Combine up to 50 IDs in one query using pipe separator: /works?filter=doi:https://doi.org/10.1371/journal.pone.0266781|https://doi.org/10.1371/journal.pone.0267149|... You can include up to 50 values per filter. Use per-page=50 to get all results. See: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists#addition-or

❌ DON'T: Ignore rate limits when using concurrency/threading

Using multiple threads WITHOUT respecting rate limits will get you rate-limited or banned.

βœ… DO: Respect rate limits even across concurrent requests

  • Default pool: 1 request/second

  • Polite pool (with email): 10 requests/second

  • Daily limit: 100,000 requests When using threading/async:

  1. Implement rate limiting across ALL threads

  2. Track requests per second globally

  3. Add your email to requests for 10x higher limits

Quick Reference

Base URL and Authentication

Get Higher Rate Limits (Polite Pool)

Add your email to ANY parameter:

This increases your rate limit from 1 req/sec β†’ 10 req/sec Always do this for production applications!

Entity Endpoints

Essential Query Parameters

Filter Syntax

Basic Filtering

Comparison Operators

Multiple Values in Same Attribute

You can express AND within a single attribute two ways:

Both mean: "works with author from US AND author from GB"

OR Queries (Pipe Separator)

You can combine up to 50 values with pipes.

Important: OR only works WITHIN a filter, not BETWEEN filters

Common Patterns

Get Random Sample of Works

Search Works by Title/Abstract

Find Works by Author (Two-Step Pattern)

Find Works by Institution (Two-Step Pattern)

Get Highly Cited Recent Papers

Get Open Access Works Only

Filter by Multiple Criteria

Bulk Lookup by DOIs

Get Works from Specific Journal

Aggregate/Group Data

Pagination for Large Result Sets

Select Specific Fields Only (Faster Responses)

Autocomplete for Type-Ahead

Tag Your Own Text (/text endpoint)

Response Structure

List Endpoints

All list endpoints (/works, /authors, etc.) return:

Single Entity Endpoints

Getting a single entity returns the object directly:

Group By Responses

Performance Optimization Tips

1. Use Maximum Page Size

2. Use Batch ID Lookups

3. Select Only Fields You Need

4. Use Concurrent Requests with Rate Limiting

5. Add Email for 10x Speed Boost

Handling Errors

Common HTTP Status Codes

Exponential Backoff Pattern

Entity-Specific Filter Examples

Works Filters (Most Common)

Authors Filters

Sources Filters

Institutions Filters

External ID Support

You can use external IDs directly in the API:

Works

Authors

Institutions

Sources

Advanced Tips

Reproducible Random Samples

Always use a seed for reproducible sampling:

Same seed = same results every time.

Filtering by Date Ranges

Complex Boolean Searches

The search parameter supports boolean operators:

Rate Limiting Best Practices

Without Email (Default Pool)

  • 1 request per second

  • 100,000 requests per day

  • Sequential processing recommended

With Email (Polite Pool)

  • 10 requests per second

  • 100,000 requests per day

  • Parallel processing viable

  • Always include your email for production use

Concurrent Requests Strategy

Daily Limit Management

With 100k/day limit:

  • ~4,166 requests per hour average

  • ~69 requests per minute average

  • Plan accordingly for large jobs

  • Consider OpenAlex Premium for higher limits

Common Mistakes to Avoid

  1. ❌ Using page numbers for sampling β†’ βœ… Use ?sample=

  2. ❌ Filtering by entity names β†’ βœ… Get IDs first, then filter

  3. ❌ Default page size β†’ βœ… Use per-page=200

  4. ❌ Sequential ID lookups β†’ βœ… Batch with pipe (|) operator

  5. ❌ No error handling β†’ βœ… Implement retry with backoff

  6. ❌ Ignoring rate limits in threads β†’ βœ… Global rate limiting

  7. ❌ Trying to group by multiple fields β†’ βœ… Multiple queries + combine

  8. ❌ Not including email β†’ βœ… Add mailto= for 10x speed

  9. ❌ Fetching all fields β†’ βœ… Use select= for needed fields only

  10. ❌ Assuming instant responses β†’ βœ… Add timeouts (30s recommended)

Need More Info?

  • Full documentation: https://docs.openalex.org

  • API Overview: https://docs.openalex.org/how-to-use-the-api/api-overview

  • Entity schemas: https://docs.openalex.org/api-entities

  • Help: https://openalex.org/help

  • User group: https://groups.google.com/g/openalex-users

For Premium Features

If you need:

  • More than 100k requests/day

  • Faster than daily snapshot updates

  • Commercial support

  • SLA guarantees

See: https://openalex.org/pricing


Last updated: 2025-10-13 Maintained for: LLM agents, AI applications, and automated tools

Last updated