Get N-grams

N-grams are groups of sequential words that occur in the text of a Work.

N-grams list the words and phrases that occur in the full text of a Work. We obtain them from Internet Archive's publicly (and generously 👏) available General Index and use them to enable fulltext searches on the Works that have them, through both the fulltext.search filter, and as an element of the more holistic search parameter.

Note that while n-grams are derived from the fulltext of a Work, the presence of n-grams for a given Work doesn't imply that the fulltext is available to you, the reader. It only means the fulltext was available to Internet Archive for indexing. Work.open_access is the place to go for information on public fulltext availability.

API Endpoint

In addition to enabling fulltext search capabilities, a Work's n-grams are viewable directly through an endpoint that accepts either an OpenAlex ID or a DOI.

Unlike other API endpoints, n-grams are cached via CDN, which means this one is super fast, and you can call it as fast as you want - rate limits don't apply.

The response is a list of Ngram objects, sorted from 5-grams down to unigrams:

{
  meta: {
    count: 1068,
    doi: "https://doi.org/10.1103/physrevb.37.785",
    openalex_id: "https://openalex.org/W2023271753"
  },
  ngrams: [
    {
      ngram: "energy formula into a functional",
      ngram_tokens: 5,
      ngram_count: 1,
      term_frequency: 0.0005452562704471102
    },
    {
      ngram: "functional of the electron density",
      ngram_tokens: 5,
      ngram_count: 1,
      term_frequency: 0.0005452562704471102
    },
    ...
  ]
}

The ID-based link is provided in Work.ngrams_url if n-grams are available. Works with n-grams can be found using the Work.has_ngrams filter, which can be combined with other filters using logical expressions.

Fulltext Coverage

About 57 million works have n-grams coverage through Internet Archive. OurResearch is the first organization to host this data in a highly usable way, and we are proud to integrate it into OpenAlex!

Curious about n-grams used in search? Browse them all via the API. Highly-cited works and less recent works are more likely to have n-grams, as shown by the coverage charts below:

Last updated