Get N-grams

N-grams are groups of sequential words that occur in the text of a Work.

N-grams list the words and phrases that occur in the full text of a Work. We obtain them from Internet Archive's publicly (and generously 👏) available General Index and use them to enable fulltext searches on the Works that have them, through both the fulltext.search filter, and as an element of the more holistic search parameter.

Note that while n-grams are derived from the fulltext of a Work, the presence of n-grams for a given Work doesn't imply that the fulltext is available to you, the reader. It only means the fulltext was available to Internet Archive for indexing. Work.open_access is the place to go for information on public fulltext availability.

API Endpoint

In addition to enabling fulltext search capabilities, a Work's n-grams are viewable directly through an endpoint that accepts either an OpenAlex ID or a DOI.

Unlike other API endpoints, n-grams are cached via CDN, which means this one is super fast, and you can call it as fast as you want - rate limits don't apply.

The response is a list of Ngram objects, sorted from 5-grams down to unigrams:

{
  meta: {
    count: 1068,
    doi: "https://doi.org/10.1103/physrevb.37.785",
    openalex_id: "https://openalex.org/W2023271753"
  },
  ngrams: [
    {
      ngram: "energy formula into a functional",
      ngram_tokens: 5,
      ngram_count: 1,
      term_frequency: 0.0005452562704471102
    },
    {
      ngram: "functional of the electron density",
      ngram_tokens: 5,
      ngram_count: 1,
      term_frequency: 0.0005452562704471102
    },
    ...
  ]
}

The ID-based link is provided in Work.ngrams_url if n-grams are available. Works with n-grams can be found using the Work.has_ngrams filter, which can be combined with other filters using logical expressions.

Fulltext Coverage

You can see which works we have full-text for using the has_fulltext filter. This does not necessarily mean that the full text is available to you, dear reader; rather, it means that we have indexed the full text and can use it to help power searches. If you are trying to find the full text for yourself, try looking in open_access.oa_url.

We get access to the full text in one of two ways: either using an open-access PDF, or using N-grams obtained from the Internet Archive. You can learn where a work's full text came from at fulltext_origin.

About 57 million works have n-grams coverage through Internet Archive. OurResearch is the first organization to host this data in a highly usable way, and we are proud to integrate it into OpenAlex!

Curious about n-grams used in search? Browse them all via the API. Highly-cited works and less recent works are more likely to have n-grams, as shown by the coverage charts below:

Last updated