Counting n-grams with Python and with Pandas

python
data
nlp
Published

April 7, 2020

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it’s very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like “New York” together rather than treating them as the separate words “New” and “York”. To do this we need a way of extracting and counting sequences of words.

To find all sequences of n-grams; that is contiguous subsequences of length n, from a sequence xs we can use the following function:

def seq_ngrams(xs, n):
    return [xs[i:i+n] for i in range(len(xs)-n+1)]

For example:

> seq_ngrams([1,2,3,4,5], 3)
[[1,2,3], [2,3,4], [3,4,5]]

This works by iterating over all possible starting indices in the list with range, and then extracting the sequence of length n using xs[i:i+n].

In the specific case of splitting text into sequences of words this is called w-shingling and can be done by splitting:

def shingle(text, w):
    tokens = text.split(' ')
    return [' '.join(xs) for xs in seq_ngrams(tokens, w)]

Then to count the w-shingles in a corpus you can simply use the inbuilt Counter:

from collections import Counter
def count_shingles(corpus, w):
    return Counter(ngram for text in corpus for ngram in shingle(text, w))

If you’re dealing with very large collections you can drop in replace Counter with the approximate version bounter.

The rest of this article explores a slower way to do this with Pandas; I don’t advocate using it but it’s an interesting alternative.

Counting n-grams with Pandas

Suppose we have some text in a Pandas dataframe df column text and want to find the w-shingles.

                                                         text
0                                 Engineering Systems Analyst
1                                     Stress Engineer Glasgow
2                            Modelling and simulation analyst

This can be turned into an array using split and then unnested with explode.

words = (df
.text
.str.split(' ')
.explode()
)

This would result in one word per line. The index is preserved so you can realign it with the original series.

0         Engineering 
0         Systems
0         Analyst
1         Stress
1         Engineer
1         Glasgow
2         Modelling
2         and
2         simulation
2         analyst

To get sequences of words you can use the shift operator which is like lead and lag in SQL.

next_word = words.groupby(level=0).shift(-1)

Resulting in:

0         Systems
0         Analyst
0         NaN 
1         Engineer
1         Glasgow
1         NaN
2         and
2         simulation
2         analyst
2         NaN

and these can be recombined with (words + next_word).dropna():

0         Engineering Systems
0         Systems Analyst
1         Stress Engineer
1         Engineer Glasgow
2         Modelling and
2         and simulation
2         simulation analyst

Finally you can find the total with value_counts.

While this is a bit messier and slower than the pure Python method, it may be useful if you needed to realign it with the original dataframe. This can be abstracted to arbitrary n-grams:

import pandas as pd
def count_ngrams(series: pd.Series, n: int) -> pd.Series:
    ngrams = series.copy().str.split(' ').explode()
    for i in range(1, n):
        ngrams += ' ' + ngrams.groupby(level=0).shift(-i)
        ngrams = ngrams.dropna()
    return ngrams.value_counts()    

This is similar to the approach of the R tidytext library for extracting n-grams which has the function unnest_tokens that can produce ngrams of arbitrary length.