Counting n-grams with Python and with Pandas
Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it’s very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like “New York” together rather than treating them as the separate words “New” and “York”. To do this we need a way of extracting and counting sequences of words.
To find all sequences of n-grams; that is contiguous subsequences of length n, from a sequence xs
we can use the following function:
def seq_ngrams(xs, n):
return [xs[i:i+n] for i in range(len(xs)-n+1)]
For example:
> seq_ngrams([1,2,3,4,5], 3)
1,2,3], [2,3,4], [3,4,5]] [[
This works by iterating over all possible starting indices in the list with range
, and then extracting the sequence of length n
using xs[i:i+n]
.
In the specific case of splitting text into sequences of words this is called w-shingling
and can be done by splitting:
def shingle(text, w):
= text.split(' ')
tokens return [' '.join(xs) for xs in seq_ngrams(tokens, w)]
Then to count the w-shingles
in a corpus you can simply use the inbuilt Counter:
from collections import Counter
def count_shingles(corpus, w):
return Counter(ngram for text in corpus for ngram in shingle(text, w))
If you’re dealing with very large collections you can drop in replace Counter with the approximate version bounter.
The rest of this article explores a slower way to do this with Pandas; I don’t advocate using it but it’s an interesting alternative.
Counting n-grams with Pandas
Suppose we have some text in a Pandas dataframe df
column text
and want to find the w-shingles
.
text
0 Engineering Systems Analyst
1 Stress Engineer Glasgow
2 Modelling and simulation analyst
This can be turned into an array using split
and then unnested with explode
.
= (df
words
.textstr.split(' ')
.
.explode() )
This would result in one word per line. The index is preserved so you can realign it with the original series.
0 Engineering
0 Systems
0 Analyst
1 Stress
1 Engineer
1 Glasgow
2 Modelling
2 and
2 simulation
2 analyst
To get sequences of words you can use the shift
operator which is like lead
and lag
in SQL.
= words.groupby(level=0).shift(-1) next_word
Resulting in:
0 Systems
0 Analyst
0 NaN
1 Engineer
1 Glasgow
1 NaN
2 and
2 simulation
2 analyst
2 NaN
and these can be recombined with (words + next_word).dropna()
:
0 Engineering Systems
0 Systems Analyst
1 Stress Engineer
1 Engineer Glasgow
2 Modelling and
2 and simulation
2 simulation analyst
Finally you can find the total with value_counts
.
While this is a bit messier and slower than the pure Python method, it may be useful if you needed to realign it with the original dataframe. This can be abstracted to arbitrary n-grams:
import pandas as pd
def count_ngrams(series: pd.Series, n: int) -> pd.Series:
= series.copy().str.split(' ').explode()
ngrams for i in range(1, n):
+= ' ' + ngrams.groupby(level=0).shift(-i)
ngrams = ngrams.dropna()
ngrams return ngrams.value_counts()
This is similar to the approach of the R tidytext
library for extracting n-grams which has the function unnest_tokens
that can produce ngrams
of arbitrary length.