Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.
To find all sequences of n-grams; that is contiguous subsequences of length n, from a sequence
xs we can use the following function:
def seq_ngrams(xs, n): return [xs[i:i+n] for i in range(len(xs)-n+1)]
> seq_ngrams([1,2,3,4,5], 3) [[1,2,3], [2,3,4], [3,4,5]]
This works by iterating over all possible starting indices in the list with
range, and then extracting the sequence of length
In the specific case of splitting text into sequences of words this is called
w-shingling and can be done by splitting:
def shingle(text, w): tokens = text.split(' ') return [' '.join(xs) for xs in seq_ngrams(tokens, w)]
Then to count the
w-shingles in a corpus you can simply use the inbuilt Counter:
from collections import Counter def count_shingles(corpus, w): return Counter(ngram for text in corpus for ngram in shingle(text, w))
If you're dealing with very large collections you can drop in replace Counter with the approximate version bounter.
The rest of this article explores a slower way to do this with Pandas; I don't advocate using it but it's an interesting alternative.
Counting n-grams with Pandas
Suppose we have some text in a Pandas dataframe
text and want to find the
text 0 Engineering Systems Analyst 1 Stress Engineer Glasgow 2 Modelling and simulation analyst
This can be turned into an array using
split and then unnested with
words = (df .text .str.split(' ') .explode() )
This would result in one word per line. The index is preserved so you can realign it with the original series.
0 Engineering 0 Systems 0 Analyst 1 Stress 1 Engineer 1 Glasgow 2 Modelling 2 and 2 simulation 2 analyst
To get sequences of words you can use the
shift operator which is like
lag in SQL.
next_word = words.groupby(level=0).shift(-1)
0 Systems 0 Analyst 0 NaN 1 Engineer 1 Glasgow 1 NaN 2 and 2 simulation 2 analyst 2 NaN
and these can be recombined with
(words + next_word).dropna():
0 Engineering Systems 0 Systems Analyst 1 Stress Engineer 1 Engineer Glasgow 2 Modelling and 2 and simulation 2 simulation analyst
Finally you can find the total with
While this is a bit messier and slower than the pure Python method, it may be useful if you needed to realign it with the original dataframe. This can be abstracted to arbitrary n-grams:
import pandas as pd def count_ngrams(series: pd.Series, n: int) -> pd.Series: ngrams = series.copy().str.split(' ').explode() for i in range(1, n): ngrams += ' ' + ngrams.groupby(level=0).shift(-i) ngrams = ngrams.dropna() return ngrams.value_counts()
This is similar to the approach of the R
tidytext library for extracting n-grams which has the function
unnest_tokens that can produce
ngrams of arbitrary length.