# Counting n-grams with Python and with Pandas

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it’s very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like “New York” together rather than treating them as the separate words “New” and “York”. To do this we need a way of extracting and counting sequences of words.

To find all sequences of n-grams; that is contiguous subsequences of length n, from a sequence `xs`

we can use the following function:

```
def seq_ngrams(xs, n):
return [xs[i:i+n] for i in range(len(xs)-n+1)]
```

For example:

```
> seq_ngrams([1,2,3,4,5], 3)
1,2,3], [2,3,4], [3,4,5]] [[
```

This works by iterating over all possible starting indices in the list with `range`

, and then extracting the sequence of length `n`

using `xs[i:i+n]`

.

In the specific case of splitting text into sequences of words this is called `w-shingling`

and can be done by splitting:

```
def shingle(text, w):
= text.split(' ')
tokens return [' '.join(xs) for xs in seq_ngrams(tokens, w)]
```

Then to count the `w-shingles`

in a corpus you can simply use the inbuilt Counter:

```
from collections import Counter
def count_shingles(corpus, w):
return Counter(ngram for text in corpus for ngram in shingle(text, w))
```

If you’re dealing with very large collections you can drop in replace Counter with the approximate version bounter.

The rest of this article explores a slower way to do this with Pandas; I don’t advocate using it but it’s an interesting alternative.

# Counting n-grams with Pandas

Suppose we have some text in a Pandas dataframe `df`

column `text`

and want to find the `w-shingles`

.

```
text
0 Engineering Systems Analyst
1 Stress Engineer Glasgow
2 Modelling and simulation analyst
```

This can be turned into an array using `split`

and then unnested with `explode`

.

```
= (df
words
.textstr.split(' ')
.
.explode() )
```

This would result in one word per line. The index is preserved so you can realign it with the original series.

```
0 Engineering
0 Systems
0 Analyst
1 Stress
1 Engineer
1 Glasgow
2 Modelling
2 and
2 simulation
2 analyst
```

To get sequences of words you can use the `shift`

operator which is like `lead`

and `lag`

in SQL.

`= words.groupby(level=0).shift(-1) next_word `

Resulting in:

```
0 Systems
0 Analyst
0 NaN
1 Engineer
1 Glasgow
1 NaN
2 and
2 simulation
2 analyst
2 NaN
```

and these can be recombined with `(words + next_word).dropna()`

:

```
0 Engineering Systems
0 Systems Analyst
1 Stress Engineer
1 Engineer Glasgow
2 Modelling and
2 and simulation
2 simulation analyst
```

Finally you can find the total with `value_counts`

.

While this is a bit messier and slower than the pure Python method, it may be useful if you needed to realign it with the original dataframe. This can be abstracted to arbitrary n-grams:

```
import pandas as pd
def count_ngrams(series: pd.Series, n: int) -> pd.Series:
= series.copy().str.split(' ').explode()
ngrams for i in range(1, n):
+= ' ' + ngrams.groupby(level=0).shift(-i)
ngrams = ngrams.dropna()
ngrams return ngrams.value_counts()
```

This is similar to the approach of the R `tidytext`

library for extracting n-grams which has the function `unnest_tokens`

that can produce `ngrams`

of arbitrary length.