# Finding Exact Duplicate Text

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well.

Naively finding exact duplicates by comparing every pair would be `O(N^2)`

, but if we sort the input, which is `O(N log(N))`

, then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools `groupby`

to do something like `uniq`

. In this case we can get the indices of the exact duplicates from a list (adding them with `enumerate`

).

```
from itertools import groupby
def second(x):
return x[1]
def exact_duplicate_indices(items)
= []
exact_duplicates for _key, group in groupby(sorted(enumerate(items), key=second), key=second):
= list(group)
group if len(group) > 1:
0] for item in group])
exact_duplicates.append([item[return exact_duplicates
```

If memory was an issue because the items are really large we could even hash them to make the `sorted`

possible. With the 400,000 ad descriptions, on average 1600 characters long, this took 1.3s on my laptop.

I could then look at the size of the duplicate clusters and investigate them:

```
from collections import counter
duplicates = exact_duplicate_indices(ads)
Counter(len(cluster) for cluster in duplicates)
```

This gave a table of frequencies; most clusters only have one duplicate, but there are a few with many duplicates.

Cluster Size | Frequency |
---|---|

2 | 5836 |

3 | 293 |

4 | 71 |

5 | 26 |

6 | 7 |

7 | 8 |

8 | 5 |

9 | 7 |

10 | 1 |

12 | 1 |

13 | 1 |

14 | 2 |

15 | 1 |

24 | 1 |

I then inspected the largest clusters. In my case I kept them in the same order as a dataframe they came from and could index into them with `iloc`

.

```
# Get the 20 largest clusters
= sorted(exact_duplicates, key=len, reverse=True)[:20]
megaclusters
# Look it up in the original dataframe
0]] df.iloc[megaclusters[
```

This let me discover that in most of these cases it’s due to being posted in multiple locations. I verified this with a quick analysis that the size of the cluster was almost the number of unique locations.

`[(len(megacluster), len(df.iloc[megacluster].LocationRaw.unique())) for megacluster in megaclusters]`

This is useful for exact duplicat, but what about slight variations? In this case the largest two clusters were in fact the same ad posted to different sites with a slight variation in the footer text. I’ll look at this in the next part of the series.