Finding Common Substrings
I’ve found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. One thing that would be useful to know is what the common sections of the ads are. Typically if they have a high 3-Jaccard similarity it’s because they have some text in common.
The most asymptotically efficient to find the longest common substring would be to build a suffix tree, but for experimentation the heuristics in Python’s DiffLib work well enough.
I define a function that gets all common strings above a certain length. I look for all pairs difflib considers equal and print them out; this won’t get all common substrings but works well enough on the job ads I tried them on. A benefit of using difflib is that if we want to find the longest common string of tokens we can just pass in a
and b
as lists of tokens.
def common_substrings(a, b, min_length=7):
= []
seqs = difflib.SequenceMatcher(a=a, b=b, autojunk=False)
seqmatcher for tag, a0, a1, b0, b1 in seqmatcher.get_opcodes():
if tag == 'equal' and a1 - a0 >= min_length:
seqs.append(a[a0:a1])return seqs
Once we have a way to find common substrings of pairs it’s straightforward to extend it to a list of substrings. In particular we just look to see for each existing common substrings whether it has any common substrings with the next text.
def all_common_substrings(args, min_length=7):
= None
seqs for arg in args:
if seqs is None:
= [arg]
seqs continue
= []
new_seqs for seq in seqs:
+= common_substrings(arg, seq, min_length)
new_seqs = new_seqs
seqs return seqs
This could easily be extended to allow a few mismatches by collecting across difflib tags other than equal
up to some length of tokens.
Exact brute force approach
It’s good to compare this with an exact solution to make sure the difflib heuristics are actually working. I always find it’s good to start with a simple slow obviously correct solution before trying to build a more complex efficient algorithm.
In particular we could start by producing all substrings of a string by iterating over each possible starting point and length:
def all_substrings(s):
for i in range(len(s)):
for j in range(len(s)-i):
yield s[i:i+j+1]
To know whether one string is a substring of another we can just check whether it matches a any position:
def contains_substring(a, b):
"""Does a constrin substring b"""
for i in range(len(a) - len(b) + 1):
if a[i:i+len(b)] == b:
return True
Then we could find the common substrings by just checking if any of the substrings of one are in the other:
def naive_common_substrings(a, b):
for substring in all_substrings(a):
if contains_substring(b, substring):
yield substring
This will output a lot of substrings because any substring of a common substring is also a common substring. For example if “the” is in common, then so is “t”, “th”, “h”, “he” and “e”. We can filter this down to the “proper substrings”, those that aren’t contained in a larger substring.
def proper_substrings(a):
= []
proper for s in a:
if any(contains_substring(p, s) for p in proper):
continue
= [contains_substring(s, p) for p in proper]
supersequence if any(supersequence):
= s
val for idx, value in enumerate(supersequence):
if value:
= val
proper[idx] = None
val = [p for p in proper if p]
proper else:
proper.append(s)return proper
Note that this implementation is awfully slow; the operations in calcualting proper_substrings(naive_common_substrings(a, b))
is quadratic in the length of a
and roughly linear in the length of b
. But it’s good for a sanity check on some simple strings, and using it I find the difflib
captures most of the common substrings on the job ads I tried.