Offline Translation in Python
nlp

Offline Translation in Python

Suppose you want to translate text from one language to another. Most people's first point of call is an online translation service from one of the big cloud providers, and most translation libraries in Python wrap Google translate. However the free services have rate limits, the paid services can quickly get expensive, and sometimes you have private data you don't want to upload online. An alternative is to run a machine translation model locally, and thanks to Hugging Face it's pretty simple to do.

Language Models as Classifiers
nlp

Language Models as Classifiers

Probabalistic language models can be used directly as a classifier. I'm not sure if this is a good idea; in particular it seems less efficient than building a classifier, but it's an interesting idea. A language model can give the probability of a given text under the model. Suppose we have multiple language models each trained on a distinct corpus representing a class (e.g. genre or author, or even sentiment). Then we can calculate the probability conditional on that model and compare them to calculate the class.

Sentence Boundaries in N-gram Language Models
nlp

Sentence Boundaries in N-gram Language Models

An N-gram language model guesses the next possible word by looking at how frequently is has previously occurred after the previous N-1 words. I think this is how my mobile phone suggests completions of text; if I type "I am" it suggests "glad", "not" or "very" which are likely occurrences. To make everything add up you have to have special markers for the start and end of the sentence, and the I think the best way is to make them the same marker.

Language Through Prism
nlp

Language Through Prism

Neural language models, which have advanced the state of the art for Natural Language Processing by a huge leap over previous methods, represent the individual tokens as a sequence of vectors. This sequence of vectors can be thought of explicitly as a discrete time varying signal in each dimension, and you could decompose this signal into low frequency components, representing the information at the document level, and high frequency components, representing information at the token level and discarding higher level information.

Building NLP Datasets from Scratch
nlp

Building NLP Datasets from Scratch

There's a common misconception that the best way to build up an NLP dataset is to first define a rigorous annotation schema and then crowdsource the annotations. The problem is that it's actually really hard to guess the right annotation schema up front, and this is often the hardest part on the modelling side (as opposed to the business side). This is explained wonderfully by spaCy's Matthew Honnibal at PyData 2018.

Topic Modelling to Bootstrap a Classifier
data

Topic Modelling to Bootstrap a Classifier

Sometimes you want to classify documents, but you don't have an existing classification. Building a classification that is mutually exclusive and completely exhaustive is actually very hard. Topic modelling is a great way to quickly get started with a basic classification. Creating a classification may sound easy until you try to do it. Think about novels; is a Sherlock Holmes novel a mystery novel or a crime novel (or both)? Or do we go more granular and call it a detective novel, or even more specifically a whodunit?

Finding Australian Locations with Placeholder
python

Finding Australian Locations with Placeholder

People write locations in many different ways. This makes them really hard to analyse, so we need a way to normalise them. I've already discussed how Placeholder is useful for coarse geocoding. Now I'm trying to apply it to normalising locations from Australian Job Ads in Common Crawl. The best practices when using Placeholder are: Go from the most specific location information (e.g. street address) to the most general (e.

Converting HTML to Text
python

Converting HTML to Text

I've been thinking about how to convert HTML to Text for NLP. We want to at least extract the text, but if we can preserve some of the formatting it can make it easier to extract information down the line. Unfortunately it's a little tricky to get the segmentation right. The standard answers on Stack Overflow are to use Beautiful Soup's getText method. Unfortunately this just turns every tag into the argument, whether it is block level or inline.

NLP Learning Resources in 2020
nlp

NLP Learning Resources in 2020

There's a lot of great freely available resources in NLP right now; and the field is moving quickly with the recent success of neural models. I wanted to mention a few that look interesting to me. Jurefsky and Martin's Speech and Language Processing The third edition is a free ebook that is in progress that covers a lot of the basic ideas in NLP. It's got a great reputation in the NLP community and is nearly complete now.

Don't Stop Pretraining
nlp

Don't Stop Pretraining

In the past two years the best performing NLP models have been based on transformer models trained on an enormous corpus of text. By understanding how language in general works they are much more effective at detecting sentiment, classifying documents, answering questions and translating documents. However in any particular case we are solving a particular task in a certain domain. Can we get a better performing model by further training the lanugage model on the specific domain or task?

Tangled up in BLEU
nlp

Tangled up in BLEU

How can we evaluate how good a machine generated translation is? We could get bilingual readers to score the translation, and average their scores. However this is expensive and time consuming. This means evaluation becomes a bottleneck for experimentation If we need hours of human time to evaluate an experiment this becomes a bottleneck for experimentation. This motivates automatic metrics for evaluation machine translation. One of the oldest examples is the BiLingual Evaluation Understudy (BLEU).

A Checklist for NLP models
data

A Checklist for NLP models

When training machine learning models typically you get a training dataset for fitting the model and a test dataset for evaluating the model (on small datasets techniques like cross-validation are common). You typically assume the performance on your chosen metric on the test dataset is the best way of judging the model. However it's really easy for systematic biases or leakage to creep into the datasets, meaning that your evaluation will differ significantly to real world usage.

Sequential Weak Labelling for NER
data

Sequential Weak Labelling for NER

The traditional way to train an NER model on a new domain is to annotate a whole bunch of data. Techniques like active learning can speed this up, but especially neural models with random weights require a ton of data. A more modern approach is to take a large pretrained NER model and fine tune it on your dataset. This is the approach of AdaptaBERT (paper), using BERT. However this takes a large amount of GPU compute and finicky regularisation techniques to get right.

pyBART: Better Dependencies for Information Extraction
python

pyBART: Better Dependencies for Information Extraction

Dependency trees are a remarkably powerful tool for information extraction. Neural based taggers are very good and Universal Dependencies means the approach can be used for almost any language (although the rules are language specific). However syntax can get really strange requiring increasingly complex rules to extract information. The pyBART system solves this by rewriting the rules to be half a step closer to semantics than syntax. I've seen that dependency based rules are useful for extracting skills from noun phrases and adpositions.

Discovering Job Titles
jobs

Discovering Job Titles

A job ad title can contain a lot of things like location, skills or benefits. I want a list of just the job titles, without the rest of those things. This is a key piece of information extraction that can be used to better understand jobs, and built on by understanding how different job titles relate, for example with salary. To do this we first normalise the words in the ad title, doing things like removing plurals and expanding acronyms.

Making Words Singular
nlp

Making Words Singular

Trying to normalise text in job titles I need a way to convert plural words into their singular form. For example a job for "nurses" is about a "nurse", a job for "salespeople" is about a "salesperson", a job for "workmen" is about a "workman" and a job about "midwives" is about a "midwife". I developed an algorithm that works well enough for converting plural words to singular without changing singular words in the text like "sous chef", "business" or "gas".

Summary of Finding Near Duplicates in Job Ads
nlp

Summary of Finding Near Duplicates in Job Ads

I've been trying to find near duplicate job ads in the Adzuna Job Salary Predictions Kaggle Competition. Job ads can be duplicated because a hirer posts the same ad multiple times to a job board, or to multiple job boards. Finding exact duplicates is easy by sorting the job ads or a hash of them. But the job board may mangle the text in some way, or add its own footer, or the hirer might change a word or two in different posts.

Finding Duplicate Companies with Cliques
jobs

Finding Duplicate Companies with Cliques

We've found pairs of near duplicate texts in 400,000 job ads from the Adzuna Job Salary Predictions Kaggle Competition. When we tried to extracted groups of similar ads by finding connected components in the graph of similar ads. Unfortunately with a low threshold of similarity we ended up with a chain of ads that were each similar, but the first and last ad were totally unrelated. One way to work around this is to find cliques, or a group of job ad were every job ad is similar to all of the others.

Finding Common Substrings
data

Finding Common Substrings

I've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. One thing that would be useful to know is what the common sections of the ads are. Typically if they have a high 3-Jaccard similarity it's because they have some text in common. The most asymptotically efficient to find the longest common substring would be to build a suffix tree, but for experimentation the heuristics in Python's DiffLib work well enough.

Minhash Sets
jobs

Minhash Sets

We've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract these groups from the pairs. Extracting the Groups Directly with Union Find Each band of the LSH consists of buckets of items that may be similar; you could view the buckets as a partition of the corpus of all documents.

Searching for Near Duplicates with Minhash
nlp

Searching for Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search through them would take a couple of days on my laptop. MinHash was designed to approach this problem as outlined in the original paper.

Detecting Near Duplicates with Minhash
nlp

Detecting Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I've found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to store it. While this is tractable we can find almost all pairs with a significant overlap it in half an hour in-memory using MinHash.

Near Duplicates with TF-IDF and Jaccard
nlp

Near Duplicates with TF-IDF and Jaccard

I've looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren't much better, and there are some complications in using it in practice. When trying to find similar ads with the Jaccard index we looked at the proportion of n-grams they have in common relative to all the n-grams between them.

Near Duplicates with Jaccard
nlp

Near Duplicates with Jaccard

Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that's efficient on small sets. I've tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. This works pretty well at finding near-duplicates and even ads from the same company; although by itself it can't detect duplicates. I've looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it's slow to calculate.

Edit Distance
nlp

Edit Distance

Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one sequence to another. I had a look at using this for trying to compare duplicate ads with reasonable results, but it's a little slow to run on many ads. I've previously looked at finding ads with exactly the same text in the Adzuna Job Salary Predictions Kaggle Competition, but there are a lot of ads that are slight variations.

Jaccard Shingle Inequality
maths

Jaccard Shingle Inequality

Two similar documents are likely to have many similar phrases relative to the number of words in the document. In particular if you're concerned with plagiarism and copyright, getting the same data through multiple sources, or finding versions of the same document this approach could be useful. In particular MinHash can quickly find pairs of items with a high Jaccard index, which we can run on sequences of w tokens. A hard question is what's the right number for w?

Finding Exact Duplicate Text
python

Finding Exact Duplicate Text

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well. Naively finding exact duplicates by comparing every pair would be O(N^2), but if we sort the input, which is O(N log(N)), then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools groupby to do something like uniq.

Showing Side-by-Side Diffs in Jupyter
python

Showing Side-by-Side Diffs in Jupyter

When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.

Creating a Diff Recipe in Prodigy
nlp

Creating a Diff Recipe in Prodigy

I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.

Counting n-grams with Python and with Pandas
python

Counting n-grams with Python and with Pandas

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.

Not using NER for extracting Job Titles
nlp

Not using NER for extracting Job Titles

I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).

Rules, Pipelines and Models
nlp

Rules, Pipelines and Models

Over the past decade deep neural networks have revolutionised dealing with unstructured data. Problems like identifying what objects are in a video through generating realistic text to translating speech between languages that were intractable are now used in real-time production systems. You might think that today all problems on text, audio an images should be solved by training end-to-end neural networks. However rules and pipelines are still extremely valuable in building systems, and can leverage the information extracted from the black-box neural networks.

Training a job title NER with Prodigy
nlp

Training a job title NER with Prodigy

In a couple of hours I trained a reasonable job title Named Entity Recogniser for job ad titles using Prodigy, with over 70% accuracy. While 70% doesn't sound great it's a bit ambiguous what a job title is, and getting exactly the bounds of the job title can be a hard problem. It's definitely good enough to be useful, and could be improved. After thinking through an annotation scheme for job titles I wanted to try annotating and training a model.

Annotating Job Titles
nlp

Annotating Job Titles

When doing Named Entity Recognition it's important to think about how to set up the problem. There's a balance between what you're trying to achieve and what the algorithm can do easily. Coming up with an annotation scheme is hard, because as soon as you start annotating you notice lots of edge cases. This post will go through an example with extracting job titles from job ads. In our previous post we looked at what was in a job ad title and a way of extracting some common job titles from the ads.

What's in a Job Ad Title?
nlp

What's in a Job Ad Title?

The job title should succinctly summarise what the role is about, so it should tell you a lot about the role. However in practice job titles can range from very broad to very narrow, be obscure or acronym-laden and even hard to nail down. They're even hard to extract from a job ad's title - which is what I'll focus on in this series. In a previous series of posts I developed a method that could extract skills written a very particular way.

Extracting Skills from Job Ads: Part 3 Conjugations
nlp

Extracting Skills from Job Ads: Part 3 Conjugations

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "experience in telesales" using spaCy's dependency parse, but it wouldn't extract many types of experience from a job ad. Here we will extend these rules to extract lists of skills (for example extracting "telesales" and "callcentre" from "experience in telesales or receptionist", which will let us analyse which experiences are related.

Extracting Skills from Job Ads: Part 2 - Adpositions
nlp

Extracting Skills from Job Ads: Part 2 - Adpositions

Extracting Experience in a Field I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "subsea cable engineering experience". This worked well, but extracted a lot of qualifiers that aren't skills (like "previous experience in", or "any experience in"). Here we will write rules to extract experience from phrases like "experience in subsea cable engineering", with much better results.

Extracting Skills from Job Ads: Part 1 - Noun Phrases
nlp

Extracting Skills from Job Ads: Part 1 - Noun Phrases

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. Using rules to extract noun phrases ending in experience (e.g. subsea cable engineering experience) we can extract many skills, but there's a lot of false positives (e.g. previous experience) You can see the Jupyter notebook for the full analysis. Extracting Noun Phrases It's common for ads to write something like "have this kind of experience":