Suppose you want to translate text from one language to another. Most people's first point of call is an online translation service from one of the big cloud providers, and most translation libraries in Python wrap Google translate. However the free services have rate limits, the paid services can quickly get expensive, and sometimes you have private data you don't want to upload online. An alternative is to run a machine translation model locally, and thanks to Hugging Face it's pretty simple to do.
Probabalistic language models can be used directly as a classifier. I'm not sure if this is a good idea; in particular it seems less efficient than building a classifier, but it's an interesting idea. A language model can give the probability of a given text under the model. Suppose we have multiple language models each trained on a distinct corpus representing a class (e.g. genre or author, or even sentiment). Then we can calculate the probability conditional on that model and compare them to calculate the class.
An N-gram language model guesses the next possible word by looking at how frequently is has previously occurred after the previous N-1 words. I think this is how my mobile phone suggests completions of text; if I type "I am" it suggests "glad", "not" or "very" which are likely occurrences. To make everything add up you have to have special markers for the start and end of the sentence, and the I think the best way is to make them the same marker.
Neural language models, which have advanced the state of the art for Natural Language Processing by a huge leap over previous methods, represent the individual tokens as a sequence of vectors. This sequence of vectors can be thought of explicitly as a discrete time varying signal in each dimension, and you could decompose this signal into low frequency components, representing the information at the document level, and high frequency components, representing information at the token level and discarding higher level information.
There's a common misconception that the best way to build up an NLP dataset is to first define a rigorous annotation schema and then crowdsource the annotations. The problem is that it's actually really hard to guess the right annotation schema up front, and this is often the hardest part on the modelling side (as opposed to the business side). This is explained wonderfully by spaCy's Matthew Honnibal at PyData 2018.
Sometimes you want to classify documents, but you don't have an existing classification. Building a classification that is mutually exclusive and completely exhaustive is actually very hard. Topic modelling is a great way to quickly get started with a basic classification. Creating a classification may sound easy until you try to do it. Think about novels; is a Sherlock Holmes novel a mystery novel or a crime novel (or both)? Or do we go more granular and call it a detective novel, or even more specifically a whodunit?
People write locations in many different ways. This makes them really hard to analyse, so we need a way to normalise them. I've already discussed how Placeholder is useful for coarse geocoding. Now I'm trying to apply it to normalising locations from Australian Job Ads in Common Crawl. The best practices when using Placeholder are: Go from the most specific location information (e.g. street address) to the most general (e.
I've been thinking about how to convert HTML to Text for NLP. We want to at least extract the text, but if we can preserve some of the formatting it can make it easier to extract information down the line. Unfortunately it's a little tricky to get the segmentation right. The standard answers on Stack Overflow are to use Beautiful Soup's getText method. Unfortunately this just turns every tag into the argument, whether it is block level or inline.
There's a lot of great freely available resources in NLP right now; and the field is moving quickly with the recent success of neural models. I wanted to mention a few that look interesting to me. Jurefsky and Martin's Speech and Language Processing The third edition is a free ebook that is in progress that covers a lot of the basic ideas in NLP. It's got a great reputation in the NLP community and is nearly complete now.
In the past two years the best performing NLP models have been based on transformer models trained on an enormous corpus of text. By understanding how language in general works they are much more effective at detecting sentiment, classifying documents, answering questions and translating documents. However in any particular case we are solving a particular task in a certain domain. Can we get a better performing model by further training the lanugage model on the specific domain or task?
How can we evaluate how good a machine generated translation is? We could get bilingual readers to score the translation, and average their scores. However this is expensive and time consuming. This means evaluation becomes a bottleneck for experimentation If we need hours of human time to evaluate an experiment this becomes a bottleneck for experimentation. This motivates automatic metrics for evaluation machine translation. One of the oldest examples is the BiLingual Evaluation Understudy (BLEU).
When training machine learning models typically you get a training dataset for fitting the model and a test dataset for evaluating the model (on small datasets techniques like cross-validation are common). You typically assume the performance on your chosen metric on the test dataset is the best way of judging the model. However it's really easy for systematic biases or leakage to creep into the datasets, meaning that your evaluation will differ significantly to real world usage.
The traditional way to train an NER model on a new domain is to annotate a whole bunch of data. Techniques like active learning can speed this up, but especially neural models with random weights require a ton of data. A more modern approach is to take a large pretrained NER model and fine tune it on your dataset. This is the approach of AdaptaBERT (paper), using BERT. However this takes a large amount of GPU compute and finicky regularisation techniques to get right.
Working with unstructured text is much easier if we add structure to it. Stanza is a state of the art library for doing this in over 60 languages. Given some text it will tokenize, sentencize, tag parts of speech and morphological features, parse syntactic dependencies and in a few languages perform NER. It's easy to use and gets extremely good results on benchmarks for each of these tasks on a large number of languages.
Dependency trees are a remarkably powerful tool for information extraction. Neural based taggers are very good and Universal Dependencies means the approach can be used for almost any language (although the rules are language specific). However syntax can get really strange requiring increasingly complex rules to extract information. The pyBART system solves this by rewriting the rules to be half a step closer to semantics than syntax. I've seen that dependency based rules are useful for extracting skills from noun phrases and adpositions.
Many documents available on the web have meaningful markup. Headers, paragraph breaks, links, emphasis and lists all change the meaning of the text. A common way to deal with HTML documents in NLP is to strip away all the markup, e.g. with Beautiful Soup's .get_text. This is fine for a bag of words approach, but for more structured text extraction or language model this seems like throwing away a lot of information.
A job ad title can contain a lot of things like location, skills or benefits. I want a list of just the job titles, without the rest of those things. This is a key piece of information extraction that can be used to better understand jobs, and built on by understanding how different job titles relate, for example with salary. To do this we first normalise the words in the ad title, doing things like removing plurals and expanding acronyms.
I'm trying to find job titles in job ads, but the same title can be written lots of different ways. An "RN" is the same as a "Registered Nurse", and broadly the same role as "Registered nurses". As a preprocessing step to job title discovery I need to normalise the text. The process I use is simple: rewrite terms containing of, e.g. "Director of Sales" to "Sales Director" Expand punctuation with whitespace; e.
Trying to normalise text in job titles I need a way to convert plural words into their singular form. For example a job for "nurses" is about a "nurse", a job for "salespeople" is about a "salesperson", a job for "workmen" is about a "workman" and a job about "midwives" is about a "midwife". I developed an algorithm that works well enough for converting plural words to singular without changing singular words in the text like "sous chef", "business" or "gas".
When examining words in job titles I noticed that if was common to see titles written as "head of ..." or "director of ...". This is unusual because most role titles go from specific to general (e.g. finance director) to you look backwards from the role word. In the "A of B" format the role goes from specific to general and so you have to reverse the search order. One solution is to rewrite "director of finance" to "finance director".
I found NER wasn't the right tool for extracting job titles, and a frequency based approach is going to work better. The first step for this is to identify words that signify a job title, like "manager", "nurse" or "accountant". I develop a whitelist of these terms and start moving towards a process for detecting role titles. I have developed a method for identifying duplicate job ads and used it to remove duplicates.
I've been trying to find near duplicate job ads in the Adzuna Job Salary Predictions Kaggle Competition. Job ads can be duplicated because a hirer posts the same ad multiple times to a job board, or to multiple job boards. Finding exact duplicates is easy by sorting the job ads or a hash of them. But the job board may mangle the text in some way, or add its own footer, or the hirer might change a word or two in different posts.
We've found pairs of near duplicate texts in 400,000 job ads from the Adzuna Job Salary Predictions Kaggle Competition. When we tried to extracted groups of similar ads by finding connected components in the graph of similar ads. Unfortunately with a low threshold of similarity we ended up with a chain of ads that were each similar, but the first and last ad were totally unrelated. One way to work around this is to find cliques, or a group of job ad were every job ad is similar to all of the others.
I've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. One thing that would be useful to know is what the common sections of the ads are. Typically if they have a high 3-Jaccard similarity it's because they have some text in common. The most asymptotically efficient to find the longest common substring would be to build a suffix tree, but for experimentation the heuristics in Python's DiffLib work well enough.
We've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract these groups from the pairs. Extracting the Groups Directly with Union Find Each band of the LSH consists of buckets of items that may be similar; you could view the buckets as a partition of the corpus of all documents.
I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search through them would take a couple of days on my laptop. MinHash was designed to approach this problem as outlined in the original paper.
I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I've found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to store it. While this is tractable we can find almost all pairs with a significant overlap it in half an hour in-memory using MinHash.
I've looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren't much better, and there are some complications in using it in practice. When trying to find similar ads with the Jaccard index we looked at the proportion of n-grams they have in common relative to all the n-grams between them.
Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that's efficient on small sets. I've tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. This works pretty well at finding near-duplicates and even ads from the same company; although by itself it can't detect duplicates. I've looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it's slow to calculate.
Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one sequence to another. I had a look at using this for trying to compare duplicate ads with reasonable results, but it's a little slow to run on many ads. I've previously looked at finding ads with exactly the same text in the Adzuna Job Salary Predictions Kaggle Competition, but there are a lot of ads that are slight variations.
Two similar documents are likely to have many similar phrases relative to the number of words in the document. In particular if you're concerned with plagiarism and copyright, getting the same data through multiple sources, or finding versions of the same document this approach could be useful. In particular MinHash can quickly find pairs of items with a high Jaccard index, which we can run on sequences of w tokens. A hard question is what's the right number for w?
Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well. Naively finding exact duplicates by comparing every pair would be O(N^2), but if we sort the input, which is O(N log(N)), then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools groupby to do something like uniq.
When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.
I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.
Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.
I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).
Over the past decade deep neural networks have revolutionised dealing with unstructured data. Problems like identifying what objects are in a video through generating realistic text to translating speech between languages that were intractable are now used in real-time production systems. You might think that today all problems on text, audio an images should be solved by training end-to-end neural networks. However rules and pipelines are still extremely valuable in building systems, and can leverage the information extracted from the black-box neural networks.
Active learning reduces the number of annotations you have to make by selecting for annotation the items that will have the biggest impact on model retraining. Active learning for NER is built into Prodigy, but I failed to use to it to improve my job title recogniser. Having built a reasonable NER model for recognising job titles I wanted to see if I could easily improve it with Protidy's active learning.
In a couple of hours I trained a reasonable job title Named Entity Recogniser for job ad titles using Prodigy, with over 70% accuracy. While 70% doesn't sound great it's a bit ambiguous what a job title is, and getting exactly the bounds of the job title can be a hard problem. It's definitely good enough to be useful, and could be improved. After thinking through an annotation scheme for job titles I wanted to try annotating and training a model.
When doing Named Entity Recognition it's important to think about how to set up the problem. There's a balance between what you're trying to achieve and what the algorithm can do easily. Coming up with an annotation scheme is hard, because as soon as you start annotating you notice lots of edge cases. This post will go through an example with extracting job titles from job ads. In our previous post we looked at what was in a job ad title and a way of extracting some common job titles from the ads.
The job title should succinctly summarise what the role is about, so it should tell you a lot about the role. However in practice job titles can range from very broad to very narrow, be obscure or acronym-laden and even hard to nail down. They're even hard to extract from a job ad's title - which is what I'll focus on in this series. In a previous series of posts I developed a method that could extract skills written a very particular way.
I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "experience in telesales" using spaCy's dependency parse, but it wouldn't extract many types of experience from a job ad. Here we will extend these rules to extract lists of skills (for example extracting "telesales" and "callcentre" from "experience in telesales or receptionist", which will let us analyse which experiences are related.
Extracting Experience in a Field I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "subsea cable engineering experience". This worked well, but extracted a lot of qualifiers that aren't skills (like "previous experience in", or "any experience in"). Here we will write rules to extract experience from phrases like "experience in subsea cable engineering", with much better results.
I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. Using rules to extract noun phrases ending in experience (e.g. subsea cable engineering experience) we can extract many skills, but there's a lot of false positives (e.g. previous experience) You can see the Jupyter notebook for the full analysis. Extracting Noun Phrases It's common for ads to write something like "have this kind of experience":