Composition Over Inheritence
programming

Composition Over Inheritence

I have been trying to extract job ads from Common Crawl, and have designed a pipeline with 3 phases; fetching the data, extracting the content and normalising the data. However the way I implemented this is using inheritance to remove some of the duplication between the steps. While this made it quick to start it's hard to understand, and to extend it I need to introduce more duplication. My current design started pragmatically; a data source needed to be able to fetch, extract and normalise.

A First Cut of Job Extraction
jobs

A First Cut of Job Extraction

I've finally built a first iteration of a job extraction pipeline in my job-advert-analysis repository. There's nothing in there that I haven't written about, but it's simply doing the work to bring it all together. I'm really happy to have a full pipeline that extracts lots of interesting features to analyse, and is easy to extend. I've already talked about how to extract jobs from Common Crawl and the architecture for extracting the data.

Building a Job Extraction Pipeline
jobs

Building a Job Extraction Pipeline

I've been trying to extract job ads from Common Crawl. However I was stuck for some time on how to actually write transforms for all the different data sources. I've finally come up with an architecture that works; download, extract and normalise. I need a way to extract the job ads from heterogeneous sources that allows me to extract different kinds of data, such as the title, location and salary. I got stuck in code for a long time trying to do all this together and getting a bit confused about how to make changes.

Finding Australian Locations with Placeholder
python

Finding Australian Locations with Placeholder

People write locations in many different ways. This makes them really hard to analyse, so we need a way to normalise them. I've already discussed how Placeholder is useful for coarse geocoding. Now I'm trying to apply it to normalising locations from Australian Job Ads in Common Crawl. The best practices when using Placeholder are: Go from the most specific location information (e.g. street address) to the most general (e.

Tips for Extracting Data with Beautiful Soup
python

Tips for Extracting Data with Beautiful Soup

Beautiful soup can be a useful library for extracting information from HTML. Unfortunately there's a lot of little issues I hit working with it to extract data from a careers webpage using Common Crawl. The library is still useful enough to work with; but the issues make me want to look at alternatives like lxml (via html5-parser). The source data can be obtained at the end of the article. Use a good HTML parser Python has an inbuilt html.

Extracting Australian Job Postings with SPARQL
jobs

Extracting Australian Job Postings with SPARQL

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I have previously written scripts to read in the graphs, explore JobPosting schema and analyst the schema using SPARQL. Now we can use these to find some Austrlian Job Postings in the data. For this analysis I used 15,000 pages containing job postings with different domains from the 2019 Web Data Commons Extract.

Schemas for JobPostings in Practice
jobs

Schemas for JobPostings in Practice

A job posting has a description, a company, sometimes a salary, ... and what else? Schema.org have a detailed JobPosting schema, but it's not immediately obvious what is important and how to use it. However the Web Data Commons have extracted JobPostings from hundreds of thousands of webpages from Common Crawl. By parsing the data we can see how these are actually used in practice which will help show what is actually useful in describing a job posting.

Extracting Job Ads from Common Crawl
commoncrawl

Extracting Job Ads from Common Crawl

I've been using data from the Adzuna Job Salary Predictions Kaggle Competition to extract skills, find near duplicate job ads and understand seniority of job titles. But the dataset has heavily processed ad text which makes it harder to do natural language processing on. Instead I'm going to find job ads in Common Crawl's, a dataset containing over a billion webpages each month. The Common Crawl data is much better because it's longitudinal over several years, international, broad and continually being updated.

Understaning Job Ad Titles with Salary
jobs

Understaning Job Ad Titles with Salary

Different industries have different ways of distinguishing seniority in a job title. Is a HR Officer more senior than a HR Administrator? Is a PHP web developer more skilled than a PHP developer? How different is a medical sales executive to general sales roles? Using the jobs from Adzuna Job Salary Predictions Kaggle Competition I've found common job titles and can use the advertised salary to help understand them. Note that since the data is from the UK from several years ago a lot of the details aren't really applicable, but the techniques are.

Discovering Job Titles
jobs

Discovering Job Titles

A job ad title can contain a lot of things like location, skills or benefits. I want a list of just the job titles, without the rest of those things. This is a key piece of information extraction that can be used to better understand jobs, and built on by understanding how different job titles relate, for example with salary. To do this we first normalise the words in the ad title, doing things like removing plurals and expanding acronyms.

Making Words Singular
nlp

Making Words Singular

Trying to normalise text in job titles I need a way to convert plural words into their singular form. For example a job for "nurses" is about a "nurse", a job for "salespeople" is about a "salesperson", a job for "workmen" is about a "workman" and a job about "midwives" is about a "midwife". I developed an algorithm that works well enough for converting plural words to singular without changing singular words in the text like "sous chef", "business" or "gas".

Summary of Finding Near Duplicates in Job Ads
nlp

Summary of Finding Near Duplicates in Job Ads

I've been trying to find near duplicate job ads in the Adzuna Job Salary Predictions Kaggle Competition. Job ads can be duplicated because a hirer posts the same ad multiple times to a job board, or to multiple job boards. Finding exact duplicates is easy by sorting the job ads or a hash of them. But the job board may mangle the text in some way, or add its own footer, or the hirer might change a word or two in different posts.

Finding Duplicate Companies with Cliques
jobs

Finding Duplicate Companies with Cliques

We've found pairs of near duplicate texts in 400,000 job ads from the Adzuna Job Salary Predictions Kaggle Competition. When we tried to extracted groups of similar ads by finding connected components in the graph of similar ads. Unfortunately with a low threshold of similarity we ended up with a chain of ads that were each similar, but the first and last ad were totally unrelated. One way to work around this is to find cliques, or a group of job ad were every job ad is similar to all of the others.

Finding Common Substrings
data

Finding Common Substrings

I've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. One thing that would be useful to know is what the common sections of the ads are. Typically if they have a high 3-Jaccard similarity it's because they have some text in common. The most asymptotically efficient to find the longest common substring would be to build a suffix tree, but for experimentation the heuristics in Python's DiffLib work well enough.

Minhash Sets
jobs

Minhash Sets

We've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract these groups from the pairs. Extracting the Groups Directly with Union Find Each band of the LSH consists of buckets of items that may be similar; you could view the buckets as a partition of the corpus of all documents.

Searching for Near Duplicates with Minhash
nlp

Searching for Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search through them would take a couple of days on my laptop. MinHash was designed to approach this problem as outlined in the original paper.

Detecting Near Duplicates with Minhash
nlp

Detecting Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I've found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to store it. While this is tractable we can find almost all pairs with a significant overlap it in half an hour in-memory using MinHash.

Near Duplicates with TF-IDF and Jaccard
nlp

Near Duplicates with TF-IDF and Jaccard

I've looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren't much better, and there are some complications in using it in practice. When trying to find similar ads with the Jaccard index we looked at the proportion of n-grams they have in common relative to all the n-grams between them.

Near Duplicates with Jaccard
nlp

Near Duplicates with Jaccard

Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that's efficient on small sets. I've tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. This works pretty well at finding near-duplicates and even ads from the same company; although by itself it can't detect duplicates. I've looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it's slow to calculate.

Edit Distance
nlp

Edit Distance

Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one sequence to another. I had a look at using this for trying to compare duplicate ads with reasonable results, but it's a little slow to run on many ads. I've previously looked at finding ads with exactly the same text in the Adzuna Job Salary Predictions Kaggle Competition, but there are a lot of ads that are slight variations.

Finding Exact Duplicate Text
python

Finding Exact Duplicate Text

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well. Naively finding exact duplicates by comparing every pair would be O(N^2), but if we sort the input, which is O(N log(N)), then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools groupby to do something like uniq.

Showing Side-by-Side Diffs in Jupyter
python

Showing Side-by-Side Diffs in Jupyter

When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.

Creating a Diff Recipe in Prodigy
nlp

Creating a Diff Recipe in Prodigy

I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.

Not using NER for extracting Job Titles
nlp

Not using NER for extracting Job Titles

I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).

Annotating Job Titles
nlp

Annotating Job Titles

When doing Named Entity Recognition it's important to think about how to set up the problem. There's a balance between what you're trying to achieve and what the algorithm can do easily. Coming up with an annotation scheme is hard, because as soon as you start annotating you notice lots of edge cases. This post will go through an example with extracting job titles from job ads. In our previous post we looked at what was in a job ad title and a way of extracting some common job titles from the ads.

What's in a Job Ad Title?
nlp

What's in a Job Ad Title?

The job title should succinctly summarise what the role is about, so it should tell you a lot about the role. However in practice job titles can range from very broad to very narrow, be obscure or acronym-laden and even hard to nail down. They're even hard to extract from a job ad's title - which is what I'll focus on in this series. In a previous series of posts I developed a method that could extract skills written a very particular way.

Extracting Skills from Job Ads: Part 3 Conjugations
nlp

Extracting Skills from Job Ads: Part 3 Conjugations

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "experience in telesales" using spaCy's dependency parse, but it wouldn't extract many types of experience from a job ad. Here we will extend these rules to extract lists of skills (for example extracting "telesales" and "callcentre" from "experience in telesales or receptionist", which will let us analyse which experiences are related.

Extracting Skills from Job Ads: Part 2 - Adpositions
nlp

Extracting Skills from Job Ads: Part 2 - Adpositions

Extracting Experience in a Field I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "subsea cable engineering experience". This worked well, but extracted a lot of qualifiers that aren't skills (like "previous experience in", or "any experience in"). Here we will write rules to extract experience from phrases like "experience in subsea cable engineering", with much better results.

Extracting Skills from Job Ads: Part 1 - Noun Phrases
nlp

Extracting Skills from Job Ads: Part 1 - Noun Phrases

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. Using rules to extract noun phrases ending in experience (e.g. subsea cable engineering experience) we can extract many skills, but there's a lot of false positives (e.g. previous experience) You can see the Jupyter notebook for the full analysis. Extracting Noun Phrases It's common for ads to write something like "have this kind of experience":