I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. Using rules to extract noun phrases ending in experience (e.g. subsea cable engineering experience) we can extract many skills, but there's a lot of false positives (e.g. previous experience)

You can see the Jupyter notebook for the full analysis.

Extracting Noun Phrases

It's common for ads to write something like "have this kind of experience":

  • They will need someone who has at least 10-15 years of subsea cable engineering experience
  • This position is ideally suited to high calibre engineering graduate with significant and appropriate post graduate experience.
  • Aerospace industry experience would be advantageous covering aerostructures and/or aero engines.
  • A sufficient and appropriate level of building services and controls experience gained within a client organisation, engineering consultancy or equipment supplier.

We can try to extract the type of experience using spaCy's noun_chunk iterator which uses linguistic rules on the parse tree to extract noun phrases:

They Noun Chunk will need someone Noun Chunk who Noun Chunk has at least 10-15 years Noun Chunk of subsea cable engineering experience Noun Chunk
A sufficient and appropriate level Noun Chunk of building services and controls experience Noun Chunk gained within a client organisation Noun Chunk , engineering consultancy Noun Chunk or equipment supplier Noun Chunk .

We can just look for all noun_chunks that end in experience, and grab every token leading up to experience

def extract_noun_phrase_experience(doc):
    for np in doc.noun_chunks:
        if np[-1].lower_ == 'experience':
            if len(np) > 1:
                yield 'EXPERIENCE', np[0].i, np[-1].i

Analysing the results

Looking at the results from extracting the top fifty thousand job ads, the most common things it extracts aren't skills but qualifiers like "previous experience", "Proven experience", "some experience", and "demonstrable experience".

By filtering with a blacklist of the most common qualifying words, and stop words (the, this, an) we get some kinds of fields of expertise:

  • sales
  • management
  • supervisory
  • customer service
  • development
  • supervisory
  • technical
  • managment
  • telesales
  • financial services
  • design
  • project managment
  • retail
  • business sales
  • SQL
  • marketing
  • people management
  • SAP
  • engineering

While this is a good start, building a longer list requires. building a much longer blacklist of qualifier terms (e.g. proven, demonstrable, demonstrated, relevant, significant, practical, essential, desirable, ...). The fact that these qualifier terms are so common is because job ads commonly contain phrases like "previous experience in ..." or "some experience as ...".

In the next post in the series we look at extracting from these types of phrases, and get much better results.