I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. Using rules to extract noun phrases ending in experience (e.g. subsea cable engineering experience) we can extract many skills, but there's a lot of false positives (e.g. previous experience)
You can see the Jupyter notebook for the full analysis.
Extracting Noun Phrases
It's common for ads to write something like "have this kind of experience":
- They will need someone who has at least 10-15 years of subsea cable engineering experience
- This position is ideally suited to high calibre engineering graduate with significant and appropriate post graduate experience.
- Aerospace industry experience would be advantageous covering aerostructures and/or aero engines.
- A sufficient and appropriate level of building services and controls experience gained within a client organisation, engineering consultancy or equipment supplier.
Theywill need someone who has at least 10-15 years of subsea cable engineering experience
A sufficient and appropriate levelof building services and controls experience gained within a client organisation , engineering consultancy or equipment supplier .
We can just look for all
noun_chunks that end in experience, and grab every token leading up to experience
def extract_noun_phrase_experience(doc): for np in doc.noun_chunks: if np[-1].lower_ == 'experience': if len(np) > 1: yield 'EXPERIENCE', np.i, np[-1].i
Analysing the results
Looking at the results from extracting the top fifty thousand job ads, the most common things it extracts aren't skills but qualifiers like "previous experience", "Proven experience", "some experience", and "demonstrable experience".
By filtering with a blacklist of the most common qualifying words, and stop words (the, this, an) we get some kinds of fields of expertise:
- customer service
- financial services
- project managment
- business sales
- people management
While this is a good start, building a longer list requires. building a much longer blacklist of qualifier terms (e.g. proven, demonstrable, demonstrated, relevant, significant, practical, essential, desirable, ...). The fact that these qualifier terms are so common is because job ads commonly contain phrases like "previous experience in ..." or "some experience as ...".
In the next post in the series we look at extracting from these types of phrases, and get much better results.