A job posting has a description, a company, sometimes a salary, ... and what else? Schema.org have a detailed JobPosting schema, but it's not immediately obvious what is important and how to use it. However the Web Data Commons have extracted JobPostings from hundreds of thousands of webpages from Common Crawl. By parsing the data we can see how these are actually used in practice which will help show what is actually useful in describing a job posting.

The website schema.org contains schemas for representing everything from Anatomical Structures, to a How to Tip to types of beds. While a lot of work goes into making these consistent and complete, they are only as useful as they are fit for purpose and adopted in practice. For example a Job Posting contains some very unusual items like SensoryRequirement.

sensoryRequirement

A description of any sensory requirements and levels necessary to function on the job, including hearing and vision. Defined terms such as those in O*net may be used, but note that there is no way to specify the level of ability as well as its nature when using a defined term.

To understand how people use the schema in practice I analysed the 2019 Web Data Commons Job Postings subsets. There's two different datasets based on the source; JSON-LD and Microdata. To understand the breadth of common usage I took a singe JobPosting from a sample of domains (1840 JSON-LD and 2820 Microdata) and examined the fields. In practice there may be some domains with lots of JobPostings that have a specific term. Many times the data may be contained in the job description and good examples could be used as training data in a supervised learning model.

In general there is more Microdata but it's less consistent and of lower quality than JSON-LD. There's a lot of variation in consistency so even this "structured" data requires some processing to work with. There's also many different ways to structure the same thing; for example a JobPosting can have a salaryCurrency but also a a baseSalary which is a MonetaryAmount and can have a currency.

Here are the most common objects, their types, descriptions and some examples.

Schema Property JSON-LD Coverage Microdata Coverage Types Description Example Notes
JobPosting datePosted 99% 64% (7% Date/57% Text) Date Publication date of listing 11/20/2019 08:55:24 AM Should be an ISO-8601 date, often is another type of date(time)
JobPosting title 99% 84% Text Title of the job Psychiatric Nurse Practitioner
JobPosting description 99% 76% Text (HTML) Description of the job <p><strong>Category Manager ... Sometimes HTML is double encoded, sometimes not
JobPosting hiringOrganization 98% 59% (35% Organization/24% Text) Organization Organization offering the job NTT DATA Services (See also Organization)
JobPosting jobLocation 98% 63% (49% Place/14% Text) Place Location associated to job See Place
JobPosting employmentType 82% 39% Text Type of employment FULL_TIME When it's multiple it tends to be alternatives
JobPosting validThrough 60% 23% (3% Date, 20% Text) DateTime (90%), Date (10%) Date after when offer is not valid 2019-11-28 Should be an ISO-8601 date/datetime, often is another type of date(time)
JobPosting baseSalary 47% 21% (12% MonetaryAmount/9% Text) MonetaryAmount, Text, PriceSpecification (Rarely) The base salary of the job 40 000 60 000 руб. (Also see MonetaryAmount)
JobPosting identifier 41% 7% PropertyValue, Text, URL Any kind of identifier 39576074 The value is often a path or an id
JobPosting industry 39% 21% Text Industry associated with job Engineering Doesn't seem to be standard, can be in different languages
JobPosting url 23% 20% (9% URL, 11% Text) URL URL of the item https://nmhc.selectleaders.com/job/90569/senior-debt-analyst/
JobPosting salaryCurrency 15% 4% Text ISO 4217 currency for salary GBP Occasionally € instead of EUR
JobPosting educationRequirements 10% 7% EducationalOccupationalCredential (Rarely), Text Educational background needed MBO Often it's "UNAVAILABLE" or "Not Applicable", etc
JobPosting occupationalCategory 9% 8% CategoryCode, Text A category describing the job Information Technology Very little consistency
JobPosting experienceRequirements 9% 9% Text Skills and experience required 1 - 2 Year(s) Sometimes years, sometimes level (senior), sometimes text description
JobPosting workHours 8% 8% Text Typical working hours 11:00~24:00 週2日 Variable formats, sometimes specifying "various"
JobPosting jobBenefits 8% 2% Text Benefits associated with job Job Security, HRA, TA, DA Inconsistent, often comma separates list of text
JobPosting skills 8% 4% DefinedTerm, Text Competency needed to fill this role JavaScript, Apple iOS, Android Often comma separated list
JobPosting qualifications 7% 6% EducationalOccupationalcredential (Rarely), Text Qualifications required for this role Sie müssen Personaler eines Unternehmens sein Typically long text, inconsistent
JobPosting image 6% 7% URL (mostly), ImageObject (sometimes) An image of the item https://images.rigzone.com/images/rz-logo.jpg Mostly logos
JobPosting jobLocationType 3% <1% Text A description of the job location TELECOMMUTE When it's present it's almost always TELECOMMUTE
JobPosting incentiveCompensation 3% <1% Text Bonuses and commissions Provides Equity Highly variable
Place (JobLocation) address 94% 44% PostalAddress (mostly), Text Physical Address Amsterdam (Also see PostalAddress)
Place geo 5% <1% GeoCoordinates (mostly), Text Geo coordinates of place 54.727,55.955 (See GeoCoordinates) Sometimes 0,0 (garbage)
Place name 3% 1% Text Name of the place Southwark
PostalAddress (address) addressLocality 89% 34% Text Location within Region Philadelphia
PostalAddress addressRegion 81% 30% Text Region within Country California Sometimes country specific abbreviation
PostalAddress addressCountry 77% 16% Text (95%), Country (5%) Country United States Sometimes name, sometimes country code
PostalAddress postalCode 54% 12% Text, Int Postal Code J3A1B6
PostalAddress streetAddress 34% 7% Text Street Address 21 Lassell Gardens Can be junk like '-', UNKNOWN
GeoCoordinates (geo) latitude 5% <1% Float,Text Latitude (WGS 84) 54.727615356462
GeoCoordinates longitude 5% <1% Float,Text Longitude (WGS 84) 55.955778063477
Country (addressCountry) name 4% 0% Text Name of the country Italia Sometimes name, sometimes country code
MonetaryAmount (baseSalary) currency 38% 11% Text Currency (e.g. ISO 2417 code) GBP
MonetaryAmount value 44% 9% QuantitativeValue (95%), Text (5%) Quantity of salary 25000 (See QuantitativeValue) When text contains odd things like 'Hourly'
MonetaryAmount minValue 2% 2% Text, Int, Float Lower value of salary 9.94
MonetaryAmount maxValue 2% 2% Text, Int, Float Upper Value of salary 13,500,000
QuantitativeValue (baseSalary value) unitText 35% 5% Text Unit of measurement YEAR Should be 3 letter UN/CEFACT Common Code (e.g. HUR, DAY, WEE, MON, ANN)
QuantitativeValue minValue 7% 4% Text, Int, Float Quantity of salary 400 Sometimes 0
QuantitativeValue maxValue 6% 3% Text, Int, Float Lower value of salary 10000 Sometimes 0
QuantitativeValue value 29% 2% Text, Int, Float Upper Value of salary 300 Sometimes NULL, sometimes text range
Organization (hiringOrganization) name 94% 32% Text Name of the company Anixter International
Organization (hiringOrganization) sameAs 60% 6% URL URL that identifies company https://www.socialdeal.nl Appear to be URLs
Organization (hiringOrganization) logo 54% 10% URL (usually), ImageObject (sometimes) Associated logo https://kaigoworker.jp/img/gfjimg_kaigo.png
Organization (hiringOrganization) url 5% 8% URL URL of company http://www.lgcassociates.com
ImageObject (image, logo) url URL URL of image https://www.hiq.se/globalassets/bilder/hiq_bg_bild_some.jpg
ImageObject contentUrl URL URL of image https://media.rabota.ru/processor/logo/small/2010/04/08/silajjn.gif
ImageObject width Int Width of image 1043
ImageObject height Int Height of image 1800
ImageObject name Text Name of image TRN Logo with Website

Getting the data

The nquad data was turned into Graphs with parse_nquads.

f = gzip.open('ndquads.gz', 'rt')
all_graphs = parse_nquads(f)

seen_domains = set()
graphs = []
skipped = []

for _ in tqdm(range(100_000)):
    graph = next(all_graphs)
    dom = graph_domain(graph)
    if dom in seen_domains:
        continue
    
    try:
        jp = list(get_job_postings(graph))[0]
    except IndexError:
        # This can happen because a disjoint graph from the
        # page without a job posting is split
        skipped.append((graph.identifier, dom))
        continue
    graphs.append((graph, jp))
    seen_domains.update([dom])

The domain of the graph is extracted with a simple function:

def graph_domain(graph):
    return urllib.parse.urlparse(graph.identifier).netloc 

You can view the whole laborious analysis in Jupyter.