Schemas for JobPostings in Practice
A job posting has a description, a company, sometimes a salary, … and what else? Schema.org have a detailed JobPosting schema, but it’s not immediately obvious what is important and how to use it. However the Web Data Commons have extracted JobPostings from hundreds of thousands of webpages from Common Crawl. By parsing the data we can see how these are actually used in practice which will help show what is actually useful in describing a job posting.
The website schema.org contains schemas for representing everything from Anatomical Structures, to a How to Tip to types of beds. While a lot of work goes into making these consistent and complete, they are only as useful as they are fit for purpose and adopted in practice. For example a Job Posting contains some very unusual items like SensoryRequirement.
sensoryRequirement
A description of any sensory requirements and levels necessary to function on the job, including hearing and vision. Defined terms such as those in O*net may be used, but note that there is no way to specify the level of ability as well as its nature when using a defined term.
To understand how people use the schema in practice I analysed the 2019 Web Data Commons Job Postings subsets. There’s two different datasets based on the source; JSON-LD and Microdata. To understand the breadth of common usage I took a singe JobPosting from a sample of domains (1840 JSON-LD and 2820 Microdata) and examined the fields. In practice there may be some domains with lots of JobPostings that have a specific term. Many times the data may be contained in the job description and good examples could be used as training data in a supervised learning model.
In general there is more Microdata but it’s less consistent and of lower quality than JSON-LD. There’s a lot of variation in consistency so even this “structured” data requires some processing to work with. There’s also many different ways to structure the same thing; for example a JobPosting can have a salaryCurrency but also a a baseSalary which is a MonetaryAmount and can have a currency.
Here are the most common objects, their types, descriptions and some examples.
Schema | Property | JSON-LD Coverage | Microdata Coverage | Types | Description | Example | Notes |
---|---|---|---|---|---|---|---|
JobPosting | datePosted | 99% | 64% (7% Date/57% Text) | Date | Publication date of listing | 11/20/2019 08:55:24 AM | Should be an ISO-8601 date, often is another type of date(time) |
JobPosting | title | 99% | 84% | Text | Title of the job | Psychiatric Nurse Practitioner | |
JobPosting | description | 99% | 76% | Text (HTML) | Description of the job | <p><strong>Category Manager … | Sometimes HTML is double encoded, sometimes not |
JobPosting | hiringOrganization | 98% | 59% (35% Organization/24% Text) | Organization | Organization offering the job | NTT DATA Services (See also Organization) | |
JobPosting | jobLocation | 98% | 63% (49% Place/14% Text) | Place | Location associated to job | See Place | |
JobPosting | employmentType | 82% | 39% | Text | Type of employment | FULL_TIME | When it’s multiple it tends to be alternatives |
JobPosting | validThrough | 60% | 23% (3% Date, 20% Text) | DateTime (90%), Date (10%) | Date after when offer is not valid | 2019-11-28 | Should be an ISO-8601 date/datetime, often is another type of date(time) |
JobPosting | baseSalary | 47% | 21% (12% MonetaryAmount/9% Text) | MonetaryAmount, Text, PriceSpecification (Rarely) | The base salary of the job | 40 000 60 000 руб. (Also see MonetaryAmount) | |
JobPosting | identifier | 41% | 7% | PropertyValue, Text, URL | Any kind of identifier | 39576074 | The value is often a path or an id |
JobPosting | industry | 39% | 21% | Text | Industry associated with job | Engineering | Doesn’t seem to be standard, can be in different languages |
JobPosting | url | 23% | 20% (9% URL, 11% Text) | URL | URL of the item | https://nmhc.selectleaders.com/job/90569/senior-debt-analyst/ | |
JobPosting | salaryCurrency | 15% | 4% | Text | ISO 4217 currency for salary | GBP | Occasionally € instead of EUR |
JobPosting | educationRequirements | 10% | 7% | EducationalOccupationalCredential (Rarely), Text | Educational background needed | MBO | Often it’s “UNAVAILABLE” or “Not Applicable”, etc |
JobPosting | occupationalCategory | 9% | 8% | CategoryCode, Text | A category describing the job | Information Technology | Very little consistency |
JobPosting | experienceRequirements | 9% | 9% | Text | Skills and experience required | 1 - 2 Year(s) | Sometimes years, sometimes level (senior), sometimes text description |
JobPosting | workHours | 8% | 8% | Text | Typical working hours | 11:00~24:00 週2日 | Variable formats, sometimes specifying “various” |
JobPosting | jobBenefits | 8% | 2% | Text | Benefits associated with job | Job Security, HRA, TA, DA | Inconsistent, often comma separates list of text |
JobPosting | skills | 8% | 4% | DefinedTerm, Text | Competency needed to fill this role | JavaScript, Apple iOS, Android | Often comma separated list |
JobPosting | qualifications | 7% | 6% | EducationalOccupationalcredential (Rarely), Text | Qualifications required for this role | Sie müssen Personaler eines Unternehmens sein | Typically long text, inconsistent |
JobPosting | image | 6% | 7% | URL (mostly), ImageObject (sometimes) | An image of the item | https://images.rigzone.com/images/rz-logo.jpg | Mostly logos |
JobPosting | jobLocationType | 3% | <1% | Text | A description of the job location | TELECOMMUTE | When it’s present it’s almost always TELECOMMUTE |
JobPosting | incentiveCompensation | 3% | <1% | Text | Bonuses and commissions | Provides Equity | Highly variable |
Place (JobLocation) | address | 94% | 44% | PostalAddress (mostly), Text | Physical Address | Amsterdam (Also see PostalAddress) | |
Place | geo | 5% | <1% | GeoCoordinates (mostly), Text | Geo coordinates of place | 54.727,55.955 (See GeoCoordinates) | Sometimes 0,0 (garbage) |
Place | name | 3% | 1% | Text | Name of the place | Southwark | |
PostalAddress (address) | addressLocality | 89% | 34% | Text | Location within Region | Philadelphia | |
PostalAddress | addressRegion | 81% | 30% | Text | Region within Country | California | Sometimes country specific abbreviation |
PostalAddress | addressCountry | 77% | 16% | Text (95%), Country (5%) | Country | United States | Sometimes name, sometimes country code |
PostalAddress | postalCode | 54% | 12% | Text, Int | Postal Code | J3A1B6 | |
PostalAddress | streetAddress | 34% | 7% | Text | Street Address | 21 Lassell Gardens | Can be junk like ‘-’, UNKNOWN |
GeoCoordinates (geo) | latitude | 5% | <1% | Float,Text | Latitude (WGS 84) | 54.727615356462 | |
GeoCoordinates | longitude | 5% | <1% | Float,Text | Longitude (WGS 84) | 55.955778063477 | |
Country (addressCountry) | name | 4% | 0% | Text | Name of the country | Italia | Sometimes name, sometimes country code |
MonetaryAmount (baseSalary) | currency | 38% | 11% | Text | Currency (e.g. ISO 2417 code) | GBP | |
MonetaryAmount | value | 44% | 9% | QuantitativeValue (95%), Text (5%) | Quantity of salary | 25000 (See QuantitativeValue) | When text contains odd things like ‘Hourly’ |
MonetaryAmount | minValue | 2% | 2% | Text, Int, Float | Lower value of salary | 9.94 | |
MonetaryAmount | maxValue | 2% | 2% | Text, Int, Float | Upper Value of salary | 13,500,000 | |
QuantitativeValue (baseSalary value) | unitText | 35% | 5% | Text | Unit of measurement | YEAR | Should be 3 letter UN/CEFACT Common Code (e.g. HUR, DAY, WEE, MON, ANN) |
QuantitativeValue | minValue | 7% | 4% | Text, Int, Float | Quantity of salary | 400 | Sometimes 0 |
QuantitativeValue | maxValue | 6% | 3% | Text, Int, Float | Lower value of salary | 10000 | Sometimes 0 |
QuantitativeValue | value | 29% | 2% | Text, Int, Float | Upper Value of salary | 300 | Sometimes NULL, sometimes text range |
Organization (hiringOrganization) | name | 94% | 32% | Text | Name of the company | Anixter International | |
Organization (hiringOrganization) | sameAs | 60% | 6% | URL | URL that identifies company | https://www.socialdeal.nl | Appear to be URLs |
Organization (hiringOrganization) | logo | 54% | 10% | URL (usually), ImageObject (sometimes) | Associated logo | https://kaigoworker.jp/img/gfjimg_kaigo.png | |
Organization (hiringOrganization) | url | 5% | 8% | URL | URL of company | http://www.lgcassociates.com | |
ImageObject (image, logo) | url | URL | URL of image | https://www.hiq.se/globalassets/bilder/hiq_bg_bild_some.jpg | |||
ImageObject | contentUrl | URL | URL of image | https://media.rabota.ru/processor/logo/small/2010/04/08/silajjn.gif | |||
ImageObject | width | Int | Width of image | 1043 | |||
ImageObject | height | Int | Height of image | 1800 | |||
ImageObject | name | Text | Name of image | TRN Logo with Website |
Getting the data
The nquad data was turned into Graphs with parse_nquads
.
= gzip.open('ndquads.gz', 'rt')
f = parse_nquads(f)
all_graphs
= set()
seen_domains = []
graphs = []
skipped
for _ in tqdm(range(100_000)):
= next(all_graphs)
graph = graph_domain(graph)
dom if dom in seen_domains:
continue
try:
= list(get_job_postings(graph))[0]
jp except IndexError:
# This can happen because a disjoint graph from the
# page without a job posting is split
skipped.append((graph.identifier, dom))continue
graphs.append((graph, jp)) seen_domains.update([dom])
The domain of the graph is extracted with a simple function:
def graph_domain(graph):
return urllib.parse.urlparse(graph.identifier).netloc
You can view the whole laborious analysis in Jupyter.