Extracting Australian Job Postings with SPARQL

jobs

commoncrawl

rdf

Published

June 25, 2020

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I have previously written scripts to read in the graphs, explore JobPosting schema and analyst the schema using SPARQL. Now we can use these to find some Austrlian Job Postings in the data.

For this analysis I used 15,000 pages containing job postings with different domains from the 2019 Web Data Commons Extract. Here’s the final query that extracts 285 domains; the rest of this article will explain what it’s doing.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>
PREFIX sdo_mv: <http://schema.org/MonetaryValue/>
PREFIX sdos_mv: <https://schema.org/MonetaryValue/>


SELECT distinct ?src
WHERE {
 { 
  GRAPH ?src
  {[] a sdo:JobPosting .}
  BIND (replace(str(?src), 
                'https?://([^?/]+).*',
                '\\1') AS ?domain)
    FILTER (strends(?domain, '.au'))
 }
 UNION
 {
  GRAPH ?src
  {
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country) && 
                 lcase(replace(str(?country),
                               '[ \n\t]*(.*)[ \n\t]*',
                               '\\1')) in ('au', 'australia'))
    }
    UNION
    {[] a sdo:JobPosting ;
        ((sdo:salaryCurrency|sdo_jp:salaryCurrency)|
         (sdo:baseSalary|sdo_jp:baseSalary)/
         (sdo:currency|sdo_mv:currency|sdos_mv:currency)) ?currency .
    BIND (replace(str(?currency), '[ \n\t]+', '') as ?curr)
    FILTER (lcase(?curr) = 'aud')}
  }
 }
}

Finding Australian Domains

The fact there is a URL in the JobPostings extract in Web Data Commons tells you that the URL contains a structured Job Posting. One heuristic for finding Australian job listings is looking for domains ending in .au.

We can get the URL from Common Crawl containing the data by searching for the graph identifier, which we’ll call ?src, filtering to graphs containing a JobPosting.

PREFIX sdo: <http://schema.org/>

SELECT ?src
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting .}
}
LIMIT 10

This gives a list of 10 URLs like https://tire-factory.hiringthing.com/job/17125/warehouse-associate.

We can extract the domain with a regular expression using the replace function, and just get results that end in .au using strends (string-ends).

PREFIX sdo: <http://schema.org/>

SELECT DISTINCT ?src
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting .}
    BIND (replace(str(?src), 'https?://([^?/]+).*', '\\1') AS ?domain)
    FILTER (strends(?domain, '.au'))
}

This gets 233 URLs containing job postings; but we can do a little better if we use the structured data in the job postings.

Finding Job Postings located in Australia

The JobPosting Schema contains a jobLocation, which can be a Place which can contain an address, which can be a PostalAddress, which can contain an addressCountry, which can be a Country which can have a name. Phew!

Extracting Country Name with property paths

We can express this succinctly using SPARQL property paths.

PREFIX sdo: <http://schema.org/>

SELECT ?country (COUNT(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:jobLocation/sdo:address/sdo:addressCountry/sdo:name ?country .}
}
GROUP BY ?country
ORDER BY DESC(?count)
LIMIT 15

Country	count
US	127
CA	20
DE	20
GB	18
IL	14

Note that we count distinct graph identifiers; because a page can contain multiple job listings (which in turn can contain multiple jobLocations) it may contribute to multiple countries.

Extracting plain text addressCountry

The addressCountry can also be plain text, and in fact that’s much more common. We can filter out the cases where it’s a structured value (and so ?country is a blank node) using isLiteral.

PREFIX sdo: <http://schema.org/>

SELECT ?country (COUNT(distinct ?src) AS ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:jobLocation/sdo:address/sdo:addressCountry ?country .}
    FILTER (isLiteral(?country))
}
GROUP BY ?country
ORDER BY DESC(?count)
LIMIT 10

In the results we can see 86 jobs with AU in the countries.

Country	Count
United States	385
JP	358
GB	345
US	320
DE	270
NL	253
Deutschland	179
United Kingdom	139
FR	110
AU	86

Matching Country and text at the same time

We should be able to combine the two by making name optional with the ZeroOrOnePath operator ?.

PREFIX sdo: <http://schema.org/>

SELECT ?country (count(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:jobLocation/sdo:address/sdo:addressCountry/(sdo:name?) ?country .}
}
GROUP BY ?country
ORDER BY DESC(?total)
LIMIT 15

However for some strange reason we end up with some URIs in the results:

Country	Count
US	447
United States	385
GB	363
JP	359
DE	290
NL	257
Deutschland	179
United Kingdom	140
FR	116
AU	91
CA	81
India	60
	60
http://schema.org/JobPosting	56
http://schema.org/Place	56

Oddly enough this doesn’t happen if we rewrite it as an alternation:

PREFIX sdo: <http://schema.org/>

SELECT ?country (count(?src) as ?total)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:jobLocation/sdo:address/(sdo:addressCountry|sdo:addressCountry/sdo:name) ?country .}
}
GROUP BY ?country
ORDER BY DESC(?total)
LIMIT 15

Country	Total
US	609
JP	444
United States	395
GB	363
DE	304
NL	258
Deutschland	179
United Kingdom	140
FR	116
AU	91
CA	84
India	70
Canada	66
	60
IN	57

I would expect these to be the same; but I don’t know if my understanding of SPARQL is wrong or it’s a bug in rdflib. When we filter to literal nodes we get the same results, so I’m not going to dwell on it.

Fully qualified paths

In the microdata extract the properties are specified by fully qualified paths, for example <http://schema.org/Place/address> instead of just <http://schema.org/address>. So we need to match these patterns too, which means adding a whole heap more prefixes.

We can check the property it’s binding on, but have to be careful to filter out common strings to reduce false positives (e.g. if ?country is the empty string then this will extract all properties with an empty string).

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>

SELECT ?property (count(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
         sdo_jp:jobLocation/sdo_pl:address/sdo_pa:addressCountry/(sdo_co:name?) ?country .
         [] ?property ?country .
         FILTER (isliteral(?country) &&
                (lcase(str(?country)) not in ('', 'na', 'n/a', 'unavailable', ' ', 'null')))
         }
}
GROUP BY ?property
ORDER BY DESC(?count)
LIMIT 10

Having a Country is very rare in microdata, but this looks about right.

Property	Count
http://schema.org/PostalAddress/addressCountry	1351
http://schema.org/Country/name	4
http://schema.org/PostalAddress/addressLocality	3
http://schema.org/PostalAddress/streetAddress	1
http://schema.org/PostalAddress/addressRegion	1

Combining the patterns

We can combine the two possible schema paths using alternations.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>

SELECT ?country (count(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country))
         }
}
GROUP BY ?country
ORDER BY DESC(?count)
LIMIT 10

Unfortunately sometimes the country has a language tag and this means the results are treated differently.

Country	Count
United States (Lang=EN)	469
US	463
United States	393
GB	365
JP	359
DE	295
RU	283
NL	257
Deutschland	186
United Kingdom	142

We can strip away the language tags by converting it to a plain string with str. Furthermore we can remove any leading/trailing whitespace with a regular expression.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>

SELECT ?countryplain (count(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country))
         BIND (replace(str(?country), '[ \n\t]*(.*)[ \n\t]*', '\\1') as ?countryplain)
         }
}
GROUP BY ?countryplain
HAVING (COUNT(distinct ?src) >= 50)
ORDER BY DESC(?count)

There’s still some normalisation to do; United States, US and USA are all the same as are DE, Deutschland and Germany.

Country	Count
United States	863
US	496
GB	381
JP	362
DE	355
RU	287
NL	264
Deutschland	192
United Kingdom	175
FR	128
AU	96
CA	88
India	65
Canada	61
	60
IN	59
Germany	50
USA	50

We find jobs located in Australia looking for the country being ‘AU’ or ‘Australia’ in some case, after trimming whitespace.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>

SELECT DISTINCT ?src
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country) &&
                 lcase(replace(str(?country),
                       '[ \n\t]*(.*)[ \n\t]*', '\\1'))
                 in ('au', 'australia'))
         }
}

This gets 124 URLs, 40 of which don’t end in .au. This includes some New Zealand job sites, some global companies, some talent platforms with company subdomains for Australian companies (breezy.hr, gosnaphot and recruitee and jobsindevenport.com which is a site dedicated to jobs in the city of Devonport in Tasmania. The majority of these look like Australian job ads.

Note that this means that around half of the jobs in a .au domain don’t have Australia as a country. I’m willing to guess this is because the metadata is incomplete; they probably don’t have an addressCountry property at all.

Another place to look for a location would be applicantLocationRequirements which is used for remote jobs, but isn’t used much in practice and so doesn’t seem worth investigating.

Jobs paying Australian Dollars

Australia has it’s own unique currency, the Australian Dollar (AUD). We could try to find Australian jobs by extracting the currency from the job and matching to AUD.

The easiest way is with the salaryCurrency field, removing any lanugage tags as before.

PREFIX sdo: <http://schema.org/>

SELECT ?curr (COUNT(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
        sdo:salaryCurrency ?currency .
    }
    BIND (str(?currency) as ?curr)
}
GROUP BY ?curr
ORDER BY DESC(?count)
LIMIT 10

Currency	Count
GBP	179
EUR	93
USD	69
€	58
AUD	41
JPY	27
	13
円	8
INR	7
HKD	7

Another way the currency can be encoded is as the currency in the baseSalary:

PREFIX sdo: <http://schema.org/>

SELECT ?curr (COUNT(distinct ?src) AS ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
        sdo:baseSalary/sdo:currency ?currency .
    }
    BIND (str(?currency) as ?curr)
}
GROUP BY ?curr
ORDER BY DESC(?count)
LIMIT 10

Currency	Count
GBP	314
JPY	261
USD	234
EUR	211
	117
INR	102
JPN	93
€	62
AUD	54
AFA	23

Combining all the currency variants

As before we add the fully qualified schemas to get every possible variation. We also add <https://schema.org/MonetaryValue/> because this occurs a few times in practice. In fact the https://schema.org should be equivalent so I should check it everywhere doubling the number of variants. It doesn’t occur much in this dataset, so I mostly ignore it here, but it might become a bigger issue in future.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_mv: <http://schema.org/MonetaryValue/>
PREFIX sdos_mv: <https://schema.org/MonetaryValue/>


SELECT ?curr (COUNT(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
        ((sdo:salaryCurrency|sdo_jp:salaryCurrency)|
         (sdo:baseSalary|sdo_jp:baseSalary)/
          (sdo:currency|sdo_mv:currency|sdos_mv:currency)) ?currency .
    }
    BIND (replace(str(?currency), '[ \n\t]+', '') as ?curr)
    FILTER (!(lcase(?curr) in ('', 'null', 'na', 'n/a', 'unavailable')))
}
GROUP BY ?curr
ORDER BY DESC(?count)
LIMIT 20

The resulting data is pretty good; we could further normalise € as EUR and £ as GBP, but the currencies otherwise look like ISO 4217 currency codes.

Currency	Count
GBP	392
USD	302
EUR	295
JPY	266
AUD	114
INR	108
JPN	93
€	68
CZK	57
RUB	50
RUR	49
AFA	23
CAD	19
VND	18
HKD	14
£	14
BRL	12
SEK	11
PKR	10
THB	10

Finally we can filter down to the 114 Job ads offering salary in AUD:

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_mv: <http://schema.org/MonetaryValue/>
PREFIX sdos_mv: <https://schema.org/MonetaryValue/>

SELECT distinct ?src
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
        ((sdo:salaryCurrency|sdo_jp:salaryCurrency)|
         (sdo:baseSalary|sdo_jp:baseSalary)/(sdo:currency|sdo_mv:currency|sdos_mv:currency)) ?currency .
    }
    BIND (replace(str(?currency), '[ \n\t]+', '') as ?curr)
    FILTER (lcase(?curr) = 'aud')
}

This gives 114 jobs, of which 18 don’t have a .au domain and 12 of those don’t have Australia as a country. Most of these jobs are valid Australian jobs, but for some reason there are a few New Zealand jobs (which should be in NZD).

Combining all the results

The query at the start of the article is just the UNION of the three variants: .au domain, Australia as a country or AUD as the currency.

One catch is that we need to specify the .au in a separate GRAPH query because it filters the domain, which we don’t want to do for country or salary.

 { 
  GRAPH ?src
  {[] a sdo:JobPosting .}
  BIND (replace(str(?src), 
                'https?://([^?/]+).*',
                '\\1') AS ?domain)
    FILTER (strends(?domain, '.au'))
 }

But the Country and Salary queries can be done in the same GRAPH search

 {
  GRAPH ?src
  {
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country) && 
                 lcase(replace(str(?country),
                               '[ \n\t]*(.*)[ \n\t]*',
                               '\\1')) in ('au', 'australia'))
    }
    UNION
    {[] a sdo:JobPosting ;
        ((sdo:salaryCurrency|sdo_jp:salaryCurrency)|
         (sdo:baseSalary|sdo_jp:baseSalary)/
         (sdo:currency|sdo_mv:currency|sdos_mv:currency)) ?currency .
    BIND (replace(str(?currency), '[ \n\t]+', '') as ?curr)
    FILTER (lcase(?curr) = 'aud')}
  }

You can see the very similar Jupyter notebook for all the underlying code and analysis.

Now that we have a way of identifying metadata relating to Australian jobs we can start to build them into a pipeline to extract and analyse the data.