I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I have previously written scripts to read in the graphs, explore JobPosting schema and analyst the schema using SPARQL. Now we can use these to find some Austrlian Job Postings in the data.

For this analysis I used 15,000 pages containing job postings with different domains from the 2019 Web Data Commons Extract. Here's the final query that extracts 285 domains; the rest of this article will explain what it's doing.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>
PREFIX sdo_mv: <http://schema.org/MonetaryValue/>
PREFIX sdos_mv: <https://schema.org/MonetaryValue/>


SELECT distinct ?src
WHERE {
 { 
  GRAPH ?src
  {[] a sdo:JobPosting .}
  BIND (replace(str(?src), 
                'https?://([^?/]+).*',
                '\\1') AS ?domain)
    FILTER (strends(?domain, '.au'))
 }
 UNION
 {
  GRAPH ?src
  {
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country) && 
                 lcase(replace(str(?country),
                               '[ \n\t]*(.*)[ \n\t]*',
                               '\\1')) in ('au', 'australia'))
    }
    UNION
    {[] a sdo:JobPosting ;
        ((sdo:salaryCurrency|sdo_jp:salaryCurrency)|
         (sdo:baseSalary|sdo_jp:baseSalary)/
         (sdo:currency|sdo_mv:currency|sdos_mv:currency)) ?currency .
    BIND (replace(str(?currency), '[ \n\t]+', '') as ?curr)
    FILTER (lcase(?curr) = 'aud')}
  }
 }
}

Finding Australian Domains

The fact there is a URL in the JobPostings extract in Web Data Commons tells you that the URL contains a structured Job Posting. One heuristic for finding Australian job listings is looking for domains ending in .au.

We can get the URL from Common Crawl containing the data by searching for the graph identifier, which we'll call ?src, filtering to graphs containing a JobPosting.

PREFIX sdo: <http://schema.org/>

SELECT ?src
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting .}
}
LIMIT 10

This gives a list of 10 URLs like https://tire-factory.hiringthing.com/job/17125/warehouse-associate.

We can extract the domain with a regular expression using the replace function, and just get results that end in .au using strends (string-ends).

PREFIX sdo: <http://schema.org/>

SELECT DISTINCT ?src
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting .}
    BIND (replace(str(?src), 'https?://([^?/]+).*', '\\1') AS ?domain)
    FILTER (strends(?domain, '.au'))
}

This gets 233 URLs containing job postings; but we can do a little better if we use the structured data in the job postings.

Finding Job Postings located in Australia

The JobPosting Schema contains a jobLocation, which can be a Place which can contain an address, which can be a PostalAddress, which can contain an addressCountry, which can be a Country which can have a name. Phew!

Extracting Country Name with property paths

We can express this succinctly using SPARQL property paths.

PREFIX sdo: <http://schema.org/>

SELECT ?country (COUNT(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:jobLocation/sdo:address/sdo:addressCountry/sdo:name ?country .}
}
GROUP BY ?country
ORDER BY DESC(?count)
LIMIT 15
Country count
US 127
CA 20
DE 20
GB 18
IL 14

Note that we count distinct graph identifiers; because a page can contain multiple job listings (which in turn can contain multiple jobLocations) it may contribute to multiple countries.

Extracting plain text addressCountry

The addressCountry can also be plain text, and in fact that's much more common. We can filter out the cases where it's a structured value (and so ?country is a blank node) using isLiteral.

PREFIX sdo: <http://schema.org/>

SELECT ?country (COUNT(distinct ?src) AS ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:jobLocation/sdo:address/sdo:addressCountry ?country .}
    FILTER (isLiteral(?country))
}
GROUP BY ?country
ORDER BY DESC(?count)
LIMIT 10

In the results we can see 86 jobs with AU in the countries.

Country Count
United States 385
JP 358
GB 345
US 320
DE 270
NL 253
Deutschland 179
United Kingdom 139
FR 110
AU 86

Matching Country and text at the same time

We should be able to combine the two by making name optional with the ZeroOrOnePath operator ?.

PREFIX sdo: <http://schema.org/>

SELECT ?country (count(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:jobLocation/sdo:address/sdo:addressCountry/(sdo:name?) ?country .}
}
GROUP BY ?country
ORDER BY DESC(?total)
LIMIT 15

However for some strange reason we end up with some URIs in the results:

Country Count
US 447
United States 385
GB 363
JP 359
DE 290
NL 257
Deutschland 179
United Kingdom 140
FR 116
AU 91
CA 81
India 60
60
http://schema.org/JobPosting 56
http://schema.org/Place 56

Oddly enough this doesn't happen if we rewrite it as an alternation:

PREFIX sdo: <http://schema.org/>

SELECT ?country (count(?src) as ?total)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting; sdo:jobLocation/sdo:address/(sdo:addressCountry|sdo:addressCountry/sdo:name) ?country .}
}
GROUP BY ?country
ORDER BY DESC(?total)
LIMIT 15
Country Total
US 609
JP 444
United States 395
GB 363
DE 304
NL 258
Deutschland 179
United Kingdom 140
FR 116
AU 91
CA 84
India 70
Canada 66
60
IN 57

I would expect these to be the same; but I don't know if my understanding of SPARQL is wrong or it's a bug in rdflib. When we filter to literal nodes we get the same results, so I'm not going to dwell on it.

Fully qualified paths

In the microdata extract the properties are specified by fully qualified paths, for example <http://schema.org/Place/address> instead of just <http://schema.org/address>. So we need to match these patterns too, which means adding a whole heap more prefixes.

We can check the property it's binding on, but have to be careful to filter out common strings to reduce false positives (e.g. if ?country is the empty string then this will extract all properties with an empty string).

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>

SELECT ?property (count(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
         sdo_jp:jobLocation/sdo_pl:address/sdo_pa:addressCountry/(sdo_co:name?) ?country .
         [] ?property ?country .
         FILTER (isliteral(?country) &&
                (lcase(str(?country)) not in ('', 'na', 'n/a', 'unavailable', ' ', 'null')))
         }
}
GROUP BY ?property
ORDER BY DESC(?count)
LIMIT 10

Having a Country is very rare in microdata, but this looks about right.

Property Count
http://schema.org/PostalAddress/addressCountry 1351
http://schema.org/Country/name 4
http://schema.org/PostalAddress/addressLocality 3
http://schema.org/PostalAddress/streetAddress 1
http://schema.org/PostalAddress/addressRegion 1

Combining the patterns

We can combine the two possible schema paths using alternations.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>

SELECT ?country (count(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country))
         }
}
GROUP BY ?country
ORDER BY DESC(?count)
LIMIT 10

Unfortunately sometimes the country has a language tag and this means the results are treated differently.

Country Count
United States (Lang=EN) 469
US 463
United States 393
GB 365
JP 359
DE 295
RU 283
NL 257
Deutschland 186
United Kingdom 142

We can strip away the language tags by converting it to a plain string with str. Furthermore we can remove any leading/trailing whitespace with a regular expression.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>

SELECT ?countryplain (count(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country))
         BIND (replace(str(?country), '[ \n\t]*(.*)[ \n\t]*', '\\1') as ?countryplain)
         }
}
GROUP BY ?countryplain
HAVING (COUNT(distinct ?src) >= 50)
ORDER BY DESC(?count)

There's still some normalisation to do; United States, US and USA are all the same as are DE, Deutschland and Germany.

Country Count
United States 863
US 496
GB 381
JP 362
DE 355
RU 287
NL 264
Deutschland 192
United Kingdom 175
FR 128
AU 96
CA 88
India 65
Canada 61
60
IN 59
Germany 50
USA 50

We find jobs located in Australia looking for the country being 'AU' or 'Australia' in some case, after trimming whitespace.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_pl: <http://schema.org/Place/>
PREFIX sdo_pa: <http://schema.org/PostalAddress/>
PREFIX sdo_co: <http://schema.org/Country/>

SELECT DISTINCT ?src
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country) &&
                 lcase(replace(str(?country),
                       '[ \n\t]*(.*)[ \n\t]*', '\\1'))
                 in ('au', 'australia'))
         }
}

This gets 124 URLs, 40 of which don't end in .au. This includes some New Zealand job sites, some global companies, some talent platforms with company subdomains for Australian companies (breezy.hr, gosnaphot and recruitee and jobsindevenport.com which is a site dedicated to jobs in the city of Devonport in Tasmania. The majority of these look like Australian job ads.

Note that this means that around half of the jobs in a .au domain don't have Australia as a country. I'm willing to guess this is because the metadata is incomplete; they probably don't have an addressCountry property at all.

Another place to look for a location would be applicantLocationRequirements which is used for remote jobs, but isn't used much in practice and so doesn't seem worth investigating.

Jobs paying Australian Dollars

Australia has it's own unique currency, the Australian Dollar (AUD). We could try to find Australian jobs by extracting the currency from the job and matching to AUD.

The easiest way is with the salaryCurrency field, removing any lanugage tags as before.

PREFIX sdo: <http://schema.org/>

SELECT ?curr (COUNT(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
        sdo:salaryCurrency ?currency .
    }
    BIND (str(?currency) as ?curr)
}
GROUP BY ?curr
ORDER BY DESC(?count)
LIMIT 10
Currency Count
GBP 179
EUR 93
USD 69
58
AUD 41
JPY 27
13
8
INR 7
HKD 7

Another way the currency can be encoded is as the currency in the baseSalary:

PREFIX sdo: <http://schema.org/>

SELECT ?curr (COUNT(distinct ?src) AS ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
        sdo:baseSalary/sdo:currency ?currency .
    }
    BIND (str(?currency) as ?curr)
}
GROUP BY ?curr
ORDER BY DESC(?count)
LIMIT 10
Currency Count
GBP 314
JPY 261
USD 234
EUR 211
117
INR 102
JPN 93
62
AUD 54
AFA 23

Combining all the currency variants

As before we add the fully qualified schemas to get every possible variation. We also add <https://schema.org/MonetaryValue/> because this occurs a few times in practice. In fact the https://schema.org should be equivalent so I should check it everywhere doubling the number of variants. It doesn't occur much in this dataset, so I mostly ignore it here, but it might become a bigger issue in future.

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_mv: <http://schema.org/MonetaryValue/>
PREFIX sdos_mv: <https://schema.org/MonetaryValue/>


SELECT ?curr (COUNT(distinct ?src) as ?count)
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
        ((sdo:salaryCurrency|sdo_jp:salaryCurrency)|
         (sdo:baseSalary|sdo_jp:baseSalary)/
          (sdo:currency|sdo_mv:currency|sdos_mv:currency)) ?currency .
    }
    BIND (replace(str(?currency), '[ \n\t]+', '') as ?curr)
    FILTER (!(lcase(?curr) in ('', 'null', 'na', 'n/a', 'unavailable')))
}
GROUP BY ?curr
ORDER BY DESC(?count)
LIMIT 20

The resulting data is pretty good; we could further normalise € as EUR and £ as GBP, but the currencies otherwise look like ISO 4217 currency codes.

Currency Count
GBP 392
USD 302
EUR 295
JPY 266
AUD 114
INR 108
JPN 93
68
CZK 57
RUB 50
RUR 49
AFA 23
CAD 19
VND 18
HKD 14
£ 14
BRL 12
SEK 11
PKR 10
THB 10

Finally we can filter down to the 114 Job ads offering salary in AUD:

PREFIX sdo: <http://schema.org/>
PREFIX sdo_jp: <http://schema.org/JobPosting/>
PREFIX sdo_mv: <http://schema.org/MonetaryValue/>
PREFIX sdos_mv: <https://schema.org/MonetaryValue/>

SELECT distinct ?src
WHERE {
    GRAPH ?src
    {[] a sdo:JobPosting ;
        ((sdo:salaryCurrency|sdo_jp:salaryCurrency)|
         (sdo:baseSalary|sdo_jp:baseSalary)/(sdo:currency|sdo_mv:currency|sdos_mv:currency)) ?currency .
    }
    BIND (replace(str(?currency), '[ \n\t]+', '') as ?curr)
    FILTER (lcase(?curr) = 'aud')
}

This gives 114 jobs, of which 18 don't have a .au domain and 12 of those don't have Australia as a country. Most of these jobs are valid Australian jobs, but for some reason there are a few New Zealand jobs (which should be in NZD).

Combining all the results

The query at the start of the article is just the UNION of the three variants: .au domain, Australia as a country or AUD as the currency.

One catch is that we need to specify the .au in a separate GRAPH query because it filters the domain, which we don't want to do for country or salary.

 { 
  GRAPH ?src
  {[] a sdo:JobPosting .}
  BIND (replace(str(?src), 
                'https?://([^?/]+).*',
                '\\1') AS ?domain)
    FILTER (strends(?domain, '.au'))
 }

But the Country and Salary queries can be done in the same GRAPH search

 {
  GRAPH ?src
  {
    {[] a sdo:JobPosting ;
         (sdo:jobLocation|sdo_jp:jobLocation)/
         (sdo:address|sdo_pl:address)/
         (sdo:addressCountry|sdo_pa:addressCountry)/
         ((sdo:name|sdo_co:name)?) ?country .
         FILTER (isliteral(?country) && 
                 lcase(replace(str(?country),
                               '[ \n\t]*(.*)[ \n\t]*',
                               '\\1')) in ('au', 'australia'))
    }
    UNION
    {[] a sdo:JobPosting ;
        ((sdo:salaryCurrency|sdo_jp:salaryCurrency)|
         (sdo:baseSalary|sdo_jp:baseSalary)/
         (sdo:currency|sdo_mv:currency|sdos_mv:currency)) ?currency .
    BIND (replace(str(?currency), '[ \n\t]+', '') as ?curr)
    FILTER (lcase(?curr) = 'aud')}
  }

You can see the very similar Jupyter notebook for all the underlying code and analysis.

Now that we have a way of identifying metadata relating to Australian jobs we can start to build them into a pipeline to extract and analyse the data.