Common Crawl Time Ranges
commoncrawl

Common Crawl Time Ranges

Common Crawl provides a huge open web dataset of going back to around 2009. Unfortunately it's not easy to find out the time period covered in each index, and so I ran a quick job to get rough estimates of the periods. This is useful if you want to get data from a specific time range. Methodology The best way to do this would be to use the Athena Columnar Index to search the dates, but I didn't have Athena set up and I'd have to be a little careful about the costs.

Building a Job Extraction Pipeline
jobs

Building a Job Extraction Pipeline

I've been trying to extract job ads from Common Crawl. However I was stuck for some time on how to actually write transforms for all the different data sources. I've finally come up with an architecture that works; download, extract and normalise. I need a way to extract the job ads from heterogeneous sources that allows me to extract different kinds of data, such as the title, location and salary. I got stuck in code for a long time trying to do all this together and getting a bit confused about how to make changes.

Processing RDF nquads with grep
data

Processing RDF nquads with grep

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I previously came up with a SPARQL query to extract the Australian jobs from the domain, country and currency. Unfortunately it's quite slow, but we can speed it up dramatically by replacing it with a similar script in grep. With a short grep script we can get twenty thousand Australian Job Postings with metadata from 16 million lines of compressed nquad in 30 seconds on my laptop.

Extracting Australian Job Postings with SPARQL
jobs

Extracting Australian Job Postings with SPARQL

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I have previously written scripts to read in the graphs, explore JobPosting schema and analyst the schema using SPARQL. Now we can use these to find some Austrlian Job Postings in the data. For this analysis I used 15,000 pages containing job postings with different domains from the 2019 Web Data Commons Extract.

Extracting Job Ads from Common Crawl
commoncrawl

Extracting Job Ads from Common Crawl

I've been using data from the Adzuna Job Salary Predictions Kaggle Competition to extract skills, find near duplicate job ads and understand seniority of job titles. But the dataset has heavily processed ad text which makes it harder to do natural language processing on. Instead I'm going to find job ads in Common Crawl's, a dataset containing over a billion webpages each month. The Common Crawl data is much better because it's longitudinal over several years, international, broad and continually being updated.

Common Crawl Index Athena
commoncrawl

Common Crawl Index Athena

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are petabytes of data archived so directly searching through them is very expensive and slow. To search for pages that have been archived within a domain (for example all pages from wikipedia.com) you can search the Capture Index. But this doesn't help if you want to search for paths archived across domains. For example you might want to find how many domains been archived, or the distribution of languages of archived pages, or find pages offered in multiple languages to build a corpus of parallel texts for a machine translation model.

Extracing Text, Metadata and Data from Common Crawl
commoncrawl

Extracing Text, Metadata and Data from Common Crawl

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. You can search the index to find where pages from a particular website are archived, but you still need a way to access the data. Common Crawl provides the data in 3 formats: If you just need the text of the internet use the WET files If you just need the response metadata, HTML head information or links in the webpage use the WAT files If you need the whole HTML (with all the metadata) then use the full WARC files The index only contains locations for the WARC files, the WET and WAT files are just summarisations of it.

Searching 100 Billion Webpages Pages With Capture Index
commoncrawl

Searching 100 Billion Webpages Pages With Capture Index

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. Every month they use Apache Nutch to follow links accross the web and download over a billion unique items to Amazon S3, and have data back to 2008. This is like what Google and Bing do to build their search engines, the difference being that Common Crawl provides their data to the world for free.