Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are petabytes of data archived so directly searching through them is very expensive and slow. To search for pages that have been archived within a domain (for example all pages from wikipedia.com) you can search the Capture Index. But this doesn't help if you want to search for paths archived across domains.
For example you might want to find how many domains been archived, or the distribution of languages of archived pages, or find pages offered in multiple languages to build a corpus of parallel texts for a machine translation model. For these usecases Common Crawl provide a columnar index to the WARC files, which are parquet files available on S3. Even the index parquet files are 300 GB per crawl so you may want to process them with Spark or AWS Athena (which is a managed version of Apache Presto).
Common Crawl have a guide to setting up access to the index in Athena, and a repository containing examples of Athena queries and Spark jobs to extract information from the index.
This article will explore some examples of querying this data wiht Athena, assuming you have created the table
ccindex as per the Common Crawl setup instructions.
You can run them through the AWS web console, through an Athena CLI or in Python with pyathena or R with RAthena
Keeping Athena costs low
Every time you run a query in AWS Athena they charge for processing the query (currently $5 per terabyte), for S3 requests and transfer to the underlying data (which we don't pay for here because the S3 bucket results) and the S3 storage costs of any created results. See the AWS pricing details for more complete and current information, but the strategies will be the same.
Whenever you run a query in Athena the output is stored as an uncompressed CSV in S3 in the staging bucket you configured, so you should periodically delete old results, for example manually or with lifecycle rules. If you ever need to create a large extract (multiple gigabytes) it's more cost efficient to store it as parquet with a Create Table As Statement (CTAS); see my article on exporting data from Athena for details.
To keep the amount of data processed low the best things to do is filter on
crawl which correspond to which monthly snapshot is used, and
subset, because the table is partitioned on these it will always reduce the amount of data scanned.
Only querying the columns you need (instead of
select *) will also reduce the amount of data scanned, since the data is stored in a columnar format.
Finally if you can use a low cardinality column prefer that over a high cardinality column (e.g. don't use
url if you only want the TLD, use
Exploring Common Crawl with Athena
We can find out what crawls are in the data by searching through the partitions. Because we're just using the partition columns we're not charged for any data processed, but it's relatively slow; it took 3 minutes for me.
SELECT crawl, subset, count(*) as n_captures FROM "ccindex"."ccindex" GROUP BY crawl, subset ORDER BY crawl desc, subset
The crawls contain the year and ISO week of the crawl (so e.g.
CC-MAIN-2020-24 is the crawl from the 24th week of 2020, which is early June).
There are 3 kinds of subsets, as described in the 2016 release
- warc - The actual web archive files containing all the data of successful requests (200 ok)
- crawldiagnostics - Contains resonses like 404 file not found, redirects, 304 not modified, etc.
- robotstxt - Contains the robots.txt that would impact what pages the crawl accessed
At the time of writing it looks like there's something wrong with the most recent index, it contains 280 million captures, when the dataset should countail 2.75 billion. However the 2020-16 one looks correct.
Note that when new crawls are added you have to first run
MSCK Repair Table to be able to access them (this rescanse the partitions).
What's in the columnar index
To see what's in the index let's look at a few example rows; we'll limit 10 to reduce the amount of data scanned (under 10MB).
The columns are
- url_surtkey: Canonical form of URL with host name reversed
- url: URL that was archived
- url_host_name: The host name
- url_host_tld: The TLD (e.g. au)
- url_host_2nd_last_part, ... url_host_5th_last_part: The parts of the host name separated by .
- url_host_registry_suffix: e.g. .com.au
- url_protocol: e.g. https
- url_port: The port accesed, it seems to be blank for default ports (80 for http, 443 for https).
- url_path: The path of the URL (everything from thh first
/to the query parameter starting at
- url_query: Query parameter; everything after the
- fetch_time: When the page was retrieved
- fetch_status: The HTTP status of the request (e.g. 200 is OK)
- content_digest: A digest to uniquely identify the content
- content_mime_type: The type of content in the header
- content_mime_detected: The type of content detected
- content_charset: The characterset of the data (e.g. UTF-8)
- content_languages: Languages declared of the content
- warc_filename: The filename the archived data is in
- warc_record_offset: The offset in bytes in the archived file where the corresponding data starts
- warc_record_length: The length of the archived data in bytes
- warc_segment: The segment the data is archived in; this is part of the filename
- crawl: The id of the crawl (e.g. CC-MAIN-YYYY-WW where YYYY is the year and WW is the ISO week of the year).
- subset: Is this the 'warc', or 'robotstxt', or 'crawldiagnostics'
SELECT * FROM "ccindex"."ccindex" WHERE crawl = 'CC-MAIN-2020-24' AND subset = 'warc' AND url_host_tld = 'au' AND url_host_registered_domain = 'realestate.com.au' limit 10
Most crawled TLDs
To get an idea of the coverage of Common Crawl we can look at the most crawled TLDs, the number of captured domains and the average number of captures per domain for a snapshot.
# Scans 150GB (~ 75c) SELECT url_host_tld, approx_distinct(url_host_registered_domain) as n_domains, count(*) as n_captures, sum(1e0) / approx_distinct(url_host_registered_domain) as avg_captures_per_domain, FROM "ccindex"."ccindex" WHERE crawl = 'CC-MAIN-2020-16' AND subset = 'warc' group by url_host_tld order by n_captures desc
The number of domains is staggering; 15 million from
.com alone, 400k for Australia.
The typical average number of pages per domain is 80, but for
.edu it's nearly 5,000 and for
.gov it's nearly 2,500.
Much more content is archived from a university of government pages than general domains.
Australian domains with most pages archived
This query finds the
au domains with most pages archived from 2020-16 crawl.
It takes about 5s and scans under 10MB (so it's practically free).
SELECT COUNT(*) AS count, url_host_registered_domain FROM "ccindex"."ccindex" WHERE crawl = 'CC-MAIN-2020-16' AND subset = 'warc' AND url_host_tld = 'au' GROUP BY url_host_registered_domain HAVING (COUNT(*) >= 100) ORDER BY count DESC limit 500
The top websites look to be government and university websites. Note this isn't about popularity of the site, but is related to the number of pages on the site, how many links there are to each page, and the how permissive its robots.txt file is. To find the most popular sites you would need panel web traffic panel like Alexa.
Domains with the most subdomains
Some domains have a lot of different subdomains which they provide to users as namespaces. Wordpress is a common example where you can get a free personal website with a Wordpress domain. These could be good places to look for user generated content, each subdomain belonging to a user.
SELECT url_host_registered_domain, approx_distinct(url_host_name) AS num_subdomains FROM "ccindex"."ccindex" WHERE crawl = 'CC-MAIN-2020-16' AND subset = 'warc' GROUP BY url_host_registered_domain ORDER BY num_subdomains DESC LIMIT 100 # Scans 1GB data
The results make sense, the top 5 sites; blogspot, wordpress, wixsite, weebly and fc2 are all sites for hosting personal content.
Public Dropbox Content Types
There were almost five thousand subdomains of dropboxusercontent.com, it would be interesting to see what type of content is in there.
SELECT content_mime_detected, count(*) as n FROM "ccindex"."ccindex" WHERE crawl = 'CC-MAIN-2020-16' AND subset = 'warc' AND url_host_registered_domain = 'dropboxusercontent.com' GROUP BY 1 ORDER BY n DESC LIMIT 100
It's mostly PDFs, but there's some zip files and mobile applications. I wonder how much of it is malware.
Downloading some content
Let's find some SQL files on Github:
SELECT url, warc_filename, warc_record_offset, warc_record_offset + warc_record_length as warc_record_end FROM "ccindex"."ccindex" WHERE crawl = 'CC-MAIN-2020-16' AND subset = 'warc' AND url_host_registered_domain = 'githubusercontent.com' AND content_mime_detected = 'text/x-sql' LIMIT 5
We can then use this to retrieve the WARC for the first record; we just prepend
https://commoncrawl.s3.amazonaws.com/ to the filename and only get the relevant bytes with the Range header.
curl https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-16/segments/1585370493684.2/warc/CC-MAIN-20200329015008-20200329045008-00304.warc.gz \ -H "range: bytes=689964496-689966615" > sql_sample.warc.gz
Then you can inspect the file with zcat
If you needed to export the WARC data at scale Common Crawl have a script for producing the WARC extract from a SparkSQL query.