Common Crawl Index Athena

commoncrawl

data

Published

June 12, 2020

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are petabytes of data archived so directly searching through them is very expensive and slow. To search for pages that have been archived within a domain (for example all pages from wikipedia.com) you can search the Capture Index. But this doesn’t help if you want to search for paths archived across domains.

For example you might want to find how many domains been archived, or the distribution of languages of archived pages, or find pages offered in multiple languages to build a corpus of parallel texts for a machine translation model. For these usecases Common Crawl provide a columnar index to the WARC files, which are parquet files available on S3. Even the index parquet files are 300 GB per crawl so you may want to process them with Spark or AWS Athena (which is a managed version of Apache Presto).

Common Crawl have a guide to setting up access to the index in Athena, and a repository containing examples of Athena queries and Spark jobs to extract information from the index. This article will explore some examples of querying this data with Athena, assuming you have created the table ccindex as per the Common Crawl setup instructions. You can run them through the AWS web console, through an Athena CLI or in Python with pyathena or R with RAthena

Keeping Athena costs low

Every time you run a query in AWS Athena they charge for processing the query (currently $5 per terabyte), for S3 requests and transfer to the underlying data (which we don’t pay for here because the S3 bucket results) and the S3 storage costs of any created results. See the AWS pricing details for more complete and current information, but the strategies will be the same.

Whenever you run a query in Athena the output is stored as an uncompressed CSV in S3 in the staging bucket you configured, so you should periodically delete old results, for example manually or with lifecycle rules. If you ever need to create a large extract (multiple gigabytes) it’s more cost efficient to store it as parquet with a Create Table As Statement (CTAS); see my article on exporting data from Athena for details.

To keep the amount of data processed low the best things to do is filter on crawl which correspond to which monthly snapshot is used, and subset, because the table is partitioned on these it will always reduce the amount of data scanned. Only querying the columns you need (instead of select *) will also reduce the amount of data scanned, since the data is stored in a columnar format. Finally if you can use a low cardinality column prefer that over a high cardinality column (e.g. don’t use url if you only want the TLD, use url_host_tld instead).

Exploring Common Crawl with Athena

We can find out what crawls are in the data by searching through the partitions. Because we’re just using the partition columns we’re not charged for any data processed, but it’s relatively slow; it took 3 minutes for me.

SELECT crawl, subset, count(*) as n_captures
FROM "ccindex"."ccindex"
GROUP BY crawl, subset
ORDER BY crawl desc, subset

crawl	subset	n_captures
CC-MAIN-2020-24	crawldiagnostics	70454588
CC-MAIN-2020-24	robotstxt	15356021
CC-MAIN-2020-24	warc	277844242
CC-MAIN-2020-16	crawldiagnostics	432035051
CC-MAIN-2020-16	robotstxt	112871871
CC-MAIN-2020-16	warc	2886236237
…	…	…

The crawls contain the year and ISO week of the crawl (so e.g. CC-MAIN-2020-24 is the crawl from the 24th week of 2020, which is early June).

There are 3 kinds of subsets, as described in the 2016 release

warc - The actual web archive files containing all the data of successful requests (200 ok)
crawldiagnostics - Contains responses like 404 file not found, redirects, 304 not modified, etc.
robotstxt - Contains the robots.txt that would impact what pages the crawl accessed

At the time of writing it looks like there’s something wrong with the most recent index, it contains 280 million captures, when the dataset should countail 2.75 billion. However the 2020-16 one looks correct.

Note that when new crawls are added you have to first run MSCK Repair Table to be able to access them (this re-scans the partitions).

What’s in the columnar index

To see what’s in the index let’s look at a few example rows; we’ll limit 10 to reduce the amount of data scanned (under 10MB).

The columns are

url_surtkey: Canonical form of URL with host name reversed
url: URL that was archived
url_host_name: The host name
url_host_tld: The TLD (e.g. au)
url_host_2nd_last_part, … url_host_5th_last_part: The parts of the host name separated by .
url_host_registry_suffix: e.g. .com.au
url_host_private_domain
url_protocol: e.g. https
url_port: The port accesed, it seems to be blank for default ports (80 for http, 443 for https).
url_path: The path of the URL (everything from the first / to the query parameter starting at ?)
url_query: Query parameter; everything after the ?
fetch_time: When the page was retrieved
fetch_status: The HTTP status of the request (e.g. 200 is OK)
content_digest: A digest to uniquely identify the content
content_mime_type: The type of content in the header
content_mime_detected: The type of content detected
content_charset: The characterset of the data (e.g. UTF-8)
content_languages: Languages declared of the content
warc_filename: The filename the archived data is in
warc_record_offset: The offset in bytes in the archived file where the corresponding data starts
warc_record_length: The length of the archived data in bytes
warc_segment: The segment the data is archived in; this is part of the filename
crawl: The id of the crawl (e.g. CC-MAIN-YYYY-WW where YYYY is the year and WW is the ISO week of the year).
subset: Is this the ‘warc’, or ‘robotstxt’, or ‘crawldiagnostics’

SELECT *
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2020-24'
  AND subset = 'warc'
  AND url_host_tld = 'au'
  AND url_host_registered_domain = 'realestate.com.au'
limit 10

url_surtkey	url	url_host_name	url_host_tld	url_host_2nd_last_part	url_host_3rd_last_part	url_host_4th_last_part	url_host_registry_suffix	url_host_registered_domain	url_host_private_suffix	url_host_private_domain	url_protocol	url_path	fetch_time	fetch_status	content_digest	content_mime_type	content_mime_detected	content_charset	content_languages	warc_filename	warc_record_offset	warc_record_length	warc_segment	crawl	subset
au,com,realestate)/advice	https://www.realestate.com.au/advice/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/	2020-05-28 22:09:01.000	200	BIVR34XTK7HJGQJ5H47GO65UBODZA6XY	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590347400101.39/warc/CC-MAIN-20200528201823-20200528231823-00270.warc.gz	898091083	17598	1590347400101.39	CC-MAIN-2020-24	warc
au,com,realestate)/advice/10-clever-storage-solutions-rental-properties	https://www.realestate.com.au/advice/10-clever-storage-solutions-rental-properties/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/10-clever-storage-solutions-rental-properties/	2020-06-06 05:39:55.000	200	2YJSRGZHQVSE5WWKHEHPJ26RBIIQOIMP	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590348509972.80/warc/CC-MAIN-20200606031557-20200606061557-00502.warc.gz	855590228	36195	1590348509972.80	CC-MAIN-2020-24	warc
au,com,realestate)/advice/10-features-to-look-for-when-buying-a-period-home	https://www.realestate.com.au/advice/10-features-to-look-for-when-buying-a-period-home/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/10-features-to-look-for-when-buying-a-period-home/	2020-05-26 08:40:48.000	200	V75WAUJFWLXT5VPZFCPHUEPFHAZJY6MX	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590347390755.1/warc/CC-MAIN-20200526081547-20200526111547-00430.warc.gz	809064278	35882	1590347390755.1	CC-MAIN-2020-24	warc
au,com,realestate)/advice/10-foodie-towns-worth-going-rural	https://www.realestate.com.au/advice/10-foodie-towns-worth-going-rural/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/10-foodie-towns-worth-going-rural/	2020-05-27 16:31:24.000	200	XP4RGQCIWL5GL4O3KJG53AGRYVRXEGGV	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590347394756.31/warc/CC-MAIN-20200527141855-20200527171855-00398.warc.gz	909969420	35652	1590347394756.31	CC-MAIN-2020-24	warc
au,com,realestate)/advice/10-things-consider-buying-fixer-upper	https://www.realestate.com.au/advice/10-things-consider-buying-fixer-upper/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/10-things-consider-buying-fixer-upper/	2020-05-27 13:38:45.000	200	EZSDPRBNJIJEY4Y47C6A42T7YCDNDOSH	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590347394074.44/warc/CC-MAIN-20200527110649-20200527140649-00029.warc.gz	891295141	34746	1590347394074.44	CC-MAIN-2020-24	warc
au,com,realestate)/advice/10-things-moving-new-house	https://www.realestate.com.au/advice/10-things-moving-new-house/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/10-things-moving-new-house/	2020-06-02 18:28:54.000	200	EX24KJUKOHX5GOX36QFTNVZGACZPJO2J	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590347425481.58/warc/CC-MAIN-20200602162157-20200602192157-00363.warc.gz	879272045	35475	1590347425481.58	CC-MAIN-2020-24	warc
au,com,realestate)/advice/10-tips-coping-share-house-kitchen	https://www.realestate.com.au/advice/10-tips-coping-share-house-kitchen/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/10-tips-coping-share-house-kitchen/	2020-06-02 09:15:24.000	200	GWGMC36W4MHHSZ4GOQJA5AHGPY3LUUID	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590347423915.42/warc/CC-MAIN-20200602064854-20200602094854-00528.warc.gz	840946829	33050	1590347423915.42	CC-MAIN-2020-24	warc
au,com,realestate)/advice/10-tips-to-sharing-a-home-with-your-adult-children	https://www.realestate.com.au/advice/10-tips-to-sharing-a-home-with-your-adult-children/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/10-tips-to-sharing-a-home-with-your-adult-children/	2020-05-28 02:26:38.000	200	FMVSRNL32GKBRPLTV755XBL43G27P7YX	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590347396300.22/warc/CC-MAIN-20200527235451-20200528025451-00478.warc.gz	873617107	31724	1590347396300.22	CC-MAIN-2020-24	warc
au,com,realestate)/advice/101-guide-contents-insurance	https://www.realestate.com.au/advice/101-guide-contents-insurance/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/101-guide-contents-insurance/	2020-06-05 11:00:50.000	200	PTLEDOTFYPBVZO2AC26KUHB6SBJ6FZKG	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590348496026.74/warc/CC-MAIN-20200605080742-20200605110742-00542.warc.gz	837847660	31792	1590348496026.74	CC-MAIN-2020-24	warc
au,com,realestate)/advice/11-things-you-really-need-to-tell-your-landlord	https://www.realestate.com.au/advice/11-things-you-really-need-to-tell-your-landlord/	www.realestate.com.au	au	com	realestate	www	com.au	realestate.com.au	com.au	realestate.com.au	https	/advice/11-things-you-really-need-to-tell-your-landlord/	2020-05-31 04:37:28.000	200	MWL7BHWJEGV7TU7ORULZYGBRDNIY3GQI	text/html	text/html	UTF-8	eng	crawl-data/CC-MAIN-2020-24/segments/1590347410745.37/warc/CC-MAIN-20200531023023-20200531053023-00083.warc.gz	867625659	33829	1590347410745.37	CC-MAIN-2020-24	warc

Most crawled TLDs

To get an idea of the coverage of Common Crawl we can look at the most crawled TLDs, the number of captured domains and the average number of captures per domain for a snapshot.

# Scans 150GB (~ 75c)
SELECT url_host_tld,
       approx_distinct(url_host_registered_domain) as n_domains,
       count(*) as n_captures,
       sum(1e0) / approx_distinct(url_host_registered_domain) as avg_captures_per_domain,
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2020-16'
  AND subset = 'warc'
group by url_host_tld
order by n_captures desc

The number of domains is staggering; 15 million from .com alone, 400k for Australia. The typical average number of pages per domain is 80, but for .edu it’s nearly 5,000 and for .gov it’s nearly 2,500. Much more content is archived from a university of government pages than general domains.

Australian domains with most pages archived

This query finds the au domains with most pages archived from 2020-16 crawl. It takes about 5s and scans under 10MB (so it’s practically free).

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2020-16'
  AND subset = 'warc'
  AND url_host_tld = 'au'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
limit 500

The top websites look to be government and university websites. Note this isn’t about popularity of the site, but is related to the number of pages on the site, how many links there are to each page, and the how permissive its robots.txt file is. To find the most popular sites you would need panel web traffic panel like Alexa.

Domains with the most subdomains

Some domains have a lot of different subdomains which they provide to users as namespaces. Wordpress is a common example where you can get a free personal website with a Wordpress domain. These could be good places to look for user generated content, each subdomain belonging to a user.

SELECT url_host_registered_domain,
       approx_distinct(url_host_name) AS num_subdomains
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2020-16'
  AND subset = 'warc'
GROUP BY url_host_registered_domain
ORDER BY num_subdomains DESC
LIMIT 100
# Scans 1GB data

The results make sense, the top 5 sites; blogspot, wordpress, wixsite, weebly and fc2 are all sites for hosting personal content.

Largest number subdomains in Common Crawl

Public Dropbox Content Types

There were almost five thousand subdomains of dropboxusercontent.com, it would be interesting to see what type of content is in there.

SELECT content_mime_detected,
       count(*) as n
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2020-16'
  AND subset = 'warc'
  AND url_host_registered_domain = 'dropboxusercontent.com'
GROUP BY 1
ORDER BY n DESC
LIMIT 100

It’s mostly PDFs, but there’s some zip files and mobile applications. I wonder how much of it is malware.

content_mime_detected	n
application/pdf	5033
application/zip	204
audio/mpeg	186
application/vnd.symbian.install	130
text/plain	127
image/jpeg	63
application/vnd.android.package-archive	55
video/mp4	38
application/epub+zip	36

Downloading some content

Let’s find some SQL files on Github:

SELECT url,
       warc_filename,
       warc_record_offset,
       warc_record_offset + warc_record_length as warc_record_end
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2020-16'
  AND subset = 'warc'
  AND url_host_registered_domain = 'githubusercontent.com'
  AND content_mime_detected = 'text/x-sql'
LIMIT 5

url	warc_filename	warc_record_offset	warc_record_length
https://raw.githubusercontent.com/arey/spring-batch-toolkit/blog-parallelisation/src/test/resources/com/javaetmoi/core/batch/test/TestParallelAndPartitioning.sql	crawl-data/CC-MAIN-2020-16/segments/1585370493684.2/warc/CC-MAIN-20200329015008-20200329045008-00304.warc.gz	689964496	689966615
https://raw.githubusercontent.com/bailangzhan/demo/master/category.sql	crawl-data/CC-MAIN-2020-16/segments/1585371896913.98/warc/CC-MAIN-20200410110538-20200410141038-00434.warc.gz	616371570	616562918
https://raw.githubusercontent.com/citrix/sample-scripts/master/storefront/Create-ImportSubscriptionDataSP.sql	crawl-data/CC-MAIN-2020-16/segments/1585371896913.98/warc/CC-MAIN-20200410110538-20200410141038-00251.warc.gz	619544564	619546850
https://raw.githubusercontent.com/citrix/sample-scripts/master/storefront/Create-StoreSubscriptionsDB-2016.sql	crawl-data/CC-MAIN-2020-16/segments/1585370506673.7/warc/CC-MAIN-20200402045741-20200402075741-00091.warc.gz	669277710	669280288
https://raw.githubusercontent.com/cwoodruff/ChinookDatabase/master/Scripts/Chinook_SqlServer.sql	crawl-data/CC-MAIN-2020-16/segments/1585370506673.7/warc/CC-MAIN-20200402045741-20200402075741-00421.warc.gz	652805559	652898120

We can then use this to retrieve the WARC for the first record; we just prepend https://data.commoncrawl.org/ to the filename and only get the relevant bytes with the Range header.

curl https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-16/segments/1585370493684.2/warc/CC-MAIN-20200329015008-20200329045008-00304.warc.gz \
    -H "range: bytes=689964496-689966615" > sql_sample.warc.gz

Then you can inspect the file with zcat

If you needed to export the WARC data at scale Common Crawl have a script for producing the WARC extract from a SparkSQL query.