Restoring Wayback Machine HTML
The Internet Archive’s Wayback Machine is a digital archive of a large portion of the internet (hundres of billions of web pages). However they don’t store the webpage in its original form, but make some changes to the page to make it easier to view as it was at the time; for example replacing links to images, CSS and Javascript with their archived versions. But how exactly do they change the HTML, and how do we get the original version?
I originally came across this when fetching resources from the Internet Archive through its CDX Server. The server response includes a SHA-1 digest, but when I tried to recalculate it on the content I got a different value. When I searched for why I came across an Internet Archive post explaining the digest has the SHA-1 of the original content, not what’s in the Wayback Machine.
As you may have guessed, downloading all instances of a webpage, and hashing them yourself, would be worse than relying on the CDX digest. That is because all the instances of the webpage are guaranteed to be different, because the Wayback Machine replaces all links by internal hyperlinks. These urls contain timestamps, and the timestamps obviously differ.
However it turns out to be trivial to get the original content; if the Wayback version is at http://web.archive.org/web/<timestamp>/<url>
then the original capture is at http://web.archive.org/web/<timestamp>id_/<url>
.
The Internet Archive allows us to retrieve the raw version of web pages. For example, if you have this URL (https://web.archive.org/web/20170204063743/http://john.smith@example.org/), replace the timestamp 20170204063743 with 20170204063743id_ (so the modified URL will look like https://web.archive.org/web/20170204063743id_/http://john.smith@example.org/) then you will get the original HTML without any additional comments added by the Internet Archive.
But I only learned that after spending time trying to reverse engineer the Wayback HTML, and the rest of the article covers what the changes are.
About a test case
To work out what was happening I needed a small page and so I used my about page page.
Searching the Internet Archive CDX I get a recent capture:
import requests
= requests.get('http://web.archive.org/cdx/search/cdx',
r ={'url': 'skeptric.com/about/', 'output': 'json'})
params= r.json()
captures
import pandas as pd
= pd.DataFrame(captures[1:], columns=captures[0]) df
This gives a capture of the page from 2020-11-12:
urlkey | timestamp | original | mimetype | statuscode | digest | length | |
---|---|---|---|---|---|---|---|
0 | com,skeptric)/about | 20211120235913 | https://skeptric.com/about/ | text/html | 200 | Z5NRUTRW3XTKZDCJFDKGPJ5BWIBNQCG7 | 3266 |
We can check the base 32 encoded SHA-1 digest against a current snapshot:
from hashlib import sha1
from base64 import b32encode
def sha1_digest(content: bytes) -> str:
return b32encode(sha1(content).digest()).decode('ascii')
= f'http://web.archive.org/web/{record.timestamp}id_/{record.original}'
original_url = requests.get(original_url).content
original_content sha1_digest(original_content)
This gives Z5NRUTRW3XTKZDCJFDKGPJ5BWIBNQCG7
which matches the record.
Now we can get the Wayback Machine version of the content by inserting the timestamp and original URL
= df.iloc[0]
record = f'http://web.archive.org/web/{record.timestamp}/{record.original}'
wayback_url = requests.get(wayback_url).content
wayback_content
sha1_digest(wayback_content)
This gives us a different digest: DEXQJ2HFM7EYGOWJ6W6FPKIJC4V3VXEE
.
Restoring Links
The links in the Wayback Machine versino of the webpage are prefixed with http://web.archive.org/web/
b'(?:href|src)="([^"]*)"', wayback_content) re.findall(
This gives results including:
http://web.archive.org/web/20211120235913cs_/https://skeptric.com/style.main.min.5ea2f07be7e07e221a7112a3095b89d049b96c48b831f16f1015bf2d95d914e5.css
http://web.archive.org/web/20211120235913/https://skeptric.com/
/web/20211120235913/https://skeptric.com/about/
/web/20211120235913/https://skeptric.com/
http://web.archive.org/web/20211120235913/https://www.whatcar.xyz/
http://web.archive.org/web/20211120235913js_/https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js
So we can remove the prefixes:
def remove_wayback_links(content: bytes, timestamp: str) -> bytes:
# Remove web links
= timestamp.encode('ascii')
timestamp = content.replace(b'http://web.archive.org', b'')
content for prefix in [b'', b'im_', b'js_', b'cs_']:
= content.replace(b'/web/' + timestamp + prefix + b'/', b'')
content return content
And the rest
def remove_wayback_changes(content, timestamp):
= remove_wayback_header(content)
content = remove_wayback_footer(content)
content = remove_wayback_links(content, timestamp)
content return content
We can then compare the cleaned wayback content with the original using seqmatcher
(see side-by-side diffs in Jypyter for a fancier solution). For every area where the two are different we print the original and then the cleaned wayback version, with an additional 20 tokens of context on either side:
from difflib import SequenceMatcher
= SequenceMatcher(isjunk=None,
seqmatcher =original_content,
a=clean_wayback_content,
b=False)
autojunk
= context_after = 20
context_before
for tag, a0, a1, b0, b1 in seqmatcher.get_opcodes():
if tag == 'equal':
continue
= max(a0 - context_before, 0)
a_min = min(a1 + context_after, len(seqmatcher.a))
a_max print(seqmatcher.a[a_min:a_max])
= max(b0 - context_before, 0)
b_min = min(b1 + context_after, len(seqmatcher.b))
b_max print(seqmatcher.b[b_min:b_max])
print()
This yields a set of very small changes; here they are:
- Removed trailing whitespace in tags
- Made relative links absolute
- Added a trailing / to the domain URL
meta charset="utf-8" />\n <meta http-eq
meta charset="utf-8"/>\n <meta http-eq
e" content="IE=edge" />\n\n \n \n <t
e" content="IE=edge"/>\n\n \n \n <t
ndly" content="True" />\n <meta name="v
ndly" content="True"/>\n <meta name="v
, initial-scale=1.0" />\n\n \n <link r
, initial-scale=1.0"/>\n\n \n <link r
015bf2d95d914e5.css" />\n<script async src
015bf2d95d914e5.css"/>\n<script async src
"menuitem"><a href="/about/">About</a></
"menuitem"><a href="https://skeptric.com/about/">About</a></
"menuitem"><a href="/">Home</a></li>\n
"menuitem"><a href="https://skeptric.com/">Home</a></li>\n
https://skeptric.com">skeptric.com</a>.<
https://skeptric.com/">skeptric.com</a>.<
What’s interesting about this is there’s no way to recover this information without the original; there’s no way of knowing for sure where the trailing whitespace is (you could search for it by matching against the SHA-1, but it would be expensive). It’s good that the Internet Archive provide an original version of the HTML as well!
For this case I wrote a little script that would munge the original content into something closer to what the Wayback Machine emits, but it wouldn’t be robust enough to work for other captures:
import re
def wayback_normalise_content(content, base_url):
= base_url.encode('ascii')
url = re.sub(b' */>', b'/>', content)
content = content.replace(b'href="/', b'href="' + url + b'/')
content = re.sub(b'href="' + url + b'"', b'href="' + url + b'/"', content)
content return content
assert wayback_normalise_content(original_content, 'https://skeptric.com') == clean_wayback_content
If you want to try this at home there’s a Jupyter Notebook (or you can view it in your browser).