import csv
from collections import Counter
from io import BytesIO
from pathlib import Path
import requests
Downloading books from Project Gutenberg with Python
Project Gutenberg is a great resource for free eBooks, and has lots of great classic texts for NLP. While there exist some libraries for accessing Project Gutenberg from Python such as py-gutenberg and GutenbergPy these require implicitly or explicitly building a database which makes them complex to use. The R package gutenberr is much easier to use because it distributes a snapshot of the catalog and loads it into memory, but I can’t find an equivalent in Python. So instead we’re going to directly search for books from Project Gutenberg’s CSV exports, and use them to download all the books of P. G. Wodehouse
Reading the Catalog
Project Gutenberg doesn’t have an API but has documentation on offline catalogs. There exists a large RDF catalog (around 100MB compressed) with detailed metadata, and a smaller CSV catalog (14MB uncompressed) that contains limited metadata.
The CSV catalog is small enough we can download it quickly into memory (note that requests automatically decompresses):
import requests
= "https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv.gz"
GUTENBERG_CSV_URL
= requests.get(GUTENBERG_CSV_URL)
r = r.content.decode("utf-8")
csv_text
f"Total size: {len(r.content) / 1024**2:0.2f}MB"
'Total size: 14.04MB'
The text is a standard CSV file:
print(csv_text[:400])
Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
1,Text,1971-12-01,The Declaration of Independence of the United States of America,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence",E201; JK,Politics; American Revolutionary War; United States Law
2,Text,1972-12-01,"The United States Bill o
An easy way to process it is with a DictReader
, wrapping the text in StringIO to make it look like a file
import csv
from io import StringIO
next(csv.DictReader(StringIO(csv_text)))
{'Text#': '1',
'Type': 'Text',
'Issued': '1971-12-01',
'Title': 'The Declaration of Independence of the United States of America',
'Language': 'en',
'Authors': 'Jefferson, Thomas, 1743-1826',
'Subjects': 'United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence',
'LoCC': 'E201; JK',
'Bookshelves': 'Politics; American Revolutionary War; United States Law'}
We can then search for all P. G. Wodehouse books by looking for authors containing “Wodehouse”:
= [book for book in csv.DictReader(StringIO(csv_text))
wodehouse_books if 'Wodehouse' in book['Authors']]
len(wodehouse_books)
56
Let’s show our results in a HTML table (it’s a bit long - feel free to skim past it):
from IPython.display import display, HTML
def dicts_to_html_table(dicts):
= []
html = None
header for d in dicts:
if header is None:
= d.keys()
header "<table><tr>" +
html.append("".join([f"<th>{h}</th>" for h in header]) +
"</tr>")
"<tr>" +
html.append("".join([f"<td>{d[h]}</td>" for h in header]) +
"</tr>")
"</table>")
html.append(
return "".join(html)
display(HTML(dicts_to_html_table(wodehouse_books)))
Text# | Type | Issued | Title | Language | Authors | Subjects | LoCC | Bookshelves |
2005 | Text | 1999-12-01 | Piccadilly Jim | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Piccadilly (London, England) -- Fiction | PR | Best Books Ever Listings; Humor |
2042 | Text | 2000-01-01 | Something New | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Nobility -- Fiction; Blandings Castle (England : Imaginary place) -- Fiction; Shropshire (England) -- Fiction | PR | Best Books Ever Listings; Humor |
2233 | Text | 2000-06-01 | A Damsel in Distress | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Nobility -- Fiction; Blandings Castle (England : Imaginary place) -- Fiction; Shropshire (England) -- Fiction | PR | Humor |
2607 | Text | 2001-04-01 | Psmith, Journalist | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories | PR | Humor |
3756 | Text | 2008-06-25 | Indiscretions of Archie | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | New York (N.Y.) -- Fiction; Humorous stories; British -- United States -- Fiction; World War, 1914-1918 -- Veterans -- Fiction; Hotels -- Fiction; Married men -- Fiction; Fathers-in-law -- Fiction | PR | Humor |
3829 | Text | 2003-03-01 | Love Among the Chickens | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; England -- Fiction; Farm life -- Fiction; Chicken breeders -- Fiction; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction | PR | Humor |
4075 | Text | 2003-05-01 | The Intrusion of Jimmy | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | New York (N.Y.) -- Fiction; Humorous stories; Love stories; Burglary -- Fiction; British -- United States -- Fiction; Police -- Family relationships -- Fiction | PR | Humor |
6683 | Text | 2004-10-01 | The Little Nugget | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Kidnapping -- Fiction | PR | Humor |
6684 | Text | 2004-10-01 | Uneasy Money | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | New York (N.Y.) -- Fiction; Humorous stories; Inheritance and succession -- Fiction; Love stories; Aristocracy (Social class) -- Fiction; British -- United States -- Fiction | PR | Humor |
6753 | Text | 2004-10-01 | Psmith in the City | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories | PR; PZ | Humor |
6768 | Text | 2004-10-01 | The Man Upstairs and Other Stories | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories, English | PR | Humor |
6836 | Text | 2004-11-01 | Three Men and a Maid | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories | PR | Humor |
6837 | Text | 2004-11-01 | The Little Warrior | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Love stories; Poor women -- Fiction; Musicals -- Fiction; Broadway (New York, N.Y.) -- Fiction; Long Island (N.Y.) -- Fiction | PR | Humor |
6877 | Text | 2004-11-01 | The Head of Kay's | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Schools -- Fiction | PR; PZ | Humor; School Stories |
6879 | Text | 2004-11-01 | The Gold Bat | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories, English; Boys -- Conduct of life -- Juvenile fiction; Schools -- Juvenile fiction; Sports -- Juvenile fiction | PR; PZ | Humor; School Stories |
6880 | Text | 2004-11-01 | The Coming of Bill | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories | PR | Humor |
6927 | Text | 2004-11-01 | The White Feather | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Schools -- Fiction; Children's stories | PR; PZ | Humor; School Stories |
6955 | Text | 2004-11-01 | The Prince and Betty | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories | PR | Humor |
6980 | Text | 2004-11-01 | Tales of St. Austin's | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Schools -- Fiction; Humorous stories | PR; PZ | Humor; School Stories |
6984 | Text | 2004-11-01 | The Pothunters | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Schools -- Fiction; Humorous stories; Theft -- Fiction; England -- Social life and customs -- 20th century -- Fiction | PR; PZ | Humor; School Stories |
6985 | Text | 2004-11-01 | A Prefect's Uncle | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Schools -- Fiction; Humorous stories; Uncles -- Fiction; Schoolboys -- Fiction; Cricket stories | PR; PZ | Humor; School Stories |
7028 | Text | 2004-12-01 | The Clicking of Cuthbert | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Golf stories | PR | Humor |
7050 | Text | 2004-12-01 | The Swoop! or, How Clarence Saved England: A Tale of the Great Invasion | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Satire; Humorous stories; England -- Fiction; Boy Scouts -- Fiction; Imaginary wars and battles -- Fiction | PR | Humor; Scouts |
7230 | Text | 2005-01-01 | Not George Washington — an Autobiographical Novel | en | Westbrook, H. W. (Herbert Wetton); Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories | PR | Humor |
7298 | Text | 2005-01-01 | William Tell Told Again | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | English wit and humor; Tell, Wilhelm -- Fiction | PR; PZ | Humor |
7423 | Text | 2005-02-01 | Mike | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Boarding schools -- Fiction; Schools -- Fiction; Humorous stories; England -- Fiction; Cricket -- Fiction | PR; PZ | Humor; School Stories |
7464 | Text | 2005-02-01 | The Adventures of Sally | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | New York (N.Y.) -- Fiction; Humorous stories; Inheritance and succession -- Fiction | PR | Humor |
7471 | Text | 2005-02-01 | The Man with Two Left Feet, and Other Stories | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Short stories; Humorous stories, English | PR | Humor |
8164 | Text | 2005-05-01 | My Man Jeeves | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; England -- Fiction; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction | PR | Humor |
8176 | Text | 2005-05-01 | Death at the Excelsior, and Other Stories | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories, English; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Valets -- Fiction | PR | Humor |
8178 | Text | 2005-05-01 | The Politeness of Princes, and Other School Stories | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Schools -- Fiction | PR; PZ | Humor; School Stories |
8190 | Text | 2005-05-01 | A Wodehouse Miscellany: Articles & Stories | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories, English; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Valets -- Fiction; England -- Social life and customs -- Fiction | PR | Humor |
8713 | Text | 2005-08-01 | A Man of Means | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories | PR | Humor |
8931 | Text | 2005-09-01 | The Gem Collector | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Jewel thieves -- Fiction | PR | Humor |
10554 | Text | 2004-01-01 | Right Ho, Jeeves | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; England -- Fiction; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction | PR | Humor |
10586 | Text | 2004-01-01 | Mike and Psmith | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Boarding schools -- Fiction; Schools -- Fiction; Humorous stories; England -- Fiction; Cricket -- Fiction | PZ | Humor; School Stories |
20532 | Text | 2007-02-06 | Love Among the Chickens A Story of the Haps and Mishaps on an English Chicken Farm | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975; Both, Armand, 1881-1922 [Illustrator] | Humorous stories; England -- Fiction; Farm life -- Fiction; Chicken breeders -- Fiction; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction | PR | Humor |
20533 | Text | 2007-02-06 | Jill the Reckless | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Love stories; Poor women -- Fiction; Musicals -- Fiction; Broadway (New York, N.Y.) -- Fiction; Long Island (N.Y.) -- Fiction | PR | Humor |
20717 | Text | 2007-03-01 | The Girl on the Boat | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Children of the rich -- Fiction; Golf stories; Man-woman relationships -- Fiction | PR | |
23899 | Sound | 2007-12-01 | Psmith in the City | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories | PR; PZ | |
26303 | Sound | 2008-08-01 | Right Ho, Jeeves | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; England -- Fiction; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction | PR | |
26579 | Sound | 2008-09-01 | Love Among the Chickens | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; England -- Fiction; Farm life -- Fiction; Chicken breeders -- Fiction; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction | PR | |
43317 | Text | 2013-07-26 | Lord Lyons: A Record of British Diplomacy, Vol. 1 of 2 | en | Newton, Thomas Wodehouse Legh, Baron, 1857-1942 | Europe -- Politics and government -- 1871-1918; Lyons, Richard Bickerton Pemell Lyons, Earl, 1817-1887; Diplomatic and consular service, British; Great Britain -- Foreign relations -- 1837-1901 | DA | |
44143 | Text | 2013-11-10 | Lord Lyons: A Record of British Diplomacy, Vol. 2 of 2 | en | Newton, Thomas Wodehouse Legh, Baron, 1857-1942; Ward, Wilfrid, Mrs., 1864-1932 [Contributor] | Europe -- Politics and government -- 1871-1918; Lyons, Richard Bickerton Pemell Lyons, Earl, 1817-1887; Diplomatic and consular service, British; Great Britain -- Foreign relations -- 1837-1901 | DA | |
58508 | Text | 2018-12-21 | Index of the Project Gutenberg Works of Pelham Grenville Wodehouse | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975; Widger, David, 1932-2021? [Editor] | Indexes | PR | |
59254 | Text | 2019-04-11 | The Inimitable Jeeves | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction; England -- Social life and customs -- Fiction; Upper class -- England -- Fiction | PR | |
60067 | Text | 2019-08-06 | Leave it to Psmith | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Impostors and imposture -- Fiction; Nobility -- Fiction; Blandings Castle (England : Imaginary place) -- Fiction; Shropshire (England) -- Fiction; Jewel thieves -- Fiction | PR | |
61507 | Text | 2020-02-25 | Ukridge | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction | PR | |
63735 | Text | 2020-11-13 | Subscription the disgrace of the English Church [2nd edition] | en | Wodehouse, C. N. (Charles Nourse), 1790-1870 | Church of England -- Controversial literature; Church of England. Thirty-nine Articles | BX | |
63738 | Text | 2020-11-13 | Subscription the disgrace of the English Church [1st edition] | en | Wodehouse, C. N. (Charles Nourse), 1790-1870 | Church of England -- Controversial literature; Church of England. Thirty-nine Articles | BX | |
65172 | Text | 2021-04-26 | A Gentleman of Leisure | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | New York (N.Y.) -- Fiction; Humorous stories; Love stories; Burglary -- Fiction; British -- United States -- Fiction; Police -- Family relationships -- Fiction | PR | |
65974 | Text | 2021-08-01 | Carry On, Jeeves | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | British -- New York (State) -- New York -- Fiction; Short stories; Humorous stories; England -- Fiction; Wooster, Bertie (Fictitious character) -- Fiction; Jeeves (Fictitious character) -- Fiction; Single men -- Fiction; Valets -- Fiction | PR | |
67368 | Text | 2022-02-10 | Sam in the Suburbs | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories; England -- Fiction; Gangsters -- Fiction; Publishers and publishing -- Fiction | PR | |
70041 | Text | 2023-02-14 | The small bachelor | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | New York (N.Y.) -- Fiction; Humorous stories; Man-woman relationships -- Fiction; Upper class -- Fiction; Painters -- Fiction | PR | |
70222 | Text | 2023-03-06 | Meet Mr Mulliner | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories, English; England -- Fiction; Short stories, English; San Francisco (Calif.) -- Fiction; Interpersonal relations -- Fiction; Mulliner family (Fictitious characters) -- Fiction | PR | |
72227 | Text | 2023-11-25 | Divots | en | Wodehouse, P. G. (Pelham Grenville), 1881-1975 | Humorous stories, English; Golfers -- Fiction; Golf stories, English | PR |
There are a couple of results above that aren’t what I am looking for:
- Other authors with the name Wodehouse: Wodehouse, C. N. and Thomas Wodehouse Legh
- The “Index of the Project Gutenberg Works of Pelham Grenville Wodehouse”
- Some of them are “Sound” not text
We can filter these out to get just the books we need.
= [b for b in wodehouse_books
wodehouse_books if "Wodehouse, P. G." in b["Authors"]
and "Indexes" not in b["Subjects"]
and b["Type"] == "Text"]
len(wodehouse_books)
48
Downloading the text
Once we have the id of the book (Text#), it can be downloaded from a standard URL. For human access we can get them from https://www.gutenberg.org/ebooks/{id}.txt.utf-8
:
= "https://www.gutenberg.org/ebooks/{id}.txt.utf-8"
GUTENBERG_TEXT_URL
= wodehouse_books[0]["Text#"]
book_id
#r = requests.get(GUTENBERG_TEXT_URL.format(id=book_id))
#text = r.text
but their robots access policy suggests using a special URL to get links (we set the filetypes to txt
here to get text).
= "http://www.gutenberg.org/robot/harvest?filetypes[]=txt"
GUTENBERG_ROBOT_URL = requests.get(GUTENBERG_ROBOT_URL)
r
print(r.text[:750])
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>All Files (offset: 0, filetypes: txt) - Project Gutenberg</title>
</head>
<body>
<h1>All Files (offset: 0, filetypes: txt)</h1> <p><a href="http://aleph.gutenberg.org/etext02/comed10.zip">http://aleph.gutenberg.org/etext02/comed10.zip</a></p>
<p><a href="http://aleph.gutenberg.org/1/2/3/7/12370/12370-8.zip">http://aleph.gutenberg.org/1/2/3/7/12370/12370-8.zip</a></p>
<p><a href="http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip">http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip</a></p>
<p><a href="http://aleph.guten
The mirror can be extracted from the URLs:
import re
= re.search('(https?://[^/]+)[^"]*.zip', r.text).group(1)
GUTENBERG_MIRROR GUTENBERG_MIRROR
'http://aleph.gutenberg.org'
Then we can construct the URL using the same logic as gutenbergr. Note that sometimes we need to add a suffix (e.g. look at http://aleph.gutenberg.org/0/1/ which only has a -0
)
def gutenberg_text_urls(id: str, mirror=GUTENBERG_MIRROR, suffixes=("", "-8", "-0")) -> list[str]:
= "/".join(id[:-1]) or "0"
path return [f"{mirror}/{path}/{id}/{id}{suffix}.zip" for suffix in suffixes]
gutenberg_text_urls(book_id)
['http://aleph.gutenberg.org/2/0/0/2005/2005.zip',
'http://aleph.gutenberg.org/2/0/0/2005/2005-8.zip',
'http://aleph.gutenberg.org/2/0/0/2005/2005-0.zip']
We can then try each URL in turn until we find the file, and then unzip it:
import logging
import zipfile
def download_gutenberg(id: str) -> str:
for url in gutenberg_text_urls(id):
= requests.get(url)
r if r.status_code == 404:
f"404 for {url}")
logging.warning(continue
r.raise_for_status()break
= zipfile.ZipFile(BytesIO(r.content))
z
if len(z.namelist()) != 1:
raise Exception(f"Expected 1 file in {z.namelist()}")
return z.read(z.namelist()[0]).decode('utf-8')
= download_gutenberg(book_id)
text
print(text[:1500])
The Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: Piccadilly Jim
Author: P. G. Wodehouse
Release Date: September 12, 2012 [EBook #2005]
Last Updated: August 16, 2016
Language: English
Character set encoding: ASCII
*** START OF THIS PROJECT GUTENBERG EBOOK PICCADILLY JIM ***
Produced by Jim Tinsley
Piccadilly Jim
by
Pelham Grenville Wodehouse
CHAPTER I
A RED-HAIRED GIRL
The residence of Mr. Peter Pett, the well-known financier, on
Riverside Drive is one of the leading eyesores of that breezy and
expensive boulevard. As you pass by in your limousine, or while
enjoying ten cents worth of fresh air on top of a green omnibus,
it jumps out and bites at you. Architects, confronted with it,
reel and throw up their hands defensively, and even the lay
observer has a sense of shock. The place resembles in almost
equal proportions a cathedral, a suburban villa, a hotel and a
Chinese pagoda. Many of its windows are of stained glass, and
above the porch stand two terra-cotta lions, considerably more
repulsive even than the complacent animals which guard New York's
Public Library. It is a house which is impossible to overlook:
and it
Searching for this text we can see it also appears near the end of the text (actually this book has some transcriber’s notes after the end of the text, but we’ll leave them in)
= "PROJECT GUTENBERG EBOOK "
GUTENBERG_TEXT
= text.splitlines()
lines
= True
first for idx, line in enumerate(lines):
if GUTENBERG_TEXT in line:
if first:
= False
first continue
print('=' * 80)
print('\n'.join(lines[idx-20:idx+20]))
print('=' * 80)
print()
================================================================================
This is a somewhat clumsy construction, and quite un-Wodehousian.
The original passage in the serialization read:
"Before his stony eye the immaculate Bartling wilted. All that
he had ever heard and read about doubles came to him."
--------------------------------
End of the Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse
*** END OF THIS PROJECT GUTENBERG EBOOK PICCADILLY JIM ***
***** This file should be named 2005.txt or 2005.zip *****
This and all associated files of various formats will be found in:
http://www.gutenberg.org/2/0/0/2005/
Produced by Jim Tinsley
Updated editions will replace the previous one--the old editions
will be renamed.
Creating the works from public domain print editions means that no
one owns a United States copyright in these works, so the Foundation
(and you!) can copy and distribute it in the United States without
permission and without paying copyright royalties. Special rules,
set forth in the General Terms of Use part of this license, apply to
copying and distributing Project Gutenberg-tm electronic works to
protect the PROJECT GUTENBERG-tm concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if you
charge for the eBooks, unless you receive specific permission. If you
================================================================================
We can read everything between the first and last header with a simple state machine:
def strip_headers(text):
= False
in_text = []
output
for line in text.splitlines():
if GUTENBERG_TEXT in line:
if not in_text:
= True
in_text else:
break
else:
if in_text:
output.append(line)
return "\n".join(output).strip()
= strip_headers(text) stripped_text
And check that they have worked:
print(stripped_text[:200])
print("*" * 80)
print(stripped_text[-500:])
Produced by Jim Tinsley
Piccadilly Jim
by
Pelham Grenville Wodehouse
CHAPTER I
A RED-HAIRED GIRL
The residence of Mr. Peter Pett, the well-known financier, on
Riverside Drive is one
********************************************************************************
ling wilted.
It was a perfectly astounding likeness, but it was
apparent to him when what he had ever heard and read
about doubles came to him."
This is a somewhat clumsy construction, and quite un-Wodehousian.
The original passage in the serialization read:
"Before his stony eye the immaculate Bartling wilted. All that
he had ever heard and read about doubles came to him."
--------------------------------
End of the Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse
Downloading all the files
Now we can download all the files in a simple loop; let’s create a simple function that gets and cleans the text:
def book_text(book_id):
= requests.get(GUTENBERG_TEXT_URL.format(id=book_id))
r = r.text
text = strip_headers(text)
clean_text return clean_text
We’ll save each book into the “data” folder
= Path("data")
data_path =True) data_path.mkdir(exist_ok
And finally save all the books (one at a time to not overload the server):
for book in wodehouse_books:
id = book["Text#"]
= book_text(id)
text print(f"Saving {book['Title']} by {book['Authors']} containing {len(text):_} characters")
with open(data_path / (id + ".txt"), "wt") as f:
f.write(text)
Saving Piccadilly Jim by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 449_842 characters
Saving Something New by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 419_221 characters
Saving A Damsel in Distress by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 429_025 characters
Saving Psmith, Journalist by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 322_135 characters
Saving Indiscretions of Archie by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 413_041 characters
Saving Love Among the Chickens by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 278_160 characters
Saving The Intrusion of Jimmy by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 381_406 characters
Saving The Little Nugget by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 388_673 characters
Saving Uneasy Money by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 364_858 characters
Saving Psmith in the City by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 295_582 characters
Saving The Man Upstairs and Other Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 562_134 characters
Saving Three Men and a Maid by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 307_343 characters
Saving The Little Warrior by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 650_963 characters
Saving The Head of Kay's by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 245_760 characters
Saving The Gold Bat by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 240_786 characters
Saving The Coming of Bill by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 466_145 characters
Saving The White Feather by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 249_059 characters
Saving The Prince and Betty by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 379_548 characters
Saving Tales of St. Austin's by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 283_438 characters
Saving The Pothunters by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 228_580 characters
Saving A Prefect's Uncle by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 236_385 characters
Saving The Clicking of Cuthbert by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 341_920 characters
Saving The Swoop! or, How Clarence Saved England: A Tale of the Great Invasion by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 102_301 characters
Saving Not George Washington — an Autobiographical Novel by Westbrook, H. W. (Herbert Wetton); Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 289_260 characters
Saving William Tell Told Again by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 83_002 characters
Saving Mike by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 596_796 characters
Saving The Adventures of Sally by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 433_963 characters
Saving The Man with Two Left Feet, and Other Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 383_878 characters
Saving My Man Jeeves by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 272_292 characters
Saving Death at the Excelsior, and Other Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 208_157 characters
Saving The Politeness of Princes, and Other School Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 131_705 characters
Saving A Wodehouse Miscellany: Articles & Stories by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 164_155 characters
Saving A Man of Means by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 154_068 characters
Saving The Gem Collector by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 180_113 characters
Saving Right Ho, Jeeves by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 405_146 characters
Saving Mike and Psmith by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 301_734 characters
Saving Love Among the Chickens
A Story of the Haps and Mishaps on an English Chicken Farm by Wodehouse, P. G. (Pelham Grenville), 1881-1975; Both, Armand, 1881-1922 [Illustrator] containing 272_012 characters
Saving Jill the Reckless by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 633_811 characters
Saving The Girl on the Boat by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 379_578 characters
Saving The Inimitable Jeeves by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 374_608 characters
Saving Leave it to Psmith by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 501_634 characters
Saving Ukridge by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 423_246 characters
Saving A Gentleman of Leisure by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 382_804 characters
Saving Carry On, Jeeves by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 406_254 characters
Saving Sam in the Suburbs by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 448_006 characters
Saving The small bachelor by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 413_161 characters
Saving Meet Mr Mulliner by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 302_837 characters
Saving Divots by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 377_621 characters
We can also save our metadata for future reference:
with open(data_path / 'metadata.csv', 'wt') as f:
= csv.DictWriter(f, fieldnames=wodehouse_books[0].keys())
csv_writer
csv_writer.writeheader()for book in wodehouse_books:
csv_writer.writerow(book)
Conclusion
It’s really simple to search for books using the Project Gutenberg CSV catalog, and to download the books in a way that complies with their robots and crawlers guidelines (thanks to gutenbergr for showing the way). You can easily get books from Project Gutenberg for further data analysis or machine learning; I’m going to train a language model on P. G. Wodehouse.