I'm working on a project to extract books from Hacker News, and want to link the books to records from Open Library. I've already looked at the process of adding a book to Open Library and loading a data export into sqlite. Now I really want to look through the data and see what's inside.
I do this through two different perspectives; summarising the metadata and looking at some specific records.
Summarising the Metadata
Looking through a random 1% sample of the works, and the related authors and editions, I analysed the fields that occur more than 1% of the time in the metadata. I used a sample so the whole thing would fit into memory and make analysis faster. You can see the details in the Jupyter Notebook.
It seems like there are a few different processes for adding entities to Open Library. As well as the public adding and editing fields there seem to be some bulk imports from other book databases, and potentially some other programmatic edits. In fact around 70% of the editions were imported from an external source, leaving 30% that may have been manually uploaded. This just reports high level statistics, but it would be interesting to understand the field usage by source.
- 70% of the editions have been automatically imported from one of MARC, Better World Books, Internet Archive, or Amazon (listed in
- 64% of editions have at least one ISBN 10 or ISBN 13 (this is asked for in manual uploads, or an LCCN)
- Almost always have a
title, and sometimes a
full_title(19%; often the concatenation of the
edition_name(14%), and occasionally
- Generally have
authors(89%), and sometimes a
by_statement(43%) which is how the authors are listed as text in the book
- Editions often contain a
publish_date(98%) and 58% of the time locations in
publish_country(the latter of which is often a US state and not in the Open Library user interface)
- 60% of the time
subjectsare available; other details about what the book is about are available less than 10% of the time, such as
- 48% have
lc_classificationsand 18% a
dewey_decimal_classwhich help identify the topic
- It can be connected to other databases using things like
- Sometimes more information is in
description, or a
- There are sometimes a
cover(33%) image hosted on Open Library
A work can contain multiple editions. In the Open Libary user interface it's not very clear how you edit a work, but some changes on editions automatically changed the work (such as adding a cover0.
- Works largely have a subset of the fields of editions, not always consistent with the editions
- The authors are normally a superset of the authors of the editions, typically there's only one author (89% of the time), and 5% of the time no author.
- On average there's 1.3 editions per work
Authors are a bit
- On average there's 1.1 works per author
- Most authors have a
name, 62% a
personal_name, and 4%
alternate_names. They're often inconsistent with format (e.g.
- 22% have a
death_date(free text) which could be useful for disambiguation
- 7% have
remote_idslinking to wikidata, VIAF and ISNI where additional information can be obtained
- Less than 2% have a
biofor the author or
Looking at some specific examples
As a complement to the high level statistics it's useful to look at some specific example texts. I picked some technical books I'm aware of and looked through their records in a notebook.
Searching for books with the title "Bayesian Data Analysis" (in a case insensitive way) returned 4 separate works, all clearly the same book.
Note that one book has the author name in the wrong order (
Gelman Andrew), and that only
/works/OL18391964W contains all the authors (including Andrew Gelman twice, the second time as
|/works/OL25152967W||Bayesian Data Analysis||Gelman Andrew||/authors/OL9492748A|
|/works/OL12630389W||Bayesian data analysis||Andrew Gelman||/authors/OL2668098A|
|/works/OL19124056W||Bayesian data analysis||Andrew Gelman||/authors/OL2668098A|
|/works/OL18391964W||Bayesian data analysis||Andrew Gelman||/authors/OL2668098A|
|/works/OL18391964W||Bayesian data analysis||John B. Carlin||/authors/OL2692132A|
|/works/OL18391964W||Bayesian data analysis||Hal S. Stern||/authors/OL2692133A|
|/works/OL18391964W||Bayesian data analysis||Donald B. Rubin||/authors/OL1194305A|
|/works/OL18391964W||Bayesian data analysis||A. Gelman||/authors/OL2692134A|
Multiple works for an edition
Sometimes an edition has multiple works, but all the cases I've checked seem to be errors.
Summary of Open Libarary
Open Library has massive coverage of books, often with other useful information about the books, but with some duplication and inconsistencies (e.g. among publishers). It's a good starting point for a knowledge base, but requires additional work to remove duplicates and other errors. A lot of these are driven by the interface; an interesting extension would be to look more into how the field usage varies by source, and what the sources of duplication are. These could potentially be improved by the Open Library team to create better results in the future. But it's still useful enough to work with as is, if we're careful.