Bootstrapping a book classifier
hnbooks

Bootstrapping a book classifier

I'm working on a project to extract books from Hacker News. Most HackerNews posts aren't about books, and it would be extremely tedious to manually annotate examples when most of them are negative. Instead I used different heuristics to determine whether a book contains a title, that can then be used for weak labelling. My main takeaway is that zero shot classification seems like a great starting point for building a classifier.

Displaying Hacker News Book Comments in HTML
hnbooks

Displaying Hacker News Book Comments in HTML

I'm currently working on a project to extract books from Hacker News to help find interesting books. Having extracted extracted ASINs from Hacker News posts and linked them to Open Library records I have a basic proof of concept. Now I want to be able to display this information in some webpages to help people find the books. I ended up building a minimal prototype by manually curating some examples; this let me focus on the design rather than on the technical aspects.

Adding a Book to Open Library
hnbooks

Adding a Book to Open Library

I've been looking at Open Library as a knowledge base for books. A lot of the data here is manually uploaded by people, and the user interface has a big impact on the data they enter. To understand the data better I wanted to actually go through the process of adding a book, Overcoming Floccinaucinihilipilification by Jon Manning, and this article documents the process. Overall Open Library prioritises making it easy for people to add data, and then has facilities to edit it.

Evaluating Book Retrieval from Hacker News
nlp

Evaluating Book Retrieval from Hacker News

I'm working on a project to extract books from Hacker News. I've been thinking about ways to bootstrap this process such as transfer learning, weak labelling, and active learning. I was reading Robert Monarch's excellent book Human-in-the-Loop Machine Learning where he gives good reasons as to why you should start with a separate random dataset for evaluation; you want something that is representative of the real distribution, and any use of a model inherits the biases of that model.

Question Answeeing as Zero Shot NER for Books
nlp

Question Answeeing as Zero Shot NER for Books

I'm working on a project to extract books from Hacker News. I've previously found book recommendations for Ask HN Books, and have used the Work of Art named entity from Ontonotes to detect the titles. Another approach is to use extractive question answering as a sort of zero-shot NER. This works amazingly well, at least providing that there is an actual book title there. The code is simple using Transformers high level Question Answering Pipeline.

Book NER as a Work of Art
nlp

Book NER as a Work of Art

I'm working on a project to extract books from Hacker News. I've previously found book recommendations for Ask HN Books. Now I want a way to extract the book titles and authors. The Ontonotes corpus contains an NER category called Work of Art (for titles of books, songs, etc.) (see the PDF release notes for details). I wanted to see how well this worked. I quickly tried 3 well known systems all trained on this corpus; SpaCy, Stanza, and Flair NLP.

Ask HN Book Recommendations
nlp

Ask HN Book Recommendations

I'm working on a project to extract books from Hacker News. Most HackerNews posts aren't about books, so we need some heuristics to get posts somewhat likely about books. I've already used ASINs to extract book links to Amazon; another approach is like MapFilterFold to use Ask HN threads about books. Ask HN is a kind of post on Hacker News that allows asking questions to the community. We can't identify them in the dataset but they typically start with "Ask HN" in the title.

Finding ASINs in HackerNews
nlp

Finding ASINs in HackerNews

I'm currently working on a project to extract books from Hacker News. After exporting all 2021 posts from the Google Bigquery dataset in a Kaggle Notebook and doing an exploratory data analysis I'm looking for methods to extract books. One way to extract books is using Amazon links. People often refer to a book with a link to Amazon, and each Amazon product has a 10 character ASIN (Amazon Standard Identification Number); for books this is the same as its ISBN-10.

Hacker News Dataset EDA
python

Hacker News Dataset EDA

A mystery! A riddle! A puzzle! A quest! This was the moment that Ada loved best. Ada Twist, Scientist, Andrea Beaty This is an exploration of 2021 Hacker News posts as a precursor to building a books dataset. The data was sourced from the Google Bigquery public dataset bigquery-public-data.hacker_news.full using a Kaggle notebook. SELECT * FROM `bigquery-public-data.hacker_news.full` where '2021-01-01' <= timestamp and timestamp < '2022-01-01' I want to get a basic understanding of what’s in the dataset before doing any data mining.

Side Project Outline: Book Title NER
nlp

Side Project Outline: Book Title NER

I'm starting a month long project to extract book titles from Hacker News using Named Entity Recognition. I've been thinking lately about how Data Science can learn from the practices that have emerged in software development, and wanted to find good books on the subject. A lot of the ones I'd read, such as Feathers' Working Effectively with Legacy Code had come out of Hacker News. However this is a hard thing to search with using traditional search techniques.