I'm working on a project to extract books from Hacker News. Most HackerNews posts aren't about books, and it would be extremely tedious to manually annotate examples when most of them are negative. Instead I used different heuristics to determine whether a book contains a title, that can then be used for weak labelling. My main takeaway is that zero shot classification seems like a great starting point for building a classifier.
I'm currently working on a project to extract books from Hacker News to help find interesting books. Having extracted extracted ASINs from Hacker News posts and linked them to Open Library records I have a basic proof of concept. Now I want to be able to display this information in some webpages to help people find the books. I ended up building a minimal prototype by manually curating some examples; this let me focus on the design rather than on the technical aspects.
I've been looking at Open Library as a knowledge base for books. After extracting ASINs from Hacker News posts, I want to link them to Open Library records. For books an ASIN is the same as its ISBN-10, which creates a linkage point with Open Library. From my previous Open Library Exploration about 20% of editions in Open Library has an ISBN-13, but not an ISBN-10 (and 64% have one of either).
I'm working on a project to extract books from Hacker News, and want to link the books to records from Open Library. I've already looked at the process of adding a book to Open Library and loading a data export into sqlite. Now I really want to look through the data and see what's inside. I do this through two different perspectives; summarising the metadata and looking at some specific records.
I've been looking at Open Library as a knowledge base for books. A lot of the data here is manually uploaded by people, and the user interface has a big impact on the data they enter. To understand the data better I wanted to actually go through the process of adding a book, Overcoming Floccinaucinihilipilification by Jon Manning, and this article documents the process. Overall Open Library prioritises making it easy for people to add data, and then has facilities to edit it.
I'm working on a project to extract books from Hacker News, and want to link the books to records from Open Library. The Open Library data dumps are several gigabytes of compressed TSV, so are too big to fit in memory on a standard machine. The Libraries Hacked repository imports it into a PostgreSQL database, but I don't want to set up a Postgres database for an analysis job if I can avoid it.
I'm working on a project to extract books from Hacker News. Once I have extracted book titles (e.g. with NER or Question Answering) I need a way to disambiguate them to an entity, and potentially link them to other information. The Internet Archive's Open Library looks like a very good way to do that. Book titles can be ambiguous and so we need some way to link it to a unique entity.
I'm working on a project to extract books from Hacker News. I've been thinking about ways to bootstrap this process such as transfer learning, weak labelling, and active learning. I was reading Robert Monarch's excellent book Human-in-the-Loop Machine Learning where he gives good reasons as to why you should start with a separate random dataset for evaluation; you want something that is representative of the real distribution, and any use of a model inherits the biases of that model.
I'm working on a project to extract books from Hacker News. I've previously found book recommendations for Ask HN Books, and have used the Work of Art named entity from Ontonotes to detect the titles. Another approach is to use extractive question answering as a sort of zero-shot NER. This works amazingly well, at least providing that there is an actual book title there. The code is simple using Transformers high level Question Answering Pipeline.
I'm working on a project to extract books from Hacker News. I've previously found book recommendations for Ask HN Books. Now I want a way to extract the book titles and authors. The Ontonotes corpus contains an NER category called Work of Art (for titles of books, songs, etc.) (see the PDF release notes for details). I wanted to see how well this worked. I quickly tried 3 well known systems all trained on this corpus; SpaCy, Stanza, and Flair NLP.
I'm working on a project to extract books from Hacker News. Most HackerNews posts aren't about books, so we need some heuristics to get posts somewhat likely about books. I've already used ASINs to extract book links to Amazon; another approach is like MapFilterFold to use Ask HN threads about books. Ask HN is a kind of post on Hacker News that allows asking questions to the community. We can't identify them in the dataset but they typically start with "Ask HN" in the title.
I'm currently working on a project to extract books from Hacker News. After exporting all 2021 posts from the Google Bigquery dataset in a Kaggle Notebook and doing an exploratory data analysis I'm looking for methods to extract books. One way to extract books is using Amazon links. People often refer to a book with a link to Amazon, and each Amazon product has a 10 character ASIN (Amazon Standard Identification Number); for books this is the same as its ISBN-10.
A mystery! A riddle! A puzzle! A quest! This was the moment that Ada loved best. Ada Twist, Scientist, Andrea Beaty This is an exploration of 2021 Hacker News posts as a precursor to building a books dataset. The data was sourced from the Google Bigquery public dataset bigquery-public-data.hacker_news.full using a Kaggle notebook. SELECT * FROM `bigquery-public-data.hacker_news.full` where '2021-01-01' <= timestamp and timestamp < '2022-01-01' I want to get a basic understanding of what’s in the dataset before doing any data mining.
I'm starting a month long project to extract book titles from Hacker News using Named Entity Recognition. I've been thinking lately about how Data Science can learn from the practices that have emerged in software development, and wanted to find good books on the subject. A lot of the ones I'd read, such as Feathers' Working Effectively with Legacy Code had come out of Hacker News. However this is a hard thing to search with using traditional search techniques.