I'm working on a project to extract books from Hacker News. I've previously found book recommendations for Ask HN Books. Now I want a way to extract the book titles and authors. The Ontonotes corpus contains an NER category called Work of Art (for titles of books, songs, etc.) (see the PDF release notes for details). I wanted to see how well this worked.
I quickly tried 3 well known systems all trained on this corpus; SpaCy, Stanza, and Flair NLP.
en_core_web_trf; the smaller models don't work) and Flair NLP both performed quite well.
They both got a lot of the titles and persons, although struggled on punctuation.
The lack of periods seemed to hurt both the models in hyphen lists, and things like quotations seemed to leak into results.
Stanza performed much worse than the other two models and isn't worth considering.
These aren't good enough to solve the problem well by themselves, but are a good starting point for training a better NER model for getting book titles.