A typical way to evaluate NER (Named Entity Recognition) Systems is to look at the F1 score, however this is a bad idea as stated in Chris Manning’s 2006 blog post Doing Named Entity Recognition? Don’t optimize for F1. F1 score will penalise a partial match twice (once as a false negative and once as a false positive), but in many cases a partial match is a better result and may make the overall system the NER model is a part of better.
I’ve been building a system for showing book mentions in HackerNews posts. The core idea is to count all the mentions of books and show the posts that mention them. This requires finding the books (via the NER system), linking them to a knowledge base (such as Open Library) and then aggregating them. The NER system doesn’t need to be perfect; some false negatives are acceptable, and some false positives may not be able to be linked.
This article will look at some example data from this project and demonstrate why the F1 score isn’t ideal, and explore some alternatives.
The data
The data I’m using is a set of manually annotated HackerNews posts (to find out more about the methodology look at my other posts on this or the code) with annotations of all PERSON (the name of a person) or WORK_OF_ART (the name of a book, album, artwork, movie, etc.).
Convert every annotation into a SpaCy Doc to make predictions on.
This is straightforward but we need to be careful with the fact the Prodigy span ends are inclusive, but the SpaCy Spans exclude the end.
def annotation_to_doc(vocab, annotation, set_ents=True): doc = Doc( vocab=vocab, words=[token['text'] for token in annotation['tokens']], spaces=[token['ws'] for token in annotation['tokens']] ) spans = [Span(doc=doc, start=span['token_start'],# N.B. Off by one due to Prodigy including the end but SpaCy excluding it end=span['token_end'] +1, label=span['label'])for span in annotation['spans']]if set_ents: doc.set_ents(spans)return doc
Use en_core_web_trf, an English Transformer model that has PERSON, WORK_OF_ART and many other named entities.
nlp = spacy.load('en_core_web_trf')
Convert all the docs
docs = [annotation_to_doc(nlp.vocab, d) for d in data]len(docs)
305
An example annotated document.
from spacy import displacydisplacy.render(docs[0], style='ent')
Second this,
Becoming Steve Jobs
WORK_OF_ART
is the superior book.
Run these through the SpaCy model to make our predictions.
%%timepreds =list(nlp.pipe(annotation_to_doc(nlp.vocab, d, set_ents=False) for d in data))
CPU times: user 3.76 s, sys: 116 ms, total: 3.88 s
Wall time: 3.94 s
An example prediction; it got the Work of Art and an additional ORDINAL.
displacy.render(preds[0], style='ent')
Second
ORDINAL
this,
Becoming Steve Jobs
WORK_OF_ART
is the superior book.
Getting BILOU tags
For evaluating NER the tags generally need to be in some standard form like Inside-Outside-Beginning (IOB) (also known as BIO).
This function will convert them to the equivalent use Beginning-Inside-Last-Outside-Unit (BILOU or equivalently IOBES or BMEWO) because SpaCy has a handy function to do it.
from spacy.training import offsets_to_biluo_tagsdef get_biluo(doc, include_labels=None):if include_labels isNone: include_labels = [ent.label_ for ent in doc.ents]return offsets_to_biluo_tags(doc, [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents if ent.label_ in include_labels])
Filter to the entities of interest in the annotation.
ENTS = ['WORK_OF_ART', 'PERSON']
So this
displacy.render(docs[0], style='ent')
Second this,
Becoming Steve Jobs
WORK_OF_ART
is the superior book.
But what about partial matches? These count against precision and recall (so it’s worse than predicting no match), but for our use case it would be better if we can still link the entity.
Let’s look more closely at ways to evaluate NER systems.
NEREvaluate
David Batista has an excellent blog post on Named Entity Evaluation. In short the f1-score treats NER like a binary classification problem, but it’s not there are lots of ways to get it almost right.
This has been implemented in the nervaluate Python library.
To make the results a bit easier to visualise we’re going to switch the rows and columns
from collections import defaultdictdef flip_nested_dict(dd): result = defaultdict(dict)for k1, d in dd.items():for k2, v in d.items(): result[k2][k1] = vreturndict(result)
The types are, from David Batista’s post
Strict: exact boundary surface string match and entity type
Exact: exact boundary match over the surface string, regardless of the type
Partial: partial boundary match over the surface string, regardless of the type;
Type: some overlap between the system tagged entity and the gold annotation is required;
Strict is the same as seqeval uses and the scores match the micro average.
pd.DataFrame(flip_nested_dict(results))
correct
incorrect
partial
missed
spurious
possible
actual
precision
recall
f1
ent_type
138
2
0
19
3
159
143
0.965035
0.867925
0.913907
partial
134
0
6
19
3
159
143
0.958042
0.861635
0.907285
strict
134
6
0
19
3
159
143
0.937063
0.842767
0.887417
exact
134
6
0
19
3
159
143
0.937063
0.842767
0.887417
Looking by entity type is even more revealing, for work of art we have very high precision on partial matches showing this could actually be a better solution than it first appears, with a precision closer to 94% than the strict 89%.
Note there’s a discrepency here; the strict f1 for WORK_OF_ART is 79.1%, when seqeval gave 80.3%. This is because seqeval ignores the other types of tags when evaluating at a tag level, but nervaluate includes them.
from IPython.display import displayfor tag, tag_results in results_by_tag.items(): display(pd.DataFrame(flip_nested_dict(tag_results)).style.set_caption(tag))
def has_partial_overlaps_with_same_label(true_item, pred_item):return (true_item != pred_item) and same_label(true_item, pred_item) and overlaps(true_item, pred_item)
Get all the document indices with a partial overlap
partial_overlap_idx = [idx for idx inrange(len(evaluator.true)) ifany(has_partial_overlaps_with_same_label(true_item, pred_item) for true_item in evaluator.true[idx] for pred_item in evaluator.pred[idx])]len(partial_overlap_idx)
4
Three of the four examples only differ in surrounding punctuation and whitespace, and the other includes the author (so is not ambiguous).
All these examples would work perfectly in the entity linking stage.
for idx in partial_overlap_idx: display(idx) displacy.render(docs[idx], style='ent') displacy.render(preds[idx], style='ent')
98
After I posted I remembered the book "
How to Invent Everything
WORK_OF_ART
" which takes the case of a time traveler stuck in the past with a guide to invent civilization from scratch.
After I posted I remembered the book "
How to Invent Everything"
WORK_OF_ART
which takes the case of a time traveler stuck in the past with a guide to invent civilization from scratch.
105
I found this: https://us.macmillan.com/books/9781250280374
<<
The Vaccine: Inside the Race to Conquer the COVID-19 Pandemic
WORK_OF_ART
Author:
Joe Miller
PERSON
with Dr.
Özlem Türeci
PERSON
and Dr.
Ugur Sahin
PERSON >>
I found this: https://us.macmillan.com/books/9781250280374
<<
The Vaccine: Inside the Race to Conquer the COVID-19 Pandemic
WORK_OF_ART
Author:
Joe Miller
PERSON
with Dr.
Özlem Türeci
PERSON
and Dr.
Ugur Sahin
PERSON >>
234
The ideas in the article are similar (in a good way) to
Shape Up
WORK_OF_ART
by Basecamp [0].
The ideas in the article are similar (in a good way) to
Shape Up by Basecamp
WORK_OF_ART
[0].
278
From "
The Elements of Journalism
WORK_OF_ART
" by
Bill Kovach
PERSON
and
Tom Rosenstiel
PERSON
: "Originality is a bulwark of better journalism, deeper understanding, and more accurate reporting.
From
"The Elements of Journalism"
WORK_OF_ART
by
Bill Kovach
PERSON
and
Tom Rosenstiel
PERSON
: "Originality is a bulwark of better journalism, deeper understanding, and more accurate reporting.
Final Thoughts
When choosing a metric always make sure it aligns with your final goals. F1 score is fine when comparing similar systems, but in this case it actually gives worse results than the usecase and a better metric would give credit to substantial partial overlaps.