Training Recipe Ingredient NER with Transformers

nlp
python
ner
Published

April 9, 2022

I trained a Transformer model to predict the components of an ingredient, such as the name of the ingredient, the quantity and the unit. It performed better than the benchmark CRF model when using fewer data examples, and on the full dataset performed similarly but was useful for identifying issues in the data annotations. It also has some success even on languages that it wasn’t trained on such as French, Hungarian, and Russian. It took several hours to put together with no prior experience, and minutes to train for free on a Kaggle notebook. You can try it online or see the training notebook.

The underlying training and test data is from A Named Entity Based Approach to Model Recipes, by Diwan, Batra, and Bagler. They manually annotated a large number of ingredients from AllRecipes.com and FOOD.com with the tags below.

Tag Significance Example
NAME Name of Ingredient salt, pepper
STATE Processing State of Ingredient. ground, thawed
UNIT Measuring unit(s). gram, cup
QUANTITY Quantity associated with the unit(s). 1, 1 1/2 , 2-4
SIZE Portion sizes mentioned. small, large
TEMP Temperature applied prior to cooking. hot, frozen
DRY/FRESH Fresh otherwise as mentioned. dry, fresh

I have previously replicated their benchmark using Stanford NER, a Conditional Random Fields model. Here are the f1-scores reported in the paper (columns are training set, rows are testing set).

Benchmark - Paper AllRecipes FOOD.com BOTH
Testing Set
AllRecipes 96.82% 93.17% 97.09%
FOOD.com 86.72% 95.19% 98.48%
BOTH 89.72% 94.98% 96.11%

While these may look impressive, using the model of predicting the most common label for each token, and O for out of vocabulary tokens, gets an f1-score over 92%. This is actually quite a simple problem because most tokens have a label and ambiguity is rare.

I followed the process of training an NER with transformers from Chapter 4 of Natural Language Processing with Transformers by Tunstall, von Werra, and Wolf (using their public notebooks as a guide). There was marked improvement on using the smaller AllRecipes dataset (1470 training samples). However on the larger FOOD.com dataset (5142 training samples) the increase in performance was smaller, and on the combined dataset it was very marginal. Note this is keeping a validation set, and using the default hyperparameters from the text; I haven’t tried to optimise it at all or use every data point.

Transformer (XLM Roberta) AllRecipes FOOD.com BOTH
Testing Set
AllRecipes 96.94% 95.73% 97.34%
FOOD.com 91.64% 96.04% 95.77%
BOTH 92.9% 95.96% 96.15%

I suspect the reasons it doesn’t do much better are because of inconsistencies in the annotation and hitting the limits. Running an error analysis, as per the NLP with transformers text, showed some issues. Often only the first of multiple ingredients is annotated.

token 1 teaspoon orange zest or 1 teaspoon lemon zest
label QUANTITY UNIT NAME NAME O O O O O

In this case all but the last ingredient name is annotated.

token 1/4 cup sugar , to taste ( can use honey , agave syrup , or stevia )
labels QUANTITY UNIT NAME O O O O UNIT O NAME O NAME NAME O O O O

The inconsistencies confused both me and the model. There are instances in both “firm tofu” and “firm tomatoes” where firm is considered part of the name, and others where it is part of the state. Similarly in “stewing beef”, stewing is sometimes a state and sometimes part of the name. Though there were real issues in the model; it couldn’t distinguish “clove” in “garlic clove” (a unit), from “ground cloves” (a name).

An amazing thing about using a multilingual transformer model like XLM Roberta is it has some zero-shot cross-language generalisation. Even though all the examples are English it does better than random on other languages. Admittedly the pattern of ingredients makes it easier (e.g. a numerical quantity, followed by a unit, followed by a name), but it picked up some other things. I didn’t have a dataset to test on, but tried it on a few examples I could find. If you want to try more you can try it in the Huggingface model hub and share what you find.

As you might expect it does well on a French example, where there’s a lot of similar vocabulary. However any model relying on token lookups would not be able to learn this from the training data.

token 1 petit oignon rouge
translation 1 small onion red
actual I-QUANTITY I-SIZE I-NAME I-NAME
prediction I-QUANTITY I-SIZE I-NAME I-NAME

Going a bit further afield to Hungarian it certainly does better than random. Here’s an example where it only makes a mistake on one entity; but picks up that fagyasztott is not part of the name.

token 1 csomag fagyasztott kukorica
translation 1 packet frozen corn
actual QUANTITY UNIT TEMP NAME
prediction QUANTITY UNIT STATE NAME

Here’s another Hungarian example where it gets the name wrong because it missed the unit (konzerv).

token 50 dkg kukorica konzerv
translation 50 dkg (10g) corn canned
actual QUANTITY UNIT NAME UNIT
prediction QUANTITY UNIT NAME NAME

However here’s a harder example that it gets precisely right.

token őrölt fehér bor
translation ground white pepper
actual STATE NAME NAME
prediction STATE NAME NAME

Russian should be even harder since it’s a different script, although is straightforward to transliterate. However here’s an example that it gets exactly right:

token Сало свиное свежее - 50 г
translation fat port fresh - 50 g
actual NAME NAME DF O O I-UNIT
prediction NAME NAME DF O O I-UNIT

If one wanted to extend this model to one of these other languages the existing predictions would be a good way to start. Then annotators could correct the mistakes, especially where the model is unsure, which is much faster than manually labelling every token. In this way a good training set could be constructed relatively quickly by bootstrapping from another language. For more ideas on dealing with few to no labels, see Chapter 9 of the Natural Language Processing with Transformers book.

To take the model further we could fix the annotation errors, in particular multiple annotations within an ingredient, and retrain the model. We could also annotate more diverse ingredient sets; the NY Times released a similar ingredient phrase tagger along with training data (and the corresponding blog post is informative). However the tagger is already really very good.

Though really the model is really good and a better thing to do would be to run it over a large number of recipe ingredients to extract information. There are many recipes that can be extracted from the internet; for example using Web Data Commons extracts of Recipes, recipes subreddit (via pushshift), or exporting the Cookbook wikibook, or using OpenRecipes (or their latest export). Practically the CRF model is likely a better choice since it works roughly as well and would run much more efficiently. Then you could look at which ingredient names occur together, estimate nutritional content, convert quantities by region or do more complex tasks like suggest ingredient substitutes or generate recipes.