Exploring Food Recipes using Machine Intelligence

Original article was published on Deep Learning on Medium

Exploring Food Recipes using Machine Intelligence

Using Word2Vec model for recipe analysis


Food is an inseparable part of our lives. It has been observed that ingredients and recipes are often considered when an individual chooses to eat. Influenced by ingredients and style of cooking, a cuisine can have several hundred or thousands of recipes for different dishes. A recipe on Website shows the ingredients that are needed for a dish and the procedure of the cooking. But the problem is, the user cannot identify what are the dishes can be cooked by using the ingredients available by the user. To overcome these problems, Machine Learning approach is used which enables to suggest the recipes based on the available ingredients by the user.

So, before we dig further into how machine learning can be used in food industry, lets first understand more about Natural Language Processing (NLP).

What is NLP

Natural language refers to the language used by humans to communicate with each other. This communication can be verbal or textual. For instance, face-to-face conversations, tweets, blogs, emails, websites, SMS messages, all contain natural language. However, for computers to easily comprehend and process this natural language requires applying rules and algorithms such that the unstructured data is converted into a form that computers can understand.

Syntactic analysis and Semantic analysis are the main techniques used to complete Natural Language Processing tasks. “Syntax” refers to the arrangement of words in a sentence such that they make grammatical sense, whereas “Semantics” refers to the meaning that is conveyed by a text

With these rules and word embedding algorithms we convert the natural language words in a numeric format that is understandable by the computers.

Word Embedding

Word embedding is a type of word representation that allows words with similar meaning to be understood by machine learning algorithms. It is also called as distributed semantic model or semantic vector space or vector space model; which means categorizing or grouping vectors of similar words together in vector space. The idea behind it is fairly straightforward: you shall know a word by the company it keeps. Thus words having similar neighbors, i.e., the usage context is about the same, are highly possible having same meaning or at least highly related.

Word2Vec – An Word Embedding Approach

The Word2Vec is an word embedding approach, developed by Tomas Mikolov, is considered the state of the art. Word2Vec approach uses deep learning and neural networks-based techniques to convert words into corresponding vectors in such a way that the semantically similar vectors are close to each other in N-dimensional space, where N refers to the dimensions of the vector.

Wait ! Why on earth do we need word embedding in analyzing food recipes and ingredients? Well, we need some way to convert text and categorical data into numeric machine readable variables if we want to compare one recipe with another. With this tutorial, we will learn how the Word2Vec can be used to;

  • Suggest similar concepts — Here, word embedding helps us to suggest similar ingredients to the word being subjected to the prediction model.
  • Create a group of related words: It is used for semantic grouping which will group things of similar characteristic together and dissimilar far away.
  • Find unrelated concepts
  • Compute similarity between two words and more

P.S. The point of this post is intended as a reference and starting point for those interested in exploring the field further.

Food Recipes Dataset

The secret to getting Word2Vec really working for you is to have lots of text data in the relevant domain. For this tutorial, we will be using dataset that has around roughly 5000 recipes of different cuisines and varying ingredients.

Data Cleaning and Pre-Processing

Let us first load the recipes into pandas dataframe and remove empty rows

The resultant is around approximately 5400 recipes with respective columns i.e. recipe name, rating, course-type, cuisine and list of ingredients required for each recipe.

The raw data is as always prone to noise where there are lots of typos, stop word unwanted spacings, punctuations, numbers etc are removed. Additional pre-processing that is covered includes,

  1. Ingredients are represented in plural forms (e.g. tomatoes instead of tomato; potatoes vs potato) which needs to be converted into singular form to reduce the word dimensions.
  2. Most ingredients are prefixed with adjectives e.g. dried tomatoes, squeezed lemons, fresh coriander etc. These words (dried, squeezed, fresh etc) are no useful in generating meaningful word embeddings. Hence these are stripped using regular expression functions.

Here’s the script that cleans & pre-processes the ingredients data:

As you see from the results, a total of 55K ingredients are used across 5400+ recipes, out of which 2600+ ingredients appear to be unique after pre-processing.

Now, lets explore the most common and least common ingredients.

#find the most common ingredients used across all recipes
print ("---- Most Common Ingredients ----")
print (counts_ingr.most_common(10))
print ("\n")#find the most common ingredients used across all recipes
print ("---- Least Common Ingredients ----")
print (counts_ingr.most_common()[-10:])
Most and Least Common ingredients

Let’s visualize this with word-cloud visualization technique.

Word Cloud Visualization

Training Word2Vec

With Gensim, it is extremely straightforward to create Word2Vec model. The ingredients list is passed to the Word2Vec class of the gensim.models package. Word2Vec uses all these tokens to internally create a vocabulary.

The step above, builds the vocabulary using the ingredients list, and starts training the Word2Vec model. Behind the scenes, we are training a neural network with a single hidden layer to predict the current word based on the context. The goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. The resulting learned vector is known as the embeddings.


Let’s get into the fun stuff now! This first example shows a simple look up of words (e.g. ingredients like paneer, egg, mango, bread, rice) appear to be similar or at the least related with other ingredients

# check the similar ingredients returned by the model for search_termssimilar_words = {search_term: [item[0] for item in model.wv.most_similar([search_term], topn=5)]
for search_term in ['paneer','egg','mango','bread', 'rice']}

Wow, that looks pretty good. Let’s look at more ingredients

Similar or related ingredients for ingredient “chocolate”


Great, the similarity brings out all the words that are closely related with “chocolate” e.g. dark chocolate, vanilla beans etc

Similar or related ingredients for ingredient “mayonnaise”


Similar or related ingredients for ingredient “chicken”


Overall, the results actually make sense. All of the related words end to be used in similar contexts. Now lets use Word2Vec to compute similarity between two ingredients in the vocabulary by invoking the similarity(...) function and passing in the relevant words.

model.wv.similarity(‘paneer’, ‘chicken’)

Under the hood, the model computes the cosine similarity between the two specified words using word vectors (embeddings) of each. The resultant score makes sense as, “paneer” is primarily used in vegetarian diet, whereas “chicken” into non-vegetarian diet

Another fun stuff is to identify food analogies, similar to word analogies. For the analogy “bread is to cheese”, the goal is to predict a reasonable analogy for “chicken is to ….”

x = ‘chicken’
b= ‘cheese’
a = ‘bread’
predicted = model.wv.most_similar([x, b], [a])[0][0]
print(“ {} is to {} as {} is to {} “.format(a, b, x, predicted))

Evaluating Word2Vec

We’ve created embeddings of 300 dimensions with word2vec. Luckily, when we want to visualize our high dimensional word embeddings, we can employ a dimensionality reduction technique. Below, we can see some of the vector embeddings for common ingredients projected onto two dimensions by t-SNE. The positions of the ingredients below represent probability distributions rather than actual positions in space. t-SNE plots can be difficult to interpret as the hyper parameter, perplexity, can drastically change the size and distance between clusters. However, we aren’t trying to interpret clusters, but rather hoping to evaluate whether or not our model learned something useful about our recipes.

TSNE plot to visualize embeddings

You can see, the vegetarian food “paneer” ingredients are nearby which includes the masalas, garlic and ginger pastes. And it definitely makes sense. Similarly all ingredients w.r.t. “eggs”, “mango” are closely appearing.

What’s Next?

The above tutorial explores just the ingredients part for the recipes. There are many other use-cases or exploration ideas that can be further implemented. Here are some of questions that I will try to build and get it answered in my next tutorial.

  1. Cuisine classification/prediction based on ingredients provided
  2. Given a recipe, finding similar recipes from the corpus
  3. Recipe Recommendation based on ingredients supplied.
  4. Using set of given ingredients, what recipes can be prepared.


Capturing the meaning and relations between words is hugely important when identifying information in text. The embeddings provide a foundation for more complex tasks and models in Natural Language Processing and Machine Learning.

So go have fun — try and find some interesting data sets of text stuff you can feed in and what you can work out about the relationships — and feel free to comment here with anything interesting you find.

Happy Cooking !

You can find my Kaggle notebook/kernel here :

Documentation and Learning References: