an AI-based pizza detector

Source: Deep Learning on Medium

Hi everyone, this is my first post on medium so I apologise but I must introduce myself.

I work as Lead Data Engineer @ Quantyca, an IT consultancy company that works with data: data and system integration, big data architectures, reporting and analytics.

I am proud to be part of Innovation Team, whose goal is to explore technologies and methods to help their integration in our solutions.

This winter we decided to work together with our Analytics Team, to face deep learning challenges by using library and Jeremy’s experience.

First step in this learning path was Image Classification.

So each of us thought on an application to work on: my idea was to build a “pizza-detector”, i.e. an image classification application which could recognize a pizza taste by analyzing a photo.

You can try it here:

This story explains of how I built a this application by using a mix of: python, mathematics and statistics, data mining, web development.

I won’t explain neither deep learning nor details since you can find lots of posts and useful resources out there. Let’s just keep things simple and.. enjoy the trip! 😉

Since I am a proud Italian guy, and being moreover Sicilian and cooking enthusiast I used the most known Italian pizzas’ tastes as reference. I identified 15 classes for the application to work on.

classes = [‘capricciosa’, ‘crudo_rucola_grana’, ‘diavola’, ‘frutti_di_mare’, ‘kebab’, ‘margherita’, ‘marinara’, ‘ortolana’, ‘parmigiana’, ‘prosciutto’, ‘prosciutto_e_funghi’, ‘quattro_formaggi’, ‘salsiccia_e_friarelli’, ‘tedesca’, ‘tonno_e_cipolle’]

Didn’t you find your favourite taste? More classes will be added in the future I guess 😊

So the first challenge is: How do I get the data I need?

2. Collecting data

Many people working on deep learning get their data from search engines, e.g. Google images or bing, withy tricks. But the truth is: after the first 10–20 results you get lot of trash. For example some images may have been shown because in a blog appeared your search words but this doesn’t necessarily mean that an image would belong to your criteria.

Moreover, this kind of images most of the time are not genuine for commercial reasons: compare a real pizza to its image on a menu.

So to have real data about pizza I decided to collect images directly from users of Instagram by exploring some hashtags.

I chose some pizza tastes to be part of my dataset and went through hashtags and pizza photos for a while. This required also a lot of manual filtering to determine if they were appropriate for the dataset, especially to check if each image contained the expected ingredients with respect to the official recipe.

Some challenges in this step:

  • customizations: by adding ingredients you can obtain a different class
  • photos containing more than a pizza: how do you handle classification?
  • ambiguity: if a human can’t tell what kind pizza it is, it can’t label it properly
  • fashion: people are more likely to take photos of beautiful pizzas, not standard home-made/delivered ones

At the end of the process I had for each of the 15 classes 70–100 images each: not a big number actually, but by using data augmentation we can achieve good results.

Now let’s build a dataset for a learner from these pictures.

3. Data preparation

Building a Databunch ( class containing data, labels and transforms for both the and validation set) is really simple:

data = ImageDataBunch.from_folder(Path(data_dir), train=”.”, valid_pct=0.2, ds_tfms=tfms, size=224, bs=bs).normalize(imagenet_stats)

Images are randomly divided into training (80%) and validation (20%).

classes, #classes, #items in training set, #items in validation set

To check if Databunch is working properly, I call show_batch method that gets a sample of images together with their label. This is also useful to take a look at the dataset.

sample of images

Here you can see that this photos can be very different:

  • With/without context (other object around)
  • Perspective
  • Cooking style

To use data augmentation features I used the following transforms, which I identified to fit well for this kind of data:

tfms = get_transforms([crop_pad(), rotate(degrees=(-45,45), p=0.5), brightness(change=(0.3, 0.7), p=0.5), contrast(scale=(0.5, 1.5), p=0.5), jitter(magnitude=0.5, p=0.5), symmetric_warp(magnitude=(-0.1,0.1), p=0.5), zoom(scale=1.25, p=0.5), cutout(n_holes=(1,4), length=(10, 160), p=0.75)])

Here you can find more details about transforms.

Now I can start training a model.

4. Training

To start training I decided to create a CNN with resnet34 architecture, which is easier/faster to train and is fine for a first attempt.

One of the most surprising resources here is transfer learning: instead of training a network from scratch one can use a pretrained network and then adapt parameters to its data.

In vision, imagenet is one of the biggest dataset and brings a model trained on lots of categories: so it’s already able to distinguish shapes, colours and use them to identify a lot of stuff.

To build a model I will use this pretrained model and feed it with my data (15 pizza classes) so that it can understand specific features on this kind of data.

Under the hood library will use a learning rate of 0.03 and mean squared error as loss function.

Loss function is used to identify the best model by computing gradient and by setting model’s parameters to go where loss decreases. Learning rate is used to determine how big this steps are.

I usually start by running fit_one_cycle for 4 epochs and see how it goes.

Interesting! In 4 epochs we obtained a model with 75.8% accuracy!

Fine-tuning phase uses lr_find to determine which is the best speed, for next step (usually 1 value before the minimum). Here unfreeze means that the whole neural network could change its parameters: this takes longer but can be helpful to reach a global minimum and to adapt to the current data.

1e-4 is a good learning_rate, so I can go on with training:

+8% accuracy 😎

Note also how loss on training and validation data decreases.

Let’s go deeper.

This shape means that we are close to end fine-tuning because decrease is very little.

By reaching 86.2% accuracy I am satisfied and I can save the model.

I could have continued, though, since loss is still decreasing, even if very slowly: the model is still underfitting.

Ok, but let’s take a look at how this model predicts pizza’s taste.

5. Results interpretation

A useful resource is analyzing most_confused classes:

  • parmigiana vs ortolana: they can be confused easily because both contains eggplant 🍆
  • capricciosa vs prosciutto and funghi: the first contains olives and artichoke in addition so it depends on how good is the model to identify these ingredients
  • the other classes may be badly classified for many reasons

A second resource for interpretating model behavior is confusion matrix:

It compares actual vs predicted class for each image and put image counts on each cell : elements on diagonal are correctly classified.

diavola, marinara and salsiccia_e_friarelli are the classes that are better predicted. This is easy to understand because this pizza tastes are quite different from the others.

6. Trying a different architecture

After obtaining a good 86.2% accuracy I trained a new model also with resnet50, which is an architecture with more layers than resnet34 and so could identify more details and reach higher accuracy.

First round brings 78.3% accuracy.

By unfreezing all layers and running 2 more epochs I obtained 85.9% accuracy.

And with another round I reach 89.5% accuracy.

Although I could go further, since the model haven’t overfitted yet, I stopped training.

Let’s watch confusion matrix for this new model.

Very nice! It seems more accurate and capricciosa became one of the best predicted classes. 😎

By watching most_confused images we can understand a bit more from the model. You can guess what could have driven the model to make this wrong predictions.

7. Serving

To build an application that uses this model I used

  • python for app development
  • docker to package everything in a container
  • Google Kubernetes Engine to host the application

The model can be used on a single image to obtain a set of classes and the corresponding probabilities.

I decided to filter the output of the model by using these two parameters:

threshold = 0.40

max_results = 5

So only the classes for which the model is sure for more than 40% will be considered and the final output will contain a maximum of 5 classes.

So I wrote an HTML code to collect this filtered output and show it on a table.

I also wanted a bit of automation in the build process so I built a pipeline. Whenever a tag is published on the project’s git repo, a build is triggered and then is saved as an image on GCloud, that later can be deployed on Kubernetes cluster.

GCloud Builds

Application UI is really simple:

  • On the left you can upload an image
  • On the right you can verify the currently handled pizza tastes and their ingredients (only in I at the moment)
  • After uploading an image you will see on the bottom of the page the table containing model predictions
  • If nothing comes out you will see “unknown” label

Here is an example:

a simple test

8. Next steps

The tour ends here. But I already have lots of ideas to enrich the application, and other contributes are appreciated!

A quick list of ideas:

  • add more data to make model stronger and reach overfitting edge
  • add data also for home-made or home-delivered pizzas to go beyond Instagram’s typical contents
  • add a “not-a-pizza” class and train the model on it to avoid predicting a pizza taste in photos without pizzas but where something (colour, shapes,..) can be misconfused
  • add more classes! (someone could ask to add “pineapple_pizza” category but that wouldn’t be pizza at all for italians 😂)
  • get feedback from users and save uploaded images to build a continuous improvement on model that could be trained on new data

I hope you enjoyed the trip!

Now have a pizza! 🍕 😋