Fine-tuning with Nvidia Digits — Part 2

This is the second part of that series. Fine-tuning is necessary for making Deeplearning’s power accessible to more specific (or little)projects that are not like 1000 objects recognition, Google translate, self-driving cars or playing Go . With fine-tuning we can reuse those magnificent achievements for our particular needs. In our case we were working in embedding a face sentimental analysis system in a mobile phone so we needed to have a software that can infer human emotions (happy, sad, anger, emotion, surprise….) from still images so, a classification problem. Emotions are a crucial aspect of human life in example, happiness is the evaluation that our goals are satisfied and on the contrary sadness is the evaluation that they are not being satisfied this sound powerful to me from a marketing point of view.

The conclusions presented after the 5th EmotiW Challenge were that Deep learning based methods outperform traditional vision, machine learning methods, so use a ConvNet will be a good idea an transfer learning is another good idea if you lack a lot (a lot is A LOT) of data. It does not exists big and labeled face emotions datasets so it is recommendable and necessary train your network transferring the learned features of another network. I collected around 36k faces (labeled with emotions) and VGGFace2 was trained from scratch with a dataset that contained 3.31 million images of 9131 subjects. Retraining 30 epochs of my network in an K80 12 Gigas RAM GPU takes around 9 hours, make your numbers.

Digits and also Caffe has excelent tutorials in fine tuning but it lacks some practical details about layer names and learning rate. So, for transfer learning you have to had the trained model and touch some layers of your network. With Digits and using Caffe (as our transfer model was trained with Caffe) you previously need : a builded dataset with x emotions examples and a trained model, VGG faces in our case.

Configuration of building a new classification model

As you can see, we choose the following solver options (remember this is a Caffe framework model)

  • Only 10 epochs as our budget is limited but also because this kind of models converge very quickly from the firsts epochs .
  • A batch size of 60 if you use a GPU with 12 gigabytes (p2.xlarge aws instance) and your 224×224 images.
  • Image mean substraction, necessary for data normalization
  • Dataset images in 224×224 size (as VGG16 requirement)
  • Learning rate of 0.001 in my experience this is the most insanely critical hyperparameter that you can touch.

In our case as I mentioned in my first post we choose the VGG16 trained by the University of Oxford. I used the VGG Faces model and it worked smoothly from the first experiments we obtained accuracys up to 80%. As I fine tuned networks at the same time I was constructing the dataset I firstly build a network with a dataset of 9k images and after training this network I repeated the process with the total dataset of 36k images and importing the weights of the first trained model (the 9k images). The results were excellent 🙂

The second part is just relative to the structure of the network and you can find my prototxt files in my github. As it has been already mentioned this is a VGG16 network if you want to learn more about how and why it works (specially the 3×3 filter) just check their papers in our case we tested two combinations one necessary and one as extra ball. How Caffe do fine-tuning is very simply but not clearly explain (in my opinion) it’s very simple:

  • change the name of the layers you want the network relearn
  • put to 0 the learning rate multipliers of the layers you freeze (the ones that you keep the weights) param { lr_mult: 0.0 decay_mult: 0.0 } but i’m not sure if this is necessary
  • add the necessary last Fully connected layer with the number of outputs depending of the number of classes your dataset has

layer { name: “fc8-retrain4” type: “InnerProduct” bottom: “fc7” top: “fc8-retrain4” param { lr_mult: 1.0 decay_mult: 2.0 } param { lr_mult: 1.0 decay_mult: 2.0 } inner_product_param { num_output: 7 weight_filler { type: “xavier” std: 0.02 } bias_filler { type: “constant” value: 0.2 } }}

In Digits new model you choose a “custom model” modify your structure, and don’t forget to always visualize it before start training. Then, indicate the path of your trained network the one containing the weights that will be copied to the layers that you did not rename.

In our case we make two experiments retraining a last convolutional layer (as we added new data to the dataset) and adding a 7 outputs fully connected layer or just adding the last FC layer. You will see the differences in accuracy in the last post of this series.

I hope it is a little beat more clear than the excellent Nvidia and Caffe tutorials In the third post I will present an inference model, how to deal with input images (you have to crop the faces and frontalize it) and present my results that are more than acceptable 🙂

Source: Deep Learning on Medium