Building Face Recognition Model Under 30 Minutes

Original article was published by Parth Rajesh Dedhia on Deep Learning on Medium


In this blog post, I am going to give a walk-through of some implementation details of the face recognition model. I have also designed a browser-based UI for adding a new person to the database. Explaining the web-development part is out of the scope for this blog.

This post assumes an understanding of the Siamese Network model and Triplet Loss function. If you prefer to run the model first, then you can clone it from my repository here.

Try out the working model from the repository

This blog post is structured as follows:

  • Model Architecture
  • Dataset
  • Triplet Generation
  • Miscellaneous Details
  • Conclusion

Model Architecture

We all know that training a Convolution Neural Network(CNN) from scratch takes a lot of data and also compute power. So we instead use transfer learning, where a model trained on similar data is fine-tuned as per our requirement. The Visual Geometry Group (VGG) at Oxford has built three models — VGG-16, ResNet-50, and SeNet-50 trained for face recognition as well as for face classification. I have used the VGG-16 model as it is a smaller model and the prediction in real-time can work on my local system without GPU.

Note: To avoid confusion between the VGG-16 Deep Learning model and Oxford’s Visual Geometry Group (VGG), I will be calling the later as the Oxford group.

This implementation has the entire model in Keras with TensorFlow v1.14 as backend. I had planned to build the same in TensorFlow v2.3, so I created a virtualenv in my local system and extracted the model weights. These extracted weights were stored in vgg_face_weights.h5 and later loaded them on an untrained VGG-16 (in TensorFlow v2.3) network shown in this paper. If you wish to use ResNet-50 or SeNet-50 then you can use Refik Can Malli’s repository to obtain the model and the weights.

The VGG-16 model was trained on the dataset shown in this paper, where they had trained the classification model on 2622 different faces. The second to last layer has 4096 Dense Units, to which we append a 128 unit Dense layer, without the bias term, and remove the classification/softmax layer containing 2622 units. All the layers before the 128 Dense layer are frozen (trainable = False) and only the newly added dense layer needs to be trained.

Loading VGG-16 pre-trained weights and then Customizing the model

Now for training this network, we use a Triplet Loss function. The triplet loss function takes three, 128-D features generated from the above network. Let these three be known as an anchor, a positive, and a negative where

  • Anchor: An image of a person that will be used for comparison.
  • Positive: An image of the same person as that of the anchor.
  • Negative: An image of a different person than the anchor.
Triplet Loss function — paper

Triplet loss tries to reduce the distance between the anchor and the positive pair and increase the distance between the anchor and the negative pair. There is also another parameter alpha = 0.2 which adds a margin thus making the training harder and giving better convergence. The parameters — 128D dense unit and the loss function parameter alpha are selected based on the analysis show in this paper.

Implementation of Triplet Loss Function

Let’s sum up until now!! The VGG-16 Network gives us 128D features for anchor, positive, and negative images, which are then fed to the Loss function.

Now for training, one option is to call the same model three times on each of the anchor, positive and negative image, and then give the value to the loss function. However, running them one after another would be a bad idea. So I have instead wrap them in a Siamese Network class that extends the tf.keras.Model and leave the parallelization to the TensorFlow. Also, there is one more thing added to the model, L2 Regularization is applied to the output of the 128D Dense layer.

Siamese Network Class

I have added a function get_features to the SiameseNetwork class, which is just an optimization that will be useful during testing.

Great, we have built a model!! Now let’s check out the dataset for training.


The VGGFace dataset consisting of 2622 distinct celebrity images, is used for training the VGG-16 model used above. Later, the Oxford group also released the VGGFace2 consisting of 8631 celebrity images for training, and 500 of them in the testing, each of them are distinct. Since the training set is 39GB, I downloaded only the test set, which is 2BG, and trained the last dense layer.

While using a test set for training may sound counter-intuitive, but this is the test set concerning the model trained by them. As for me, I have used it as a training set and tested my model on my family members and my friends.

The pre-processing generally depends on the underlying model. So, for training and testing, the input images have to go through the same pre-processing that is defined by the VGG-16 model implemented by Oxford Group. Images input to the model first run through a face detector described in this paper and then sent to the preprocess_input function given here. In my implementation, I have used the frontal face detector provided by the dlib library and then sent the images to the preprocess_input function.

Note: The preprocess_input function defined by the here is different than the one used by VGG-16 trained on ImageNet. Hence the code for pre-processing, in my repository, is taken from pip installed VGGFace library.

Now, I will show the directory structure of the datasets as it becomes a way to optimize memory during training. Let’s first check out the downloaded dataset directory structure. In the directory structure below, each directory(n000001, n000009, etc.) is allotted to a celebrity for all its images.

└── vggface2_test
└── test
├── n000001
│ ├── 0001_01.jpg
│ ├── 0002_01.jpg
│ ├── 0003_01.jpg ...
├── n000009
│ ├── 0001_01.jpg
│ ├── 0002_01.jpg
│ ├── 0003_01.jpg ...
(so on and so forth)

As mentioned above, we used the dlib’s frontal face detector to detect images containing faces and store them in a different folder called dataset. Below is the directory tree of the face detected images. This notebook has the implementation for the same.

└── dataset
└── list.txt
└── images
├── n000001
│ ├── 0001_01.jpg
│ ├── 0002_01.jpg
│ ├── 0003_01.jpg ...
├── n000009
│ ├── 0001_01.jpg
│ ├── 0002_01.jpg
│ ├── 0003_01.jpg ...
(so on and so forth)

The directory structure of the vggface_test and dataset is almost similar. But, the dataset directory may contain fewer images as some of the faces may not have been detected by the dlib’s detector. Also, there is a file list.txt in the dataset directory, which contains the data as follows directory-name/image-name for each image. This list.txt is used for memory optimization during training.

Triplet Generation

For training, a model requires three images — the anchor, the positive, and the negative image. The first idea on top of my mind is to generate all the possible pairs of triplets. This may seem to have given a lot of data, but the research literature suggests that it’s inefficient. So I have used a random number generator for selecting the anchor, the positive, and the negative pairs of images. I have used a Data Generator which yields data during the training loop. If you are not familiar with Data Generator, do refer to this blog.

Fun Fact: It took me more time to write the DataGenerator class than the model took to train.

Triplet Data Generator

__getitem__ is the most important function. However, to understand the same, let’s check the constructor and other methods as well.

  • __init__: The constructor takes the path to the dataset directory defined in the previous subsection. The constructor uses the list.txt to make a dictionary. This dictionary has the directory name as its key and a list of images in that directory as its value. It is here, and in the shuffling step, that the list.txt becomes an easy way for us to have a dataset overview, thus avoiding to load images for shuffling.
  • __getitem__: We get the names of the people from the above dictionary keys. For 1st batch, the first 32 (batch size) people images are used as anchors, and a different image, of the same person, is used as positives. A negative image, from any other directory, is selected for training. For all of the triplets, the anchors, the positive, and the negative images are chosen randomly. The next 32 people become the anchor for the next batch.
  • curate_dataset: Creates the dictionary explained in the __init__
  • on_epoch_end: On each epoch end, the order of people is shuffled, so that in the next epoch, the first 32 images are different than the one seen in the previous epoch.
  • get_image: The get image function uses the preprocess_input after resizing the image to (224 x 224) size.
  • __len__: This will return the number of batches that will define one epoch.

Done !!!

Training and Testing

I have used a custom training loop with tqdm (you still get Keras to feel) and trained the model for 50 epochs. On colab, the training time for each epoch is 24 seconds, so yes, the training is pretty fast.

For testing, you can save images of your family, friends, and your own in a directory and also store the 128D features generated from the dense layer for each person. You can use the get_features() function, which is defined in the SiameseNetwork class here. Also, to save you some time, I have made a notebook Real-time-prediction.ipynb, which loads the model checkpoints and also provides instruction for collecting images for testing on the fly and predicting them on a webcam-video.

Miscellaneous Details

Increase Training Speed in Colab

In the DataGenerator, all the images are not loaded in the memory, instead, their indexes for manipulation. If you have your GPU, then the details in this sub-section may be less relevant.

I initially thought that reading and writing operations from colab to drive shall be fast, but it turns out they became slower than my local system which even does not have a GPU. To solve this issue, compressed the dataset to dataset.7z and uploaded it to my drive. Then copied the zip file from my Google drive to colab’s space given per session, extracted there, and then used for training. Using colab’s space significantly increased the speed of the training process.

However, my tensorboard summaries, and model checkpoints, were stored to the drive, as they are accessed once every epoch and do not significantly reduce the performance.

UI based Tool

I wanted to learn some web technologies like HTML, CSS, and Javascript. The best way to learn that was by making a small project. Hence, I have tried to develop a UI based tool for collecting data for testing as well as for prediction. Steps for running the same are explained in my repository.


In this blog, we have covered key details about fine-tuning an existing network and building a Siamese Network on them. The results of the current model are much better than expected, but we could also improve them by manually creating good triplets. One could also download the entire training dataset for training the model. Literature suggests that manually selecting a set of hard triplets will significantly decrease the training time and increase the rate of convergence of the model.

You can refer to my repository for trying the Browser-based Tool as well as checking out the notebooks for training. The tool can also detect multiple people!!


O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep Face Recognition, British Machine Vision Conference, 2015.

Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman, VGGFace2: A dataset for recognising face across pose and age, International Conference on Automatic Face and Gesture Recognition, 2018.

F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR, 2015.

G. Koch, R. Zemel, R. Salakhutdinov, Siamese Neural Networks for One-shot Image Recognition, ICML deep learning workshop. Vol. 2. 2015.