FaceNet Architecture — Part 1

Source: Deep Learning on Medium

Go to the profile of Milind Deore

The comprehension in this article comes from FaceNet and GoogleNet papers. This is a two part series, in the first part we will cover FaceNet architecture along with the example running on Google Colab and later part will cover mobile version.

FaceNet is a start-of-art face recognition, verification and clustering neural network. It has 22-layers deep neural network that directly trains its output to be a 128-dimensional embedding. The loss function used at the last layer is called triplet loss.

Fig 1: High Level Modal Structure (Source - FaceNet)

FaceNet is comprised of above building blocks and therefore we will go through each of them in sequence.

The deep net shown in Fig-1 is from GoogleNet architecture (it has many revisions, ‘Inception-Resenet-v1’ is the one that we will use in our coding example). FaceNet paper doesn’t deal much with the internal workings of GoogleNet, it considers the deep neural network as a black box, but we will touch important concepts to know how its being used and for what purpose.

Deep Network — GoogleNet

GoogleNet is a winner of ImageNet 2014 challenge, this network has given some ground breaking results and improvements over the conventional Convolutional Neural Network (CNN). Few of these features are listed below:

  • 22-layers deep network compared to 8-layers AlexNet.
  • Efficient, faster computationally power. Computational cost: 2 times less than AlexNet.
  • Significantly more accurate compared to AlexNet.
  • Low memory usage and low power consumption.
  • Network is bigger but number of parameter are smaller compared to AlexNet. 12 times less parameters compared to AlexNet.
  • MUL-ADD Ops Budget was restricted to 1.5 billion (during inference), such that the architecture can be used for real world applications, specially for portable devices like: mobile phones.

In conventional CNN, convolution is done on an image with a given filter to construct a correlation statistics layer-by-layer and then clustering these neurons that are highly correlated as an output. Important point to note is the correlation is local to the image patch and the highest correlation exist in the earlier layers of the network and hence large filter size and early pooling would reduce the important information hidden in the image patch.

This was the primary inspiration behind GoogleNet architecture and that got transformed into something called network-in-network, named as ‘inception module’.

The conventional CNN had few other challenges that GoogleNet solved quite elegantly and they are:


  1. More layers in the network is always better but the downside is, it also increase number of parameters and may cause over-fitting.
  2. Deep network also suffer from vanishing gradient problem, because the gradient could not reach through the network till the initial layer during backpropogragion cause weights unchanged and that is undesirable.
  3. Linear increase in filters cause quadratic increase in operations and by which more computational power.
  4. More number of parameter, would need more dataset and longer training time, even data-augmentation won’t help much. Often, cosmetic data generating is not a solution.
  5. Reduce representation bottleneck. This can be understood as trade-off between dimension reduction Vs information extraction. In convolution, as we go deep in the network the dimension of the input reduces and information decays, therefore the information extraction should be effective with each passing layer, specially w.r.t. the local region is concern.


  1. GoogleNet use 1×1 filter for dimension reduction. The idea behind 1×1 convolution is to keep the input size (height and width) intact but shrink channels. Example: converting an 256x256x3 RGB image to 256x256x1 image.
  2. Along with 1×1, other smaller but spatially spread-out filters are used like 3×3, 5×5 and 7×7. Since max-polling was successful to downsample the image, filters are applied in parallel and eventually all the intermediate outputs are concatenated for next stage. This makes inception module wider in the middle but connecting many such modules back-to-back makes it deeper. Visually the basic building block ‘Inception module’ looks as below:
Fig 2: Inception module with dimension reductions (Source – GoogleNet)

3. Considering the depth of the network it was bound to vanishing gradient problem during back-propagation, hence two auxiliary outputs that were tapped at middle layers and taken weighted average before adding it to total loss, thats is:

total_loss = final_loss + (1/3 * aux1_loss) + (1/3 * aux2_loss)

Since Inception-v1, modules has gone through various improvements, as mention below, in brief:

Inception-v2 and Inception-v3 (paper)

This version has Factorization which decreases the parameter and reduce the overfitting problem, BatchNormalization was introduced, label smoothing that prevent a particular logit from becoming too large compared to others hence regularizing is applied at the classifier layer.

Inception-v4 and Inception-ResNet-v1 (paper)

This version simplified stem of the network (this is the preamble of the network that connects to the first inception module). The inception blocks are same as before just that they are named as A, B, C. For ResNet Version, Residual connection is introduces, replacing pooling from the inception module.

In David Sandberg’s FaceNet implementation, ‘Inception-ResNet-v1’ version is being used.

During FaceNet training, deep network extracts and learns various facial features, these features are then converted directly to 128D embeddings, where same faces should have close to each other and different faces should be long apart in the embedding space (embedding space is nothing but feature space). This is just to give you intuition but implementation wise this is achieved using a loss function called Triplet Loss.

Cost Function

The very specific feature of FaceNet is its loss function. Triplet loss is the name of the function that is used for face validation but David’s FaceNet implementation has two loss functions ‘Triplet loss’ as well as ‘Softmax activation with cross entropy loss’. Triplet cost function looks as:

Fig 4: Cost Function

Triplet Loss: Let us say, f(x) creates embedding in d-dimensional space for an image x. Example images are:

  • Anchor : Image of Elon Mask, that we want to compare with,
  • Positive : Another image of Elon Mask, positive example,
  • Negative : Image of John Travolta, negative example.
Fig 3: Three images, grouped.

Theoretically, Anchor image should be closer to positive image and away from negative one in the euclidean space this can be calculated as:

 dist(A,P) dist(A,N)
||f(A) - f(P)||² + α <= ||f(A) - f(N)||²
||f(A) - f(P)||² + α - ||f(A) - f(N)||² <= 0 ... (1)


||f(A) — f(P)||² is distance between anchor and positive,

||f(A) — f(N)||² is distance between anchor and negative.

To keep positives set further apart from negatives set, a margin α is added to the positive, that way we push positive further away.

The loss function (1) can be zero and in that case the equation would look like following (as we do not need value below zero):

L(A,P,N) = max(||f(A) - f(P)||² + α - ||f(A) - f(N)||², 0) ... (2)

Triplet Selection: Obvious question comes to mind is to how would we choice the f(A,P) and f(A,N) pairs because if we select them randomly, the above equation (2) would quite easily satisfied but our network won’t learn much from it, moreover finding local minima would also be incorrect and gradient decent may convergence to wrong weights.

Paper suggests, using very hard examples can cause convergence happening right in the beginning and may cause broken model. Semi-hard examples is preferred option. This can be done using reasonable mini-batch size, in the paper author used 40 face in a mini-batch.

Hence, its good that we must pair the ‘semi-hard’ examples and present it to the network. Such that:

 d(A,P) ≈ d(A,N)

Because α margin will always keep them away even if they are close to each other.

FaceNet paper suggest two methods:

  1. Offline on every n training steps: Where you compute the argmin and argmax on the latest checkpoint and apply it on the subset of the data.
  2. Online: Where select a large mini-batch and computer argmin and argmax within the batch.

NOTE :- Training with triplet loss can be troublesome and hence David’s FaceNet implementation suggest using ‘Softmax with cross entropy loss’, this theory comes from paper.

SVM Training — Inference

These embedding are than used to find Euclidean distance to match or validate photos. SVM is best suited machine learning algorithm for classification, which is trained on these generated embeddings and later can be inferred for test data.

Example Code

Code can be found here. Best is if you can open it on Google Colab and run it there.

In the next part-2 we will cover hands-on example for FaceNet on mobile, we will also learn what is .tflite model and why it is required on mobiles.