Contrastive | Triplet | Quadruplet Loss: question pairs detection

Original article was published on Deep Learning on Medium


Through this article, I will evaluate and compare three different losses for the task of Deep Similarity Learning. If this topic is still not perfectly understandable to you, I have written an article introducing the main concepts with code examples as well as a complete GitHub repository for you to check:

Table of Content

I. Quick overview of the task

II. Siamese Recurrent Network: sequence-processing equivalent of Siamese Neural Networks

III. Losses for Deep Similarity Learning

IV. Concrete Application: question pairs detection

I. Quick overview of the task

I used for this task the famous Quora question pairs dataset, where the main goal is to predict if two question pairs have the same intent. For instance:

  • What can make Physics easy to learn ? / How can you make Physics easy to learn ? have similar intents
  • What is the best way to make money online ? / What is the best way to ask for money online ? have different intents

For this task, different solutions can be used but the one we will see today is: Word Embeddings + Siamese Recurrent Network. The Word Embedding algorithm is not the point of focus here (Word2Vec will be used) but we will focus on training the Siamese Recurrent Network. Hence, before talking about training, we will have a quick overview of what is a Siamese Recurrent Network (more details can be found in my other article above…).

II. Siamese Recurrent Network: the sequence-processing equivalent of Siamese NN

Figure of a Siamese BiLSTM Figure

As presented above, a Siamese Recurrent Neural Network is a neural network that takes, as an input, two sequences of data and classify them as similar or dissimilar.

The Encoder

To do so, it uses an Encoder whose job is to transform the input data into a vector of features. One vector is then created for each input and are passed on to the Classifier. When working with images, this encoder will often be a stack of convolutional layers, while when working with sequences, it will often be a stack of RNNs. In our case, we used a stack of 3 Bidirectional LSTMs.

The Classifier

The classifier then calculates, from these two inputs, a distance value (the distance function can be any distance: L1, L2…). This distance is then classified as being the distance of similar or dissimilar data instances: this process is just similar to finding the right distance value threshold above which two data objects are considered dissimilar.

Training a Siamese Neural Network

Given the definitions of the encoder and the classifier, one may realize that all the difficulty of working with Siamese NN lies within the creation process of the vector of features. Indeed, this vector needs the following properties:

  • To be appropriately descriptive enough so that two similar pieces of data (with variability) will have similar vectors (and hence, small distances)
  • To be discriminative enough so that two dissimilar pieces of data will have dissimilar vectors
Animation of Data Comparison process

Hence, we see that training this network is all about training it, on one hand, to recognize similar things, while on the other hand, to recognize when things are dissimilar: both with good confidence. It is not enough to teach a model what two similar pieces of data are, it would overfit to the training data and have a tendency to find everything to be similar (high recall but low precision): it is also about training it to recognize dissimilar data (and hence balance its recall and precision) and hence, what makes two pieces of data inherently different.

For the intent of training a Siamese Neural Network, the most popular loss function used is the Contrastive Loss [2] (it has been more thoroughly reviewed in my earlier post that you can find above). However, it is not the only one that exists. I will compare it to two other losses by detailing the main idea behind these losses as well as their PyTorch implementation.

III. Losses for Deep Similarity Learning

Contrastive Loss

When training a Siamese Network with a Contrastive loss [2], it will take two inputs data to compare at each time step. These two input data could either be similar or dissimilar. This is modelled by a binary class variable Y whose values are:

  • 0 if dissimilar;
  • 1 if similar.

These classes can obviously be changed, to the condition that the loss function is adapted.

Illustration of the Contrastive Loss details

You can find the PyTorch code of the Contrastive Loss below:

Triplet Loss

When training a Siamese Network with a Triplet loss [3], it will take three inputs data to compare at each time step. Oppositely to the Contrastive Loss, the inputs are intentionally sampled regarding their class:

  • We sample an anchor object, used as a point of comparison for the two other data objects;
  • We sample a positive object, known to be similar to the anchor object;
  • We then sample a negative object, known to be dissimilar to the anchor object.
Illustration of the Triplet Loss details

You can find the PyTorch code of the Triplet Loss below:

Quadruplet Loss

When training a Siamese Network with a Quadruplet loss [3], it will take four inputs data to compare at each time step. Just like the Triplet Loss, the inputs are again intentionally sampled regarding their class:

  • We sample an anchor object, used as a point of comparison for the two other data objects;
  • We sample a positive object, known to be similar to the anchor object;
  • We sample a negative object, known to be dissimilar to the anchor object;
  • Then, we sample another negative object, known to be dissimilar to every other of the 3 data objects.
Illustration of Quadruplet Loss details

You can find the code of the Quadruplet Loss below:

Visual Comparison of the losses and their impact on the network