Original article was published on Deep Learning on Medium
In this series of posts, we will study the concept of similarity learning applied to different type of data for different contexts:
- Text data/Sequence data (e.g for sentence similarities)
- Image data (e.g for object similarities, face recognition…)
- Multimodal data (e.g for retail product comparison, comprising of a title and an image)
In this article, I will go through my take on the general concept of Similarity Learning, which processes it involves and how it can be summarized. I will then apply these outlined concepts to the context of questions similarities.
Table of Contents
- Overview of Similarity Learning
- Text Similarity Learning
- Source code (PyTorch implementation)
1. Overview of Deep Similarity Learning
When one is doing similarity learning, the same process is always performed:
As explained in this infographic, any process involving Similarity Learning revolves around 3 main concepts:
- Transformation of the data in a vector of features
- Comparison of the vectors using a distance metric
- Classification of the distance as being similar or dissimilar
1.a Transformation through an Encoder
In most Deep Learning tasks, the first layers of a model represent what is sometimes referred to as “an encoding phase”: it has the role of extracting relevant features from the input data.
For the rest of the article, we will write the encoding function as follows:
This encoder can take, depending on the input, different forms, amongst which we find:
- RNN layers for encoding and comparison of sequences;
- CNN layers for temporal/spatial data (1D Convolutions can be used for sequences as well);
Usually, after the input data has been reduced to a vector by these encoders, we stack layers of Fully-Connected Neurons to classify these extracted features. In our case, we use this vector as a dimensionally reduced version of our data to compute distance with other pieces of data. It becomes way easier to numerically say how different two vectors are rather than two sentences for example.
To sum up, an encoder will use a combination of any kind of layers that will, adequately to its input data, generate the data’s latent representation, a compressed, non-human interpretable, vector of information.
Throughout Deep Learning history, multiple types of architectures have been created to generate latent vectors. Some of them were:
- Siamese Neural Networks (Koch, Zemel and Salakhutdinov, 2015)
- Multimodal Autoencoders (Silberer and Lapata, 2015)
We will explore Siamese Neural Networks further away in this article.
1.b Distance calculation
Once we have our vectorized input data, we can compare the two vectors using a distance function. The most popular distances are:
- The Manhattan distance
- The Euclidean distance
Once the distance is calculated, we could set a threshold above which we consider two pieces of data to be dissimilar and vice versa to consider them similar.
1.c Distance classification
However, depending on the input data, setting this threshold might be complex or time consuming. For simplicity, we can use another classifier that will, given an input distance, classify if this distance is the one of similar or dissimilar objects. My choice was to use a logistic regression classifier: finding a linear seperation in our data correspond to learn the threshold classifying our distances.