Finding duplicate sentences using semantic understanding

Source: Deep Learning on Medium

There are several ways with which we can proceed through this problem statement. Like finding cosine similarity between two sentence embeddings, Using Word Mover’s theorem, training a random forest classifier. But we will be working on Infersent.

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

But let me state the reasons why we choosing Infersent above all in the list. Because proofs are important my friend 🙂

The WMD(word movers distance) approach is giving encouraging results. The Paper “Using Centroids of Word Embeddings and Word Mover’s Distance for Biomedical Document Retrieval in Question Answering” uses centroid distance to do initial pruning and then WMD for fine results.

But, Note that WMD is slow. It is O(n*m) where n is length of sentence1 and m is length of sentence2. Thus making it more slow. You are dead if you are have a big dataset.

But these two approaches are not using the information coming from sequences. So there have been multiple research in this domain also.

Infersent uses the information coming from sequences. Infersent takes into consideration the importance of each word in a sentence. Refer to this graph below to have an understanding of who things work.

As we see Barack-Obama, president and United States are given more priority than other words. Don’t worry we will get to this till the end of this blog.

Unlike in computer vision, where convolutional neural networks are predominant, there are multiple ways to encode a sentence using neural networks. Infersent is trained on a bi-directional LSTM architecture with max pooling, trained on the Stanford Natural Language Inference (SNLI) dataset.

This is how infersent aims to demonstrate that sentence encoder trained in natural language inference are able to learn sentence representations that capture universally useful features.

Wait but what in the world is this u,v, this arrows. Cool. Let’s break it down to make it simpler.

Our goal is to train a generic encoder. We have a sentence encoder that outputs a representation for the premise u and the hypothesis v. After the sentence vectors are generated, 3 matching methods are applied to extract the relation between two i.e concatenation of two vectors (u,v), Element wise product u ∗ v and absolute element-wise difference.

Then the result is fed to the 3 class classifier. We will be covering the details of the LSTM models in our next blogs because that’s, not our aim here 😉 . Next blog would have a detailed working on LSTM’s and how infersent is trained. Initially, we start off by importing the models.py file from Infersent.