Solving the Embedding Mystery!

Source: Deep Learning on Medium


Go to the profile of Rajat Jain

NLP is a field that is continuosly evolving, introducing better and computationally efficient capabilities along with it. My previous post was on exploiting the core capabilities that are offered by Spacy, for determining how genders are portrayed in novel from different eras. Since recently I have been dealing with huge textual data for which one-hot feature creation and control would be a mess, pre-trained word embeddings have proved to be a viable solution.

In this article, I would be covering very briefly about what are word embeddings, focusing more on how you can use several techniques to customize your embedding to your specific use case, and improve classification accuracy.

Why word embeddings?

In order to classify text, we need textual features. One methodology of approaching the problem can be to one hot encode the text into features and directly feed the binary features to our model.

The one-hot encoding technique has two main drawbacks:

  1. For high-cardinality variables — those with many unique categories — the dimensionality of the transformed vector becomes unmanageable.
  2. The mapping is completely uninformed: “similar” categories are not placed closer to each other in embedding space.

To get over this issue, we use word embeddings: in simple words, weights of the hidden layer of a neural network while training on a huge corpus. Let’s discuss word embeddings in detail.

Working of Embeddings

Let’s start off with word2vec. The basic gist of the 2013 paper by Mikolov states:

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high-quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

Let’s get into the details. We are going to create a neural network with a single hidden layer. In input, we will pass one hot vector representing a word and in the output, we will get a probability of that word being near the center word. This is one of that neural network where the actual goal is not the output that we get, in this case, we are interested in learning from the weights of the hidden layer. Output layer would be required here, so as to get weights in hidden layer.

This is a basic representation of skip gram model. We can get intuition about CBOW(continuous bag of words) if we observe the same in reverse.

Now, we would be using these techniques practically to gauge a hands-on experience on dealing with embeddings in a dataset, we will be using the Quora insincere question classification dataset, which can be easily found here.

Customising Embeddings:

The primary goal should be to increase the coverage of vocabulary in the pre-trained embedding being used. Though embedding size is huge, it is difficult to get 100% coverage in your specific use case corpus. we can start with eliminating or transforming words that might not be included in the vocabulary. Here’s a little example for the same on GloVe.

<script src=”https://gist.github.com/RandomForestGump/905925f669ca78b65e7225a5f5abdde8.js“></script>