Deep NLP: Word Vectors with Word2Vec

Source: Deep Learning on Medium

Deep NLP: Word Vectors with Word2Vec

Using deep learning for natural language processing has some amazing applications which have been proven to be performing very well. The core concept is to feed human readable sentences into neural networks so that the models can extract some sort of information from them.

This can be used for many tasks, some of the popular ones are sentiment analysis and natural entity recognition. Sentiment analysis is where we try to predict whether a given text has a negative or positive emotion overall. We feed the text into a neural network and it will output a value between 0 to 1 to indicate the overall emotion. In natural entity recognition, we try to extract the key entities in a given text. Some examples of types of entities are names, addresses, organizations, etc. Given some text, the neural network will have to extract all the entities available. For example if we give the sentence: “Josh works in Google”, it should tell us that “Josh” is a name and “Google” is an organization.

There are many such tasks in the world of NLP. But before we dive into implementing these tasks using deep learning, we first need to understand how to preprocess textual data for our neural networks. Neural networks can take numbers as an input, but not raw text. Hence, we need to figure out a way to convert these words into a numerical format. The traditional method we used to follow is one hot encoding. But nowadays using word vectors is a much better way to represent textual data for many reasons. We will discuss both these techniques in depth and compare them. First let’s discuss one hot encoding.

One Hot Encoding

In one hot encoding, we essentially represent each word as a vector of length V, where V is the total number of unique words available in the entire textual data. V is also called the vocabulary count. Each word’s one hot encoded vector is essentially a binary vector with the value 1 being in a unique index for each word and the value 0 being in every other index of the vector. Let’s visualize this.

Suppose our data comprises of the following 2 sentences:

  • Deep learning is hard
  • Deep learning is fun

Here, the value of V is 5, because there are 5 unique words: [‘Deep’, ‘learning’, ‘is’, ‘hard’, ‘fun’]. Now to represent each word, we will use a vector of length 5.

One hot encoding example

This is how we represent words as numbers using one hot encoding. Each and every word in the dataset has a corresponding one hot encoded vector which is unique. This is a simple and straightforward approach to convert all the words in a set of data into numbers and is one of the first methods implemented for this purpose, however this method has many issues.

Firstly, the size of each word’s one hot encoded vector will always be V, which is the size of the entire vocabulary. In the example it was 5 but usually the value of V can reach 10k or even 100k. This means we need a huge vector just to represent a single word, which can lead to excessive memory usage while representing text as vectors.

Secondly, the index which is assigned to each word does not hold any semantic meaning, it is merely an arbitrary value assigned to it. When we consider the vectors for two words, we would ideally want the vectors of similar meaning to have similar vectors. For example, it would be good if the vector for “Dog” and “Cat” are close to each other since the words share certain amount of similarity. However, with one hot encoded vectors, the vectors for “Dog” and “Cat” are just as close to each other as “Dog” and “Computer”. Hence the neural network has to work really hard to understand each word since they are being treated as completely isolated entities. The usage of word vectors aims to resolve both of these issues.

One hot encoded vectors take up excessive memory and they do not hold any semantic information of the word. Word vectors aim to resolve both of these problems.

Word Vectors

Word vectors are essentially a much more efficient method of representing words. Word vectors take up much lesser space than one hot encoded vectors and they also hold semantic information regarding the word. These are the key features of word vectors. Before diving into the method of finding word vectors, we need to understand the one core principle behind the concept of word vectors:

Similar words occur more frequently together than dissimilar words.

Just think about this for a while, does it make sense? When a word occurs within the vicinity of another word, it doesn’t always mean it has a similar meaning, but when we consider the frequency of words which are found close together, we find that words of similar meaning are often found together.

For example, the word “Dog” will be found within the vicinity of the word “Cat” a lot more frequently than it will be found within the vicinity of the word “Computer”, this is because “Dog” shares certain semantic similarity with “Cat” and there will hence be many sentences which have both “Dog” and “Cat”. This is the key factor which deep learning researchers have exploited to come up with word vectors. We will now discuss one of the first implementations of word vectors: Word2Vec by Google.

Word2Vec

In Word2Vec, there are two different architectures: CBOW and Skip Gram. We will first go through Skip Gram architecture, after which, understanding CBOW will be much easier.

The first thing we do for Word2Vec is to collect word co occurrence data. Basically we need a set of data telling us which words are occurring close to a certain word. We use something called as a context window for doing this.

Let’s consider the sentence: “Deep Learning is very hard and fun”. We need to set something knows as window size. Let’s assume it’s 2 for this case. What we do is iterate over all the words in the given data, which in this case is just 1 sentence, and then consider a window of words which surround it. Here since our window size is 2, we will consider 2 words behind the word and 2 words after the word. Hence for each word, we will get 4 words associated with it. We will do this for each and every word in the data and collect the word pairs. Let’s visualize this.

Context Window example

As you can see for the words near the edges, we have lesser words to consider and hence number of context words will be lesser than 4

As we are passing the context window through the text data, we find all word pairs of target and context words to form a dataset in the format of (target word, context word). For the sentence above, it will look like this:

1st Window pairs: (Deep, Learning), (Deep, is)

2nd Window pairs: (Learning, Deep), (Learning, is), (Learning, very)

3rd Window pairs: (is, Deep), (is, Learning), (is, very), (is, hard)

And so on. At the end our target word vs context word data set is going to look like this:

(Deep, Learning), (Deep, is), (Learning, Deep), (Learning, is), (Learning, very), (is, Deep), (is, Learning), (is, very), (is, hard), (very, learning), (very, is), (very, hard), (very, and), (hard, is), (hard, very), (hard, and), (hard, fun), (and, very), (and, hard), (and, fun), (fun, hard), (fun, and)

This can be considered as our “training data” for word2vec.

In the Skip Gram model, we try to predict each context word given a target word. We use a neural network for the prediction task. The input to the neural network will be a one hot encoded version of the target word and the output is the one hot encoded version of the context word. Hence the size of input and output layers is V (vocabulary count). This neural network has only 1 hidden layer in the middle, the size of this hidden layer determines the size of word vectors we wish to have at the end. We will consider this as 300 for now.

Let’s say V is 10k and one of the target, context word pairs in our dataset is (Deep, Learning). The word2vec training will look like this:

Skip Gram architecture in Word2Vec

Since this neural network has a total of 3 layers, there will be only 2 weight matrices for the network, W1 and W2. W1 will have dimensions of 10000*300 and W2 will have dimensions of 300*10000. These two weight matrices will play an important role in calculating word vectors and we will discuss them more later.

For our entire (target word, context word) dataset which we have collected from the original textual data, we will pass each pair into the neural network and train it. Essentially the task which the neural network is trying to do here is to guess which context words can appear given a target word. After training the neural network, if we input any target word into the neural network, it will give a vector output which represents the words which have a high probability of appearing near the given word.

For CBOW the only difference is that we try to predict the target word given the context words, essentially we just invert the skip gram model to get the CBOW model. It looks like this:

CBOW Architecture in Word2Vec

Here when we give a vector representation of a group of context words, we will get the most appropriate target word which will be within the vicinity of those words. For example, if we give the sentence: Deep _____ is very hard, where [“Deep”, “is”, “very”, “hard”] represents the context words, the neural network should hopefully give “Learning” as the output target word. This is the core task the neural network tries to train for in the case of CBOW.

Now we know the task of the neural network in word2vec, but this still doesn’t help us get the word vectors we require. Now we have to understand that the neural network is doing this task so that it distills and captures semantic information about words by training on their relationship with other similar words. As the neural network trains on all the target and context word pairs in the dataset, it will eventually start understanding which words are semantically similar to which other words. This happens because of the phenomenon we discussed earlier where similar words occur together more frequently than dissimilar words. Hence the neural network will be exposed to (Dog, Cat) way more times than it sees (Dog, Computer), hence this semantic similarity is captured by it.

The neural network is capturing a lot of information regarding semantics of a given word, but how do we extract this information? Whenever a neural network is training, we can say that the weight matrices in between the layers store the “information” or “patterns” which the neural network has learnt. The values in the weight matrices don’t necessarily make any sense to us as humans but whatever a neural network learns is represented in the weights.

To extract the word vectors from the weights of the neural network, we generally consider W1, which is also called the “Embedding Layer”. There are some implementations which consider the weights of W2 as well but for now let’s look at how to get the vectors from the first weight matrix.

When we are training the neural network, the first thing which happens is the multiplication of the input layer with the first weight matrix. This is a visual of that operation:

Neural network first layer multiplication operation (Input Layer * W1)

If we represent the weights in a matrix form, we get a matrix with 10k rows and 300 columns. Since the input is a one hot encoded vector ([0,0,…,1,…,0]) which has one 1 in the vector, when we multiply the weight matrix with a one hot encoded vector, one row in the matrix will retain it’s values and all other rows will become 0. Hence the values in the hidden layer will simply be the values of this 1 row.

Hence, when we are using a particular word, let’s say the nth word of the vocabulary, we will only be affecting the nth row in the weight matrix because all other rows becomes 0. Therefore we can say that the nth row in the weight matrix contains all the trained information regarding the nth word and this row will be the trained word vector for the nth word. This is how we extract word vectors from a trained word2vec neural network.

That covers the fundamentals of word vectors. Word2Vec is just one of the several methods out there today used for generating word vectors for NLP and we will discuss more methods in the future. Here we have discussed how to preprocess text into a format compatible with neural networks, and in the next few articles on Deep NLP we will look at how we use these word vectors to perform several different tasks in NLP. Thanks for reading!