Why word2vec works?



Generate Embeddings (Word2vec) as multi-label classification.

This part we cover the motivation behind formulating it as a muti-label classification.

  1. Introduction

Word2vec and word embeddings are popular terms in the field of Natural Language Processing and Deep Learning in last 3 years. It is of no surprise that, these are gaining more popularity. The ability of a computer saying “man” is to “woman” as “king” is to “queen” is very interesting and exciting [1].

This tutorial is not supposed to be another explanation of how word2vec works, no detailing about skip-gram, CBOW etc. but a simplification of how word2vec or word embeddings can be simplified to basic classification approach, without much overhead. I hope intuitively, it might help people to gain a simple and clarified understanding of how word embeddings are generated.

“ You shall know a word by the company it keeps — J. H. Firth ”.

This statement has all the idea behind word2vec. A simple example is, people working in “data science” might have similar interest, from a global view. The reason is, the tool-sets, algorithms, etc these people are dealing with appears to be same. That is why, they appears similar. Expand this analogy, to NLP. Assume that people in data science community can be “words” , and the toolsets,algorithms they are dealing with can be considered as the “context” or “neighbor” or “window words”. This context, is what gives the notion of similarity to words.

2. How backpropagation helps to get good word embeddings?

The following sections are bit technical oriented, and I am expecting readers are familiar with word2vec, softmax, backpropogation.

We will be using the following corpus to explain how it works.

“ I read magazine weekly . 
I read newspaper daily . I like to read books daily . “

Assume that our window size = 5 , which means for each word we can see 5 neighboring words to the left and right. and we are treating each sentence separated by a “.” delimiter. We are going to generate training pairs from the above corpus. The following snippet explains how to generate it.

Some training pairs looks like

('magazine', 'weekly')
Vocab index 2 3
('magazine', 'read')
Vocab index 2 1
('magazine', 'magazine')
Vocab index 2 2
('weekly', 'weekly')
Vocab index 3 3
('weekly', 'magazine')
Vocab index 3 2
('i', 'read')
Vocab index 0 1
('i', 'newspaper')
Vocab index 0 4
('i', 'daily')
Vocab index 0 5
('read', 'i')
Vocab index 1 0
('read', 'read')
Vocab index 1 1
('read', 'newspaper')
Vocab index 1 4
('read', 'daily')
Vocab index 1 5
('newspaper', 'read')
Vocab index 4 1
('newspaper', 'newspaper')
Vocab index 4 4
('newspaper', 'daily')
Vocab index 4 5

Given an input word, we have to predict the neighbor words, which means we have to map the input observation to the output class, so the total number of output classes equals the cardinality of the vocabulary.

Here (magazine,weekly) is one training pair, and 2, 3 are the indexes of these words in the vocabulary. Now, to formulate it as a classification problem, we need to generate features for the input words. We are initializing all the input training data with, some random features. Random features means, we are assigning some random numbers of fixed length ( it can be 100, 200, 512 , 128 etc ). These fixed length is called embedding dimension.

What all word embedding model tries to do is, starting off with some random features for all the words, and on the course of training, tune or change these features, to make it more meaningful. The meaningful or useful information are integrated into this words/features, by neighbor or context words, with the help of backpropogation.

Now the problem boils down to as follows. This is just a dummy code to explain, what is actually happening inside.

embedding_dimension = 100
n_classes = len(vocab)
X = np.random.uniform(-1,1, (len(vocab), embedding_dimension)) # input features for X 
W = np.random.uniform(-1,1, (len(vocab), embedding_dimension)) # output matrix ( weights )
for each_pair in target_words_index:

inp_pair_index = each_pair[0]
op_pair_index = each_pair[1]

x_input = random_embeddings[inp_pair_index] # 1 x embedding_dimension

probs = softmax(np.dot(W, x_input)) # 1 x n_classes

####### loss(probs, op_pair_index)
####### backpropagate , update W

This is what happens in word2vec at higher level. But,

The crux of the algorithm is unlike other classification problems, instead of finding the feature matrix by updating it, word2vec also updates the inputs . This is a notable difference, and this is where backpropagation plays its role. Two words are similar, or ended up similar to each other, because the information by which these input words are getting updated, is same. As they have similar context or neighbor words, these context words are like information carriers.

3. Is softmax a good option ?

  1. Computational perspective

The complexity of above problem lies in the fact that, assume if we are using Wikipedia data as training corpus, now the total number of unique words ( vocabulary ) will be at least 2–3 million. For each training pairs, we have to calculate the probability over 2–3 million, and it is very very expensive. To overcome this, in word2vec they have used negative sampling which is an approximated function of/for softmax, and computationally very fast.

2. Quality perspective

When we are having same training data, occurs multiple times with different labels , eg: ( magazine , read ) , ( magazine , weekly ) etc. if we are using softmax, we are boosting one class and that is the purpose of softmax. From, this standpoint the entire formulation can go wrong, but the purpose of word2vec is not classification. That is all what that matters.

Is softmax the right loss function, when training data looks like a multi-label one?

If softmax is not the right loss fucntion for training data with multiple labels, how come word2vec does it right. All the matters is the objective, which in the case of word2vec is not doing classification, but generating quality word embeddings by using the backpropagation ( gradients ) that comes from the classification like ( negative sampling ), and use that to flow towards the embedding.

4. How gradients makes word embeddings work?

Have a look https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py lines 417– 444.

The above is a very naive and simple implementation of forward pass and backward pass of negative sampling. The idea is simple. Consider the training pair ( magazine, weekly ).

probs = softmax( np.dot( X[magazine], W)) # 1 x n_classes
prediction_prob = 0.3 # probability of the output class
will give a vector of probabilities. Now, assume, for word weekly, probability is 0.3. If model was good, or great the maximum it would have predicted was 1.0, which means 0.3 needs to travel a bit to reach to 1.0 .How, can we make the model do that, we have to update W. We need to calculate the error. 
1. error = 1.0 - 0.3 # actually the reverse in practice, this is intuition
2. Update W. pass this error backwards to tune W. (W = W - dW*error)
3. in the case of embeddings, use that updated information from W, to X and update it. dX =
Repeat this. when my prediction_prob = `1.0, error = 0 ( 1.0 - 1.0 ), nothing will get updated, because model has learned what it wants to learn. 
#### in practice its hard, one or other-way, some gradients from different training data keeps on updating it, but in negative sampling this chance is less, and thats the beauty of negative sampling. Have a look at gensim source code.

The idea behind backpropagation is, if we are learning a new topic, first time we might have learned some part. Next time when we are reading, we need not to read too much about what we have learned first time, we have to focuss mostly on the remaining part. For these, we need to have a measure ( which we call as error). This is the measure of what amount of information I needs to update and this process repeats, when I learned everything perfectly, my error is 0. And I am done, nothing needs to update, I have already learned everything. This is different ways of have intuition.

If 2 words are similar, for eg: magazine and newspaper , that means they have common neighbors. They ended up in having similar embeddings, because the weights ( vectors , from W ) by which their input embeddings ( X in the above code snippet ) gets updated, is almost same or similar.

In the next part we will implement multi-label classification for word embeddings in Tensorflow and compare it with word2vec

References

https://www.technologyreview.com/s/541356/king-man-woman-queen-the-marvelous-mathematics-of-computational-linguistics/

https://www.tensorflow.org/tutorials/representation/word2vec

Source: Deep Learning on Medium