Original article was published on Deep Learning on Medium
Mathematics behind the training of Word2vec Model
In this article, we will understand the mathematics involved behind the training of the Word2Vec Algorithm.
In my previous article, We went through the basic idea of Word Embedding and its implementation using the genism library. People who are not aware of the basics of word embedding I strongly suggest them to have a look at my previous article here.
The training of Word2Vec Model can be done using two algorithms-
1-Skip-Gram Model (SG)
2-Continous Bag of Words (CBOW)
In this article, we will be covering the mathematics involved behind the SG model.
This model assumes that a word can be used to predict its surrounding words in a text corpus. The main idea behind this algorithm is that given a center word (Vc) it tries to predict the conditional probability of the neighboring words(Vw) and further tries to maximize that probability of occurrence. The number of neighboring words to consider depends upon the window size m. The figure below explains this idea using window size=2.
We assume that, given the central word, the context words are generated independently of each other. In this case, the probability formula for the co-occurrence of all the words will be as follows-
Defining a Loss Function:
The above probability calculation is just using some random word as a center in the whole corpus. But we will have to traverse across the whole corpus (T) to consider all the words as a center words one at a time and predict its respective surrounding words for a window size of m. The maximum likelihood function can be written as follows-
As we know it always easy to work with LogLikelihood functions in terms of gradient calculations, therefore we take Log on both sides and convert it into a logarithmic form.
Now let us define some notations as we will use these from here onwards-
Vc-Vector representation of the center word.
Uo-Vector Representation of the context word.
The conditional probability of generating the context word o for the given central target word c can be obtained using the softmax function. In order to convince yourself to you can think that if two words are the same then the similarity will be one and probability will be 1.
This softmax converts it into a probability by normalizing it with over the whole vocabulary V. The dot product represents the similarity between two-word vectors.
Now let’s write the final cost function after substituting the expression for probability.
Our main objective is to find the vector representation of every single word in the text in a reduced d dimensional space. The trick here is each word w will have two different representations one Vw when word w is a center word and another Uw when word w is a context word. So the parameter ϴ about which we discussed before will be 2V x d dimensional vector, where each row represents a different word in the vocabulary with its d dimensional vector representation across the columns.
Now we have two vector representations for each word. In order to train our model, we will be taking derivatives with respect to Vw and Uw and update these vectors using gradient descent. Vc is basically the same as Vw. We have just changed the notations to avoid any confusion. According to the Matrix ϴ, we only have two parameters Vc and Uw.
Gradient with Respect to Vc:
Step 2: Applying the Chain rule for Term 2.
Gradient with Respect to Uw:
Here w is the context word. So, there can be two cases when o is the context word w and when o is not the context word w. So, we will be splitting gradients into two parts to compute the gradient with respect to Uw. Following the same kind of calculations as we did above for Vc we can write the gradient w.r.t Uw using symmetry.
O ≠ W:
O = W:
Now we will update the computed gradients using the gradient descent algorithm for the whole corpus. The pseudo-code for the algorithm is given below
However, there is one problem in this approach. As we can see, in the denominator in the gradient, we have to take the exponential of the dot product of all our words and this is very time consuming when we have a huge vocabulary. We will need to train billions of weights which is computationally expensive. Therefore, we use a technique called negative sampling to tackle that. In this technique, we only take a small number of words from the vocabulary based upon random selection from unigrams distribution. I am not going into those details for this article.
I hope this article gave you a basic sense of mathematics that is happening behind the Word2vec model. This article might seem a bit gloomy due to too many pictures embedded in it, but if you go step by step I believe it should be really helpful. It would be great if you guys can give me some feedback so that I can present it in a better way next time.
Thank you for reading!!!!
If you like my work and want to support me:
1-The BEST way to support me is by following me on Medium.
2-Follow me on LinkedIn.