Source: Deep Learning on Medium
I’m still working on week 2 of Sequence Models course. Today learned about neural language model and some optimization techniques for Word2Vec. I also completed quizzes and assignments from second and third deep learning specialization courses by Andrew Ng. Here and here are the certificates:)
Word2Vec Tutorial Part 2 - Negative Sampling
Neural Language Model
Natural language model is earlier model of word2vec algorithm.
In this model, it takes previous four words as context words of the target.
Let’s see some modifications we can make for word2vec to have a better performance.
Recall the skip-gram model looks as the follow:
To compute the y-hat, we need to perform softmax as the follow:
It is computationally very expensive to compute the denominator above. One way to solve this issue is to use hierarchical softmax.
The idea of hierarchical softmax is to use a tree-like structure and perform binary classification at each node. This can reduce the time complexity from O(V) down to O(log(V)).
The idea is that sometimes two words are combined and mean something completely different. For example, “Boston Globe” (a newspaper) has nothing to do with the meanings of “Boston” and “Globe”. In this case, we want to treat “Boston Globe” as one word.
If we directly sample word pairs out of the corpus, we are more likely to have pairs that contain the words like “a”, “the”, and “of”. The problem is that the word pair (“apple”, “the”) doesn’t really tell us anything about the meaning of the word “apple” and we don’t want these kinds of word pairs to clutter our training examples.
The solution is to subsample the corpus before extracting word pairs. Concretely, we are going to keep each word with some probability defined as the follow:
The graph looks as the follow:
The more frequent a word appear in the corpus, the less likely the word will be kept.
After subsampling, for example, “the quick brown fox jumped over the lazy dog” will become “quick brown fox jumped lazy dog”.
Another problem with word2vec is that it takes too much time to update all the weights in back propagation. The idea is that we only choose a small number of neurons to back propagate. For example, let’s say the expected output is [1, 0, 0, …, 0]. We will first choose the first node to update because that’s the positive pair. Then we will randomly choose 5 negative pairs to update.
That’s it for today.