My Machine Learning Diary: Day 78

Source: Deep Learning on Medium

Day 78

I’m still working on week 2 of Sequence Models course. Today learned about neural language model and some optimization techniques for Word2Vec. I also completed quizzes and assignments from second and third deep learning specialization courses by Andrew Ng. Here and here are the certificates:)

Word2Vec Tutorial Part 2 - Negative Sampling
What is hierarchical softmax?
How does sub-sampling of frequent words work in the context of Word2Vec?

Neural Language Model

Natural language model is earlier model of word2vec algorithm.

Natural language model

In this model, it takes previous four words as context words of the target.

Word2Vec Optimization

Let’s see some modifications we can make for word2vec to have a better performance.

Hierarchical Softmax

Recall the skip-gram model looks as the follow:

skip-gram model

To compute the y-hat, we need to perform softmax as the follow:


It is computationally very expensive to compute the denominator above. One way to solve this issue is to use hierarchical softmax.

Regular Softmax (Left) and Hierarchical Softmax (Right) (Source)

The idea of hierarchical softmax is to use a tree-like structure and perform binary classification at each node. This can reduce the time complexity from O(V) down to O(log(V)).

Word Pairs

The idea is that sometimes two words are combined and mean something completely different. For example, “Boston Globe” (a newspaper) has nothing to do with the meanings of “Boston” and “Globe”. In this case, we want to treat “Boston Globe” as one word.


If we directly sample word pairs out of the corpus, we are more likely to have pairs that contain the words like “a”, “the”, and “of”. The problem is that the word pair (“apple”, “the”) doesn’t really tell us anything about the meaning of the word “apple” and we don’t want these kinds of word pairs to clutter our training examples.

The solution is to subsample the corpus before extracting word pairs. Concretely, we are going to keep each word with some probability defined as the follow:

subsampling rate

The graph looks as the follow:

subsampling rate graph of function f(z(w)) (source)

The more frequent a word appear in the corpus, the less likely the word will be kept.

After subsampling, for example, “the quick brown fox jumped over the lazy dog” will become “quick brown fox jumped lazy dog”.

Negative Sampling

Another problem with word2vec is that it takes too much time to update all the weights in back propagation. The idea is that we only choose a small number of neurons to back propagate. For example, let’s say the expected output is [1, 0, 0, …, 0]. We will first choose the first node to update because that’s the positive pair. Then we will randomly choose 5 negative pairs to update.

That’s it for today.