Source: Deep Learning on Medium
BERT is a major milestone in creating vector representations for sentences. But instead of telling the exact design of BERT right away, we will start with word embedding that eventually leads us to the beauty of BERT. If we know the journey, we understand the intuitions better and help us to replicate the success in solving other problems. Since word embedding is a cornerstone for deep learning (DL) NLP, our first article will focus on it first.
Some words often come in pairs, like nice and easy or pros and cons. So the co-occurrence of words in a corpus can teach us something about its meaning. Sometimes, it means they are similar or sometimes it means they are opposite.
In word embedding, who you associated with tell you who you are.
We can describe the word “hen” as:
Intuitively, we can create a list of properties in describing a word. However, it will be impossible to define a universal set of properties manually that accommodates all the words in the vocabulary. Word Embedding is a Deep Learning DL method in deriving vector representations for words. For example, the word “hen” can be represented by a 512D vector, say (0.3, 0.2, 1.3, …). Conceptually, if two words are similar, they should have similar values in this projected vector space.
If we encode a word with a one-hot vector, a vocabulary of 40K words requires a 40,000-D vector. In this vector, only one component equals one while others are all zero. This non-zero component identifies a unique word. That is very un-efficient but it is a good start to find a denser representation.
Let’s enforce another constraint. We project this one-hot vector into this denser representation using linear transformation. i.e. to create the vector h, we multiply the one-hot vector with a matrix. The vector h is in a much lower dimension and we are projecting the one-hot vector into this vector space.
Given two words wᵢ and wₒ below, we want them to be as close as possible in the projected space if they are similar.
We can conceptualize the problem slightly differently. Let’s start with the word wᵢ. We first compute its vector representation with an embedding matrix. Then, we multiply it with another matrix to predict another word similar to it, say wₒ. The output will not be a one-hot vector. But we can run a softmax to find the most likely word. This creates a nice concept in linking words that are related to wᵢ.
What does it buy us? Manual labeling of related words is expensive. Instead, we parse a text corpse and use the co-occurrence words within a context window (say within 2 words range) as our training data. Our key focus is to learn the first embedding matrix to create a dense vector representation for a word.
But there are two possible ways to do it.
The first one is the skip-gram model. Given a word, can we predict its neighboring words in a text corpse? Say, we use a 5-grams model (5 consecutive words). Given the word “Patriots”, can we predict the neighbor words with the training data like:
New England Patriots win 14th straight regular-season game at home in Gillette stadium.
In the diagram below, we fit the one-hot vector of the word “Patriots” in the word embedding model. It produces 4 predictions about its possible neighbors.
The log-likelihood for the predicted words given the target word t (“Patriots”) will be:
For this 5-gram model, we want to predict these 4 words on the right.
To calculate the probability p(wₒ | wᵢ), we locate the corresponding row and column entries related to wᵢ and wₒ in the corresponding matrix. Then, the conditional probability can be computed as:
The numerator measures the similarity using a dot product. We train the model such that two similar words should produce the maximum dot product value. The denominator adds up all scores together to renormalize the numerator to a probability value. In a nutshell, for similar words, we move their vector representation closer. We want this pair to have the largest similarity over other combinational pairs involving wI.
Continuous Bag-of-Words (CBOW)
The second probability is CBOW. Given the context, we want to predict the target word instead. For example, given “New”, “England”, “win” and “14th”, we want to predict the target word “Patriots”.
Let’s delay the discussion on the training for a second and examine these trained vectors first. Since it is too hard to visualize vectors in high dimensional space, we use PCA to project it into a 2-D space. The diagram plots some of the words in this 2-D space. One important observation is that this process can discover word relations with simple linear algebra!
This is the charm of word embedding because we create a simple linear mathematical model to manipulate words semantically. If we know the vector representations of Poland, Beijing, and China, we can answer questions like what is the capital of Poland. This linear behavior is mainly contributed by the use of matrix (a linear model) in projecting words into a dense space.
Next, examine the cost function for training closely.
Assume that we are using a bigram (2-gram) model, the cross-entropy between the ground truth and the predictions will be.
As shown in the equation above, the term in the middle wants to maximize the score between the word pair we observe (the numerator) while minimizing the scores between other pairs involving wI (the denominator). The gradient of the loss function is:
where we can draw samples from distribution Q (i.e. with distribution p(wi|wI)) to estimate the second term. This is good news because we find an estimation method instead of computing the exact value for all possible word pairs with wI.
Noise Contrastive Estimation (NCE)
If we treat the training as a logistic regression problem, the loss function of the word embedding is:
i.e. we want the ground truth to be classified as true while the others to be false. This is similar to the cross-entropy and the loss function becomes:
(We will not overwhelm you with the proof in this or the next section. But the proofs can be found here if you are interested.)
Sampling from Q (p(wi|wI)) is not that simple. For some less common word pairs, we need a large corpse to make the estimation more accurate. There is even a chance that a legitimate word pair may not exist in the training data set. But in the equations above, we can simplify Q to q where q is the word distribution of a single word according to its occurrence ranking in the corpse. Since it depends on a single word only, it is easier to estimate using fewer corpse data.
Negative Sampling (NEG)
NEG is a variant of NCE where we apply a logistic function to the relevancy score. So instead of handling it as a regression problem (estimating the conditional probability), we treat it as a classification problem.
The corresponding objective function becomes:
This is the function used in the word embedding training. In the next few sections, we will discuss a few implementation details.
Subsampling of Frequent Words
To choose the word wI in the training set as the next training data, we can pick sample data using the equation below:
Obviously, we pick words with higher frequency.
Here are different tradeoffs and decision choices for the word embeddings. For example, should we use skip-gram or CBOW? Here are some suggestions from the Google team.
GloVe (Global Vectors)
GloVe is another word embedding method. But it uses a different mechanism and equations to create the embedding matrix. To study GloVe, let’s define the following terms first.
And the ratio of co-occurrence probabilities as:
This ratio gives us some insight on the co-relation of the probe word wk with the word wᵢ and wⱼ.
Given a probe word, the ratio can be small, large or equal to 1 depends on their correlations. This ratio describes the correlations of three different words without suffering the sparse problem in a 3-gram (some legitimate 3-word combinations may not be found in the corpse).
Now, we want to develop a model for F given some desirable behavior we want for the embedding vector w. As discussed before, linearity is important in the word embedding concept. So if a system is trained on this principle, we should expect that F can be reformulated as:
where we just need to compute the difference and the similarity of word embedding in F.
In addition, the probe word and wᵢ (or wⱼ) should be interchangeable. i.e. their relation is symmetrical. (a.k.a. relation(a, b) = relation(b, a)). To enforce such symmetry, we want
To fulfill this relation, F(x) would be an exponential function, i.e. F(x) = exp(x). Combine the last two equations, we get
Since F(x) = exp(x),
We can absorb log(Xᵢ) as a constant bias term since it is invariant of k. But to maintain the symmetrical requirement between i and k, we will split it into two bias terms above. This w and b form the embedding matrix. Therefore, the dot product of two embedding matrices predicts the log co-occurrence count.
Let’s understand the concept through matrix factorization in a recommender system. The vertical axis below represents different users and the horizontal axis represents different movies. Each entry shows the movie rating a user gives.
This can be solved as a matrix factorization problem. We want to discover the hidden factors for the users and movies. This factor describes what a user likes or what the hidden features (like the genre) a movie will be. If their factors match, the movie rating will be high. For example, if a user likes romantic and old movies, they will match well with the movie “When Harry Met Sally”. The vector representations for the user and the movie should produce high value for their dot product.
Therefore the rating matrix holding all users and movies can be approximated as the multiplication of the users’ hidden features and the movies’ hidden features (matrix Z holding the hidden factors of all users and w hold the hidden factors for all movies).
In GloVe, we measure the similarity of the hidden factors between words to predict their co-occurrence count. Viewed from this perspective, we do not predict the co-occurrence words only. We want to create vector representations that can predict their co-occurrence counts in the corpse also.
Next, we will define the cost function. We will use the Mean Square Error to calculate the error in the ground truth and the predicted co-occurrence counts. But since word pair have different occurrence frequency in the corpus, we need a weight to readjust the cost for each word pair. This is the function f below. When the co-occurrence count is higher or equal a threshold, say 100, the weight will be 1. Otherwise, the weight will be smaller, subject to the co-occurrence count. Here is the objective function in training the GloVe model.
Now, we are done with the word embedding.
Word embedding encodes words. But it does not account for its word context. Next, we will look at vector representations for sentences that can be used for many NLP tasks.