Neural Networks & Word Embeddings

Source: Deep Learning on Medium

Neural Networks & Word Embeddings

All modern NLP techniques use neural networks as a statistical architecture. Word embeddings are mathematical representations of words, sentences and (sometimes) whole documents. Embeddings allow us to compare text, visualize it, and improve the accuracy of newer models. Stanford’s Natural Language Processing with Deep Learning Course is a great course that I’ll follow to get started . For the projects, I’ll complete all 3 coding parts in PyTorch (word2vec and dependency parsing).

Representing words by their context

[Lecture 1 — Stanford NLP with Deep Learning — Introduction and Word Vectors].

Distributional semantics

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by. In other words “You shall know a word by the company it keeps”.

Word Embeddings (a.k.a. Word vectors)

We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts [here’s a visual video snippet]. They are a distributed representation. In practice the minimum dimensionality is 50, 300 (on laptop), 1000, 2000, or 4000 are common vector sizes used.


Word2vec is a framework for learning word vectors. It’s a very simple and scalable way of learning vector representations of words. Ultimately, we want a model that gives a reasonably high probability estimate to all words that occur in the context (fairly often).

Main Idea/Summary:

  1. Iterate through each word of the whole corpus
  2. Predict surrounding words using word vectors
  3. Update vectors so you can predict well

More Detailed Steps:

  • Start with a large corpus of text (with actual sentences).
  • Every word in a fixed vocabulary is represented by a vector. Start with a random vector for each word.
  • Go through each position t in the text, which has a center word c and outside context words o.
  • Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa).
  • Keep adjusting the word vectors to maximize this probability.

The question now is how do we calculate the probability of the center word and the context given the center word? [Video Snippet]

We will use two vector representations per word, w. So we will use one vector for a word when it’s the center word and we are predicting other words (v). Then we have a second vector for the word when it’s a context word (u).

NOTE: Generally we may be using Word2vec to get the context of, let’s say, 10 words or so. So it will be a “loose” model where we might get a 5% chance to one of those 10 words being guessed correctly. So while we should not expect to get 97%, the goal is, for example, we hope to capture that the word “withdrawal” is likely to occur with “bank” instead of “football” a non-related word.

Word2vec the Skip-Gram Model

Here’s an excellent article that covers the skip gram neural network architecture for Word2Vec.

The skip-gram model proposes a simple single-layer architecture based on the inner product between two word vectors. The objective is to predict a word’s context given the word itself.

The intuition of skip-gram is:

  1. Treat the target word and a neighboring context word as positive examples.
  2. Randomly sample other words in the lexicon to get negative samples.
  3. Use logistic regression to train a classifier to distinguish those two cases.
  4. Use the regression weights as the embeddings.

[Source: Jurafsky, SLP, Ch 6.8]

Imagine a sentence like the following, with a target word apricot, and assume we’re using a window of ±2 context words:

Our goal is to train a classifier such that, given a tuple (t, c) of a target word t paired with a candidate context word c (for example (apricot, jam), or perhaps (apricot, aardvark)) it will return the probability that c is a real context word (true for jam, false for aardvark):

The probability that c is a real context word.

How does the classifier compute the probability P? The intuition of the skip-gram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. How can we compute similarity between embeddings? Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product).

Of course, the dot product t·c is not a probability. Cosine isn’t a probability either.

To turn the dot product into a probability, we’ll use the logistic or sigmoid function σ(x), the fundamental core of logistic regression:

The probability that word c is a real context word for target word t is thus computed as:

The sigmoid function just returns a number between 0 and 1, so to make it a probability we’ll need to make sure that the total probability of the two possible events (c being a context word, and c not being a context word) sums to 1.

The probability that word c is not a real context word for t is thus:

The equation above gives us the probability for one word, but we need to take account of the multiple context words in the window. Skip-gram makes the strong but very useful simplifying assumption that all context words are independent, allowing us to just multiply their probabilities:

In summary, skip-gram trains a probabilistic classifier that, given a test target word t and its context window of k words c1:k , assigns a probability based on how similar this context window is to the target word. The probability is based on applying the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had embeddings for each target word and context word in the vocabulary. Let’s now turn to learning these embeddings (which is the real goal of training this classifier in the first place).

Learning skip-gram embeddings

Read chapter 6.8, page 20 of Jurafsky’s Vector Semantics and Embeddings book for a great explanation of how Word2vec learns embeddings.

Efficient Estimation of Word Representation in Vector Space [Paper]

In this Google paper, Efficient Estimation of Word Representation in Vector Space, it was observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set. Using the DistBelief distributed framework, it should be possible to train the CBOW (Continuous Bag-of-Words) and Skip-gram models even on corpora with one trillion words, for basically unlimited size of the vocabulary.

Distributed Representations of Words and Phrases and their Compositionality [Paper]

In this Google paper they found that by subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. Also because the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada” they present a simple method for finding phrases in text. Top highlights from the paper are:

  • The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document.
  • While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality.
  • The results show that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation
  • 10⁻⁵ subsampling can result in faster training and can also improve accuracy, at least in some cases
  • The Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.

Assignment 1 —Explore Word2Vec [GitHub][Code]

I completed Assignment 1that was issued from Stanford CS224N course.
In this assignment I explore two types of word vectors: those derived from co-occurrence matrices, and those derived via word2vec.

Below is my Jupyter Notebook:

Word Vectors

Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence matrices, and those derived via word2vec.

Note on Terminology: The terms “word vectors” and “word embeddings” are often used interchangeably. The term “embedding” refers to the fact that we are encoding aspects of a word’s meaning in a lower dimensional space.

Part 1: Count-Based Word vectors

Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words.

  • Create a Co-occurrence Matrix: I create a co-occurrence matrix. A co-occurrence matrix counts how often things co-occur in some environment.
  • Dimensionality Reduction: Use sklearn.decomposition.TruncatedSVD and construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top _k_ components and produce a new matrix of k-dimensional embeddings.

Part 2: Prediction-Based Word Vectors

Here I explore the embeddings produced by word2vec.

  • Dimensionality Reduction: Reduce dimensionality of Word2Vec Word Embeddings using truncated SVD.
  • Synonyms & Antonyms: Demonstrate counter-intuitive examples and explore why they happen.
  • Solving Analogies with Word Vectors: I use gensim to solve analogies. For example, for the analogy “man:king :: woman:x”, what is x?
  • Finding Analogies: Use GenSim’s most_similar function to find analogies, e.g. Austin is the capital of Texas, and Atlanta is the capital of Georgia.
  • Explore Incorrect Analogies: Gensim doesn’t always produce correct results. Here I explore that.
  • Analyze Bias in Word Vectors: Use the most_similar function to find another case where some bias is exhibited by the vectors. E.g. I discovered that there’s a bias which results in the correlation of mothers and doctors to nurse. While father and doctors is correlated to physician.