Word Embeddings for NLP

Source: Deep Learning on Medium

Understanding word embeddings and their usage in Deep NLP

In this article we will understand how to process text for usage in machine learning algorithms? What are embeddings and why are they used for text processing?

word2vec and GloVe word embeddings

Natural Language Processing(NLP) refers to computer system designed to understand human language. Human language, like English or Hindi consists of words and sentences, and NLP attempts to extract information from theses sentences.

A few of the tasks that NLP is used for

  • Text summarization: extractive or abstractive text summarization
  • Sentiment Analysis
  • Translating from one language to another : neural machine translation
  • Chat bots

Machine learning and deep learning algorithms only takes numeric input so how do we convert text to numbers?

Bag of words(BOW)

Bag of words is a simple and popular technique for feature extraction from text. Bag of word model processes the text to find how many times each word appeared in the sentence. This is also called as vectorization.

Steps for creating BOW

  • Tokenize the text into sentences
  • Tokenize sentences into words
  • Remove punctuation or stop words
  • Convert the words to lower text
  • Create the frequency distribution of words

In the code below, we use CountVectorizer, it tokenizes a collection of text documents, builds a vocabulary of known words, and encodes new documents using that vocabulary.

#Creating frequency distribution of words using nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
text="""Achievers are not afraid of Challenges, rather they relish them, thrive in them, use them. Challenges makes is stronger.
Challenges makes us uncomfortable. If you get comfortable with uncomfort then you will grow. Challenge the challenge """
#Tokenize the sentences from the text corpus
#using CountVectorizer and removing stopwords in english language
cv1= CountVectorizer(lowercase=True,stop_words='english')
#fitting the tonized senetnecs to the countvectorizer
# printing the vocabulary and the frequency distribution pf vocabulary in tokinzed sentences

In the text classification problem, we have a set of texts and their respective labels. We use Bag of word model to extract features from the text and we do this by converting the text into a matrix of occurrence of words within a document.

code for simple text summarizer

What is the problem with bag of words?

In the bag of words model, each document is represented as a word-count vector. These counts can be binary counts, a word may occur in the text or not or will have absolute counts. The size of the vector is equal to the number of elements in the vocabulary. If most of the elements are zero then the bag of words will be a sparse matrix.

In deep learning, we would have sparse matrix as we will be working with huge amount of training data. Sparse representations are harder to model both for computational reasons as well as for informational reasons.

Huge amount of weights : Huge input vectors means a huge number of weights for a neural network.

Computationally intensive :More weights means more computation required to train and predict.

Lack of meaningful relations and no consideration for order of words : BOW is a collection of words that appear in the text or sentences with the word counts. Bag of words does not take into consideration the order in which they appear.

Word Embedding is solution to these problems

Embeddings translate large sparse vectors into a lower-dimensional space that preserves semantic relationships.

Word embeddings is a techniques where individual words of a domain or language are represented as real-valued vectors in a lower dimensional space.

Sparse Matrix problem with BOW is solved by mapping high-dimensional data into a lower-dimensional space.

Lack of meaningful relationship issue of BOW is solved by placing vectors of semantically similar items close to each other. This way words that have similar meaning have similar distances in the vector space as shown below.

“king is to queen as man is to woman” encoded in the vector space as well as verb Tense and Country and their capitals are encoded in low dimensional space preserving the semantic relationships.

source: https://developers.google.com/machine-learning/crash-course/embeddings/translating-to-a-lower-dimensional-space

How are semantically similar items placed close to each other?

Let’s explain this using collaborative filtering used in recommendation engines.

Recommendation engines predicts what a user would purchase based on historical purchases of other users with similar interest. using collaborativer filtering

Amazon and Netflix uses recommendation engines for suggesting products or movies to their users

Collaborative filtering is a method where all the similar products bought by multiple customers are embedded into a low dimensional space. This low dimensional space will contain similar products close to each other, hence, it is also called as nearest neighborhood algorithm.

This technique of nearest neighborhood is used for placing semantically similar items close to each other

How do we map high-dimensional data into a lower-dimensional space?

Using standard Dimensionality reduction techniques

Standard dimensionality reduction techniques like Principal Component Analysis(PCA) can be used to create word embeddings. PCA tries to find highly correlated dimensions that can be collapsed into a single dimension using the BOW.


Word2vec is an algorithm invented at Google for training word embeddings. word2vec relies on the distributional hypothesis. The distributional hypothesis states that words which, often have the same neighboring words tend to be semantically similar. This helps to map semantically similar words to geometrically close embedding vectors.

Distributional hypothesis uses continuous bag of words(CBOW) or skip grams.

word2vec models are shallow neural network with an input layer, a projection layer and an output layer. It is trained to reconstruct linguistic contexts of words. Input layer for Word2vec neural network takes a larger corpus of text to produce a vector space, typically of several hundred dimensions. Every unique word in the text corpus is assigned a corresponding vector in the space.

This architecture is called continuous bag of words CBOW as it uses continuous distributed representation of the context. It considers both the order of words in the history as well as future.

This helps common context word vectors in the corpus to be located close to one another in the vector space.

Source: Efficient Estimation of Word Representations in Vector Space by Mikolov-2013

Skip gram

Skip gram does not predict the current word based on the context instead it uses each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word.

GloVe: Global Vector for word representation

GloVe was developed by Pennington, et al. at Stanford. It is called Global Vectors as the global corpus statistics are captured directly by the model.

It leverages both

  • Global matrix factorization methods like latent semantic analysis (LSA) for generating low-dimensional word representations
  • Local context window methods such as the skip-gram model of Mikolov et al

LSA efficiently leverage statistical information but do not do good on word analogy, thus indicating a sub-optimal vector space structure.

Methods like skip-gram perform better on the analogy task, but poorly utilizes the statistics of the corpus as they are not trained on global co-occurrence counts. GloVe uses a specific weighted least squares model to train on global word co-occurrence counts to make efficient use of statistics.

Consider two words i=ice and j=steam in context of Thermodynamics domain. The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various probe words k.

ratio of co-occurrence probabilities

Probe words like water or fashion that are either related to both ice and steam, or to neither, the ratio should be close to one. Probe words like solid related to ice but not to steam will have large value for the ratio

Source: GloVe: Global Vectors for Word Representation — Jeffrey Pennington

Compared to the raw probabilities, the ratio is better able to distinguish relevant words (solid and gas) from irrelevant words (water and fashion) and it is also better able to discriminate between the two relevant words.

Source: https://nlp.stanford.edu/projects/glove/

It is the gender that distinguishes man from woman, similar to word pairs, such as king and queen or brother and sister. To state this observation mathematically, we might expect that the vector differences man :woman, king :queen, and brother:sister might all be roughly equal. This property and other interesting patterns can be observed in the above set of visualizations using GloVe.


Word embeddings are considered to be one of successful applications of unsupervised learning at present. They do not require any annotated corpora. Embeddings uses a lower-dimensional space while preserving semantic relationships.



GloVe: Global Vectors for Word Representation

Efficient estimation of Word Representations in vector space — Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean