Word Representation in Natural Language Processing Part III

Source: Deep Learning on Medium

In Part II of my blog series about word representations, I talked about distributed word representation such as Word2Vec and GloVe. These representations incorporate semantics (meaning) and similarity information of words into embedding. However, they fail to generalize to “Out Of Vocabulary” words (OOV) that were not part of the training set. In this part, I will describe models that mitigate this issue. Specifically, I will talk about two recently proposed models: ELMo and FastText. The idea behind ELMo and FastText is to exploit the character and morphological structure of the word. Unlike other models, ELMo and FastText do not take words as one atomic unit, but rather a union of its character combinations e.g. “remaking -> re + make + ing”

The problem of previous models

Previous models such as Word2Vec and GloVe look up pre-trained word embedding from the dictionary and thus the word sense being used in a specific context is not considered. In other words, polysemy (multiple meanings of the word ) is not taken into account. For example:

“My friend is considering to take a loan from a bank.”

“Rainfall caused Rhine river to overflow it’s bank. ”

If one would use Word2Vec or GloVe then the word “bank” would have a single embedding. But it has a different connotation in the two sentences above. In the first sentence, it means an institution that provides financial services to its customers. In the second sentence, it implies the slope beside a body of water.

ELMo: Embeddings from Language Models

The new model ELMo is able to resolve this issue of previous models by considering the entire sequence as input. It dynamically produces word embedding according to the frame of reference. The model consists of three layers: (a) Convolutional Neural Networks (CNN), (b) Bi-directional Long-short term memory (LSTM), and © the Embedding.

Figure 1

a) The model input to ELMo’s (i.e., the input to CNN) is purely character-based. So initially, we feed CNN with raw characters. Then CNN produces compact embedding that gets passed to Bi-directional LSTMs.

b) Bi-directional LSTM layer indicates that the model runs over an input sequence in straight as well as in reverse order. For instance, let’s get the embedding for the word “followed” in the input below:

input = “A cloudy morning followed by a mostly sunny afternoon.”

context_straight= [‘a’, ‘cloudy’, ‘morning’]

context_reverse = [‘by’, ‘sunny’, ‘a’, ‘mostly’, ‘sunny’, ‘afternoon’]

Thus for every target word, the model can observe preceding and succeeding words around it. As we can see from Figure 1, the stacked LSTMs construct multi-layer LSTM. Every LSTM takes as input an output sequence of the previous one, except the first LSTM layer that takes the character embedding from CNN.

c) Embedding layer concatenates hidden states of LSTM directions and produces context-dependent embeddings. In the paper, the authors define it as a linear combination of hidden states that are multiplied by task-specific model weights. It learns a separate ELMo representation for each task the models are being used. As a result, ELMo improved the performance of many NLP tasks. For the purpose of simplicity, I omitted the details. It can be found here.

EMLo using Tensorflow-hub


  • Python 3.6
  • Tensorflow 1.13.1
  • Tensorflow-hub 0.4.0

Libraries can be installed using :

pip install Tensorflow==1.13

Load pre-trained word embeddings from ELMo module:

Parameter trainable = False because I just want to load pre-trained weights. But the parameters of the graph are not necessarily fixed. It is possible to re-train model and update parameters by setting it to True.

Get the word embeddings from pre-trained ELMo:

Default output size is 1024, let’s look for the first 10 dimensions of word “bank” in the above two sentences:

As we notice, embedding for the same word in different contexts is distinct. As it was already stated, ELMo builds word embedding dynamically by feeding a sequence of characters. It relies on the current surrounding words in the sentence.

FastText: Subword model

The problem of previous models that they cannot generate word embedding for OOV (see above). This issue is not only resolved by ELMo but also in Subword model. Additionally, in contrast to ELMo, Subword model is able to exploit the morphological information. In Subword model, words with the same roots do share parameters. It is integrated as a part of FastText library, that is why it is known as FastText.

Subword model is an extension of Skip-Gram model( Word2Vec) which produces the probability of a context given a word. Model loss is defined as follows:

The first part of the loss function considers all context words w_c as a positive example and the second part samples randomly N_t,c as a negative example at position t. On the one hand, the objective aims to place co-occurring and similar words close to each other. On the other hand, it aims to locate dissimilar words far from each other in vector space. Subword model is trained in a similar fashion, except it adds computed n-grams to features. N-gram defined as a sequence of a given number of items. For instance, n-gram model with n=2 would give the following output for the word “banking”: {ba, an, nk, ki, in, ng}.

Another illustration below:

“the river bank”

In Skip-Gram model the word “river” has two context tokens: the and bank. While Subword model with n-gram (with n=2) has 13 context tokens: th, eh, e_, _a, an, nd, d_, _b, ba, an, nk, the and bank. In practice, we extract all the n-grams for 3≤ n ≤ 6. This is a very simple approach, and different sets of n-grams could be considered, for example taking all prefixes and suffixes of a word.

This n-gram information enriches word vectors with subword information and enables the model to build vectors for unseen words. It’s extremely beneficial for morphologically rich languages and datasets with a lot of rare words. The German language can be a good example as it is rich in compound nouns. The word “Tischtennis” translated as table tennis. The embedding for this word will be constructed by simple addition e.g. Tisch + Tennis → Tischtennis.

FastText using Gensim


  • Python 3.6
  • Gensim 3.7.2

Libraries can be installed using :

pip install Tensorflow==1.13

Create a model using custom data

Let’s get embeddings for a word that was not part of training:

There are pre-trained models using wiki data for 157 languages, which can be found here. They can be loaded using gensim as well.

Take Away

ELMO and Subword are advanced models that are able to produce high-quality embedding for words that are present and absent in the vocabulary. Particularly, ELMo is able to take into account context information when producing word embedding. It has a superior vector quality compared to other existing models. But since it does prediction at run-time, it has an inference cost. On the other hand, Subword is very fast and able to efficiently incorporate n-grams.