How Contextual Word Embeddings are Learnt in NLP

Original article was published on Deep Learning on Medium

How Contextual Word Embeddings are Learnt in NLP

Embeddings are vector space representation of each word, or even sentence, paragraph, or even images etc. Word embeddings represent each word of a vocabulary in vector space. They are contextual, if that representation also captures the context in which is used, and hence it can offer more semantic meaning.

Word embeddings trained using techniques like Word2vec, Glove, etc are non-contextual and language models like BERT learns non-contextual word embeddings during training and also transforms them to contextual based on context in which they are fed along.

When a record given as an input to BERT model, it goes through series of transformation at each transformer layers, which is as follow:

  1. Each token embedding is projected to H different projections, where H stands for attention heads
  2. Each head compute attention weights for each token in a segment for all the tokens in the same segment.
  3. At each attention head, encoding of the token is updated by taking attention weighted average of all tokens present in that segment and further, they are concatenated from H attentions heads. [This is where context is learnt]
  4. It is further linearly transformed, and steps 2 onwards are repeated for number of layers times.
  5. Finally a contextual representation is computed

Lets taken an example, there are four records:

R0 : A man is playing soccer. A man is playing sport.
R1 : A man is playing flute. A man is playing music.
R2 : A man is playing football. A man is playing sport.
R3 : A man is playing violin. A man is playing music.

As we can see in above example,

  • playing” in record 0 is semantically related to “playing” in record 2.
  • Similarly, “playing” in record 1 is semantically closer to “playing” in record 4.

So, these records were fed to BERT, and Figure 1 illustrates cosine similarity of token “playing” in each record. Labels in above figure can be interpreted as, <record-number>-<token-index>-<layer index>-<word name>.

Figure 1: Cosine Similarity of “playing” in all records in layer 0 and layer 12 in BERT

As we can observe, for token number 4 of each record, which is first occurrence of “playing” in each record. (Note, “playing” is used twice as token 4 and token 11, as BERT has extra tokens like, first token [CLS], and [SEP]. We consider each sentence as a segment in each record)

  • At layer 0, they are all similar. As they are all non-contextual.
  • At layer 12, embedding of “playing” in record 0 is similar to record 2, as they both are used in reference of playing soccer or football, which are semantically similar. (Note, BERT has 12 layers)
  • At layer 12, embedding of “playing” in record 1 is similar to record 3, as they both are in context of music, i.e. flute or violin.

Hence, BERT transformed non-contextual embeddings to contextual, which captures semantic similarity.

Also, note, second sentence in each record is more broader,

  • In record 0, “A man is playing soccer. A man is playing sport.”, where, second sentence use playing in a broader context, i.e. with respect to sport instead of soccer.
  • Similarly, in each record, second sentence is broader, while first sentence specifically mention name of sport, or name of musical instrument.

If we observe similarities of “playing” in both sentences (token 4 and 11) of each record, we find it as follow in Figure 2:

Figure 2: Cosine Similarity of each usage “playing” in all records
  • Token “playing” has relatively similar meaning in both of its usage (as token 4 and token 11) in record 0 and record 2. Which is in context of sports.
  • Token “playing” has similar meaning in both sentences in record 1 and record 3. Which is in context of music.

That said, language models like BERT transform non-contextual representation of words to contextual one, thats why it is called “Bidirectional Representation”, as it puts in context from both the directions into representation of each word. Which is is quite useful any downstream tasks like sentence similarity, sentiment analysis, etc.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding