Elegant Intuitions Behind Positional Encodings

Original article was published on Artificial Intelligence on Medium

How do we capture position and order of data?

Elegant Intuitions Behind Positional Encodings

Introduction:

Understanding the position and order is crucial in many tasks that involve sequences. Positional encoding play a crucial role in the widely known Transformer model (Vaswani, et al. 2019) because the architecture doesn’t naturally include the information about order of the input. The positional encoding step allows the model to inherently recognize which part of the sequence an input belongs to.

My intent in writing this post is to help students and practitioners, like myself, to gain a stronger grasp of the intuition behind the formulation of Transformer’s positional embeddings. Hopefully, a thorough understanding can help us to develop the ability to make tweaks and adjustments according to each use case and push the boundaries of research.

What are Positional Embeddings and where are they used?

Transformer Model (Vaswani, et al. 2017)

At a higher level, positional embedding is a tensor of values, where the value of each row represents the position of a word in a sequence, which are added to input embeddings to produce a final embedding with order information.

As you shown above model figure, positional embeddings are added to the input before the encoder and decoder layer because the structure of the Transformer does not take the order of the input sequence into account. We need to apply the positional encoding prior to the decoder as well, since the output of the transformer is a of sequence of word embeddings, which has lost all information about the position of elements in a sequence.

Formulation:

In this section, we will assume that the task is language modelling, where the input and the output are both sequences of words.

Given a sequence of words, we process into word embeddings Zʷ: N x hʷ, N represents the number of words in a sampled sequence, represents the embedding size. Then, pos ∈ [0, N-1] is the position of the word in the sequence and i ∈ [0, hʷ-1] is the index which spans the dimensions of the word embedding.

To reiterate:

Given: Word Embeddings Zʷ: N x hʷ

  • N: Number of word in the sequence
  • hʷ: Dimension size of Word Embedding
  • pos: position of the current word in the sequence in [0, N-1]
  • i: index of the dimensional index of word embedding in [0, hʷ-1]

Thus the formula for the positional embedding is:

It is easy to see that the frequency of the sin and cos functions are a result of the position of the current word in the sequence and the dimensional index (hʷ is fixed).

Once we calculate the positional encoding, we simply add the word (via standard element-wise addition) embedding as shown below:

For example, lets assume hʷ = 4 and we want to calculate the word embedding with positional embedding added:

Its important to note that the dimensions of and PE are identical. This can be easily seen since pos spans N and i spans the embedding dimension .

Key Intuition:

Positional Encoding Tensor with Values Color Mapped

In order to capture positional information, each element of the positional embedding varies according to a word’s position and the index of the element within the dimension of the word embedding (in this case between [0, 299]). This is achieved by varying frequencies, as mentioned above.

To further illustrate this notion:

Let us look at the positional embedding values of dimension indices.

Positional embedding values for varying dimensional indices of word embedding (1, 100, 200, 300)

As shown above, the positional encoding for each dimensional index demonstrates an a noticeable sinusoidal pattern. Furthermore, the values in the high indices are constant at 1, which is evident in the formulation (as i approaches infinity, sin(pos/a ᶦ) and cos(pos/a ᶦ) approaches 1, where a is a constant). The figures above illustrate that the positional encoding values, with respect to dimensional index of word embeddings, exhibits a pattern. However, by itself, it captures little to no information. However, this attribute is essential to creating a pattern in positional embeddings with respect to the position of each word, which is exactly what we want to do.

Positional embedding values for each dimension, for varying word positions (1, 25, 50, 75, 100)

The above figure shows rows of positional embeddings for a given position of a word in the sequence. The combined positional embeddings in each dimension in the embedding demonstrates a clear pattern, which captures information of the position of each word. As you can see above, the pattern of positional embeddings vary noticeably for differing positions of words.

Resulting tensor from addition of positional embedding to word embedding

For illustration purposes, the above plot shows the resulting tensor from adding positional embeddings to a dummy word embedding, which is a random tensor with matching dimensions with elements ranging between 0 and 1.

In the implementation of original Attention is All you Need (Vaswani, et al. 2017) paper, the positional embedding is added to the input word embedding. The above figure visualizes the values the resulting output. We see that the inherent pattern of the positional encoding, which captures the positional information, exists. The resulting word embedding is subject to a pattern of values in its embedding space that occurs as a result of varying frequencies which are altered by the position (pos) and dimensional index (i).

This is analogous to telecommunications and signal processing where frequency modulation(FM) is used to encode information in a carrier wave by varying the frequency of the wave.

PyTorch Implementation:

The current PyTorch Transformer Module (nn.Transformer, nn.TransformerEncoder, nn.TransformerDecoder…) does not include positional encoding. In order to include positional encoding, you must implement it yourself or you can use the following code, which is listed as an example on PyTorch Github:
https://github.com/pytorch/examples/blob/master/word_language_model/model.py

class PositionalEncoding(nn.Module): 
def __init__(self, d_model, dropout=0.1, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len,dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float()*(-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:x.size(0), :]
return self.dropout(x)

Conclusion

Transformers are widely used due to its high performance and intuitive ideas which is at the heart of its model structure. In this post, we’ve explicitly discussed the elegant intuition behind the positional encodings.

I hope that this post is of use to ML practitioners and students like myself who are curious about every aspect of a model.

References:

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

[2]Takase, Sho, and Naoaki Okazaki. “Positional encoding to control output sequence length.” arXiv preprint arXiv:1904.07418 (2019).

[3] Gehring, Jonas, et al. “Convolutional sequence to sequence learning.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[4]https://kazemnejad.com/blog/transformer_architecture_positional_encoding/