Original article was published on Artificial Intelligence on Medium

How do we capture position and order of data?

# Elegant Intuitions Behind Positional Encodings

## Introduction:

Understanding the position and order is crucial in many tasks that involve sequences. Positional encoding play a crucial role in the widely known Transformer model (Vaswani, et al. 2019) because the architecture doesn’t naturally include the information about order of the input. The positional encoding step allows the model to inherently recognize which part of the sequence an input belongs to.

My intent in writing this post is to help students and practitioners, like myself, to gain a stronger grasp of the intuition behind the formulation of Transformer’s positional embeddings. Hopefully, a thorough understanding can help us to develop the ability to make tweaks and adjustments according to each use case and push the boundaries of research.

## What are Positional Embeddings and where are they used?

At a higher level, positional embedding is a tensor of values, where the value of each row represents the position of a word in a sequence, which are added to input embeddings to produce a final embedding with order information.

As you shown above model figure, positional embeddings are added to the input before the encoder and decoder layer because the structure of the Transformer does not take the order of the input sequence into account. We need to apply the positional encoding prior to the decoder as well, since the output of the transformer is a of sequence of word embeddings, which has lost all information about the position of elements in a sequence.

## Formulation:

In this section, we will assume that the task is language modelling, where the input and the output are both sequences of words.

Given a sequence of words, we process into word embeddings *Zʷ: N x hʷ*, *N* represents the number of words in a sampled sequence, *hʷ* represents the embedding size. Then, *pos ∈ [0, N-1] *is the position of the word in the sequence and *i ∈ [0, hʷ-1] *is the index which spans the dimensions of the word embedding.

To reiterate:

Given: Word Embeddings *Zʷ: N x hʷ*

- N: Number of word in the sequence
*hʷ: Dimension size of Word Embedding**pos:*position of the current word in the sequence in*[0, N-1]**i:*index of the dimensional index of word embedding in*[0, hʷ-1]*

Thus the formula for the positional embedding is: