Attention layer —Continued from: Reference for dimensions and numbers used in a seq2seq model for…

Original article was published by Utpal Mattoo on Artificial Intelligence on Medium


Overall, what we did so far can be summarized in these two lines of pseudo-code:

  • Score = f (decoder output, encoder_hidden_state, encoder output)
  • SoftMax probability vector OR attention_weight vector = SoftMax (Score)

The function ‘f’ was learnt using a neural network. This function ‘f’ is exactly what we learnt in Fig 3.

Let’s include the Attention layer code (which includes the snippets I shared earlier). During training, this layer is invoked by the decoder at from the most recent time step or position in the output sentence to generate the next word. If the sentence is: “Hello world”, the call “attention_layer” is invoked for each word in the sentence.

Here it is included as a self-contained unit (helps with testing as well):

Fig 5: Attention Layer

Context_Vector calculation

Fig 6: Learning weights for Score from Decoder hidden state

“query_with_time_axis” is the tensor for the hidden state from a time step (for a particular word in the sentence in each batch. It’s dimensions over the entire batch for an input location are (64, 1024). This tensor is given as input to the keras layer, W1, with 1024-neural outputs. For the later addition step, we add a dimension and convert it to (64, 1, 1024).

The dimensions of the hidden state, X (or the “query_with_time_axis”), for a single word are (1024, )

The dimensions of W1 as initialized using Keras are: (1024, ) for one word or input.

The activation, Z = W1.T.x (read as W transpose dot product with X) will result in a scalar value i.e. a float value. That’s the score contribution from the decoder hidden state. A tensor would show ( ) as the shape attribute — an empty pair of round brackets.

Now, the score has dimensions of (16, 1) over the entire length of 16 inputs (one training example of length max_length in the batch). Over a 64-example batch, the score shape is (64, 16, 1). If we have not decided upon the batch size yet, we can represent the dimensions as (None, 16, 1).

Note: later, after adding this score with the result of step 2 below, we SoftMax this vector and convert each value in this vector to it’s corresponding probability value, that vector called the context vector will have the same shape over one sentence or over the entire batch.

2.

Fig 7: Learning weights for Score from the Encoder output

“values” represents the Encoder output with shape (64, 16, 1024). This was discussed in Part 1. In this case, we input the values tensor with shape (64, 16, 1024) into the neural network with a 1024-unit output.

3.

Tanh non-linearity is applied to the sum of (1) and (2) above and the resulting activation is passed through the self. V layer (with 1 output) resulting in the score. The score has shape (64, 16, 1). One score for each sentence position.

4.

Applying a SoftMax results in the same shape of (64, 16, 1), except that each float is now a probability value.

5.

The probability value is broadcast across the 1024-dimensional vector for the hidden shape of each word resulting in an adjusted or weighted vector called the context_vector.

The operation works as:

Fig 8: How much does each word contribute (latest from decoder and encoder output)

So, the first line, the multiplication, takes the attention_weights (SoftMax output) having shape (64, 16, 1) and multiplies it with the encoder output called values having shape (64, 16, 1024). This results in the context vector with shape (64,1024).

Specifically, the SoftMax score, which is the third dimension (at index position 2) in attention_weights gets broadcast over the corresponding dimension in the values tensor and therefore the same score is applied to each member of the 1024-dimensional vector. We get back a (64, 16, 1024) dimensional context_vector.

Next, we have to combine the 16 1024-dimensional context_vectors and this is accomplished through the tf.reduce_sum (over the columns indicated by axis =1. Vector elements in corresponding positions are all added up resulting in one 1024-dimensional vector instead of 16 different 1024-dimensional vectors.

Why did we added these 16 1024-dimensional vectors? Recall, 16 was the number of maximum allowed input words in the sentence.

With addition of the 1024-dimensional SoftMax weighted vectors — one from each input input word — a total of 16 vectors of dimensionality 1024 elements and hence a shape of (None, 16, 1024) — we will able to tell in the sum, as to which input words’ 1024-dimensional vector contributed most to the sum. And this higher individual contribution reflected in the sum (computed by tf.reduce_sum operation) will then be used as an input into the decoder step that initiated the call to the attention layer. The input word contributing more to the sum will drive the output generated by the decoder. Or gets more attention in generating the decoder output.

Now, if we are training the attention model, we will concatenate the 256-dimensional embedding of the actual ground truth word from the training set (called teacher forcing) with the 1024-dimensional embedding (the context_vector) and use this concatenation as an input to the decoder (the other input into the decoder is the decoder hidden state).

If we are not training but running in test/production, we concatenate the 256-dimensional embedding of the decoder output from the previous step with the context_vector and use that instead as an input to the decoder (in addition to the decoder hidden state).

In practice, you might be able to just run the encoder, decoder and the attention layers, all using this class tfa.seq2seq.BahdanauAttention, but taking a look under the hood is just helpful to understand new research and learning ways how the research is implemented in AI/ML platforms.

When I implemented the tensor flow implementation for Neural Machine Translation, I consumed 85 GB to save checkpoints for just 1 Epoch and about 190 GB or so to save checkpoints for 2 Epochs — just to get a sense of the complexity of implementing and trying this on your own (plus just waiting for results and also making sure you did not forget to include a change).