Multi-Headed Attention Mechanism

Original article was published on Deep Learning on Medium

Multi-Headed Attention Mechanism

Improving the Self- attention mechanism

In my last blog post , we have discussed about Self Attention. I strongly recommend going through that before understanding Multi-Headed Attention mechanism. Now , let’s see how Multi Headed attention could be of help.

Say we have a sentence:-

“I gave my dog Charlie some food.” . As we can see, there are multiple actions going on .

  • “I gave” is one action.
  • “to my dog Charlie” is second action .
  • “What did I gave(some food)” is third action.

To keep a track of all these action we need Multi headed attention.

As you can see in the above image, its an extension of Self-attention with multiple heads/layers at Keys , Queries and Values blocks, which is why we need to concat the final output and pass it via a dense layer to get the final output. This Multi head mechanism is more efficient as it performs multiple attention mechanism in parallel . Earlier, in Self attention mechanism a single layer was suppose to catch all the actions going on in the sentence “I gave my dog Charlie some food.” . By using Multi headed attention mechanism , multiple actions are being shared and better captured using multiple layers.

In my next blog post, we will discuss about Transformers, in which Multi head Attention play a crucial role. Until then Goodbye.