Image Captioning using Attention Mechanism in Keras

Source: Deep Learning on Medium

Image Captioning using Attention Mechanism in Keras


Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph.It requires both methods from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order.

A “classic” image captioning system would encode the image, using a pre-trained Convolutional Neural Network(ENCODER) that would produce a hidden state h.

Then, it would decode this hidden state by using a LSTM(DECODER) and generate recursively each word of the caption.

A classic image captioning model

Deep learning methods have demonstrated state-of-the-art results on caption generation problems. What is most impressive about these methods is a single end-to-end model can be defined to predict a caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of specifically designed models.

Problem with ‘Classic’ Image Captioning Model

The problem with this method is that, when the model is trying to generate the next word of the caption, this word is usually describing only a part of the image. It is unable to capture the essence of the entire input image. Using the whole representation of the image h to condition the generation of each word cannot efficiently produce different words for different parts of the image. —-. This is exactly where an Attention mechanism is helpful.

Concept of Attention Mechanism:

With an Attention mechanism, the image is first divided into n parts, and we compute with a Convolutional Neural Network (CNN) representations of each part h1,…, hn. When the RNN is generating a new word, the attention mechanism is focusing on the relevant part of the image, so the decoder only uses specific parts of the image.

Image Captioning using Attention Mechanism

We can recognize the figure of the “classic” model for image captioning, but with a new layer of attention model. What is happening when we want to predict the new word of the caption? If we have predicted i words, the hidden state of the LSTM is hi. We select the « relevant » part of the image by using hi as the context. Then, the output of the attention model zi, which is the representation of the image filtered such that only the relevant parts of the image remains, is used as an input for the LSTM. Then, the LSTM predicts a new word and returns a new hidden state hi+1.

Let’s discuss on how Attention Mechanism works

For images, we typically use representations from one of the fully connected layers. But suppose as shown in below figure, a man is throwing a frisbee.

So, when I say the word ‘man’ that means we need to focus only on man in the image ,and when I say the word ‘throwing’ then we have to focus on his hand in the image. Similarly , when we say ‘frisbee’ we have to focus only on the frisbee in the image. This means ‘man’, ‘throwing’ and ‘frisbee’ comes from different pixels in image. But the VGG-16 representation we used does not contain any location information in it.

But every location of convolution layers corresponds to some location of image as shown below.


Now, for example, the output of the 5th convolution layer of VGGNet is a 14*14*512 size feature map.

This 5th convolution layer has 14*14 pixel locations which corresponds to certain portion in image, that means we have 196 such pixel locations.

And finally, we can treat these 196 locations(each having 512 dimensional representation) .

The model will then learn an attention over these locations(which in turn corresponds to actual locations in the images).

As shown in the above figure 5th convolution block is represented by 196 locations which can be passed in different time step.

Let’s discuss the EQUATIONS :

Then how it works so well ?

  1. It works because it is a better modelling technique.
  2. This is a more informed model.
  3. We are essentially asking the model to approach the problem in a better (more natural) way.
  4. Given, enough data it should be able to learn these attention weights just as humans do.
  5. And in practice indeed these models work better than the vanilla Encoder-Decoder models.

Few examples :

On the figure below , we can see for each word of the caption what part of the image (in white) is used to generate it.

Attention Mechanism(Source)

For more examples, we can look at the “relevant” part of these images to generate the underlined words.

Attention Mechanism(Source)

Data Acquisition

There are many open source datasets available for this problem, like Flickr 8k (containing8k images), Flickr 30k (containing 30k images), MS COCO (containing 180k images), etc.

But for the purpose of this case study, I have used the Flickr 8k dataset which you can download from here. Also training a model with large number of images may not be feasible on a system which is not a very high end PC/Laptop.

This dataset contains 8000 images each with 5 captions (as we have already seen in the Introduction section that an image can have multiple captions, all being relevant simultaneously).

These images are bifurcated as follows:

  • Training Set — 6000 images
  • Dev Set — 1000 images
  • Test Set — 1000 images

Let me walk you through the CODE:

Utility Functions :

  1. To load the file/document.
  2. To load the image and it’s description :map[image id:image description]
  3. To clean the image description/preprocessing after loading it.
  4. To convert the loaded descriptions into a vocabulary of words.
  5. To save descriptions to file, one per line.