How To Design An Automated Image Caption Generator-I

Source: Deep Learning on Medium

(Examples Of Caption Generations)

Image captioning is an exciting application of deep learning that leverages the power of both computer vision and natural language processing. While it is really easy for a human to understand what is happening in the image, for machines it is quite complex and has been a major challenge for long until the inception of deep learning models like CNN and LSTM.

Image captioning has wide range of applications. The leaders of AI have built successful products by use of it. Microsoft’s ‘Caption-Bot’ is an excellent example of this. Also facebook has added this feature few years back. Image captioning can be an excellent model for image sentiment analysis as well. We can literally write story about what is happening inside an image.

The crux of image captioning lies in generating features from the images and somehow using the same as an input that can generate stories for us. So what can be a better model for this than an LSTM. So such a hybrid generative model that uses the advantages of both CNN and LSTM would be able to fulfill our task.

Before jumping into the problem let us have a look at the dataset that is supposed to be used for this task. We would use flickr-8k dataset that contains caption for each of the image. An image would have multiple captions so that much of the variance of it could be described. Also the dataset doesn’t have any famous person or place so that the description of the image can be learnt based on the objects of the image only which would be great for generalizing for any image in the real world.

As a conceptual approach our model shall be inspired by the classic encoder-decoder model that is used for machine translation. However with a difference that the encoder part would be CNN unlike a variant of RNN. Here we shall use a pre-trained model like VGG-16 or Resnet-50 to get the features from images instead of training our own model. As these models have been trained to capture complex features and are trained on the mighty imagenet dataset. Also it would be cumbersome for us to create a model from scratch and gather similar features. We would only take the output of the penultimate layer of the model as we are only interested in getting features but not in classification.

After getting features from the pre-trained model we would use it as input to the decoder part which would have LSTMs to generate words. Now in the dataset we already have captions for each of the images. Our aim would be to maximize the probability of getting a certain word as in the dataset given that we have already provided the image features and the word that was generated in the previous time-step as input. It would be clear with the help of the following image.

CNN-LSTM model for caption generation

As we can see in the above image there are many Giraffes standing in the image and we have a certain caption for the corresponding image as the ground truth. And we are supposed to tune the parameters such that at time step-1 the probability of getting ‘Giraffes’ as the output is maximized. Mathematically we can say that given the output from the image and word from the previous timestep we have to maximize the probability of a certain word or P(St|I,S1,S2,S3,S4…..St-1) is maximum where I is the image vector learnt by the vgg-16 and S1,S2,S3……St-1 are the words we get as output by the LSTM at the time-step 1,2,3…….t-1 respectively.

Here the <start> and <end> token is there to help the LSTM understand where the caption starts and ends respectively. Also before feeding the ground truth of each word as input to the next time-step of the LSTM we try to formulate the embedding of the same using embedding layer of LSTM where proper embedding of each word is learnt as part of the model training. Also each word is converted into a one-hot-encoded representation where each word’s dimension is equal to the size of the vocabulary.

Now that a theoretical approach to image caption generation is clear we shall see how to apply it in real world and whether it would be able to generate caption for any image which is not there in the dataset in the next post.