Introduction to Image Caption Generation using the Avenger’s Infinity War Characters

Source —

Deep learning can be a daunting field for beginners. And it was no different for me – most of the algorithms and terms sounded from another world! I needed a way to understand the concepts from scratch in order to figure out how things actually work. And lo and behold, I found an interesting way to learn deep learning concepts.

The idea is pretty simple. To understand any deep learning concept, imagine this:

A mind of a newly born baby is capable of performing a trillion calculations. And, all you need is time (epochs) and nuture (algorithms) to make it understand a “thing” (problem case). I personally call this the babifying technique.

This intuition inherently works because neural networks are inspired by the human brain in the first place. So, re-engineering the problem should definitely work! Let me explain that with a example.

What if we trained our model on American culture images, and later asked it to predict labels of traditional Indian dance folks?

Apply the re-engineering idea to the question. It would be akin to imagining a kid who has been brought up in the USA, and has been to India for a vacation. Guess what label an American kid would predict for this image? Keep that in your mind before scrolling further.

Guess the caption?

This image has a lot of traditional dressing from traditional Indian culture.

What would a kid born in America caption it (or) a model that is exposed to an American dataset?

From my experiments, the model predicted the following caption:

A Man Wearing A Hat And A Tie

It might sound funny if you’re aware of Indian culture, but that’s the bias of algorithms. Image caption generation works in a similar manner. There are two main architectures of an image captioning model.

Understanding Image Caption Generation

The first one is an image based model which extracts the features of the image, and the other is a language based model which translates the features and objects given by our image-based model to a natural sentence.

In this article, we will be using a pretrained CNN network that is trained on the ImageNet dataset. The images are transformed into a standard resolution of 224 X 224 X 3. This will make the input constant for the model for any given image.

The condensed feature vector is created from a convolutional neural network (CNN). In technical terms, this feature vector is called embedding, and the CNN model is referred to as an encoder. In the next stage, we will be using these embeddings from the CNN layer as input to theLSTM network, a decoder.

In a sentence language model, LSTM is predicting the next word in a sentence. Given the initial embedding of the image, the LSTM is trained to predict the most probable next value of the sequence. Its just like showing a person a series of pictures and asking them to remember the details. And then later show them a new image which has similar content to the previous images and ask them to recall the content. This “recall” and “remember” job is done by our LSTM network.

Technically, we also insert <start> and <stop> stoppers to signal the end of the caption.

['<start>', 'A', 'man', 'is', 'holding', 'a', 'stone', '<end>']

This way, the model learns from various instances of images and finally predicts the captions for unseen images. To learn and dig deeper, I highly recommend reading the following references:

  1. Show and Tell: A Neural Image Caption Generator by the Google Research team
  2. Automatic Image Captioning using Deep Learning (CNN and LSTM) in PyTorch by Analytics Vidhya


To replicate the results of this article, you’ll need to install the pre-requisites. Make sure you have anaconda installed. If you want to train your model from scratch, follow the below steps, else skip over to the Pretrained model part.

git clone
cd coco/PythonAPI/
python build
python install
cd ../../
git clone
cd pytorch-tutorial/tutorials/03-advanced/image_captioning/
pip install -r requirements.txt

Pretrained model

You can download the pretrained model from here and the vocabulary file from here. You should extract to ./models/ and vocab.pkl to ./data/ using theunzip command.

Now that you have the model ready, you can predict the captions using:

$ python --image='png/example.png'

The original repository and code are implemented in the command line interface and you will need to pass Python arguments. To make it more intuitive, I have made a few handy functions to leverage the model in our Jupyter Notebook environment.

Let’s begin! Import all the libraries and make sure the notebook is in the root folder of the repository:

import torch
import matplotlib.pyplot as plt
import numpy as np
import argparse
import pickle
import os
from torchvision import transforms
from build_vocab import Vocabulary
from model import EncoderCNN, DecoderRNN
from PIL import Image

Add this configuration snippet and function to load_image from notebook:

# Device configuration
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
#Function to Load and Resize the image
def load_image(image_path, transform=None): 
image =
image = image.resize([224, 224], Image.LANCZOS)
if transform is not None:
image = transform(image).unsqueeze(0)
return image

Hard code the constants with pretrained model parameters. Note that these are hard coded and should not be modified. The pretrained model was trained using the following parameters. Changes should only be made if you are training your model from scratch.

ENCODER_PATH = './models/encoder-5-3000.pkl'
DECODER_PATH = './models/decoder-5-3000.pkl'
VOCAB_PATH = 'data/vocab.pkl'

Now, code a PyTorch function that uses pretrained files to predict the output:

def PretrainedResNet(image_path, encoder_path=ENCODER_PATH, 
# Image preprocessing
transform = transforms.Compose([
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])

# Load vocabulary wrapper
with open(vocab_path, 'rb') as f:
vocab = pickle.load(f)
# Build models
encoder = EncoderCNN(embed_size).eval() # eval mode (batchnorm uses moving mean/variance)
decoder = DecoderRNN(embed_size, hidden_size, len(vocab), num_layers)
encoder =
decoder =
# Load the trained model parameters
# Prepare an image
image = load_image(image_path, transform)
image_tensor =

# Generate a caption from the image
feature = encoder(image_tensor)
sampled_ids = decoder.sample(feature)
sampled_ids = sampled_ids[0].cpu().numpy() # (1, max_seq_length) -> (max_seq_length)

# Convert word_ids to words
sampled_caption = []
for word_id in sampled_ids:
word = vocab.idx2word[word_id]
if word == '<end>':
sentence = ' '.join(sampled_caption)[8:-5].title()
# Print out the image and the generated caption
image =
return sentence, image

To predict the labels use :

predicted_label, image = PretrainedResNet(image_path='IMAGE_PATH')

We had Hulk. Now we have ML!

Let us get started with producing captions on some scenes from Avenger’s Infinity War, and see how well it generalizes!

Test Image: Mark I

Have a look at the image shown below:


What do you think this image is about? Hold a caption in your mind without scrolling down.

Let’s see how our model predicts this image..

Well, the prediction for this image is exactly to the point. Makes me curious if I can train a whole model again just on the Marvel Universe to predict the names. Personally, I would love to see Tony Stark being represented as Iron Man.

Test Image: Mark II

Perfect again! In fact, Tony is holding a cellular remote mobile to call Steve Rogers.

Test Image: Mark III

Honestly, even I am pretty amazed at the learning of the model. The model captured the front, as well as the background layer information. Although it misclassified the Panther statue as a mountain, it’s still a pretty good prediction overall.

Test image: Mark IV

Oh boy! Rocket Raccon is going to be really upset. He gets super annoyed when people around the galaxy refer to him as a rabbit or a talking panda. Dog is going to get on his nerves a bit!

Plus, the model is trained on cars, and hence spaceships are out of the question here. But I am quite happy that our model successfully predicted Raocket Racoon sitting near a “window”.

Test image: Mark V

“Woods”, correct. “Man sitting”, correct. “A Rock”, unfortunate, but correct.

Our model is absolutely brilliant at captioning the images. Taking this forward, I would like to train it further on the Marvel Universe to see if the model can recognize the names, context or perhaps even the humor.

Final Test: Avengers 4 Prediction

Avenge Us Fan poster — (Hint: The Soul World!)

The model pretty much hints at the new soul world twist in the Avenger’s 4 plot. I will leave this one out for you! Do let me know what you interpret from the last image in the comments below.

End Notes

Artificial Intelligence and Machine learning are getting awesome with every breakthrough. I hope you now have a basic intuition of how image captioning works, and had fun doing it the Avenger’s way.

PS: Ultron is gone for good. We assure you that we are NOT working on that AI singularity yet.

Source: Gify GIFS

So, take a break and share your love through claps, and don’t forget to subscribe Analytics Vidhya publication for more awesome stuff.

Source: Deep Learning on Medium