Automated Image Caption Generator for Visually Impaired People

Source: Deep Learning on Medium

Automated Image Caption Generator for Visually Impaired People

Being able to automatically describe the content of an image using properly formed English sentences is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.

These images can then be used to generate captions that can be read out loud to the visually impaired so that they can get a better sense of what is happening around them.

Challenges that blind People Face

Technical Approach to Solve Problem

We implemented a deep recurrent architecture that automatically produces short description of images. Our models use a CNN, which was pre-trained on ImageNet, to obtain images features. We then feed these features into either a vanilla RNN or a LSTM network (Figure 2) to generate a description of the image in valid English.

CNN-based Image Feature Extractor

For feature extraction, we use a CNN. CNNs have been widely used and studied for images tasks, and are currently state-of-the-art methods for object recognition and detection.We feed these features into the first layer of our RNN or LSTM at the first iteration.

RNN-based Sentence Generator

We first experiment with vanilla RNNs as they have been shown to be powerful models for processing

Figure 2: Image Retrieval System and Language Generating Pipeline.

LSTM-based Sentence Generator

Although RNNs have proven successful on tasks such as text generation and speech recognition [25, 26], it is difficult to train them to learn long-term dynamics. This problem is likely due to the vanishing and exploding gradients problem that can result from propagating the gradients down through the many layers of the recurrent networks. LSTM networks (Figure 3) provide a solution by incorporating memory units that allow the networks to learn when to forget previous hidden states and when to update hidden states when given new information.

Figure 3: LSTM unit and its gates


For this exercise, we will use the 2014 release of the Microsoft COCO dataset which has become the standard testbed for image captioning [29]. The dataset consists of 80,000 training images and 40,000 validation images, each annotated with 5 captions written by workers on Amazon Mechanical Turk. Four example images with captions can be seen in Figure 4. We convert all sentences to lowercase and discard non-alphanumeric characters.

Figure 4: Example images and captions from the Microsoft COCO Caption dataset.

Qualitative Results

Our models generates sensible descriptions of images in valid English (Figure 6 and 7). As can be seen from example groundings in Figure 5, the model discovers interpretable visual-semantic correspondences, even for relatively small objects such as the phones in Figure 7. The generated descriptions are accurate enough to be helpful for visually impaired people. In general, we find that a relatively large portion of generated sentences (60%) can be found in the training data.

Figure 5: Evaluation of full image predictions on 1,000 test images of the Microsoft COCO 2014 dataset
Figure 6: Example image descriptions generated using the RNN structure.
Figure 7: Example image descriptions generated using the LSTM structure.


We have presented a deep learning model that automatically generates image captions with the goal of helping visually impaired people better understand their environments.

References › reports › mcelamri

Li, Li-Jia, R. Socher, and Li Fei-Fei. ”Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework.” 2009 IEEE Conference on Computer Vision and Pattern Recognition(2009). Web. 21 Apr. 2016