Source: Deep Learning on Medium
Automated Image Caption Generator for Visually Impaired People
Being able to automatically describe the content of an image using properly formed English sentences is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
These images can then be used to generate captions that can be read out loud to the visually impaired so that they can get a better sense of what is happening around them.
Technical Approach to Solve Problem
We implemented a deep recurrent architecture that automatically produces short description of images. Our models use a CNN, which was pre-trained on ImageNet, to obtain images features. We then feed these features into either a vanilla RNN or a LSTM network (Figure 2) to generate a description of the image in valid English.
CNN-based Image Feature Extractor
For feature extraction, we use a CNN. CNNs have been widely used and studied for images tasks, and are currently state-of-the-art methods for object recognition and detection.We feed these features into the first layer of our RNN or LSTM at the first iteration.
RNN-based Sentence Generator
We first experiment with vanilla RNNs as they have been shown to be powerful models for processing
LSTM-based Sentence Generator
Although RNNs have proven successful on tasks such as text generation and speech recognition [25, 26], it is difficult to train them to learn long-term dynamics. This problem is likely due to the vanishing and exploding gradients problem that can result from propagating the gradients down through the many layers of the recurrent networks. LSTM networks (Figure 3) provide a solution by incorporating memory units that allow the networks to learn when to forget previous hidden states and when to update hidden states when given new information.
For this exercise, we will use the 2014 release of the Microsoft COCO dataset which has become the standard testbed for image captioning . The dataset consists of 80,000 training images and 40,000 validation images, each annotated with 5 captions written by workers on Amazon Mechanical Turk. Four example images with captions can be seen in Figure 4. We convert all sentences to lowercase and discard non-alphanumeric characters.
Our models generates sensible descriptions of images in valid English (Figure 6 and 7). As can be seen from example groundings in Figure 5, the model discovers interpretable visual-semantic correspondences, even for relatively small objects such as the phones in Figure 7. The generated descriptions are accurate enough to be helpful for visually impaired people. In general, we find that a relatively large portion of generated sentences (60%) can be found in the training data.
We have presented a deep learning model that automatically generates image captions with the goal of helping visually impaired people better understand their environments.
Li, Li-Jia, R. Socher, and Li Fei-Fei. ”Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework.” 2009 IEEE Conference on Computer Vision and Pattern Recognition(2009). Web. 21 Apr. 2016