Visual Question Answering


Deep Learning — a name that has gain more popularity than Messi in the word of artificial intelligence. An approach that has proven its dominance by being the workhorse of many practical applications in the field of computer vision, voice recognition and NLP. And one such exciting application of deep learning is VQA — Visual Question Answering. As the name itself suggests it has a three parts a visual, a question and its third sibling the answer.

Figure-1: Results generate by VQA Network

The trained VQA algorithm can be compared to a small child who has just started to understand things and we are very excited to check how well he understands. We show some object or picture or person and ask what this is or who is this? And if the child has learned well he/she would answer — this is fire or he is Dad for instance.

The VQA model will have two inputs — a visual i.e. an image and a question related to it and one output — the answer. Four examples are shown above in figure-1.

Now this is obviously a trivial task for human being but from computer perspectives it has to solve a number of complex problems. For instance consider the third and fourth example above. For the question — “where is the dog sitting?” and “What is he holding?” The system should have a very good understanding of entire scene in the picture. It not just need to know the objects present in the image, it also has to know the region in the image being referred to and recognize the object in that region i.e. bed for the sitting case and bat for the holding case.

And from language point of view it should have the understanding of what sitting and holding means and relate them to the region that are being referred to in the query image.

Architecture of VQA Model

The VQA model comprises of two famous Deep learning architecture CNN and RNN to accomplish this task. The CNN (Convolution neural network) is used to obtain the image features and the RNN (Recurrent Neural network) is used to obtain the question features. These features are then combined and fed into a fully connected multi-layer perceptron which can be then trained as a normal multi-class classifier over all the possible answer classes. The output of the network is probability distribution over all the possible answer classes. The model used to demonstrate the VQA model for this is trained on 1000 most frequent answer classes of VQA training data set.

Figure-2: Visual Question Answering (CNN+LSTM)

The CNN used is a pretrained VGG-16 network. However. the last layer of 1000 node softmax layer is removed. The vgg-16 converts the input image to a 4096 dimensional feature vector. The Glove vectors are uses to obtain 300 dimensional word vector (word embedding) for each word which is then fed to a LSTM network which converts the input question to a 512 dimensional feature vector. The image features and the question features are then combined to generate a single feature vector which becomes the input to fully connected feed forward neural network. The output of this network is the probability distribution over 1000 answer classes. The figure below elucidates the entire architecture with top five predictions of the network for the question- “what is he holding?”

This blog post is inspired by Avi SIngh’s blog — Deep Learning for VQA

If you are interested and want to get a feel of VQA, there is a GUI implementation of the same https://github.com/anujshah1003/VQA-Demo-GUI. You can give any image and ask any query and be ready to get amazed by the answer this VQA child gives you. The YouTube link for the same is at https://www.youtube.com/watch?v=7FB9PvzOuQY.

Source: Deep Learning on Medium