Visual Question Answering is a research area about building a computer system to answer questions presented in an image and a natural language. First of all, let’s examine three examples in Visual Question Answering.
In VQA Dataset from www.visualqa.org, the computer system needs to address issues, such as, a binary classification problem (Is the umbrella upside down?), a counting problem (How many children are in the bed?), or an open-ended question (Who is wearing glasses? Where is the child setting?)
In CLEVR Dataset from Stanford, the computer system needs to answer questions about the shape/color/size/material of the objects, and its spatial/logical relationship.
In FigureQA Dataset from Maluuba, the computer system needs to answer questions presented by bar charts, pie charts, or line plots.
Visual Question Answering and Deep Learning
Because Visual Question Answering requires techniques involving image recognition and natural language processing, one major direction in research is on deep learning: using Convolutional Neural Network (CNN) for image recognition, using Recurrent Neural Network (RNN) for natural language processing, then combining the results to deliver the final answer as shown in Figure 4.
Keras presents a generic model for Visual Question Answering as shown in Figure 5.
- Line 1–4: import Keras
- Line 6–21: implement CNN for image recognition
- Line 23–26: implement RNN for natural language processing
- Line 28–31: combine the results from CNN and RNN to deliver the final answer
Visual Question Answering and Relation Network
One interesting and important idea in the area of Visual Question Answering is Relation Network presented by DeepMind [1,2]。The major goal of Relation Network is to explore the spatial relation or the logical relation among objects presented in the image and the question, such as, “… the same size as …” in the question of Figure 6 and “… is left of …” in the question of Figure 7.
Figure 7 illustrates the architecture of relation network inside a Visual Question Answering system. Note that the relation network might explores the relationship in object-to-object-based or in feature-to-feature-based. Figure 8 shows a simple implementation about feature extraction and relation extraction in Keras/Theano.
Visual Question Answering is an interesting challenge combing differenet disciplines, including computer vision, natural language understanding, and deep learning. Hopefully we could see more articles in this area under Medium.
Source: Deep Learning on Medium