Deep Learning and Visual Question Answering

Visual Question Answering is a research area about building a computer system to answer questions presented in an image and a natural language. First of all, let’s examine three examples in Visual Question Answering.

VQA Dataset

Figure 1. VQA Dataset from

In VQA Dataset from, the computer system needs to address issues, such as, a binary classification problem (Is the umbrella upside down?), a counting problem (How many children are in the bed?), or an open-ended question (Who is wearing glasses? Where is the child setting?)

CLEVR Dataset

Figure 2. CLEVR Dataset from Stanford

In CLEVR Dataset from Stanford, the computer system needs to answer questions about the shape/color/size/material of the objects, and its spatial/logical relationship.

FigureQA Dataset

Figure 3. FigureQA from Maluuba

In FigureQA Dataset from Maluuba, the computer system needs to answer questions presented by bar charts, pie charts, or line plots.

Visual Question Answering and Deep Learning

Because Visual Question Answering requires techniques involving image recognition and natural language processing, one major direction in research is on deep learning: using Convolutional Neural Network (CNN) for image recognition, using Recurrent Neural Network (RNN) for natural language processing, then combining the results to deliver the final answer as shown in Figure 4.

Figure 4. Combining CNN/RNN for VQA

Keras presents a generic model for Visual Question Answering as shown in Figure 5.

  • Line 1–4: import Keras
  • Line 6–21: implement CNN for image recognition
  • Line 23–26: implement RNN for natural language processing
  • Line 28–31: combine the results from CNN and RNN to deliver the final answer

Visual Question Answering and Relation Network

One interesting and important idea in the area of Visual Question Answering is Relation Network presented by DeepMind [1,2]。The major goal of Relation Network is to explore the spatial relation or the logical relation among objects presented in the image and the question, such as, “… the same size as …” in the question of Figure 6 and “… is left of …” in the question of Figure 7.

Figure 6. non-relational questions and relational questions in CLEVR Dataset
Figure 7. the model of relation network

Figure 7 illustrates the architecture of relation network inside a Visual Question Answering system. Note that the relation network might explores the relationship in object-to-object-based or in feature-to-feature-based. Figure 8 shows a simple implementation about feature extraction and relation extraction in Keras/Theano.


Visual Question Answering is an interesting challenge combing differenet disciplines, including computer vision, natural language understanding, and deep learning. Hopefully we could see more articles in this area under Medium.


  1. VQA Dataset
  2. CLEVR Dataset
  3. FigureQA Dataset
  4. Keras VQA Model
  5. Relation Network from DeepMind

Source: Deep Learning on Medium