Overview of the VQA problem

Original article was published on Deep Learning on Medium

Overview of the VQA problem

by Quanglong in Cinnamon Student AI Bootcamp 2020


Remember backs when the deep learning application boomed since the ImageNet competition? Back then, it was a huge thing!

Nowadays, Deep Learning is a milestone and fundamental approach for most Machine Learning tasks, including computer vision, natural language processing, and voice recognition. Many previous works only handle one content type such as image or text. However, to get closer to human behavior, a machine is required to handle multi-task problems. Examples of work including both visual and textual content are Image-Text Retrieval, Image Captioning, and Visual Question Answering. In this blog, we will take a dive into the overview of the VQA problem, the challenges, and the practical applications.

The VQA problems

VQA stands for “Visual Question Answering”.

Figure 2: Illustration of a VQA System

We can briefly define: “The task of finding the answer for a question related to a given image/video” as VQA. More specifically, a VQA system takes a visual content and an associated text-based question as an input, then infers the text-based answer as an output. (Figure 2)

Figure 3: Some examples of a VQA system’s input

Back in the day, developing a VQA system that can answer arbitrary questions had been thought to be ambitious and intractable. However, this ability is currently considered to be a core value of a VQA system. The questions can be arbitrary and they encompass many sub-problems in the field of computer vision. For example, let’s take a look at Figure 4 and the following questions.

+ Object recognition: What kind of food is in the center?
+ Object detection: Is there any meat?
+ Attribute classification: What color is the avocado?
+ Counting: How many kinds of food are there?
+ …

Figure 4: Arbitrary questions can be asked and some are related to a sub-problem in computer vision.

In addition, more complex questions require a higher level of textual understanding, such as questions about the spatial relationships among objects, events, actions, or common sense reasoning.

Figure 5: Examples of some complex questions.


Different methods have been proposed in recent years. The common structure consists of three main parts: visual content extraction, textual content extraction and an algorithm integrating these two features to generate the answer. The process of answer generating is regularly considered as a classification problem and each unique answer is treated as a distinct category. The main difference between methods is how they combine visual and textual features.

Figure 6: The flow of a VQA system.


Since 2014, there has been an enormous amount of research in developing VQA systems with numerous challenges. Within the scope of this blog, below are our reexamination of central difficulties as discovered.

Expertise. First of all, the challenges come from the prerequisite knowledge for such development of the system! After all “Visual” lies in the domain of computer vision in the past, and “Question Answering” is a natural language understanding problem! Which is why we believe it would be a good challenge.

Lack of Image-text semantic alignments. The VQA system consists of two distinct data streams (textual and visual data) that should be used and combined correctly to ensure robust performance. Thus, to learn the cross-modal representations, current state-of-the-art on VQA-v2 dataset is using large scale models to pre-train numerous visual-textual pairs.

Figure 7: Leaning the cross-modal representations

Limited answers — not open-ended as thought. Most VQA algorithms consider the answer generation process as a classification problem. The answer dictionary normally includes a pool of K possible answers and the probability of each answer for a given question is calculated by some algorithm. The generated answer can be more varied as K increases, but this also requires a larger model and a larger training dataset.

Ability to answer complex questions. Machines are limited as human’s developing technical ability and it’s still a long way for machines to meet human cognition. Complex questions type such as “Why” or ones requiring advanced knowledge (e.g Who is in the picture? — Donald Trump) are typical examples.

Figure 8: An example of a hard question: To acknowledge the position of “global optimum for non-convex function” requires a (potentially) very vast knowledge base! (that human may not reach yet)


the fascination of VQA lies in the relevance to our daily life. Questioning and answering are crucial parts of life and it will always be. The way a VQA system answers a question is similar to us in several aspects, which include visual and textual understanding, how to combine two data streams and how to use advanced knowledge properly.

There is a series of potential applications that integrate VQA systems. Nowadays, the most outstanding one is to support visually impaired individuals. Many applications with visual-textual content transformations have been published and improved many people’s lives. This is a free application from Microsoft: Seeing AI 2016 Prototype — A Microsoft research project.

Another use of a VQA system is to provide human-computer interaction, especially to get visual content. For example, a kid can ask the system various questions to learn how to call an object by its name, or someone can ask the camera about the weather outside when they are currently indoor.


In final words, those are some overviews of the VQA problem. You can try an online VQA demo here. In the next posts, we will review other approaches we have researched and our proposal to improve the VQA system. Stay tuned!


Visual Question Answering: Dataset, Algorithms and Future Challenges

— — — — — — — — — — —

This is Quanglong’s first blog in the series of VQA, he will lead us through the overview, 2 approaches, and the comparison of the two.

About Quanglong: he’s an excellent candidate who is participating in Cinnamon Student AI Bootcamp 2020. His main focus in Bootcamp is Computer Vision.
About “Bootcamp Student AI Bootcamp 2020: Ideas to Reality”: this is a scholarship program with a new format that provides the young in AI/Deep Learning field a solid foundation to practicalize their ideas and develop their own product from scratch. More info: here.