An Introduction to VQA Systems

Original article can be found here (source): Artificial Intelligence on Medium

An Introduction to VQA Systems

Take a look at the photo below:

Photo by Vincent Keiman on Unsplash

Looking at this image for just a few seconds, one can answer a number of critical questions.

  • What is the weather like?
  • Is there anything being cooked?
  • How many people are visibly present in this scene?

These distinctions were easily made because of in-depth analysis, but rather by something more simple: the human experience. Only until the evolution of machine learning has this experience been able to be modeled in mediums other than humans. Key aspect of this experience is the ability to understand and visualize the world. Although methods of analysis like object detection and image classification I’ve done the job of understanding the world from a binary point of you, humans don’t think in ones and zeros. The ability to truly understand the world goes hand-in-hand with the ability to make assumptions about the world. So, the ability to simply classify the presence of objects in an image is not enough in the fight to model human perception.

Surely a computer can not make assumptions comparable to a human, right? Wrong. The models that allow for this ability in the realm of computer vision are known as Visual Question Answering (VQA) Systems.

A variety of methods can be deployed to create functional models. In this article, I will describe the most simple. In subsequent articles, I will describe other methods and applications increasing in complexity.

Let’s get into it.

How can computers recognize objects in the first place?

The methods used to recognize objects in images varies. While there are seemingly universal object detection programs like YOLO, other approaches have comparable levels of efficiency and ease of use. In most all cases, the a Convolutional Neural Network is deployed. Even within this classification of network types there are variations. However, in an effort to not over-stimulate your intellectually hungry minds, I will just say that all of the models have pretty much identical structures, but slightly differing functions. If you are still dying to know more, read this insightful article on CNNs.

In the case of specific object recognition, an alternative form of analysis can be used known as a Haar Cascade. While I will again try my hardest to not bore you with too many details, at a high level, the model learns to recognize objects in an image through the analysis of specific pixel formations.

The difficulty in making a model like this more generalized is that the training data to a model like this needs to be the pixel formations of already labeled regions of an image. This method is commonly used in the creation of face detection programs and consistently used and analyzed objects.

How can computers draw conclusions and make assumptions from simply the presence of an object in an image?

While I will describe more complex methods in my coming articles, in this writing, I will only describe the fundamental base that allows for the creation of alternative methods.

After a model is able to recognize the contents of an image, a number of statistical and contextual conclusions can be drawn. These visual conclusions be understood by the user through NLP and question generation. While these approaches are used in more robust models, they are not very useful in the context of a more simple VQA

Although some of these conclusions are relatively easy to grasp, value can still be derived from less robust systems. In the case of automated diagnosis, using VQA to create a user-friendly interaction is the only means by which a model trained in this context can be useful.

Seems too Simple to Make an Impact Right?

Wrong. While the structure of the program is one of ultra simplicity, the impact is actually quite profound. In comparison to most models, visual programs especially must be made easy to use for the user. The results from these programs are practically tangible, and the ability for the user to gain insight from it’s successful execution, and a result, is a terribly critical piece of the puzzle.


Hey, I’m Jack. I’m a 15 year old Innovator at The Knowledge Society. Over the past few months I’ve been diving deep into machine learning and AI. Recently I have been diving specifically into computer vision. Over the next few weeks I will be detailing the technical specifics of the concepts I learn. Stay tuned. Navigate to the links below to connect!