Source: Deep Learning on Medium
Visual Dialog Using Generative Adversarial Networks
Team:- T1000 (Final Project)
CSCI 566-Deep Learning and its Applications, Fall 2019
University of Southern California
Visual dialog is an important component of the AI and NLP research, where the aim is to design a system that can hold a dialog with some user, based on a given image. Concretely, given an image as input to the model, the task for the model is to answer a follow up question in natural language by taking into consideration both the visual content as well as the context of previous conversation. Visual dialog has many commercial and social applications such as a visual AI assistant, aiding visually impaired users in understanding their surroundings and aiding analysts in making decisions based on large quantities of surveillance data.
In this article we attempt to solve the Visual Dialog task by using Deep Generative Adversarial Networks (GANs)
The dataset available for the Visual Dialog problem is the VisDial dataset. The dataset comprises of :
- 120k images from COCO
- 10 rounds of human dialog history (10 Question-Answer pairs) per image
- 1 follow up question per image
- List of 100 possible answer options
Using this data as the input our aim is to generate free form natural language answers for the follow up question.
To begin with the task of Visual Dialog, we first use an encoder-decoder model, that would generate an answer for the follow up question based on the input image.
The encoder will create image and sentence embeddings which will then feed into a decoder that consists of an LSTM based generator. The generator generates an answer word by word.
The generator LSTM is trained to generate a probability distribution over all the words in the vocabulary at every cell.
For evaluation the generator iterates over the list of 100 possible answers given as part of the data. For every answer we derive the probability of each word in the answer, from the probability distribution at corresponding LSTM cell. We then multiply these probabilities to get the answers’ score. Using these answer scores we rank all the 100 answers.
The classical model gives some interesting qualitative results:
We observed the following limitations of the classical model:
- Short Answers
- Generic Answers
- Answer Inconsistency
In attempts to overcome these limitations we did a set of investigatory experiments. These experiments introduced us with a fundamental flaw of the classical generator model.
Fundamental Problem with generative models
We observed that for some questions, the model was giving answers contradicting with the ground truth. So we started looking at the answer scores of the possible options.
As we can clearly see that the scores for sharply contrasting answers “Yes” and “No” are having similar score values so this is probably a reason for getting the contradicting answers.
We narrow down the possible reasons for this anomaly to be:
- The Encoder’s disregard for the image in the input embedding.
- The generator LSTMs’ inability to separate contradicting answers.
For analyzing the first reason we run a GradCAM visualization of the gradients flowing through the image features, for the generated answer.
The GradCAM results rule out our initial presumption that the encoder did not perform image grounding. So we assume that the problem of inconsistent answers originates from the LSTM which can be attributed to natural language representation.
An obvious question here is why such a problem exists in Visual Dialog when other NLP tasks such as Visual QA perform beautifully on a similar language representation.
The reason for this astonishing difference in the performance of Visual QA against Visual Dialog is because Visual QA is a classification problem while Visual Dialog is a generative problem. Generative approaches tend to suffer with such issues.
Addressing these fundamental shortcomings of generative models is a non- trivial task and in the following sections we talk about our approach to tackle these shortcomings.
To mitigate the limitations of the classical model and provide some feedback to our LSTM generator, we switch to the adversarial approach. We integrate a discriminator that would differentiate between the generated answer and the ground truth. We call it the Oracle.
For pre-training the Oracle, we feed it with unified input embedding from the encoder and a list of 100 possible answers available in our data. The discriminator learns to rank these answers based on their correctness scores. To calculate the correctness scores for any given answer we use the dot product of that answer’s sentence embedding and the unified input embedding ³.
For an answer A in the possible answer list
Correctness_Score (S(A)) = [Encoder_Embedding].[Answer_Embedding]
The discriminator performs exceedingly well compared to the generator because the discriminator gets the possible answers list during training as opposed to the generator which only sees this list during evaluation.
To leverage this advantage of the discriminator, we fine tune the generator using the discriminator scores for the generated answer. Higher the discriminator score for a generated answer, better the generator.
We observe a marginal improvement in the quantitative result metrics. But on a qualitative analysis we still see “Yes” and “No” appearing close in the rank list. Speaking in a generalized manner we can say that there were some semantically contradicting answers ranked together.
The Oracle discriminator is trained on a Maximum Likelihood Estimation (MLE) model. We note here that training the Oracle in this way gives it the ability to differentiate between completely irrelevant answers and plausible answers. For eg. if the model is asked a Yes/No question it can figure out that the answer to such a question cannot be a random word like “standing”. But, since there are multiple Yes/No questions for which some have answer Yes and some have answer No, the model gets stuck in a weak local optima where it essentially gives random outputs among Yes and No and still achieves a relatively small loss. This implies that for our model to differentiate between plausible but semantically opposite answers we need a stronger signal (which means a higher loss) for these kind of answer pairs. Intuitively, we want a very high loss for Oracle if it does not differentiate highly between a ground truth answer and its semantic counterpart and equivalently, we want low scores for answers entailing the ground truth answer. The case of neutral answers lies somewhere between these two.
To introduce this functionality in our discriminative model, we integrate a new discriminator to the network.
Logician assisted Adversarial Model
Logician is our novel discriminator that inputs two sentences and outputs a probability distribution over the following classes:
Contradiction| Neutrality | Entailment
Borrowing ideas from Noise Contrastive Estimation, we come up with a new loss function that is derived in the architecture section, which does exactly what we want to achieve. We then use the Logician to fine tune the Oracle and achieve a reasonable separation between semantically contradicting answers. This separation is also reflected in the discriminator scores that are then used to fine tune the generator.
Fig 2.0 shows the separation in ranking and the scores achieved for the semantically contradicting answers. We can see that the Oracle score difference for the ground truth vs its semantic counterpart is more when we fine tune it with the logician. This allows the oracle to back-propagate a higher loss if the generator generates a semantically wrong answer.
Oracle without Logician
Ground Truth Score ("Yes") - Semantic Contradiction Score ("No")
= 5.592- 5.139 = 0.45Logician Tuned Oracle
Ground Truth Score ("Yes") - Semantic Contradiction Score ("No")
= 8.632 - 6.842 = 1.79
To evaluate our generators ability to generate semantically correct sentences using the new Logician discriminator, we built a custom dataset with binary questions about the objects present in an image. 50% of the questions in this dataset are asking for the objects that are actually present in the image and the other 50% are about objects that are not part of the image.
Running a quantitative analysis on the custom dataset shows improvement in the accuracy of the answers by Logician assisted network. Since this dataset is built for binary questions & we have labels for the answers so we use accuracy
In the following section we introduce the architecture details of all our network components. You can skip to the conclusion section if the architecture is not your thing.