Original article was published on Deep Learning on Medium
Mode Collapse with GAN in vid2vid
Imagine a bunch of training images in some higher dimensional coordinate space, which we are simply represented in the above figure in two dimensions. Some of these points represent day time samples and some represent night time images. Now, when we start training an unconditional GAN, we first generate a bunch of random images that will be pushed through the generator. Now the training process essentially tries to push these fake images towards the training images so that they look real. This leads to a problem where some of the training images may be left out and never be used. This will lead the generator to only produce images of the same kind as training goes on. Hence, GANs suffer from mode collapse and images generated by this method can not be visually diverse in nature.
This is how I came across the research paper that aims to solve this issue using Maximum Likelihood Estimation.
Image Synthesis with Conditional IMLE
Researchers at Berkeley published the paper “Diverse Image Synthesis from Semantic Layouts via Conditional IMLE” which aims to solve the problem mentioned above with GAN based training process of vid2vid network. Rather than focusing on improving the quality of the output frame, it focuses on being able to synthesize diverse images from the exact same semantic map. This means we can have the same scene rendered in any lighting or weather condition using this method, unlike with GANs where one semantic label can only produce one output. This paper shows how to use Implicit Likelihood Estimation or IMLE in order to achieve this. Let us try to understand why IMLE seems to work better than GANs for this particular use-case.