Deeper into DCGANs

Source: Deep Learning on Medium


My last post about DCGANs was primarily focused on the idea of replacing fully connected layers with convolutions and implementing upsampling convolutions with Keras. This article will further explain the architectural guidelines mentioned by Raford et al. [1], as well as additional topics mentioned in the paper such as Unsupervised Feature Learning with GANs, GAN Overfitting, and Latent Space Interpolation.

DCGAN architecture used by Radford et al. [1] to generate 64×64 RGB bedroom images from the LSUN dataset

In contrast with multi-scale architectures such as LAPGAN or Progressively-Growing GAN, or in contrast with the state-of-the-art, BigGAN, which uses many auxiliary techniques such as Self-Attention, Spectral Normalization, and Discriminator Projection to name a few… the DCGAN is an easier system to fully comprehend.

DCGAN doesn’t achieve comparable image quality to the BigGAN model and doesn’t possess nearly the same latent space control as the StyleGAN. However, it is still worth thinking about the DCGAN as a foundational pillar for GAN research. The DCGAN model’s fundamental component is to replace the fully connected layers in the generator with these upsampling convolutional layers. In designing this architecture, the authors cite three sources of inspiration.

  1. The All Convolutional Net → Replacing pooling operations with spatial downsampling convolutions
  2. Eliminating fully connected layers after convolutions
  3. Batch Normalization → Normalizing activations to help gradient flow

With these advancements in mind, the authors searched for a stable DC-GAN architecture and landed on the following architectural guidelines:

  • Replace any pooling layers with strided convolutions in the discriminator and fractional-strided convolutions in the generator
  • Use Batch Normalization in the generator and discriminator
  • Remove fully connected hidden layers for deeper architectures
  • Use ReLU activation in generator for al layers except for Tanh in output (These images are normalized between [-1, 1] rather than [0,1] , thus Tanh over sigmoid)
  • Use LeakyReLU activation in the discriminator for all layers

These architectural guidelines have been later expanded on in modern GAN literature. For example, Batch Normalization in generative models has emerging cousins such as Virtual Batch Normalization, Instance Normalization, and Adaptive Instance Normalization. Further architectural guidelines are presented by Salimans et al. [2] and are well-explained in this medium post.

Aside from the model architecture, this paper discusses many interesting ideas related to GANs such as Unsupervised Learning, GAN Overfitting, GAN feature visualization, and Latent Space Interpolation.

Unsupervised Learning with GANs

Many applications for GANs have been explored and much of the research is trying to achieve higher quality image synthesis. Many of the methods for achieving high quality image synthesis are really supervised learning techniques, because they require class labels for conditioning.

The main idea here is to use the features learned by the discriminator as a feature extractor for a classification model. Specifically, Radford et al. explore the use of unsupervised GAN feature extractors combined with an L2 + SVM classification model. The SVM model uses a loss function that aims to maximize inter-class distance based on the margin between closest points in each class, and a high-dimensional hyperplane. The SVM model is a great classifier, however, it is not a feature extractor and applying an SVM to images as they are would result in an extremely large number of local minima, essentially rendering the problem intractable. Thus, the DC-GAN serves as a feature extractor that reduces the dimensionality of the images in a semantically-preserving way, such that an SVM can learn a discriminative model.

GAN Overfitting

Re-reading the paper, the idea of GAN overfitting was one that I think is especially interesting. Overfitting in the context of supervised learning is very intuitive:

Common picture showing overfitting on a supervised learning task

The picture above is a common illustration of what overfitting looks like on a regression task. The overly parametric model adjusts itself such that it exactly matches the training data and has no error. Stepping away from the statistics of bias-variance tradeoff, we can intuitively imagine overfitting as the model’s generalizability, how well it does on the training data compared to testing data.

This is a very interesting idea in the context of GANs. The task of the generator is to produce data which the discriminator predicts as being ‘real’, meaning that it closely resembles the training dataset. It seems that the generator would be the most successful if it discards any attempt at adding stochastic changes to data points and just mimics the training data exactly. Radford et al. discuss three interesting methods for showing that their DC-GAN model is not doing this.

  1. Heuristic approximation: Models that learn quickly generalize well
  2. Auto-encoder hash collisions (Train a 3072–128–3072 auto-encoder to encode generated and original data and see how many similar the low-dimensional (128) representation is between generated and original data.
  3. Smoothness of Latent Space (sharp transitions = overfitting)

Another interesting technique for exploring overfitting in GANs not used in this paper, is to do a nearest neighbor search using L1 or L2 distance, (or maybe even VGG-19 feature distance), to grab the images from the training dataset that are most similar to a given generated image.

GAN Feature Visualization

Feature visualization in CNNs is achieved as follows. A generator network is trained via gradient descent to produce an image that results in maximum activation from a given feature. Radford et al. test this with their discriminator model on the LSUN bedroom dataset and present the following image:

It is interesting to think that these are the features the discriminator is using to tell if images are real or fake.

Latent Space Interpolation

Latent Space interpolation is one of the most interesting subjects of GAN research because it enables control over the generator. For example, GANs may eventually be used to design websites. You would like to be able to control characteristics of the design or interpolate between designs. Aside from this anecdote, latent space interpolation is very popularly described in Word2Vec where the vectors “King” – “Man” + “Woman” =“Queen”. Radford et al. explore this interpolation with their generated images.

One interesting detail to the latent space interpolation discussed in this paper that I had originally missed is that they do not use the Z vectors of individual points. For example, they don’t just take the Z vector of one smiling woman and subtract the Z vector of one neutral woman and then add the Z vector of one neutral man to achieve a smiling man image. Rather, they take the average Z vector of a series of generated images that display the external characteristics such as ‘smiling woman’.

Thank you for reading this article! I have found this paper to be very useful in my research on GANs. Each time I have returned to this paper, I have gained an appreciation for the finer details of the paper. This paper is one of the foundational works on GANs and I highly recommend checking it out, especially if you are interested in image generation.

References

[1] Alec Radford, Luke Metz, Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. 2015.

[2] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen. Improved Techniques for Training GANs. 2016.