Introduction to Deep Convolutional Generative Adversarial Networks using PyTorch

Original article was published on Deep Learning on Medium

Introduction to Deep Convolutional Generative Adversarial Networks using PyTorch

Fig 1: DCGAN for MNIST

What is Deep Convolutional Generative Adversarial Network?

Deep Convolutional Generative Adversarial Networks or DCGAN was a state-of-the-art model released with the paper called “Unsupervised Representation Learning with Deep Convolutional Adversarial Networks¹” in 2016. The concept of GAN has become much important concept because this gives the way to the Generative aspects of Deep Learning. Here I am presenting my experience on DCGAN.

How does it work?

DCGAN is basically a GAN (Generative Adversarial Net) architecture that using Convolutions. When considering the GANs, there are 2 main networks called Discriminator and the Generator which are trying to improve each other.

Model Architecture

Let’s get into deep in Generator and the Discriminator.

Generator

The generator is the network which is generating things. When it comes to this particular task (face generation), it generates the human faces. Of course, the generated images can look unrealistic or realistic. Our ultimate goal is to make the generator better at its job of generating realistic images. But how should it make the generator better when there is no way to examine the performance of it ?. This is the problem when considering the traditional Deep Learning Methods to generative tasks because we can not interpret the meaning of human-like/ real images in a logical/ machine-understandable manner. So this is where the Discriminator comes to help.

Fig 2: Isn’t it cool ??

Discriminator

The discriminator is the solution for the above problem and it is the network that examines the performance of the Generator Network. Then again, How will this discriminator know the benchmarks for identifying human-like images from fake images that are produced by the generator? Even though the discriminator has no clue to classify the real human images from the generated images at the beginning, It learns what are the benchmarks for examining the generator with the time with the help of the generator. Therefore it is basically learning how to classify generated images and the real images using the generated images.

DCGAN is using the same concept with the help of convolution layers which are ideal for capturing patterns of images. The generator network uses a random noise in order to produce the image.

Fig 3: DCGAN generator is presented in the paper which is used to generate bedroom images using the LSUN bedrooms dataset.

There are mainly 5 key points that make DCGAN architecture different from the GAN (considering the architecture mentioned in the original paper).

These points have significantly reduced the drawbacks of conventional GAN for images such as the mode collapsing, quality of the results (resolution, etc.) by increasing the flexibility and constraints of the architecture.

  1. No spatial Pooling — Strides in order to minimize the spread of the layer outputs instead of spatial pooling which have given the ability for networks to select its own spatial upsampling/ downsampling.
  2. No Fully Connected Layers
  3. BatchNormalization — Using BatchNormalization prevents the collapsing of the generator to a single sample
Fig 4:

4. ReLU activation for Generator — Instead of Tanh, ReLU is used which is not bounded as Tanh (bounded between 0–1) which prevents the generator from saturation the color space of the training images due to fast learning.

5. Leaky ReLU activation for Discriminator — For higher resolution modeling, Leaky ReLU has given better results.

Objective Function

Objective Function is the most important part of the concept because it realizes the procedure that we discussed above.

Fig 5: Objective Function of GANs presented in the paper “Generative Adversarial Nets”²

As discussed before, it can be seen that Discriminator (D) is trying to maximize V(D, G) and the Generator (G) is trying to minimize it. Let’s dive into the V(D, G). (x=real image, z=noise, D(x)=Probability of x being real ,G(z)=Generated image using z-noise)

When we look into this in the Discriminator perspective, When the real images are fed, D(x) should be higher (≈1) and D(G(z)) should be lower (≈0). Therefore D should be trained to make V(D, G) higher and on the other hand, when we look from the Generator perspective, it tries to generate better images that are real-like. Therefore it tries to make D(G(z)) higher (≈1) which makes V(D, G) minimized.

Implementation — Face Generation using CelebA face dataset

Dataset

“CelebA” is a common dataset that consists of human faces. Below code, snippet shows that downloading the dataset and creating data loader objects which makes the model training much easier (by giving options to get batches from the dataset)

Code 1: Download and Prepare the Dataset

Model using PyTorch

Discriminator and the Generator classes are defined as follows. Note that we are using ReLU, Leaky ReLU activations, and Convolution layers.

Code 2: Create the Generator and Discriminator Classes

For the Generator, Transpose Convolution layers are used in order to upsample the noise.

Support classes for the above-defined Generator and Discriminator as follows.

Code 3: Create Support Classes for Generator and Discriminator

When considering the model, the most critical part that I encountered is selecting convolution layer properties to get the desired shapes in both networks.

Loss function and optimizers

After observing the dataset, we can come up with parameters to create a generator, discriminator. Here we are creating the discriminator, generator network objects.

Code 4: Initialize the Discriminator and Generator

After creating the discriminator and generator objects, criterion/ loss function and the optimizer should be defined as above. For the optimizers, “Adam” is selected which is much resilient for the saddle points, plateaus which can be occurred in the training. Note that, we use separate 2 optimizers to optimize Discriminator and Generator separately. And for the loss function, Binary Cross Entropy is used because it can be used to train both discriminator, generator.

Training

The training cycle for GAN is critical. Training of one iteration consists of the following steps.

  1. Train Discriminator (only) to identify the real images
  2. Train Discriminator (only) to identify the fake/ generated images
  3. Train Generator (only) to fool the Discriminator by producing better fake images 😉.

Those 3 training stages are done as follows …

Code 5: Training

Note that, there will be no significant continuous decrements of losses of either Loss_G/ Loss_D because they are fighting to reduce their losses separately which is cause to increment of the other loss.

Results

After finishing the training, finally we ready to generate human images. This can be directly done using the trained Generator. Below we can observe how the generator becomes better in generating images.

Fig 5: Training Instances (illustrated using Tensorboard)

Drawbacks

The main drawback of the conventional GAN architecture is there is no controlling of the features of the generated images. It gives new human face images only conditioned on random noise.

As solutions for that, there are many GAN architectures such as conditional GAN (cGAN), StyleGAN, StackGAN, CycleGAN, etc have been published until now.