From GAN basic to StyleGAN2

Source: Deep Learning on Medium

5.2. StyleGAN2’s methods

In this part, we will look at what they are actually doing with StyleGAN2.

5.2.1.Normalization method instead of AdaIN

AdaIN normalizes using the statistics of the data actually entered, which leads to droplet modes. The authors prevent droplets by normalizing the convolution weights using estimated statistics rather than actual data statistics. The figure below is a schematic diagram.

We start by simplifying the Style block (the gray area in Figure b).

AdaIN can be broken down into two steps when described in detail. The first is to normalize the content information with its own statistics, and the second is to linear transition of the normalized content information using style information.When the AdaIN part of StyleGAN (a) is expanded according to it, it becomes as shown in (b). The operation inside AdaIN is following: normalization of content → linear transformation by style vector. but note that in Style block, the order is as following: linear transformation by style vector → (Convolution →) → normalization of content.

Next, they consider that the operation using the mean value is unnecessary, so they only need to divide the normalization operation by the standard deviation, and also to perform the linear transformation of the style only by multiplication of the coefficients. And since the noise insertion part does not need to be in the style block in particular, put it out of the style block. (c )

We have simplified the operation inside the Style block. Here, let’s consider that the first linear transformation by the Style vector is performed by the processing inside the convolution. In the Style block, the coefficient y_s, which is a linear transformation of the style vector W, is used. The operation of processing the content image multiplied by s with the convolution weight w_ijk is equivalent to convolving the content image with the product of the weight w_ijk and s. So this operation can be rewritten as follows: (Operation of Mod in (d) above)

Next, consider performing the normalization operation (here only dividing by the standard deviation) in the convolution internal processing.Here, assuming that the input follows a standard normal distribution, the standard deviation of the output.

What we want to do is multiply the output by the inverse of the standard deviation.The operation of multiplying output of convolution with the weight w_ijk by the inverse of the standard deviation after is equivalent to convolving with weight w_ijk multiplied by the reciprocal of the standard deviation. Therefore, this normalization operation is performed as follows. (Demod in d above)

With this, the sequentical operation in the style block, linear transformation by style → convolution → output normalization, was able to be expressed by one convolution process. The normalization part is a process of normalization with assumption that the output is a normal distribution. In other words, the normalization using the actual distribution,causing droplets, is not performed. Droplets do not come out when using this.

5.2.2. A high-resolution image generation method instead of Progressive Growing

The original StyleGAN Generator has a simple configuration. Without Progressive Growing, such simple generators has difficulty to generating high-resolution images a. But by increasing the expressive power of Generator and Discriminator, It seems possible to generate high resolution images without Progressive Growing.

Progressive growing is a method that gradually adds a generator and discriminator for high resolution, and is one of the methods frequently used in high resolution image generation. However, since each Generator is independent, it tends to generate frequent features, which results in teeth not following the movement of the face.

Therefore, the authors propose a high-resolution image generation method that does not use Progressive Growing, that is, a method that does not gradually add a Generator and Discriminator to increase the image resolution. The networks, similar to MSG-GAN, shown in the figure below are the candidates.

To choose the best networks , they did experiments with all combinations of (b) -type Generator / Discriminator and (c )-type Generator / Discriminator, and to adopt the best result. The following table shows the experimental results.

From these results, it can be seen that using (b) -type Generator significantly improves Perceptual Path Length, and using (c)-type Discriminator improves FID. So the authors have adopted these networks.

In Progressive Growing, the networks first focus to generate low-resolution images and then gradually generate high-resolution images. But is this new network doing that kind of learning? The experiment below confirms this.

Since the (b) -type Generator sums the generated images of each resolution, the contribution of each resolutions to the final generated image can be calculated . The vertical axis indicates the contribution of each resolutions and the horizontal axis indicates the learning progress. In above figure (a), where the new network was adopted, the contribution on the high-resolution side gradually increased with the elapse of the training progress, and when the network size was increased ( figure b), the contribution of high-resolution sides increased further at the end of learning.

The images generated using this new mechanism is as follows. The eyes and teeth become unnatural in StyleGAN, while the new mechanism shows that the eyes and teeth move naturally with the change of the face direction.

StyleGAN
StyleGAN2

5.2.3. Path Length Regularization to smooth latent space

Now that it has been found that there is likely a correlation between Perceptual Path Length(PPL), which indicates the perceptual smoothness of the latent space, and image quality, we incorporate it into the model as a regularization term. The formula is as follows. a is a constant and y is a random image generated from a normal distribution.

This regularization force the generator minimize changes with perturbation of latent variables as much as possible. The learning is performed while dynamically changing the constant a by the moving average of the first term, so that the optimal value is set during learning.