StyleGAN v2: notes on training and latent space exploration

Original article can be found here (source): Deep Learning on Medium

A Stroll Through the Manifold

Above some examples of latent space exploration for my dresses and footwear models. The idea is just to generate N sample vectors (drawn from a Gaussian distribution) and transition between them sequentially using whatever preferred transition function. In my case here this function is just a linear interpolation done on a fixed frames number (equivalent to morphing).

Notice that we are relying on the very initial latent vector z. This means that we are using the StyleGAN mapping network to first generate the latent vector w, and then using w to synthesize a new image. For this reason, we can rely on the truncation trick, and discard areas of the latent space poorly represented. We want to specify how much the generated intermediate vector w has to stay close to the average (computed based on random inputs to the mapping network). ψ (psi) value scales the deviation of w from the average, and as such can be tweaked for quality/variety trade-offs. ψ=1 is equivalent to no truncation (original w), while values towards 0 gets us closer to the average, with quality improvement but a reduction in terms of variety.

Truncation trick in action, here linearly spaced from -0.5 (top left) to 1.5 (bottom right)

Encode Real Images

We often want to be able to obtain the code/latent-vector/embedding of real images with regards to a target model, in other words: what is the input value I should feed to my model to generate the best approximation of my image.

In general, there are two methods for this:

  • pass image through the encoder component of the network
  • optimize latent (using gradient descent)

The former provides a fast solution but has problems generalizing outside of the training dataset, and unfortunately for us, it doesn’t come out of the box with a vanilla StyleGAN. The architecture simply doesn’t learn an explicit encoding function.

We are left with latent optimization option using perceptual loss. We extract high-level features (e.g. from a pre-trained model like VGG) for the reference and generated images, compute the distance between them and optimize on the latent representation (our target code). The initialization of this target code is a very important aspect for efficiency and effectiveness. The easiest way is simple random initialization, but a lot can be done to improve on this, for example by learning an explicit encoding function from images to latent. The idea is to randomly generate a set of N examples and store both the resulting image and the code that generated it. We can then train a model (e.g. ResNet) on this data, and use it to initialize our latent before the actual StyleGAN encoding process. See this rich discussion regarding improved initialization.

Encoder for v1 and Encoder for v2 provide code and step-by-step guide for this operation. I also suggest the following two papers: Image2StyleGAN and Image2StyleGAN++, which give a good overview of encoding images for Stylegan, with considerations about initialization options and latent space quality, plus an analysis of image editing operations like morphing and style mixing.

w(1) vs w(N)

StyleGAN uses a mapping network (eight fully connected layers) to convert the input noise (z) to an intermediate latent vector (w). Both are of size 512, but the intermediate vector is replicated for each style layer. For a network trained on 1024 size images, this intermediate vector will then be of shape (512, 18), for 512 size it will be (512, 16).

The encoding process is generally done on this intermediate vector, and as such one can decide whether to optimize for w(1) (meaning only one 512 layers, which is then tiled as necessary to each style layer) or the whole w(N). The official projector operated the former, while adaptations often rely on optimizing all w entries individually, for visual fidelity. Regarding this topic see also this Twitter thread.

Comparison of projection into w(N) and w(1) using Nvidia FFHQ model

Even more noticeable and goliardic when projecting reference not proper of the model training distribution, like in the following example projecting a dress image for the FFHQ model.

In general, one can always notice that for high resolutions the projector seems to fail to match fine details of the reference picture, but this is most likely a consequence of using a resolution of 256×256 for the perceptual loss, as demonstrated in this thread.

Learn Direction

The StyleGAN improvements on latent space disentanglement allow to explore single attributes of the dataset in a pleasing, orthogonal way (meaning without affecting other attributes).
While discriminative models learn boundaries to separate target attributes (e.g. male/female, smile/not-smile, cat/dog) what we are interested here is to cross those boundaries, moving perpendicularly to them. For example, if I start from a sad face I can slowly but steadily move to a smiling version of the same face.

This should already provide a hint on how to learn new directions. We first collect multiple samples (image + latent) from our model and manually classify the images for our target attribute (e.g. smiling VS not smiling), trying to guarantee proper class representation balance. We then train a model to classify or regress on our latents and manual labels. At this point we can use the learned functions of these support models as transition directions.

Robert Luxemburg shared the learned direction for the official FFHQ model.

Samples from playing with the smile latent direction

StyleGAN2 Distillation for Feed-forward Image Manipulation is a very recent paper exploring direction manipulation via a “student” image-to-image network trained on unpaired dataset generated via StyleGAN. The paper aims to overcome the encoding performance-bottleneck and learn a transformation function than can be efficiently applied to real-world images.

Examples from the paper “StyleGAN2 Distillation for Feed-forward Image Manipulation”

Conclusions

A lot of my experiments have initially been motivated by evaluating how good is the latent space learned by a StyleGAN model (representation learning), and how performant the obtained embedding can be for downstream tasks (e.g. image classification via linear models). From one side I will keep working on this kind of evaluation, trying to cover also other generative models types, like autoregressive and flow-based models.

I’m also interested in pursuing the exploration of such models for the pure image-synthesis capabilities, and the increasing potential for effective semantic mixing and editing, especially in relation to my passion for drawing, digital painting and animation. Automated lineart colorization, animating paintings, frames interpolation are some awesome free utils already out there, but there is so much more that can be done for assisted drawing, especially from a more semantic point of view. There is also plenty of room for practical improvements: generalization capabilities, speeding up inference time, training optimization and transfer learning.

Down the line, I also want to go beyond the pure 2-dimensional canvas, and learn more about the amazing things already achieved in the 3D graphics realm. Denoising is something I now frequently rely on, differentiable rendering just blew my mind and to close the loop, back again to GAN for continuous 3D shape generation.

Quoting research scientist Károly Zsolnai-Fehér

“What a time to be alive”