Source: Deep Learning on Medium
A Generative Adversarial Network is an extremely interesting deep neural network architecture able to generate new data (often images) that resembles the data given during training (or in mathematical terms, matches the same distribution).
Immediately after discovering GANs and how they work, I got intrigued. There is something special, maybe magical, about generating realistic looking images in an unsupervised manner. One area of GAN research that really caught my attention has been image-to-image translation: the ability to turn an image into another image keeping some sort of correspondence (for example turning a horse into a zebra or an apple into an orange). Academic papers like the one introducing CycleGAN (a particular architecture which uses two GANs “helping” each other to perform image to image translation) showed me a powerful and captivating deep learning application that I immediately wanted to try and implement myself.
With this particular article I would love to dive deep into GANs and provide a new way to look at their objective: this may be trivial to some people, while may be an interesting point of view to others. I will also make an example of an interesting applications of this method, while providing practical tips to help you implement it. I will assume you have some foundations in Deep Learning and already know the basic idea behind Generative Adversarial Networks.
GANs: a Brief Introduction
Let’s consider a simple Convolutional GAN architecture (DCGAN). A random noise vector is the input of a Generator, made up of different sequential convolutional layers that generate a final image. In the beginning of training, this image will be totally random noise. This “fake” image is then fed to a Discriminator, which outputs a single number, usually between 0 and 1, representing “how realistic” the image looks. The Discriminator is also fed real images. Telling the Discriminator if the image received is fake or real, allows it to get better and better with time at doing its job while also telling the Generator how to make the fake image look more realistic.
The Siamese Network
Before explaining some new modifications to the classic GAN architecture, we need to introduce a particular network called Siamese Network. The Siamese Network is used for one-shot learning: while in most cases an image classifier requires to be trained on a vast amount of training examples, a one-shot learning network has the ability to understand features of a particular object with just one example. This kind of network is very useful for implementing facial recognition algorithms, that need to be flexible enough to add sporadically a new face to the recognizable faces.
The architecture is fairly simple: it consists of an encoder, which encodes features from the image into a vector of length VecLen. The name of the network, Siamese, comes from the fact that the same weights of the encoder are used “in parallel” to compute the latent vector of two images: if the images belong to the same class, then the two vectors should be close in latent space, otherwise, if they belong to different classes, they should be far apart. This end result is obtained thanks to a special loss function. While multiple loss functions for this kind of network exist in literature (contrastive, triplet and magnet loss), here we start by considering the contrastive loss (Hadsell et al, 06).
This loss is particularly easy and intuitive: it minimizes the SquaredDistance (Dw², considering the Euclidean Distance) between our two vectors in the case of them belonging to the same class (Y=0), while minimizing (max(0, Margin-SquaredDistance))² if they belong to different classes (Y=1). This last term allows the network to push the two vectors far from one another in case their SquaredDistance is lower than Margin (a constant we choose arbitrarily); otherwise, if the two vectors are sufficiently far, the loss is zero.
Not too difficult, right? Now we are finally ready to place some of the puzzle pieces together.
The Siamese GAN
As we previously stated, a Generative Adversarial Network can be thought as a Generator and a Discriminator working together to generate realistic images from a particular collection of images, or domain. The Generator takes a random noise vector as input and “decodes” it to an image, while the Discriminator takes an image as input and outputs a score relative to how realistic the image looks. Now let’s try and use a Siamese Network as the Discriminator. What we now have is a Decoder-Encoder architecture, that takes a vector as input and has a vector as output. This structure is similar to that of an AutoEncoder (Encoder-Decoder), but with the two components swapped.
How can this network possibly train?
In the case of a Siamese Network trained to recognize faces, the number of total classes we have is the number of different faces our algorithm must recognize. Thus in that case, we expect the network to organize the latent space (the output of the network) in such a way that all the vectors encoding the same face are close together, while being far away from all the others.
In the case of GANs however, the total number of classes is 2: fake images created by the Generator and real images. Then our new Discriminator objective is to arrange the output vectors of the Siamese Network such as real images are encoded close to one another, all while keeping fake images far from them.
The Generator on the other hand tries to minimize the distance between real and fake image vectors, or in other words, wants real and fake encoded as the same class. This new objective reproduces a very similar adversarial behaviour as in the “traditional” case, making use of a different kind of adversarial loss function.
Now that we understand the basics of the idea, let’s try to iterate on it and improve it.
In our loss function, we considered the distance between vectors. But what is distance, really? In our case, we evaluated distances between two vectors that could move iteration after iteration in the vector space being output by the Discriminator. Considering the “relativity” of Distance, we can make much more robust measurements calculating distances from a fixed point in space. This problem is taken into account by the Triplet Loss for the Siamese Network, which evaluates distances from a neutral (anchor) point.
Here d stands for squared euclidean distance, a is the “anchor” point (we will consider it fixed in space), n is the negative point and p is the positive one.
In our case, only having to deal with two classes (with the end goal of making one class undistinguishable from the other) we choose a fixed point in space before training and use it as our neutral point. In my testing I used the origin point (the vector whose values are all zeros).
To reach a better understanding of our latent space and how we want to organize it for our objective, let’s visualize it. Remember that we are projecting our space with a number of dimensions equal to VecLen to a 2D plane.
At the beginning of training, our images B and G(z) (the images generated by the Generator from noise vector z) are randomly encoded by the Discriminator in our vector space.
During Training, the Discriminator pushes the vectors of B closer to the fixed point, while trying to keep the encodings of G(z) at an arbitrary distance (Margin) from the point. The Generator on the other hand wants G(z) vectors to be closer to the fixed point and to B vectors as a consequence.
Finally, here are some results from the Siamese GAN.
Now, to really understand why the Siamese GAN is extremely similar to a traditional GAN, we need to consider an edge case scenario: what if the Siamese Discriminator outputs a 1-Dimensional vector (VecLen=1)? Now we have the traditional GAN Discriminator outputting a single value: if this scalar is close to a fixed number (our 1-Dimensional point), let’s say 1, the image looks realistic, while looking fake in the opposite case. This is equivalent to keeping the score close to 1 for real and close to 0 for fake. Thus, the loss now becomes the Squared Error, typical of LSGANs (Least Squares GANs).
So, nothing new here. Well, not really. Encoding an image to a latent vector can sometimes be quite useful. Let’s talk about one practical example.
A recent paper introduced TraVeLGAN, a new approach to the problem of unpaired image-to-image translation. Unlike other methods (CycleGAN for example) TraVeLGAN doesn’t rely on pixel-per-pixel difference between images (it doesn’t use any cycle consistency constraint), resulting in image translation between wildly different domains, with hardly anything in common. To achieve that, a traditional Generator-Discriminator architecture is used together with a separate Siamese Network.
Let’s say we must turn images from domain A to images belonging to domain B. We call translated images by the Generator as G(A).
Then the Siamese Network encodes images in latent space and aims at reducing distances between transformation vectors of image pairs. With S(X) as the vector encoding of X and A1, A2 two images from domain A, the Network must encode vectors such as:
S(A1-A2) similar to S(G(A1)-G(A2))
where a similarity metric such as Cosine Distance is used.
Doing that, the Siamese Network passes information (in the form of gradient) to the generator on how to preserve the “content” of the original images in the generated ones.
All of this happens while the Discriminator tells the Generator how to create more realistic images that resemble the ones from domain B. The end result is a Generator that generates images in the style of domain B with somewhat preserved content from domain A (in the case of two completely unrelated domains, some sort of correspondence is maintained).
After this brief introduction (read the paper for more info!), how can we use our Siamese Discriminator with the TraVeLGAN approach?
By simply removing the ad-hoc Siamese Network and making the already-in-use Discriminator output a vector, we can apply the previously discussed loss function to tell the Generator how realistic its generated images are, plus we are able to compute the distances between transformation vectors of image pairs in latent space using Cosine Distance.
Summing everything up, the Discriminator encodes images into vectors such as:
1. Images with lower Euclidean Distances from our fixed point (origin) have a more realistic Style
2. Transformation vectors of encoded Image pairs (A1- A2), (G(A1)-G(A2)) have low Cosine Distances from one another, preserving Content
Knowing that Angle and Magnitude of vectors are independent features, the Discriminator is able to learn a vector space applying these two constraints.
In my testing I used U-Net with skip connections as the Generator and a traditional fully convolutional Siamese Network as the Discriminator. Furthermore, I used Attention in both the Generator and Discriminator, together with Spectral Normalization on the convolutional kernels to keep the training stable. During training, TTUR (different learning rates for Discriminator and Generator) was used.
Here are some results, trained on apples and oranges images from ImageNet:
Here’s a high definition sample (Landscape to Japanese Print (ukiyo e)):
Encoding images in latent space is very useful: we have shown that making the Discriminator output a vector instead of a single value, and changing the loss function accordingly, can lead to a more flexible objective landscape. A task like image-to-image translation can be accomplished using only a single Generator and Discriminator, without any added networks and without the cycle consistency constraint, which relies on pixel-per-pixel difference and can’t handle extremely visually different domains. Lots of other applications are to be explored, like working with labelled images and more.
Thank you for your precious attention, have fun!