Source: Deep Learning on Medium

Welcome back to the chapter 14 GAN’s series, this is the 3rd story connected to the previous 2 stories.

I hope you have gone through the last stories or you have already an idea about GAN’s and it’s types a bit.

In this story, I mainly wanna talk about different new ideas like Pix2Pix, CycleGAN’s with it’s Math and Training.

I wanna share the author’s views from ground so you can train these for your problems or do a little more work around as part of your research.

Just a little

Recap from last stories

→ Gans learn the distribution of the training data from random distribution

→ Discriminator is an X to Y mapping (Y → {0,1}), while Generator is a mapping from a fixed random noise vector to X ( training data).

→ CGAN’s take some vector as an input along with random noise at the Generator network.

Okay I hope you are comfortable of understanding Gan’s , lets start with

Pix2Pix

The name itself says “Pixel to Pixel” which means , in an image it takes a pixel, then convert that into another pixel.

The goal of this model is to convert from one image to another image, in other words the goal is to learn the mapping from an input image to an output image.

**but why and what application we can think of ??**

well, there are tons of applications we can think of

These are the main ones but you can think of a lot ( litterly you can transform from one world to another world).

And the reason why we use GAN’s for this is to synthesize these photos from one space to another.

**Ingredients required**

- Training data pairs (
**x**and**y**where**x**is the input image and**y**is the output image) - Pix2Pix uses the
**conditional GAN (CGAN) →**G : {x, z} → y. (z → noise vector, x → input image, y → output image) - Generator Network ( Encode- decode architecture) as an image is the input , we wanna learn the deep representation and decode it. and Discriminaor Network ( PatchGAN #willdiscuss).
- CGAN loss function and L1 or L2 distance.

**Training Process**

→ The Generator G takes x and z then it produces y, the goal of the G is to produce output images that cannot be distinguished from “real” images by the discriminator.

→ The discriminator D takes the pair (x and y) from both real images and fake images. The goal of the D is to distinguish between fake and real inputs.

This image illustrates that.

Training pix2pix gan as same as training any normal Gan except a little modification that is being done at the generator’s loss function.

The generator G is not only trying to reduce the loss from discriminator but also trying to move the fake distribution close to real distribution by using L1 or L2 loss

The loss fuction of generator network is

**Let’s talk about Networks Architectures.**

As we know the generator is an encoder-decoder network ( first a series of down sampling layers then we have bottle neck layer then a series of upsampling layers)

The authors used the “U-Net” architechure with skip connections as the E-D network.

The discriminator network uses the PatchGAN network ( which was termed by the authors) ,

instead of predicting the whole image as fake or real at the discriminator, the model takes a N*N patch image and predicts every pixel in that patch if its real or fake.

That’s the patchGAN.

because of every pixel giving a label, the pix2pix can produce sharp images with rich details in the images.

To know more about Network Params, Evaluation and Results , Please take a look at the paper.

Let me quickly explain the code w t r the changes.

You can find the full code here in my **github**.

Two place holders x and y as a pair (img_size is 256 and channels is 3)

The generator network is based on the U-net model (a set of downsampling layers, then bottleneck layer, then a set of upsampling layers)

The discriminator is a simple network (a bunch of downsampling layers and at the end we get the patchGAN, a grid of 30*30 values or pixels where each pixel is classified how much fake or real (between 0 and 1).

Observe that, here we concatnate the x and y.

So as usual, we feed the fake x to the discriminator along with the real x and calculate the D loss.

The G loss has now 2 components, one is the normal generator loss and another one is the weighted L1 loss between generated x (fake image) and real y.

The below snippet is for training

Note: I did not really bother about the results and did not fine tune the parameters , I just explained it to understand the topic, you can try the original author code to see the results.

Well, that’s all about the pix2pix gan → Input pairs supervised, Simple GAN and Easy Process.

Alright! so we can do image-image translations when we have paired dataset, thats cool but …

some researchers are never satisifed

Cycle GAN’s

The same researchers came up with another idea later that year, they call

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”

The outcome is → Given any two unordered image collections X and Y , the new algorithm learns to automatically “translate” an image from one into the other and vice versa.

in the above gif, not only the horse (x) can be mapped to a zebra(y) but also the zebra can be mapped back to the horse.

And this is achieved by unpaired training data unlike the above. ( if pix2pix , we would have given the horse(x) and the zebra(y) as a pair)

This is cool because often we may lack of paired training data.

In a paired dataset, we give **x** and **y** as a pair such that they have some representation in common and they share some features.

so we make neural nets learn that mapping/understanding between the both images therotically and practically with the loss (L1).

but in an unpaired dataset ,there may no meaningful transformation that we can learn from x to y.

so the question is how can we make the net learn some mapping b/w X and Y and do the image to image translation.

Ok let’s talk about how this is being done and the concepts.

Ingredients

→ The training data consists of two different sets of images ( based on the problem) One set of images called “**Domain X**”, another set of images called “**Domain Y**”.

eg: one set full of horses while another set full of zebras ( no ordered pairing).

→ Two Generators with two different functions/processes (G(x), and F(G(x))) and Two Discriminators.

→GAN loss along with Cycle Consistent loss (new).

Key ideas and concepts

Let’s first talk about the model generators and discriminatiors,

- The first generator G(x) takes an input image x from
**Domain X**, gives a generated image(output) y’

Eg: Horse to Zebra

*G(x) → y’** ( which should be indistinguishable from **Domain Y** images )*

- The second generator F(y) takes an input image y from
**Domain Y**, gives a generated image x’

Eg: Zebra to Horse

**F(y) →x’** (*which should be* *indistinguishable from **Domain X **images)*

The **DX** and **DY** are the discriminators,

- The
**DY**verifies the input image from**G(x)**if it looks like**Domain Y**images ( Eg: is it same as zebras??) - The
**DX**verifies the input image from**F(y)**if it looks like**Domain X**images ( Eg: is it same as horses??)

If I take an input image from Domain X → run the first generator → take that as the input and run the second generator → I expect to get the same image I started with

**x → G(x) → y’ → F(y’) = x**

**y → F(y) → x’ → G(x’) = y**

**x ∈ X **and** y ∈ Y**

**x’≈ Y** and **y’ ≈ X**

**F(G(x) =** **F(y’)** and **G(F(y)) = G(x’)**

This intution is called **cycle consistent**.

Well, what I have explained here is only the idea or the “should be” case

**Now let’s talk about losses and training process.**

Since we have two discriminators , we will have two normal GAN losses

**x → G(x) → y’ ***( DY discriminator fake ) *and* ***y*** ( DY discriminator real )*

**y → F(y) → x’ ***( DX discriminator fake ) *and* x ( DX discriminator real )*

and as I mentioned above, it introduces a new loss called “Cycle loss”

which is just like, take an image x from Domain X and do this process

**x → G(x) → y’ → F(y’) = x**

you should get x back so we can simply calculate L1 loss between x and F(y’) which is the cycle loss

same for **y → F(y) → x’ → G(x’) = y**

The final loss = GAN loss + Cycle loss

here is the optimal objective function for cycle gan

so we train the model with this objective function,

in English,The training process is a 7 step process.

- Take the 2 images x and y (1 from Domain X and 1 from Domain Y )
- Run the two generators (x2y and y2x) → generate 2 fake images(y’,x’)
- Run the two dicriminators (DX and DY ) DX takes x and x’ , DY takes y and y’
- Calucalte the discriminators losses from above equations
- Run again the two generators, x2y (x’ as the input) and y2x (y’ is the input) generates two cycle images (y_cycle, and x_cycle)
- Calculate the cycle l1 loss from (x,x_cycle and y,y_cycle)
- Finally calcualte the generator loss

Here is the picture drawn by me to help understand the process.

I am not going to explain the G and D networks architectures as I assume by now you can understand it by glancing the paper or code.

okay let me quickly explain the code w.r.t the changes

You can find the full code on my github **here**

First we get two (y fake and x fake) images when run the same generator network with different scopes.

Then we give those outputs to the same generators scopes but switching the inputs to produce the cycle images.

Then we have the discriminators networks (fake x vs real x and fake y vs real y).

Losses as explained above (observe we use LSGAN loss function, to understand it you can refer my previous story **here**)

This is the training script to train the generators and discriminators

That’s pretty much about the cycle GAN.

Summary

→ Image to Image translation is what we have been focused on.

→Pix2Pix takes pairs of images (X and Y) to be able to learn the translation from one image X to another Y.

→Cycle GAN don’t require the pair of images as inputs, it does the learning from taking the images from one Domain (A) and produces the another domain images in a way that there should be the cycle representation.

If anyone has doubts/thoughts/suggestions, feel free to ask and if i can help I will definitely help.

Original Papers and Credit to the Authors

*Image-to-Image Translation with Conditional Adversarial Networks*