Using Artificial Intelligence To Create People, Cars, and Cats

Source: Deep Learning on Medium


How we can make hyper realistic images using Generative Adversarial Networks

Go to the profile of Alex Yu
None of the people in these images are real, they’ve just been generated by a neural network! Source.

Think about what an average day looks like for you. Unless you spend all day locked in your room watching Netflix or sleeping, chances are you pass by a lot of people when you’re going where you want to go. Some people have long hair, some people have short hair, some people have big noses, others have small eyes.

The point is, how often do we stop and think about the different features of what we see every single day. Have you ever wondered to yourself, “what makes a face a face?” Is it something that has two eyes, one nose, a mouth, and some hair? If so, couldn’t we write a super simple computer program that generates new faces by combining those features?

Well not really, there are a ton of details that we’re missing. Things like wrinkles, shadows, and smiles. Things that we might not usually think about when we’re asked to describe the features of a face but need to be there in order for it to seem real.

Something is off but I can’t put my finger on it… Source.

So how can we actually generate faces then if it isn’t just telling a neural network to put two eyes, a nose, a mouth, and some hair on a big ball of flesh and hopes it looks reasonable?

A Game of Cat and Mouse

First, let’s take a moment to learn how Generative Adversarial Networks (GANs) work.

A GAN is actually made up of two neural networks, a generator and a discriminator. The generator’s job is to generate new data based on what it knows and the discriminator’s job is to see if the data that’s been generated is legit or not. So in our example, the generator would try and create new images of faces and the discriminator would do it’s best to determine if the face is real or not.

Pretend for a second we’re dealing with an art gallery. The generator is someone trying to create fake pieces of art and selling them, the discriminator is the curator trying to see if the artworks are actually real or not. In the beginning, they’re both pretty bad at their jobs but they both learn from experience. The counterfeiter slowly gets better at faking artwork based on what gets accepted or not, and the curator gets better at telling the reals from the fakes over time.

This is basically how a Generative Adversarial Net trains, the generator takes in a random number and spits out an image based on the input. The discriminator evaluates the output of the generator and tries to predict if the image is real or not. If it guesses correctly, the networks both factor this in and update their weights. The same thing happens if it guesses incorrectly.

Eventually, the goal is to get the two networks to a point where the generator is so good at generating fakes that the discriminator will always be hovering around 50% confidence.

One thing to be careful when training a GAN, however, is that when one side gets too good at its job there might be problems because the other side won’t be able to learn anything. If the curator is able to guess every single piece of art correctly, the counterfeiter won’t be able to learn what it did wrong. If the counterfeiter gets too good, the curator is going to always be tricked and not be able to learn either.

Mapping the Probability Distribution

Okay so that’s pretty cool, right? You can train two competing neural networks to try and outsmart each other and you end up with a generative model that does an awesome job at creating believable images. But let’s go a tiny bit deeper into how this actually works.

What a Generative Adversarial Network is actually doing is mapping the probability distribution of the data.

Now what the heck does that even mean?

We can think of pictures as samples from a high-dimensional probability distribution. Whenever you take a picture, you’re taking a sample from a probability distribution of pixels. There’s basically a probability or likelihood of generating some arrangement of pixels and a GAN maps the probability distribution based on the data.

For our example, it’s basically learning what makes a face a face. Different features such as eyes, noses, and mouths have a representation in this probability distribution. As a result, changing the noise that is inputted into the model will change some of the qualities of the image that correspond to those numbers. By sampling from this distribution, we can get a model to generate entirely new images based on what it knows about the probability of the pixels in a certain part of the image.

Using DCGAN to generate new faces

An image from the DCGAN paper. Source.

The DCGAN is basically an improved version of a regular GAN but with a few important changes:

  • Max pooling functions are replaced with strided convolutions, letting the network learn its own spatial downsampling and upsampling.
  • Getting rid of fully connected layers on top of convolutional features. One example is global average pooling which often increases model stability but hurts convergence speed. The optimal solution is to directly connect the highest convolutional features to the input and output of the generator and discriminator.
  • Using Batch Normalization to stabilize learning by normalizing inputs to have zero mean and unit variance. This helps the gradient flow in deeper models but shouldn’t be applied to all layers. Model instability is avoided when batchnorm isn’t applied to the generator output layer and discriminator input layer.
  • The ReLU activation is used in the generator for all layers except for the output which uses Tanh. It was observed that this helped the model learn more quickly and cover the color space of the training distribution.
  • LeakyReLU activation is used in the discriminator for all the layers.

If you’re interested in implementing a DCGAN, there’s an awesome PyTorch tutorial that uses the CelebFaces Attributes Dataset (CelebA), a dataset with over 200,000 pictures of faces of celebrities.

Here are some examples of the images in the dataset after being resized to 64×64 to make it easier to train:

Here’s how the two models are structured:

Generator(
(main): Sequential(
(0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace)
(3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace)
(6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace)
(9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU(inplace)
(12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(13): Tanh()
)
)
Discriminator(
(main): Sequential(
(0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(1): LeakyReLU(negative_slope=0.2, inplace)
(2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(4): LeakyReLU(negative_slope=0.2, inplace)
(5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.2, inplace)
(8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(9): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(10): LeakyReLU(negative_slope=0.2, inplace)
(11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
(12): Sigmoid()
)
)

So what happens now, we’ve trained a neural network on a bunch of faces, the results have to be pretty good right? Well that depends on what you define as good.

And honestly, if you don’t look too closely, some of them seem like they might actually be real faces. Others though, look like they came straight out of one a nightmare.

This can’t be it, right? There’s got to be a better way of making fake people that look real. Well some researchers at Nvidia might have found the secret sauce.

Nvidia’s StyleGAN

After the results were released, Nvidia’s Generative Adversarial Model for generating new images got a lot of hype and coverage in the news, and for a good reason. Most of the pictures are hyper realistic and are almost indistinguishable from real photographs.

“These people are not real — they were produced by our generator that allows control over different aspects of the image.” Source.

And honestly the results from this model are pretty crazy. If I weren’t a high school student I’d definitely try training the model myself but I’d probably turn 40 before it would actually start producing recognizable pictures. They literally recommend a NVIDIA DGX-1 with 8 Tesla V100 GPUs. So unless you have a spare $150,000 lying around or an enterprise level deep learning setup you might also be out of luck.

But how does it actually work?

The paper on the StyleGAN addresses one important issue. Even though the resolution and quality of GAN-produced images is getting better, we still have a really hard time explaining what exactly these machines are doing. It’s still like a black box. We also don’t understand the latent space, the mapping of features to variables, there’s no quantitative way to compare generators.

By taking inspiration from classic style transfer, the goal was to recreate the generator in a way that would allow us to see into the image synthesis process and let us adjust the style of the images at each layer to manipulate the different features.

On a high level, Nvidia’s StyleGAN does something similar where it learns what the different aspects of images are like without any help from humans and after training the model, the styles can be combined on different levels to get a final image with coarse, middle, and fine styles.

Check out this sick visualization of the StyleGAN!

And this picture to see some of the styles being mixed:

The Future of Generative Models

Nvidia’s StyleGAN was a pretty major breakthrough in Generative Adversarial Networks. Some of the images that have been generated by the network are almost indistinguishable from actual photos. But what are some practical applications of GANs?

  • AI generated graphics for videos
  • GANs for Super Resolution
  • GANs for Text to Image Generation
mediu

Thanks for reading! If you enjoyed it, please:

  • Add me on LinkedIn and follow my Medium to stay updated with my journey
  • Check out some of my other projects at my personal website
  • Leave some feedback or send me an email (alex@alexyu.ca)
  • Share this article with your network