Ever imagined you dancing like MJ? “Maybe in my dreams!”, Might be your answer, but it’s indeed possible now, let’s learn to dance using Generative Adversarial Networks(GANs). GANs have gained a lot of momentum in the present technical research field, it has modified the way we think and is trying to implement the impossible.
Few astonishing things that we could do using GANS:
- A GAN has the power to convert a boring black and white image to a colourful image.
- It can augment the dataset using virtually generated data, and one astonishing application is, it can also synthesise images from text and vice-versa.
- These can create new anime characters by adding randomness to the original images.
Whoa! Are we living in another dimension? Of course, Nah! So, let’s see this fantastic network once.
Introduction to GANs
GANs are generative adversarial models which are devised by Ian Goodfellow et al. The primary goal is to give the computer the power of imagination where they can generate new information and data using existing information/data. GANs comprises two players(Neural Networks Models), these work like a zero-sum locked game.
One player is a generator, and it tries to produce data that comes from some probability distribution. That would be, you trying to reproduce the party’s tickets. The other player is a discriminator which acts as a judge. It gets to decide if the input comes from the generator or the real training set. That would be the party’s security, comparing your fake ticket with the valid ticket to find flaws in your design.
There are two main rules for playing this game. First, the generator trying to maximise the probability of making the discriminator mistakes its inputs as real, and second the discriminator guiding the generator to produce more realistic images/the use case.
GANs is the most interesting idea in the last ten years in Machine Learning! — Yann LeCun
Diagrammatically, this is how it looks like; the generator adds randomness to training set images while discriminator monitors the randomness and identifies if the generated image is real or fake.
Now, can we make people dance using GANs? UC Berkeley says a pro-founding YES!
CAROLINE CHAN, SHIRY GINOSAR, TINGHUI ZHOU, ALEXEI A. EFROS have published one amazing paper, named, Everybody Dance Now! Let’s, henceforth, jump into the abyss to know more.
In a nutshell, it’s where we transfer the source object’s dance moves(One who can dance) to the static target object(One who we teach to dance). It’s called per-frame image-to-image translation with spatio-temporal smoothing. Sounds weird, right?
To make this clear, let’s divide this process into three simple stages:
- Pose Detection
- Global Pose Normalization
- Mapping from normalized pose stick figure to the target subject
Pose detection estimates the pose of the source figure from each frame of the source video. To accomplish this, we use a pre-trained pose detector “OpenPose”. This automatically detects the x,y joint coordinates of the person who’s dancing. These points are then connected to form a pose stick figure. Below is the image of how the pose is identified from the source image.
Global Pose Normalization is done to make the source’s object frame more consistent with the target’s environment.
Mapping from Normalized Pose Stick Figure to the Target Subject Finally, we map the normalized pose stick figures to the target object which makes them dance like the source object. The entire process is achieved in two phases — Training and Transfer.
During the training phase, the generated pose stick figures the target person is passed to the generator G whose job is to generate images from the abstract pose stick images.
This adversarial training proceeds using Discriminator D and a perceptual reconstruction loss dist using a pre-trained VGGNet. This helps in optimising the generated image G(x) to match the ground truth target image y. As in a conventional GAN, Discriminator D tries to distinguish between the real image, y and the false image, G(x). All these lead to the training of the generator G for the target image.
During this phase, the pose of the source object, y’ is extracted yielding x’. But since, we aren’t sure of the source object’s position, limb positions and environment, we need to make sure that it is compatible with the target person’s position. Therefore, to normalise the above, we use a global position normalization Norm generating x. We find the distance between the ankle positions and height to linearly map the source object’s position to the target object’s position. We then pass it to the previously trained G model to generate G(x) of the target person.
So far, so good. We’ve successfully generated nice target images from the source object’s position using a pix2pixHD framework which internally uses conditional GANs(The same GAN which we discussed till now). In our use case, pix2pixHD will generate human images from the pose stick figures after rigorous training of the GAN Network.
But, we then modify the adversarial training of pix2pixHD to produce coherent video frames and to get realistic facial expressions using Temporal smoothing and Face GAN.
It is used to generate a sequence of video frames. So, rather than one single frame, we use two consecutive frames to maintain the continuity of the video being modified for the static target object.
Initially, the first output G(xt-1) is conditioned on the pose stick figure xt-1 and a zero image, say, z. The next output G(xt) is conditioned on its corresponding pose stick figure xt and the previously generated output G(xt-1). The discriminator now tries to differentiate the real temporal coherent sequence(xt−1, xt, yt −1, yt) from the fake sequence(xt−1,xt,G(xt−1),G(xt)).
Face GAN helps in improving the realism of facial expressions. A face GAN is used along with the main GAN which we talked about earlier. A generator, G(xf) is used to focus specifically on the face of the person using G. This is then, along with pose stick figure xf (generated from x with focus on the face of the person) is given to a generator, Gf which outputs a residual r. This when combined with G(xf) gives us the output, r + G(xf). A discriminator then tries differentiating the real face pairs(xf, yf) from the fake face pairs(xf, r + G(xf)).
Yay! It’s done. Summarising the above, we first train the main GAN, then optimise the weights, after that, apply Face GAN. Now, you can dance too!
Source: Deep Learning on Medium