Using Pose Estimation and Conditional Adversarial Networks to create and visualize new Fortnite dances.
If you know about the game Fortnite, you probably also know about the craze surrounding the in-game celebrations/emotes/dances. Gamers have spent millions of dollars purchasing dance moves with in-app purchases, making something as simple and as silly as this a big revenue generator for the game developer. This got me thinking, if the developer allowed the users to create these dances in the game and charged extra for it, they can probably make more money. As for the users, it would be really cool if we could record ourselves on a webcam and create our own celebratory dance within the game.
Doing something like that currently would require a Microsoft Kinect-like device which has dedicated hardware to sense our body movements. However, not everyone wants to buy separate devices for this. I think with advances in deep learning, it will soon become possible to achieve similar features with just a good old webcam. Let’s see how we could achieve something like this in the future with the help of Deep Learning algorithms.
Going from webcam to Fortnite
To try this out, I have used two Deep Learning techniques in this project. First, we’ll use a pose estimation algorithm to extract a stick-figure representation (pose) from our webcam recording. For a game developer, this pose representation is enough to animate a character in the game, but since I’m not the developer, I’ll have to simply visualize how a Fortnite dance would look like created using this pose. For this purpose, I have used Conditional Adversarial Networks called pix2pix in order to generate the character in the given pose.
Pose Estimation from webcam
For estimating the input image pose, I have used the algorithm from the paper Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields by Cao et al. [CVPR 17]. The algorithm uses Convolution Neural Nets at multiple scales to identify different parts of the body like left arm, right arm, torso, etc. These detected parts represent a single node.
Once all such body parts are detected, it uses a greedy parsing algorithm to connect the nearby nodes to form a connected graph giving us a stick figure that is a representation of the pose. It runs in real-time and also works with multiple people present in the image.
Image Synthesis from Pose
Once we get the pose, we want to convert it into the Fortnite character. For this, we’ll use the same pose estimation algorithm to generate our training data. We obtain labelled training data with this, where the pose figure is our input and the Fortnite character is our target label.
Then, a pix2pix network is trained to translate the input to the output. It uses a generative adversarial network for producing the target image conditioned on the input image rather than on random noise, so we can actually generate images of the Fortnite character that follow the pose given as input.
Both input and target images are available to the generator and discriminator networks during training. The generator network in pix2pix explicitly produces both real and fake images so that the discriminator can learn to differentiate between the two faster.
After both the generator and discriminator losses converge, the network produces quite decent results. It has learnt to follow the input pose very well by associating each body part of the Fortnite character with the stick figure. Unfortunately, the images produced are very blurry without high levels of details. Let me know in the comments section down below if you know how I could improve the results of the pix2pix network.
Source: Deep Learning on Medium