Creating custom Fortnite dances with webcam and Deep Learning

Using Pose Estimation and Conditional Adversarial Networks to create and visualize new Fortnite dances.

Recreating a Fortnite character’s dance moves using poses from my webcam video.

If you know about the game Fortnite, you probably also know about the craze surrounding the in-game celebrations/emotes/dances. Gamers have spent millions of dollars purchasing dance moves with in-app purchases, making something as simple and as silly as this a big revenue generator for the game developer. This got me thinking, if the developer allowed the users to create these dances in the game and charged extra for it, they can probably make more money. As for the users, it would be really cool if we could record ourselves on a webcam and create our own celebratory dance within the game.

Doing something like that currently would require a Microsoft Kinect-like device which has dedicated hardware to sense our body movements. However, not everyone wants to buy separate devices for this. I think with advances in deep learning, it will soon become possible to achieve similar features with just a good old webcam. Let’s see how we could achieve something like this in the future with the help of Deep Learning algorithms.

Going from webcam to Fortnite

To try this out, I have used two Deep Learning techniques in this project. First, we’ll use a pose estimation algorithm to extract a stick-figure representation (pose) from our webcam recording. For a game developer, this pose representation is enough to animate a character in the game, but since I’m not the developer, I’ll have to simply visualize how a Fortnite dance would look like created using this pose. For this purpose, I have used Conditional Adversarial Networks called pix2pix in order to generate the character in the given pose.

Pipeline to go from webcam to Fortnite involves two steps: (1) getting pose from webcam image, followed by (2) synthesizing Fortnite character in that particular pose.

Pose Estimation from webcam

For estimating the input image pose, I have used the algorithm from the paper Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields by Cao et al. [CVPR 17]. The algorithm uses Convolution Neural Nets at multiple scales to identify different parts of the body like left arm, right arm, torso, etc. These detected parts represent a single node.

This image shows the detection results of a Convolutional Neural Net trained to detect the right arm of a human body. The detection is shown by the hot region of this heatmap at multiple scales.

Once all such body parts are detected, it uses a greedy parsing algorithm to connect the nearby nodes to form a connected graph giving us a stick figure that is a representation of the pose. It runs in real-time and also works with multiple people present in the image.

Image Synthesis from Pose

Once we get the pose, we want to convert it into the Fortnite character. For this, we’ll use the same pose estimation algorithm to generate our training data. We obtain labelled training data with this, where the pose figure is our input and the Fortnite character is our target label.

Training data contains collection of paired images of the pose (input) and the targeted dance move (output).

Then, a pix2pix network is trained to translate the input to the output. It uses a generative adversarial network for producing the target image conditioned on the input image rather than on random noise, so we can actually generate images of the Fortnite character that follow the pose given as input.

Both input and target images are available to the generator and discriminator networks during training. The generator network in pix2pix explicitly produces both real and fake images so that the discriminator can learn to differentiate between the two faster.


After both the generator and discriminator losses converge, the network produces quite decent results. It has learnt to follow the input pose very well by associating each body part of the Fortnite character with the stick figure. Unfortunately, the images produced are very blurry without high levels of details. Let me know in the comments section down below if you know how I could improve the results of the pix2pix network.

More such results can be found on my YouTube channel and in the video embedded below. If you liked it, feel free to subscribe to my channel to follow more of my work.

Thank you for reading! If you liked this article, please follow me on Medium, GitHub, or subscribe to my YouTube channel.

Source: Deep Learning on Medium