Everybody Dance Now!



Just recently released paper titled ‘Everybody Dance now’ by UC Berkely is causing a hype among the Ai community right now !!!!

Given a source dancing video(probably a pro-dancer for better results), they were able to transfer that state of dancing to an entirely new target through just a few minutes of training data from the source video. And remember the target is also a GAN generated target, not an actual person…!!! WIZARDS, it’s Ai everywhere……

The implications of this technology are profound. It’s like auto-tune for dancing. The demo can be found here.

Notice that, though in demo their output video was a bit pixelated or blurry(into the uncanny valley), it produced a reflection of the dancing target too ! Overwhelming!

target producing reflections too..outstanding!

This work was likely inspired by the demo where deep learning engineers generated a GAN which could do the motion retargeting in a simple manner. It can be found here.

Researchers split the training pipeline into 3 steps :

  1. Pose estimation
  2. Pose normalization
  3. Mapping normalized post stick figures to the target subject

Once they found a suitable source video, they needed to encode the key points of dancer and what better way of representing it than the pose detection pre-trained model called Openpose… Openpose outputs the stick figure with all joints/co-ordinates connected.

openPose is a simple Convolution network with some matrix multiplications and common strategy of gradient descent.

Since videos are a series of images, researchers did this(Openpose) for each frame of video resulting in a rich dataset of dance poses. They then normalized each input image to help account for the differences between the source & target by analyzing joint positions

They used this dataset as input to Generative Adversarial Network-GAN (Click here to know about GANs) where in-short the Generator’s job is to output an image and Discriminator has the training data from target video + Generator’s output and its job is to identify the image as real or fake.

But they found out that these GANs generated a kind of choppy videos & didn’t look realistic, so they did: a) Temporal smoothing & b) Facial GAN

left image is Temporarl smoothing while right is Facial GAN

Pros : Can be used for

— movie Studios or music videos where there’s no need to hire an entire background crew since we can use motion transfer there.

— Advertising campaigns

— Democratizing twerking (a type of dance)

Cons : Just like the Deep Fakes , in today’s world any video evidence can be manipulated (people can be framed for crimes they didn’t commit) as Ai can generate/synthesize voices, faces, music, videos….

CLICK HERE FOR THE PAPER

Source: Deep Learning on Medium