Just recently released paper titled ‘Everybody Dance now’ by UC Berkely is causing a hype among the Ai community right now !!!!
Given a source dancing video(probably a pro-dancer for better results), they were able to transfer that state of dancing to an entirely new target through just a few minutes of training data from the source video. And remember the target is also a GAN generated target, not an actual person…!!! WIZARDS, it’s Ai everywhere……
The implications of this technology are profound. It’s like auto-tune for dancing. The demo can be found here.
Notice that, though in demo their output video was a bit pixelated or blurry(into the uncanny valley), it produced a reflection of the dancing target too ! Overwhelming!
This work was likely inspired by the demo where deep learning engineers generated a GAN which could do the motion retargeting in a simple manner. It can be found here.
Researchers split the training pipeline into 3 steps :
- Pose estimation
- Pose normalization
- Mapping normalized post stick figures to the target subject
Once they found a suitable source video, they needed to encode the key points of dancer and what better way of representing it than the pose detection pre-trained model called Openpose… Openpose outputs the stick figure with all joints/co-ordinates connected.
Since videos are a series of images, researchers did this(Openpose) for each frame of video resulting in a rich dataset of dance poses. They then normalized each input image to help account for the differences between the source & target by analyzing joint positions
They used this dataset as input to Generative Adversarial Network-GAN (Click here to know about GANs) where in-short the Generator’s job is to output an image and Discriminator has the training data from target video + Generator’s output and its job is to identify the image as real or fake.
But they found out that these GANs generated a kind of choppy videos & didn’t look realistic, so they did: a) Temporal smoothing & b) Facial GAN
Pros : Can be used for
— movie Studios or music videos where there’s no need to hire an entire background crew since we can use motion transfer there.
— Advertising campaigns
— Democratizing twerking (a type of dance)
Cons : Just like the Deep Fakes , in today’s world any video evidence can be manipulated (people can be framed for crimes they didn’t commit) as Ai can generate/synthesize voices, faces, music, videos….
Source: Deep Learning on Medium