Everybody Dance Faster

Source: Deep Learning on Medium

Finally, we reach something closer to an entertaining example of motion transfer content.

if you don’t mind a ‘glitch’ effect

Motivated by a sense of how our our experimental designs have impacted the quality of the renditions, we can constrain our demo to more consistently produce high quality examples.

Setting the Scene

Simple, symmetric scenes will be easiest to generate. This will help us spend our practical compute budget refining models to produce hi fidelity renditions of the subject dancer.

The researchers emphasized slim fit clothing to limit the challenges of producing wrinkle textures. For our purposes, we assume participants will wear attire typical to a tech or business conference.

Additionally, we want to assume the scene is an adequately lit booth with the space to frame a shot from a similar perspective to that of the source reference video.

GANs trained half as long as examples above on smaller images

The example above shows an idealized setting for our booth after training an image-to-image translation model on roughly 5 thousand 640×480 images.

Note the glitchy frames due to poor pose estimation at the feature extraction step on the source dance video.

This reference implementation was run for roughly 8 hours on a GTX 1080 GPU. We want to get training times down to 1-hour so we will need something quite different.

Next, we discuss some implementation choices to expedite the production of motion transfer examples in a live demo setting.

Estimating Pose at the Edge

Motion transfer involves a costly feature extraction step of running pose estimates over source and target videos.

However, reference source videos are harder to come by and should be assumed to be available in a preprocessed form for our implementation.

Then by performing fast pose estimation on the target video, the remaining time will be spent training the GANs.

The new Coral Dev board (EdgeTPU) can run pose estimation at roughly 35 fps for 481×353 images using TFLite. For 640×480 images, we can run inference inline with frame acquisition at roughly 25 fps.

To achieve the greatest time resolution using hi-speed cameras, we don’t want to block frame acquisition with inference and streaming and instead write images to an mp4. Then the video file can be queued for asynchronous processing and streaming.

Assuming a realistic time budget from a user in our booth, say 15 seconds, we can increase the number of edgeTPUs & hi-speed USB cameras until we can ensure acquiring sufficiently many training samples for the GANs.

We’ve also seen how pose estimate quality impacts the final result. We choose larger, more accurate models and apply simple heuristics to exploit continuity of motion.

More concretely, we impute missing keypoints and apply time smoothing to pose estimates en-queued into a circular buffer. This is especially sensible when performing hi-speed frame acquisition.