Speech-Controlled Body Animations With Deep Learning

Original article was published on Artificial Intelligence on Medium


Speech-Controlled Body Animations With Deep Learning

Overview of the paper “Style-Controllable Speech-Driven Gesture Synthesis Using Normalizing Flows” by S Alexanderson et al.

AI techniques like LipGAN can generate lip movement animations on a face using just a speech audio file as input. This is great for automatically generating many talking animations in games. Now, in addition to this, if we can also synthesize realistic hand and body movement animations in co-ordination with the speech audio used as input, it is possible that we will soon be able to create entire animations of our virtual game character talking and interacting, without having to design any of it manually.

Source

This is why in this article I want to share the paper titled “Style-Controllable Speech-Driven Gesture Synthesis Using Normalizing Flows” by researchers in Sweden. This technique can generate plausible gestures given only an input speech audio. It is also capable of generating multiple unique gestures for the same speech thanks to its probabilistic generative modeling approach.

It uses an Auto-Regressive model like an LSTM that is trained to learn motion as a time-series distribution of poses. Here, it uses previous poses as part of the input to predict the next pose in combination with some other inputs. First of these inputs include acoustic features from our input speech with a sliding window mechanism.