How to Create Fake Talking Head Videos With Deep Learning (Code Tutorial)

Original article was published on Artificial Intelligence on Medium


How to Create Fake Talking Head Videos With Deep Learning (Code Tutorial)

Combining Face Generation (StyleGAN), Text Generation (GPT), Text-To-Speech (FlowTron), and Lip-Sync Animation (LipGAN).

Creating realistic computer-generated content has always been a challenging and time-consuming task in the movie and games industries. However, with the use of Deep Learning, we can transfer almost all of this workload onto neural networks that are trained for various audio and visual synthesis tasks.

In this article, I want to share a quick and easy method to synthesize talking head videos of random fake people. As you can see in the video embedded below, their face, script, voice as well as the lip-sync animation have all been created by different state-of-the-art Deep Learning algorithms.

To follow along with this coding tutorial, all you’ll need is to sign onto Google Colab with your Google account, create a new notebook with GPU runtime type, and follow these steps listed below.

Step 1: Face Generation with StyleGAN2

Let’s first try to generate the face of the hypothetical person we want as our narrator in the output video. The current state-of-the-art Deep Learning technique in face generation is NVIDIA’s StyleGAN2. For this purpose, we’ll use the following website for obtaining a StyleGAN2 generated face.

Let’s first install the required web tools for scraping photos off this website. Execute the following in your Colab Notebook.

Next, execute the following code to visit the website and download the image to your Colab environment. Run this portion of code again and again until you see a face you want to use for your intended purpose.

You should see a high-res face displayed in your notebook as below. Note that in this tutorial, we will only be using female faces since the speech synthesizer used later only comes with a female voice. This can easily be extended to other cases with training separate speech models.

Fake face generated by StyleGAN2. Source: www.thispersondoesnotexist.com

Step 2: Script Generation with GPT-2

Now we will use OpenAI’s GPT-2 language model to generate a script for our narrator. For this purpose, we will use the website TextSynth. Execute the following code and input a question you want to ask this fake person in the text prompt. This question will be sent to the GPT-2 model for generating subsequent text, so consider adding a few words of the answer to your question to guide the language model for generating better answers.

You should expect the response to your question to be an output like below:-

Step 3: Text-To-Speech Conversion with FlowTron

Now, let’s convert this text response to the audio/voice of our narrator. The current state-of-the-art in TTS is NVIDIA’s Flowtron model. To use this, we shall clone this repository and download the pre-trained model available by executing the following code.

Now, let’s perform inference on our GPT-2 text response. For this, run the inference.py file inside the virtual environment we created earlier by executing the following code.

This should generate a speech.wav audio file that you can preview within the Colab notebook with IPython Audio display.

Step 4: Lip-Sync Animation with LipGAN

Now comes the last piece of the puzzle. We have a face, a script, and its respective audio. We need to combine these to create a realistic animation giving us the final talking head output video. We can use LipGAN for this purpose. It takes a face image and an audio file as input, and then produces a lip-synced animation of the face according to the speech in the audio file. Let’s clone this repository in our environment and install the necessary packages.

Now we can execute the inference code as following.

This should generate an output video as shown in the embed below. There’s our AI-generated script below the video for reference subtitles.

When do you think we will get rid of the coronavirus pandemic worldwide?

As a health expert, I predict that the pandemic will happen in the 2020s, and it is possible that it may happen in the 2050s. What I think will happen is that pandemic will cause an increase in deaths from diseases that have been eradicated by vaccines or by conventional medicine. Will there still be any deaths from the new viral pathogens? Yes, we will still have many deaths from the new viruses.

Conclusion

The process of creating computer-generated media content has become incredibly easy with the advancements in Deep Learning. While the AI-generated results are far from perfect, they are still incredibly close to being indistinguishable from reality. With a few more advancements in the algorithms down the line, that gap will become even smaller.

There is still scope to improve these results by animating the rest of the face as well as the eyes and the head movements. As a continuation of this project, I will be working towards integrating facial and head animations into this pipeline to create a more life-like output video. In the meanwhile, feel free to use the one-click Colab notebook from the GitHub repository linked below to play with the entire code in this tutorial.

References

Different Tools/Repositories used in this tutorial:-

  1. Face Generation with StyleGAN2
  2. Text Generation with OpenAI GPT-2
  3. Speech-to-Text Conversion with Flowtron
  4. Lip-Sync Animation with LipGAN