Welcome to the Simulation

Source: Deep Learning on Medium

This type of technology led to several interesting prototypes of image editing software for the upcoming future.

Face editing software using SC-FEGAN (paper: SC-FEGAN: Face Editing Generative Adversarial Network with User’s Sketch and Color):


And a “Paint with AI” aka GauGAN or SPADE by NVIDIA (just play it yourself):

GauGAN interface

Still waiting for a product using StarGAN!

Text-to-Image generation

GANs are already moving to multi-modal cases, for example, generating images from an English sentence:

It’s not yet ready for creating a movie by its description (probably being created by another GAN or other generative model like, say, GPT-5, and we’ll talk on texts later), but the trend is obvious.

Video Generation

Then it started to be applied to videos by motion transfer and face-swapping.

Detecting faces and tracking people’s movements (including body, eyes or lip movements) gave rise to the possibility of transferring some personal traits to other people or generating artificial personas with the desired characteristics.

Motion transfer

The year 2018 paper called “Everybody Dance Now” describes video-to-video translation using the pose as an intermediate representation.

Everybody dance now

Another year 2018 technology by NVIDIA called vid2vid allows creating high-resolution, photorealistic, temporally coherent videos on a diverse set of input formats including segmentation masks, sketches, and poses. Results are pretty impressive:

vid2vid demo

A very recent September 2019 paper performs human motion imitation, appearance transfer, and novel view synthesis within a unified framework, which means that the model once being trained can be used to handle all these tasks:

Project Impersonator

Face-swapping and reenactment

What about face-swapping, face-reenactment or face generation, you surely heard of these two most famous examples: DeepFakes and the fake Obama speech.

In the latter case, a video with accurate lip sync was produced with Obama speaking the words, he never has spoken.

Synthesizing Obama: Learning Lip Sync from Audio / SIGGRAPH 2017

The former case of DeepFake has led to a wide ban on “involuntary synthetic pornographic imagery” among online platforms.


The original DeepFake emerged in November 2017. The first version was just a plain dumb convolutional neural network with an autoencoder (no GAN whatsoever). Both architectures were well known and were successfully used for many years. It’s strange we saw it only a couple of years ago because the technology was ready for a long time. DeepFakes with GANs came later.

Nearly a half year later Deep Video Portraits was presented at SIGGRAPH 2018. It enabled photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, it was the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor.

Deep Video Portraits / SIGGRAPH 2018

The technology for both cases is constantly upgrading. Now it is possible to change the speech using just text or to create realistic photos (or video) using a single image.

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models

Deepfake artist Hao Li, who created a Putin deepfake for at MIT Technology Review’s EmTech conference, told in September 2019 that “perfectly real” manipulated videos are just six to 12 months away from being accessible to everyday people.

A recently published August 2019 paper on FSGAN (Subject Agnostic Face Swapping and Reenactment) produces very compelling results for face swapping and reenactment in videos:

FSGAN vs. DeepFakes

Popular apps using this kind of technology finally emerging. It resembles the time neural style transfer paper by Leon Gatys was published and a pack of apps emerged, Prisma among the first ones.

Right now ZAO is a Chinese popular face-swapping application that’s able to place you into scenes from movies and TV shows after uploading just a single photograph:

Face/person generation

At the end of 2018, Xinhua presented the first AI anchor at the ongoing fifth World Internet Conference in east China’s Zhejiang Province. The AI news anchor was jointly developed by Xinhua News Agency, the official state-run media outlet of China, and the Chinese search engine company Sogou.com.

The news anchor, based on the latest AI technology, has a male image with a voice, facial expressions, and actions of a real person. “He” learns from live broadcasting videos by himself and can read texts as naturally as a professional news anchor.

Then other anchors were run. In February 2019 Xinhua ran a female news anchor, and the Russian-speaking one in June 2019. Arabic-speaking anchor was announced as well.