Syncnet Model with VidTIMIT Dataset

Original article was published by Neha Sikerwar on Deep Learning on Medium

Syncnet Model with VidTIMIT Dataset

Predicting the video as real or fake

Find the GitHub link of the project. LinkedIn profile.


There is a lot of advancement in video manipulation techniques. And it’s a lot easier to create tampered videos, which can fool human eyes. Such content leads to fake news or misinformation. And which can affect the people and the countries.

So in this project, I tried to detect whether the video is tampered or not, or you can say, real or fake? I referred to this research paper throughout my project. They focused on determining the audio-video synchronization between mouth motion and speech in the video. They used audio-video synchronization for TV broadcasting. But here I used the VidTIMIT dataset. It’s really a nice research paper, they developed a language-independent and speaker-independent solution to the lip-sync problem, without labeled data.

There are more latest research papers out there for the same problem. Like, this and this. And their results are also good. But I chose to apply the syncnet model as it’s quite easy, simple model structure and we have pretrained weights available for the model. So let’s see more about the syncnet model and its architecture.

About Syncnet and its architecture

The network ingests clips of both audio and video inputs. Two-stream ConvNet architecture that enables a joint embedding between the sound and the mouth images to be learned from unlabelled data.

Audio Data

The input audio data is MFCC values. You can read more about MFCC here. Mel Frequency Cepstral Coefficients (MFCCs) are a feature widely used in automatic speech and speaker recognition. It identifies the components of the audio signal that are good for identifying the linguistic content and discarding all the other stuff. 13 Mel frequency bands are used at each time step. Layer architecture is based on VGGM, but with modified filter sizes to ingest the inputs of unusual dimensions.

Video Data

The input for the visual network is a sequence of mouth regions as grayscale images with 111×111×5 (W×H×T) dimensions for 5 frames. Below is the screenshot of the architecture from the research paper.

The key idea is that the output of the audio and the video networks are similar for non-tampered or real videos and different for tampered or fake videos. So we can calculate the euclidean distance between the network outputs. More distance means less similarity means fake video.

About VidTIMIT Dataset

The VidTIMIT dataset consists of video and corresponding audio recordings of 43 people, reciting short sentences. There are 10 sentences per person. The video of each person is stored as a numbered sequence of JPEG images. The corresponding audio is stored as a WAV file. To unzip the folders:

As we have 43 users with 10 audios each, so I created 430 non-tampered videos from their respective images and audios. Then to create tampered videos, I replaced each correct audio with 3 incorrect audios from the same VidTIMIT dataset. Basically, I created 3 fake videos corresponding to each real video. So in our final dataset, we have a 1:3 ratio of tampered and non-tampered videos. I created videos with cv2.VideoWriter. I’m showing the code to create non-tampered videos. Similarly, you can create tampered videos.

Audio and Video Processing

In this section, we will create features for audio and video files. All the processing parameters I got from here and functions I got from here. He also implemented the same syncnet model. And all the functions are very clear. We’ll see one by one. Video files are in the .mp4 format and audio files are in .wav format.

Video processing:

In video processing, we are first detecting the frames and mouth. And converting the mouth image to grayscale and resizing it. Then taking rectangle coordinates of mouth and preparing the video features with that. Then stacking features of 5 frames together. So I used a function that takes video as input. If someone doesn’t want to process the video and has frames/images instead to process, they can use the other function for featurization. All functions are present in my GItHub repo.

Audio processing:

For audio processing, first, we are using scipy ( to read wav audio files. Then, speechpy.feature.mfcc to create MFCC features for each 0.2 sec of the clip. We considered 12 MFCC features. Then reshaping the features to (N//20, 12, 20, 1), where N is len(mfcc_features). Please find the code below.


In the modeling part, we will create the structure of two streams as shown in the architecture section. One stream to process the frames of mouth and another for the audio features. The implementation is in Keras, 2 sequential models. Then we have functions to load the pretrained weights which I got from Vikram Voleti and his GitHub repository. I got all the functions for the modeling from his repo only. Anyone can refer to his demo file to understand the full pipeline clearly. We have different modes, like ‘lip’, ‘audio’ or ‘both’. If we select the ‘lip’ mode, then only the lip sequential model will be loaded. If the ‘audio’ mode is selected, then only the audio sequential model will be loaded. If ‘both’ mode is selected, then both the lip and the audio model will be loaded in a list as shown in the code below.

Syncnet lip model:

In lip model layers, we have input shape as (height of the mouth frame, width of the mouth frame, number of video channels). Here we have 7 blocks of layers. First, 2nd and 5th blocks consist of convolutional and maxpool layers. 3rd and 4th blocks are of conv layers. 6th and 7th blocks are dense layers.

Syncnet audio model:

In audio model layers, we have input shape as (audio MFCC channels = 12, audio time steps = 20, 1). Here also we have 7 blocks of layers. 2nd and 5th blocks consists of convolutional and maxpool layers. 1st, 3rd and 4th blocks are of conv layers. 6th and 7th blocks are dense layers.

Evaluation and Results

According to the research paper, “to find the time offset between the audio and the video, we take a sliding-window approach. For each sample, the distance is computed between one 5-frame video feature and all audio features in the ± 1 second range.” So we also implemented the same here.

Also in the paper, they implemented the confidence score to find the active speaker. But here in our case, we have only one speaker in all videos. They also gave the example of dubbed videos where lip motion doesn’t match with the audio. So I used the confidence score to find a correlation between audio and video. A high confidence score means more correlation between audio and video. So there are more chances of that video to be real. And if the confidence score is low, means a low correlation between audio and video, so more chances of it being fake or tampered video.

Here I’m considering label 1 for non-tampered or real videos. And label 0 for tampered or fake videos. After the modeling part, we can predict from the models the audio and video arrays. Then with those arrays, I’m going to calculate the euclidean distance between them (for function refer GitHub repo). From the distance function, we will get a 31 valued array of distances. Then with those values, we can calculate the confidence score for each audio-video pair. Confidence can be calculated by the difference between the median and minimum value of the distance array. At last with the confidence score, we can predict the video as real or fake.

I found the confidence score for all the videos and I plotted them as shown. Orange color is for confidence values of fake videos and blue color is for the real ones. As we can see, lower values of confidence scores are for tampered/fake videos and higher values of confidence score are for real/non-tampered videos. So we can clearly classify them into real or fake on the basis of confidence values.

So we have to select a threshold value from confidence scores to classify videos as real or fake. From the graph, we can see 3.5 as a good threshold value for classification. Or else we can run the loop from 2 to 4 confidence values and select a metric (precision_score because we want lower false-positives). And then we can check at what confidence score as the threshold we are getting the highest precision. In my case, I got pretty good results with 3.5 as the threshold. I got a roc-auc score of 0.8461231275184763. There is always a trade-off between precision and recall. One can choose according to its requirements.

Chunks from the live video feed

If you are using live streaming of video, you can create 4 sec of chunks of videos. And then can process those chunks as we did before and then check if the video/chunk is tampered or not.

To create 4-sec chunks, the FFmpeg command is below.


Now we can deploy the model in the cloud, as we have saved weights and the model too. I referred to these (this and this) nice articles on deployment. To deploy we need to set up the environment in a cloud platform like AWS or heroku. But first we need to create a simple web API using Flask. So I created an file, which we have to put in the same folder as where all the other files are present. Also a HTML form which can take 2 input files, one audio (.wav format) and one video (.mp4 format) and submit button. When the form is submitted it will make a post request to ‘/predict’ route and we should be able to get the data of the form page. This HTML file we have to put in the templates folder. So the structure of folders should be like this.

After all of this we will check if it’s working in our local system. The only challenge is to set up the virtual environment for dlib. I used anaconda prompt. Follow the steps mentioned here or here for the setup of dlib. Then we just have to run 3 commands in anaconda prompt:

  1. cd C:\Users\Neha\Desktop\deploy

(change directory to your path)

2. conda activate env_dlib

(“env_dlib” is my virtual environment)

3. python

Then we will open localhost in the browser with “http://localhost:8080/index” (change port number according to yours). And, then upload the files and get your video prediction. Great!

Check the demo here

For deployment in the cloud, we need a deep learning AMI instance (so we don’t need to install all deep learning libraries) which is not eligible for free tier in AWS. So I did not do it. If you are going to try to deploy in a normal instance, then dlib installation is going to be a real challenge. So if you have proper resources then you can go ahead to deploy this model in the cloud. Good Luck!

Future Improvements

In this section, I tried to do transfer learning with the syncnet model. To create the dataset, I tried using directly with audio and video features, but as the depth is not fixed for all videos, it’ll give ValueError. So I used tf-records. After processing or featurization of audios and videos, I wrote them into tfrecords file. The code is shown below.

Then we can create the dataset to work on easily. To read the tf-records file and how to create a dataset from it, please refer to my GitHub link.

Here I will not calculate the confidence score. Rather I will use a contrastive loss function, which takes y_true and y_pred as inputs and calculates the loss between them. So y_pred will be an array of distances between audio and video, and y_true will be an array of 0s (for fake) or 1s (for real). Below are the loss and distance equations from the research paper.

Now we have the dataset and functions. We need to train the model. So there are two ways I will explain here for the training part. The first approach is, we can create a new model with all layers of the previous model and add a customized layer to calculate distance and will use the contrastive loss as the model’s loss function. And, the second approach is to freeze the top layers of the previous model and train the last dense layers only with your data and call distance function and loss function after that. We can use the GradientTape() function from TensorFlow for training. For first approach model can be as below:

So now the structure will look like this:

For the second approach, I want to train only fully connected layers 6 and 7 from both sequential models. If anyone wants to train more layers, can do it with the below code just make those layers trainable.

Then I made a list of trainable variables from both the models and showed how to train with the GradientTape() function. Like:

Here I have used customized distance function and loss function. So due to gradient issues, I’m not able to train the model. Because not everything is differentiable in my functions. But it can totally be differentiable and trainable if someone uses proper TensorFlow implementation. So I’m leaving it for future work. This is the improvement anyone can do in the future referring to this notebook.

NOTE: All the code for this case study is available in the project’s Github repo.