Source: Deep Learning on Medium
Understanding the Problem
As there is a lot of active research that is evolving in video/image generation and manipulation which defiantly helps many problems at the same time this also leads to a loss of trust in digital content, it might even cause further harm by spreading false information and the creation of fake news.
To get more understanding please have a look at the below video
Note: Please understand that the video I have included here, is not to offend anyone. This is just an example of how digital content is losing the trust and just to address the problem for the present context
The video itself explains the threat of manipulated facial re-enactment of famous persons, and most of these are manipulated or generated by implementing some Artificial Intelligence techniques.
If a machine can able to learn to detect manipulated videos then it can be one of the solutions that address this problem.
On considering this real-time problem, our objective is to build a model such that it should recognize whether the given video is Real or Fake.
The model should able to classify the video along with confidence values and it is not that important to display the reasons why it is fake/real which in mean results interpretation is not that important in this case, therefore we can comfortably solve this using DeepLearning techniques on measuring accuracy and log loss.
To enable the machines to think and learn we defiantly need some reasonable amount of data to feed and evaluate the models. Here I am using FaceForensics++ data that contains both real and manipulated face videos.
Some Data Background:
Faceforensics++ data was collected by Visual Computing Group which an active research group on computer vision, computer graphics, and machine learning. This data contains 1000 pristine (real) videos that are selectively downloaded from YouTube such that all videos have clear face visibility (videos that are mostly like news-readers reading news).
These pristine videos are manipulated by using 3 state-of-art video manipulation techniques such as DeepFakes, FaceSwap, Face2Face.To understand more about the data please refer to this paper.
I have downloaded a total of 100 raw videos (49real + 51 fake) covering all the categories and these videos are extracted into images. To download and extract the images please go through this Github page and read the instructions carefully.
Exploratory Data Analysis and Pre-processing
Before building any Machine learning/ Deep learning models we need to understand the data with some Data Analysis.
Let’s get an idea of how this data is organized:
- As I mentioned I have downloaded 49real+51 fake videos, each of these videos are separated based on its category under the directories named with original, Deepfakes, FaceSwap and Face2Face
- For each video, each folder was created where it contains all extracted image sequences.
3. For example, if the video name is ‘485.mp4’ one directory was created with the name of ‘485’ where it contains all the frames of ‘485.mp4’. See the picture below that shows the directory structure for original videos, we follow the same structure for Deepfakes, Face2Face, and Faceswap data.
Please feel free to access my Github account where I have coded every step that I am explaining here
Let’s observe each image from each category
- In each image, faces are clearly visible without any object between face and camera and all of these images are facing straight to the camera (of course the data was selectively collected in such a way)
- As per the objective, we need to find the face manipulated images, Hence we are only interested in the face part and it is a good idea to ignore all additional details like the body, background, etc.
- Therefore we can implement this by tracking the face in each of the images and feed into the classifier, to achieve this we use one of the face tracking algorithms implemented in ‘ dlib’ which is python library, you can download it from here.
To train and evaluate the model we are dividing the data into Train, Test, and CV while splitting the data we need to take care of data balancing as well, we need to split the data before preprocessing.
I have split the 20% data to test and the rest 80% used to train and validate, look at the below representation of data split and each split is taken care of with data balancing.
We do not consider the entire video sequence instead, we take only 101 frames from each video as this reduces the number of calculations during modeling.
As shown in the above figure we take only 101 frames starting from the 10th frame, this also helps in the reduction of data redundancy.
Let’s see some of the samples of face tracked images.
In short, our pipeline is:
- For each video, we take the set of 101 image sequences.
- Pre-process every image on applying a face tracker algorithm, that detects the face area pixels.
- Feed this face tracked images to the classifier.
As I am trying to pose this problem as a binary classification problem, we need to fix the label 0 for Real and 1 for Fake (Deepfakes or Face2face or FaceSwap) and also we measure both accuracy and categorical log loss.
Once after we are done with data analysis it is very important to choose(or at least guess) what kind of models might work for this data, here we are having complete image data, hence it is a good idea to choose Convolution Neural Network (CNN) based architecture
But, again choosing perfect parameters is really a great challenge, parameters which may include the number of layers, number of units, drop out rates, activations, learning rates, etc. To determine this we need to spend more time and require very high computational power.
Considering all, we can adapt to fine-tune any related model that was already trained and tested on a similar kind of huge data set. For example, the Xception network which is a CNN based architecture was already trained and tested on imagenet data set.
To adapt this architecture we need to preprocess and setup our data accordingly. The Xception network was trained with normalized images of equal size (299X299) on 3 color channels. Therefore before feeding into the model, we need to take care that our data also preprocessed into the normalized and same size
Here I am initializing xception network architecture with imagenet weights, and will apply the transfer learning by replacing only the topmost layer with 2 outputs with softmax activation.
Now we will consider this as base architecture, to train this we follow the technique called Greedy Layer-Wise Pretraining
- First, train every layer in base architecture with our train data for 3 epochs
2. Once after this, remove the topmost layer (Dense (2)) and fix the rest of layers from training and start training and validating for 15 epochs by adding a new (Dense (2)) layer at the top.
To train the architecture I have used Adam with learning rate (0.0002), batch_size to 16 and kept all other parameters to Keras default values.
After the total training phases here are the results
- On train data, we got 100% accuracy and loss is 0.058
- For Validation data, we got 96.63% accuracy and the loss is 0.133
- For Test data (unseen data) we achieved 99.10% and loss is 0.078
We used only 100 videos to build a classifier and we achieved a very good number of accuracy. If you want to further increase the accuracy you try with more number of videos.
The model now learned to detect the manipulated images and perform very well only when images are manipulated by DeepFakes, Face2Face or FaceSwap, In the very near to the future these kinds of manipulation techniques might increase in that case we should retrain the model with new data.
Please check my Github profile to get complete code