FaceForensics++ & Survey of Multi-Modal techniques

Source: Deep Learning on Medium

Multi-modal techniques for detecting deep fakes

A Deep Learning Approach for Multimodal Deception Detection


DISCLAIMER: First and foremost, this is not a paper on detection on manipulated multi-media, but on detection on deception (lying) using multimodal approach. Regardless, I believe their methodology is agnostic to the type of classification (lying: y/n) they are trying to achieve.

The author claimed that their approach of using multimodal NN for detecting deception in video is the first attempt in its field. Previous mono-modal (verbal or visual or facial/hand gestures) approaches have yielded mediocre results, or are constrained by the environment where deception had happened or involves heavy manual feature engineering. Instead, the authors propose a multimodal approach that yields good performance using simple models.

The dataset used for the paper is 121 courtroom video clips (61 of which contains deceptions, i.e. lying), which is significantly small. The proposed multimodal approach involves multimodal feature extraction and training the extracted features with a simple MLP. The feature extracted includes the following:

Visual Feature: 300 features extracted using a simple 3D-CNN with max pooling (3rd dimension is temporal, i.e. the number of frames), with filter (fm, c, fd, fh, fw) size of 32 x 3 x 5 x 5 x 5, where fm is number of feature maps, c is channels, fd is the number of frames, fh, fw being width and height. Max pooling filter is 3×3.

Textual Feature: 300 features extracted using Word2Vec on every word in the transcript then concatenated and fed into a CNN (filter size 3 or 5 or 8 with 20 feature maps) with size 2 max pooling and a fully-connected layer (300 neuron with ReLU).

Audio Feature: 300 features extracted by first removing noise with Sound eXchange followed by Z-normalization and subsequently fed into openSMILE (which had yielded 6,373 features) and finally a NN trained to reduce the dimension down to 300.

Micro Expression Feature 39 features from manual annotation (already came with the data set)

Architecture: MLPc and MLPH+c. The two architecture differ in the concatenation of the multimodal feature vector. MLPc uses direct concatenation (input size of 939) while MLPh+c computes the Hadamard product(element-wise) before concatenation and yields a size of (339 input vector). The modal is trained using 10-fold cross-validation with cross-entropy loss and optimized using SGD.

Their results on AUC (0.97) and accuracies (96.14%) significantly outperformed against previous work done using L-SVM and Logistic Regression by Wu et al and Decision Tree and Random Forest by Perez-Rosas et al (who also collected the dataset). They also made the distinction between static and non-static extraction of textural feature (keeping word vector static or optimizing it along with training). They have also compare their results against mono-modal models. The author had concluded that visual (facial) features and textual (unigram) features contributed the most to the detection.

Despite their performance, the authors acknowledge the limit on the dataset size as well as the constraint on the setting (courtroom). Thus, their model is unlikely to generalize to a wide range of real world scenarios. Regardless, they have demonstrated a potentially useful methodology of extracting and utilizing multi-modal data using relatively simple architecture.

DeepFakes: a New Threat to Face Recognition? Assessment and Detection


They found their audio-visual approach based on lip-sync inconsistency detection was not able to distinguish Deepfake videos.

  • Present a first publicly available dataset of Deepfake videos from VidTIMIT database
  • Demonstrate vulnerability of current VGG and Facenet-based face recognition
  • Evaluated several baseline face swap detection algorithms:
    Lip-sync based detection system: (41.8% error rate on LQ videos)
  • architecture: MFCCs as audio features and distances between mouth landmarks as visual features. PCA is applied to the joint audio-visual features to reduce the dimensionality and LSTM network is trained to separate tampered and non-tampered videos. They referenced the below paper
  • Image based detection system: (3.33% error rate on LQ videos)
  • architecture: IQM features with SVM, along with other variations

Evaluation: The results demonstrate that, lip-syncing based algorithm is not able to detect face swapping, as GANs are able to generate facial expressions with high quality that can match audio speech. While image based approaches are comparatively capable to effectively detect Deepfake videos. The IQM+SVM system has a reasonably high accuracy of detecting Deepfake videos. However, we could consider using multimodal to incorporate both image and audio features, also consider not using features other than distance between mouth.

Speaker Inconsistency Detection in Tampered Video


The baseline approach used by above paper. They try to detect modified audio from mouth landmarks, without any assumption whether the video is fake or not.

Architecture: MFFCs as audio features and 42 distances between mouth landmarks as visual features. The facial landmark detection is done using the OpenPose from CMU. Explored different ways to post-process the features, including ways to combine two types of features, reduce the dimensionality of blocks of features with PCA, and project both modalities into a common space with CCA. They also considered different classifiers, including GMM, SVM, MLP, and LSTM. Finally LSTM classifier out performed others. It should be mentioned that the experiments were done on three dataset, they are VidTIMIT, AMI and GRID.

Test Error rate: 24.74% on VidTIMIT dataset, 33.86% on AMI dataset, 14.12% on GRID dataset

Evaluation: Although the performance is not decent, this paper provides us possible pipeline for doing multimodal detection, which we discover also being used in other papers. They used 42 distances between mouth landmarks, which might not necessarily be appropriate. We could also explore other methods for combining features in multimodal.

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media: proposes possible architectures/techniques we can use

Source: http://delivery.acm.org/10.1145/2970000/2967211/p202-le.pdf?ip=

  • Addressed the problem of dubbing detection in broadcast data.
  • Proposed a method relying on LSTM and multi-modal feature extraction.
  • Features are extracted from each frame per modality. (For both audio and visual)
  • Concatenated frames and reduced dimensionality.
  • Applied cross-modality correlation modeling to capture the synchrony between modalities. Especially, canonical-correlation analysis (CCA) was applied
  • Used LSTM to model the outputs of cross-modality correlation modeling in the temporal domain to obtain high-level representation for further classification.
  • Contributed a dubbing dataset collected from TV news for future research.

Evaluation: Not closely about deepfake detection but talked about techniques to deal with audio and visual streams such as feature extraction, obtain face tracks, specific facial region localization which might be helpful in deepfake detection.

Exploiting Multi-domain Visual Information for Fake News Detection


architecture: Multi-domain Visual Neural Network (MVNN) to fuse the visual information of frequency and pixel domains for detecting fake news.

  • a frequency domain sub-network(frequency domain)
    transforms the input image from pixel domain to frequency domain, and utilizes a CNN-based model to capture the physical characteristics of this image
  • a pixel domain sub-network(pixel domain)
    employs a multi-branch CNN-RNN network to extract the features of different semantic levels of the input image
  • a fusion sub-network
    fuses feature vectors obtained from the frequency and pixel domain sub-network through an attention mechanism for classification

Although this is not detecting facial image, the novelty is they try to capture emotional provocations of fake news images. We could keep an eye on the semantic context from Deepfake videos.

Evaluation for survey of Multi-modal detection (TL;DR)

Overall, not much work has been done yet incorporating audio information in detecting deep fake videos. The one paper (DeepFakes: a New Threat to Face Recognition? Assessment and Detection) that did similar work was not able to achieve good performance, leaving room for improvement for us. We were able to find work around related tasks, which involved finding inconsistencies between audio and lip movements (not deep faked videos). These tasks shared a common process of: representing visual modality with mouth landmarks, representing audio modality with MFFCs, then using PCA, aligning the modalities with Canonical Correlation Analysis (CCA) then using an LSTM. This can be a possible starting point for a baseline model. We can also experiment with different ways of representing the visual (using CNNs), audio (using RNNs), combining representation (using multi-modal tensor fusion network) and aligning the different modalities (using CCA), as well as incorporating temporal features as well (using RCNN perhaps).


[1] ACM Multimedia Conference. 2016. MM’16: proceedings of the 2016 ACM Multimedia Conference: October 15–19, 2016, Amsterdam, The Netherlands. Volume 1: … ACM, Association for Computing Machinery, New York, NY.

[2] EUSIPCO (Conference), Università degli studi Roma tre, European Association for Signal Processing, IEEE Signal Processing Society, and Institute of Electrical and Electronics Engineers. 2018. EUSIPCO 2018: 26th European Signal Processing Conference : Rome, Italy, September 3–7, 2018. Retrieved October 16, 2019 from https://ieeexplore.ieee.org/servlet/opac?punumber=8537458

[3] Pavel Korshunov and Sebastien Marcel. 2018. DeepFakes: a New Threat to Face Recognition? Assessment and Detection. arXiv:1812.08685 [cs] (December 2018). Retrieved October 16, 2019 from http://arxiv.org/abs/1812.08685

[4] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. 2019. Celeb-DF: A New Dataset for DeepFake Forensics. arXiv:1909.12962 [cs, eess] (September 2019). Retrieved October 16, 2019 from http://arxiv.org/abs/1909.12962

[5] Peng Qi, Juan Cao, Tianyun Yang, Junbo Guo, and Jintao Li. 2019. Exploiting Multi-domain Visual Information for Fake News Detection. arXiv:1908.04472 [cs] (August 2019). Retrieved October 16, 2019 from http://arxiv.org/abs/1908.04472

[6] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. FaceForensics++: Learning to Detect Manipulated Facial Images. arXiv:1901.08971 [cs] (January 2019). Retrieved October 16, 2019 from http://arxiv.org/abs/1901.08971

[7] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. 2019. Recurrent Convolutional Strategies for Face Manipulation Detection in Videos. arXiv:1905.00582 [cs] (May 2019). Retrieved October 16, 2019 from http://arxiv.org/abs/1905.00582