Inter-video frame forgery detection through 3D convolutional Networks

Source: Deep Learning on Medium

Inter-video frame forgery detection through 3D convolutional Networks

3D convolutional (C3D) networks are considered state-of-the art in activity recognition. Given a sequence of frames (usually 16 frames), the spatio-temporal features are extracted from the block of images and used to predict the class category for the particular sequence. As I was training the C3D network to classify accidents and non-accidents sequences, I discovered an interesting thing. 3D convolutional networks are particularly quite accurate in detection of sequences which have “out of place” frames. Here, “out of place” means the frames which do not belong to the particular sequence that is given as input to the network. This means that the neural network can be used in the detection of inter-video frame forgery that has been a topic of great interest for researchers.

In this article, I will take you through a step-by-step procedure through which you can detect inter-video frame forgery using C3D networks. The dataset that I used for this task is DashCam for accident detection from VSLab [1]. I used the positive videos in the training dataset (455 videos) as a class representing “no out-of-place sequences” and generated other 455 sequences by adding in a random frame from somewhere else in the video to the sequence. For testing, I used 301 negative videos from test data and created a balanced dataset representing 2 classes: “no out-of-place frames” and “out-of-place frames” from it.

The C3D network takes a five dimensional input (Batch size x Channels x Clip length x H x W) and performs a series of 3D convolutions on the input block to ultimately produce the output. I used the same neural network as described by the paper [2]. The neural network has 8 convolutions, 5 max-pooling, and 2 fully connected layers, followed by a softmax output layer. Each fully connected layer has 4096 output units. The code is written in Python.

First, make the model as described in the paper.

class C3D_model(nn.Module):
The C3D network.
def __init__(self, num_classes, pretrained=False):
super(C3D_model, self).__init__()
self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))self.conv2 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1))self.pool2 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))self.conv3a = nn.Conv3d(128, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))self.conv3b = nn.Conv3d(256, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))self.pool3 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))self.conv4a = nn.Conv3d(256, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))self.conv4b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))self.pool4 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))self.conv5a = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))self.conv5b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))self.pool5 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1))self.fc6 = nn.Linear(8192, 4096)self.fc7 = nn.Linear(4096, 4096)self.fc8 = nn.Linear(4096, num_classes)self.dropout = nn.Dropout(p=0.5)self.relu = nn.ReLU(inplace=False)self.__init_weight()if pretrained:
def forward(self, x):x = self.relu(self.conv1(x))x = self.pool1(x)x = self.relu(self.conv2(x))x = self.pool2(x)x = self.relu(self.conv3a(x))x = self.relu(self.conv3b(x))x = self.pool3(x)x = self.relu(self.conv4a(x))x = self.relu(self.conv4b(x))x = self.pool4(x)x = self.relu(self.conv5a(x))x = self.relu(self.conv5b(x))x = self.pool5(x)x = x.view(-1, 8192)x = self.relu(self.fc6(x))x = self.dropout(x)fc7 = self.relu(self.fc7(x))x = self.dropout(fc7)logits = self.fc8(x)return logits

I made my own data loader in order to load in the 5 dimensional input to the network. You can also use Pytorch dataloader for this purpose.

class Dashcam_data():
def __init__(self, dataset='dashcam', dir='./data/', batch_size=64, frame_size=[112, 112],train="train", seq=False):

self.dataset = dataset
self.dir = dir
self.batch_size = batch_size
self.frame_size = frame_size
self.train = train
self.im_pointer = 0
self.batch = []
self.im_pointer = 0
self.paths = []
self.labels = []
if (self.train == "train"): self.cat_path = "/hdd/local/sda/mishal/Anticipating- Accidents-master/dataset/videos/training/frames_train" else: self.cat_path = "/hdd/local/sda/mishal/Anticipating-Accidents-master/dataset/videos/testing/frames/negative" for folder in os.listdir(self.cat_path): path_abnormal = os.path.join(self.cat_path, folder)
for folder in os.listdir(self.cat_path): path_normal = os.path.join(self.cat_path, folder)
self.im_ind = list(range(len(self.paths)))
def get_next_batch(self,batch_size,clip_len): if (self.im_pointer == 0):
self.total_folders = len(self.paths) self.batch = np.zeros((batch_size, clip_len, self.frame_size[0],
self.frame_size[1], 3))
self.l = np.zeros((batch_size)) for idx in range(batch_size): images = np.zeros((clip_len, self.frame_size[0],
self.frame_size[1], 3))
video = self.paths[self.im_ind[self.im_pointer]]
label = self.labels[self.im_ind[self.im_pointer]]
path, dirs, files = next(os.walk(video)) paths = frames = np.sort(paths) num_frames = len(frames) time_index = np.random.randint(num_frames - clip_len) sequence = frames[time_index:time_index + clip_len] if (label == 1):
time_index_1 = np.random.randint(clip_len)
time_index_2 = np.random.randint(num_frames -clip_len)
#for random location of added frame
sequence[time_index_1] = frames[time_index_2]

#for fixed location of added frame
sequence[8] = frames[time_index_2]
for file in range(len(sequence)):
img=cv2.resize(img,(self.frame_size[0], self.frame_size[1]))
images[file,:,:,:] = img
self.batch[idx,:,:,:,:] = images
self.l[idx] = label
if (self.im_pointer==len(self.paths)):
self.im_pointer = 0
self.batch = np.moveaxis(self.batch,4,1) return torch.from_numpy(self.batch).float(), torch.from_numpy(self.l).long()

The loss function I used was cross entropy. I also added another term representing the average of difference of frames to it.

criterion = nn.CrossEntropyLoss().to(device)
loss1=torch.mean(torch.mean(torch.mean(torch.mean(torch.mean(input, dim=1), dim=1), dim=1), dim=1), dim=0)
probs = nn.Softmax(dim=1)(outputs)
preds = torch.max(probs, 1)[1]
loss = criterion(outputs, labels)
loss = loss + loss1

I trained the system in two configurations: one where the added out-of-place frame was fixed at a particular location in the sequence, and second where it was added in a random location. The achieved accuracy was more in the former case.


Precision and recall along with accuracy were used at test time as evaluation metrics. The network was trained for 10 epochs with a learning rate of 1e-3. For the case, where the added frame was fixed at a certain location in the sequence, the network achieved a training accuracy of about 97 percent. The results on test set are as follows:

For the latter case (random location for the out-of-place frame), the network achieved a training accuracy above 92% and took a little longer to converge. The testing accuracy came out to be as follows:


1. Chan, F. H., Chen, Y. T., Xiang, Y., & Sun, M. (2016, November). Anticipating accidents in dashcam videos. In Asian Conference on Computer Vision (pp. 136–153). Springer, Cham.

2. Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221–231.