Source: Deep Learning on Medium
Getting Dirty With Data
We will use the UCSD anomaly detection dataset, which contains videos acquired with a camera mounted at an elevation, overlooking a pedestrian walkway. In normal settings, these videos contain only pedestrians.
Abnormal events are due to either:
- Non-pedestrian entities in the walkway, like bikers, skaters, and small carts.
- Unusual pedestrian motion patterns like people walking across a walkway or at the grass surrounding it.
The UCSD dataset consists of two parts, ped1 and ped2. We will use the ped1 part for training and testing.
Preparing The Training Set
The training set consists of sequences of regular video frames; the model will be trained to reconstruct these sequences. So, let’s get the data ready to feed our model by following these three steps:
- Divide the training video frames into temporal sequences, each of size 10 using the sliding window technique.
- Resize each frame to 256 × 256 to ensure that input images have the same resolution.
- Scale the pixels values between 0 and 1 by dividing each pixel by 256.
One last point is that since the number of parameters in this model is huge, we need a large amount of training data, so we perform data augmentation in the temporal dimension. To generate more training sequences, we concatenate frames with various skipping strides. For example, the first stride-1 sequence is made up of frames (1, 2, 3, 4, 5, 6, 7, 8, 9, 10), whereas the first stride-2 sequence consists of frames (1, 3, 5, 7, 9, 11, 13, 15, 17, 19).
Here is the code. Feel free to edit it to get more/fewer input sequences with various skipping strides, and see how the results change afterward.
Note: if you face memory error, decrease the number of training sequences or use Data Generator.
Building And Training The Model
Finally, the fun part begins! We will use Keras to build our convolutional LSTM autoencoder.
The below image shows the training process; we will train the model to reconstruct the regular events. So let us start discovering the model settings and architecture.
To build the autoencoder, we should define the encoder and the decoder. The encoder accepts as input a sequence of frames in chronological order, and it consists of two parts: the spatial encoder and the temporal encoder. The encoded features of the sequence that comes out of the spatial encoder are fed into the temporal encoder for motion encoding.
The decoder mirrors the encoder to reconstruct the video sequence, so our autoencoder looks like a sandwich.
Note: because the model has a huge number of parameters, it’s recommended that you use a GPU. Using Kaggle or Colab is also a good idea.
Initialization and Optimization:
We use Adam as an optimizer with a learning rate set to 0.0001, we reduce it when training loss stops decreasing by using a decay of 0.00001, and we set the epsilon value to 0.000001.
For initialization, we use the Xavier algorithm, which prevents the signal from becoming too tiny or too massive to be useful as it goes through each layer.
Let’s Dive Deeper into the Model!
Why using the convolutional layers in the encoder and the deconvolutional layers in the decoder?
The convolutional layers connect multiple input activations within the fixed receptive field of a filter to a single activation output. It abstracts the information of a filter cuboid into a scalar value. On the other hand, deconvolutional layers densify the sparse signal by convolutional-like operations with multiple learned filters; thus, they associate a single input activation with patch outputs by an inverse operation of convolution.
The learned filters in the deconvolutional layers serve as bases to reconstruct the shape of an input motion sequence.
Why did we use convolutional LSTM layers?
For general purposes sequence modeling, LSTM as a particular RNN structure has proven stable and robust for preserving long-range dependencies.
Here we used convolutional LSTM layers instead of fully connected LSTM layers because FC-LSTM layers do not keep the spatial data very well because of its usage of full connections in input-to-state and state-to-state transitions in which no spatial information is encoded.
What is the purpose of Layer Normalization?
Training deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons using Layer Normalization; we have used Layer Normalization instead of other methods like Batch Normalization because here we have a recurrent neural network. Read more about the normalization techniques.
Did We Do Well?
Let’s get to the testing phase.
The first step is to get the test data. We will test each testing video individually. UCSD dataset provides 34 testing videos, the value of Config.SINGLE_TEST_PATH determines which one will be used.
Each testing video has 200 frames. We use the sliding window technique to get all the consecutive 10-frames sequences. In other words, for each t between 0 and 190, we calculate the regularity score Sr(t) of the sequence that starts at frame (t) and ends at frame (t+9).
We compute the reconstruction error of a pixel’s intensity value I at the location (x,y) in frame t of the video using L2 norm:
Where Fw is the learned model by the LSTM convolutional autoencoder. Then we compute the reconstruction error of a frame t by summing up all the pixel-wise errors:
The reconstruction cost of a 10-frames sequence that starts at t can be calculated as follows:
Then we compute the abnormality score Sa(t) by scaling between 0 and 1.
We can derive regularity score Sr(t) by subtracting abnormality scores from 1.
After we compute the regularity score Sr(t) for each t in range [0,190], we draw Sr(t).
First, let’s take a look at test 32 of UCSDped1. At the beginning of the video, there is a bicycle on the walkway, which explains the low regularity score. After the bicycle left, the regularity score starts to increase. At frame 60, another bicycle enters, the regularity score decreases again and increases right after it left.
Test 004 of UCSDped1 dataset shows a skater entering the walkway at the beginning of the video, and someone walks on the grass at frame 140, which explains the two drops in the regularity score.
Test 024 of UCSDped1 dataset shows a small cart crossing the walkway, causing a drop in the regularity score. The regularity score returns to the normal state after the cart left.
Test 005 of UCSDped1 dataset shows two bicycles passing the walkway, one at the beginning and the other at the end of the video.
Try multiple datasets like the CUHK avenue dataset, UMN dataset, or even gather your own data using a surveillance camera or a small camera in your room. The training data is relatively easy to collect since it consists of videos that contain only regular events. Mix multiple datasets and see if the model will still do well. Think of a way to speed up the process of detecting anomalies like using fewer sequences in the testing stage.
And don’t forget to write your results in the comments!
 Yong Shean Chong, Abnormal Event Detection in Videos using Spatiotemporal Autoencoder (2017), arXiv:1701.01546.
 Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K. Roy-Chowdhury, Learning Temporal Regularity in Video Sequences (2016), arXiv:1604.04574.