Paper Review 4 — Compressed Video Action Recognition

This is a CVPR18 published paper on Action recognition in Videos by Manmatha and Smola et. al. Action recognition is an area of research where given a video, the goal is to recognize what action is being performed. See below figure 1 for example. Taken from video action recognition dataset UCF101. Understanding Videos is arguably the next frontier in Deep learning and Computer vision field as Videos capture way more information than images can.

This paper proposes an approach to consume videos using compressed videos directly. In Hindsight, it makes total sense. The approach is also shown to not only be faster than existing action recognition approaches (circa 2017) but also gives state-of-the-art results.

Figure 1. UCF101 example


  1. Videos have very low information density. the “true” and interesting signal is drowned in boring and repeating patterns.
  2. RGB frames of the video hinders learning of temporal structure in video.

This paper goes on to show how these challenges are overcome by their compressed video action recognition approach.

Key Idea

Motivated by that the superfluous information can be reduced by up to two orders of magnitude by video compression (using H.264, HEVC, etc.), we propose to train a deep network directly on the compressed video. The key idea is the use of compressed video instead of uncompressing the video into RGB frames. (taken from paper page 1 abstract)

Benefits of this approach — taken from the paper (page 2)

  1. Consuming compressed video already removes superfluous information.
  2. Motion vectors in video compression provide us the motion information that lone RGB images do not have.
  3. With compressed video, we account for correlation in video frames, i.e. spatial view plus some small changes over time, instead of i.i.d. images


One needs a little crash course in video codecs. I briefly describe here that MPEG codec split a video into I-frames, P-frames and B-frames. P-frames further have motion-vectors and residuals. For more information, read section 2 in the paper. In this work, only I-frames, P-frames are used.

Figure 2. Compressed video background. I-frames, P-frames (motion vectors and residuals).


the main modeling approach in short –

  1. the video is encoded into it MPEG format.
  2. this video is then used to train 3 different models — I-frame, motion vector and residual model.
  3. All these 3 models are stacked to give final action prediction at the end.
  4. The I-frame is processed as a normal image frames. i.e. approx. one I-frame for every P-frame exists in MPEG codec. This I-frame is just another image frame in the video. This is extracted and sent through a resnet-152 model trained supervised using back-propagation with Video action labels at the output.
  5. The novel innovation of this work is modeling representation for motion vector and residual data. For this, the authors simple accumulate the motion and residual data till the latest I-frame in the video and use this accumulated data as input to a shallower resnet-18 model.
  6. They show visually in fig 2. that accumulated motion and residual vectors show consider longer term difference and show clearer patterns better than pure/original data.
  7. Final prediction is a simple weighted sum of the predictions by all the three models.


Main claims

  1. compressed video is a better representation — experiments in section 4.1
  2. this representation gives higher accuracy — experiments in section 4.3
  3. faster training speed — experiments in section 4.2
  4. Train and test setup — section 4
  5. Ablation study — section 4.1
  6. Overall results — table 1 and 6

Source: Deep Learning on Medium