Understanding the Backbone of Video Classification: The I3D Architecture

Original article was published on Deep Learning on Medium

As can be seen by the diagram, the beginning of the network uses asymmetrical filters for max-pooling, maintaining time while pooling over the spatial dimension. It is not until later in the network that they run convolutions and pooling that includes the time dimension. The inception module is commonly used in 2D networks and is out of the scope of this article. In summary however, it is an approximation of an optimal local sparse structure. It also processes spatial (and time in this case) information at various scales and then aggregates the results. This module was motivated to allow the network to grow “wider” instead of “deeper”. The 1x1x1 convolution is used to reduce the number of input channels before the larger 3x3x3 convolutions, also making it less computationally expensive than the alternative.


Although the formal introduction of the architecture is a major contribution of the paper, the main contribution is the transfer learning from a Kinetics dataset to other video tasks. The Kinetics Human Action Dataset contains annotated videos of human actions. This is the reason behind the use of the pre-trained I3D network as a feature extraction network in a variety of action related deep learning tasks. Features are commonly extracted from the ‘mixed-5c’ module and passed into new architectures or a fine-tuned version of I3D.


I3D is one of the most common feature extraction methods for video processing. Although there are other methods like the S3D model [2] that are also implemented, they are built off the I3D architecture with some modification to the modules used. If you want to classify video or actions in a video, I3D is the place to start. If you want features from a pre-trained model for your video related experiments, I3D is also recommended. I hope you enjoyed this summary!


[1] Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299–6308).

[2] Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 305–321).

*The images in this article are made to resemble original images from [1]