From LeNet to EfficientNet: The evolution of CNNs

Original article was published on Deep Learning on Medium

EfficientNet: Squeeze and Excitation layers

With various unique models focused either on performance or computational efficiency, the EfficientNet model came out with the idea that both these problems can be solved by similar architectures. They proposed a common CNN skeleton architecture and three parameters, namely the width, depth, and resolution. The width of the model refers to the number of channels present in various layers, the depth refers to the number of layers in the model and the resolution refers to the input image size for the model. They claimed that by keeping all these parameters small, one can create a competitive yet computationally efficient CNN model. On the other hand, just by increasing the value of these parameters, one can create a heavier model focused on accuracy.

While Squeeze and Excitation layers were already proposed earlier, they were the first to introduce this idea into mainstream CNNs. S&E layers create interactions across channels that are invariant to spatial information. This can be used to lower the impact of less important channels. They also introduced the newly proposed Swish activation instead of ReLU, which was an important factor in performance improvement. EfficientNets are currently the best performing classification models under various categories of computation resource availability.

Squeeze and Excitation Networks [Source]

What’s next?

Nowadays, CNN models are tested on their image classification performance as well as transfer learning abilities across various other problem statements, like object detection and segmentation. Some of these problems are already considered solved. Focus is shifting towards CNN models like the hourglass architecture, where the output image resolution is the same as the input. However, even today the backbones that I have introduced in this blog are directly used for various deep learning tasks and so even though the image classification problem is almost solved, development in this field is going to still hold a lot of importance in the future.

References

[1] LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278–2324.
[2] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.
[3] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
[4] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[5] Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[6] Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861 (2017).
[7] Tan, Mingxing, and Quoc V. Le. “Efficientnet: Rethinking model scaling for convolutional neural networks.” arXiv preprint arXiv:1905.11946 (2019).