10 Papers You Should Read to Understand Image Classification in the Deep Learning Era

Original article was published by Ethan Yanjia Li on Artificial Intelligence on Medium

10 Papers You Should Read to Understand Image Classification in the Deep Learning Era

A quick walkthrough of the best papers for image classification in a decade to jump start your learning of computer vision

Picture by myself


Computer vision is a subject to convert images and videos into machine-understandable signals. With these signals, programmers can further control the behavior of the machine based on this high-level understanding. Among many computer vision tasks, image classification is one of the most fundamental ones. It not only can be used in lots of real products like Google Photo’s tagging and AI content moderation but also opens a door for lots of more advanced vision tasks, such as object detection and video understanding. Due to the rapid changes in this field since the breakthrough of Deep Learning, beginners often find it too overwhelming to learn. Unlike typical software engineering subjects, there are not many great books about image classification using DCNN, and the best way to understand this field is though reading academic papers. But what papers to read? Where do I start? In this article, I’m going to introduce 10 best papers for beginners to read. With these papers, we can see how this field evolve, and how researchers brought up new ideas based on previous research outcome. Nevertheless, it is still helpful for you to sort out the big picture even if you have already worked in this area for a while. So, let’s get started.

1998: LeNet

Gradient-based Learning Applied to Document Recognition

From “Gradient-based Learning Applied to Document Recognition

Introduced in 1998, LeNet sets a foundation for future image classification research using the Convolution Neural Network. Many classical CNN techniques, such as pooling layers, fully connected layers, padding, and activation layers are used to extract features and make a classification. With a Mean Square Error loss function and 20 epochs of training, this network can achieve 99.05% accuracy on the MNIST test set. Even after 20 years, many state-of-the-art classification networks still follows this pattern in general.

2012: AlexNet

ImageNet Classification with Deep Convolutional Neural Networks

From “ImageNet Classification with Deep Convolutional Neural Networks

Although LeNet achieved a great result and showed the potential of CNN, the development in this area stagnated for a decade due to limited computing power and the amount of the data. It looked like CNN can only solve some easy tasks such as digit recognition, but for more complex features like faces and objects, a HarrCascade or SIFT feature extractor with an SVM classifier was a more preferred approach.

However, in 2012 ImageNet Large Scale Visual Recognition Challenge, Alex Krizhevsky proposed a CNN-based solution for this challenge and drastically increased ImageNet test set top-5 accuracy from 73.8% to 84.7%. Their approach inherits the multi-layer CNN idea from LeNet, but increased the size of CNN a lot. As you can see from the diagram above, the input is now 224×224 compared with LeNet’s 32×32, also many Convolution kernels have 192 channels compared with LeNet’s 6. Although the design isn’t changed much, with hundreds of more times of parameters, the network’s ability to capture and represent complex features improved hundreds of times too. To train such as a big model, Alex used two GTX 580 GPU with 3GB RAM for each, which pioneered a trend of GPU training. Also, the use of ReLU non-linearity also helped to reduce computation cost.

In addition to bringing many more parameters for the network, it also explored the overfitting issue brought by a larger network by using a Dropout layer. Its Local Response Normalization method didn’t get too much popularity afterward but inspired other important normalization techniques such as BatchNorm to combat with gradient saturation issue. To sum up, AlexNet defined the de facto classification network framework for the next 10 years: a combination of Convolution, ReLu non-linear activation, MaxPooling, and Dense layer.

2014: VGG

Very Deep Convolutional Networks for Large-Scale Image Recognition

From Quora “https://www.quora.com/What-is-the-VGG-neural-network

With such a great success of using CNN for visual recognition, the entire research community blew up and all started to look into why this neural network works so well. For example, in “Visualizing and Understanding Convolutional Networks” from 2013, Matthew Zeiler discussed how CNN pick up features and visualized the intermediate representations. And suddenly everyone started to realize that CNN is the future of computer vision since 2014. Among all those immediate followers, the VGG network from Visual Geometry Group is the most eye-catching one. It got a remarkable result of 93.2% top-5 accuracy, and 76.3% top-1 accuracy on the ImageNet test set.

Following AlexNet’s design, the VGG network has two major updates: 1) VGG not only used a wider network like AlexNet but also deeper. VGG-19 has 19 convolution layers, compared with 5 from AlexNet. 2) VGG also demonstrated that a few small 3×3 convolution filters can replace a single 7×7 or even 11×11 filters from AlexNet, achieve better performance while reducing the computation cost. Because of this elegant design, VGG also became the back-bone network of many pioneering networks in other computer vision tasks, such as FCN for semantic segmentation, and Faster R-CNN for object detection.

With a deeper network, gradient vanishing from multi-layers back-propagation becomes a bigger problem. To deal with it, VGG also discussed the importance of pre-training and weight initialization. This problem limits researchers to keep adding more layers, otherwise, the network will be really hard to converge. But we will see a better solution for this after two years.

2014: GoogLeNet

Going Deeper with Convolutions

From “Going Deeper with Convolutions

VGG has a good looking and easy-to-understand structure, but its performance isn’t the best among all the finalists in ImageNet 2014 competitions. GoogLeNet, aka InceptionV1, won the final prize. Just like VGG, one of the main contributions of GoogLeNet is to push the limit of the network depth with a 22 layers structure. This demonstrated again that going deeper and wider is indeed the right direction to improve accuracy.

Unlike VGG, GoogLeNet tried to address the computation and gradient diminishing issues head-on, instead of proposing a workaround with better pre-trained schema and weights initialization.

Bottleneck Inception Module From “Going Deeper with Convolutions

First, it explored the idea of asymmetric network design by using a module called Inception (see diagram above). Ideally, they would like to pursuit sparse convolution or dense layers to improve feature efficiency, but modern hardware design wasn’t tailored to this case. So they believed that a sparsity at the network topology level could also help the fusion of features while leveraging existing hardware capabilities.

Second, it attacks the high computation cost problem by borrowing an idea from a paper called “Network in Network”. Basically, a 1×1 convolution filter is introduced to reduce dimensions of features before going through heavy computing operation like a 5×5 convolution kernel. This structure is called “Bottleneck” later and widely used in many following networks. Similar to “Network in Network”, it also used an average pooling layer to replace the final fully connected layer to further reduce cost.

Third, to help gradients to flow to deeper layers, GoogLeNet also used supervision on some intermediate layer outputs or auxiliary output. This design isn’t quite popular later in the image classification network because of the complexity, but getting more popular in other areas of computer vision such as Hourglass network in pose estimation.

As a follow-up, this Google team wrote more papers for this Inception series. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” stands for InceptionV2. “Rethinking the Inception Architecture for Computer Vision” in 2015 stands for InceptionV3. And “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning” in 2015 stands for InceptionV4. Each paper added more improvement over the original Inception network and achieved a better result.

2015: Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

The inception network helped researchers to reach a superhuman accuracy on the ImageNet dataset. However, as a statistic learning method, CNN is very much constrained to the statistic nature of a specific training dataset. Therefore, to achieve better accuracy, we usually need to pre-calculate the mean and standard deviation of the entire dataset and use them to normalize our input first to ensure most of the layer inputs in the network are close, which translates to better activation responsiveness. This approximate approach is very cumbersome, and sometimes doesn’t work at all for a new network structure or a new dataset, so the deep learning model is still viewed as difficult to train. To address this problem, Sergey Ioffe and Chritian Szegedy, the guy who created GoogLeNet, decided to invent something smarter called Batch Normalization.

From “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

The idea of batch normalization is not hard: We can use the statistics of a series of mini-batch to approximate the statistics of the whole dataset, as long as we train for long enough time. Also, instead of manually calculating the statistics, we can introduce two more learnable parameters “scale” and “shift” to have the network learn how to normalize each layer by itself.

The above diagram showed the process of calculating batch normalization values. As we can see, we take the mean of the whole mini-batch and calculate the variance as well. Next, we can normalize the input with this mini-batch mean and variance. Finally, with a scale and a shift parameter, the network will learn to adapt the batch normalized result to best fit the following layers, usually ReLU. One caveat is that we don’t have mini-batch information during inference, so a workaround is to calculate a moving average mean and variance during training, and then use these moving averages in the inference path. This little innovation is so impactful, and all later networks start to use it right away.

2015: ResNet

Deep Residual Learning for Image Recognition

2015 may be the best year for computer vision in a decade, we’ve seen so many great ideas popping out not only in image classification but all sorts of computer vision tasks such as object detection, semantic segmentation, etc. The biggest advancement of the year 2015 belongs to a new network called ResNet, or residual networks, proposed by a group of Chinese researchers from Microsoft Research Asia.

From “Deep Residual Learning for Image Recognition

As we discussed earlier for VGG network, the biggest hurdle of getting even deeper is the gradient vanishing issue, i.e, derivatives become smaller and smaller when back-propagate through deeper layers, and eventually reaches a point that modern computer architecture can’t really represent meaningfully. GoogLeNet tried to attack this by using auxiliary supervision and asymmetric inception module, but it only alleviates the problem to a small extent. If we want to use 50 or even 100 layers, will there be a better way for the gradient to flow through the network? The answer from ResNet is to use a residual module.

Residual Module From “Deep Residual Learning for Image Recognition

ResNet added an identity shortcut to the output, so that each residual module can’t at least predict whatever the input is, without getting lost in the wild. Even more important thing is, instead of hoping each layer fit directly to the desired feature mapping, the residual module tries to learn the difference between output and input, which makes the task much easier because the information gain needed is less. Imagine that you are learning mathematics, for each new problem, you are given a solution of a similar problem, so all you need to do is to extend this solution and try to make it work. This is much easier than thinking of a brand new solution for every problem you run into. Or as Newton said, we can stand on the shoulders of giants, and the identity input is that giant for the residual module.

In addition to identity mapping, ResNet also borrowed the bottleneck and Batch Normalization from Inception networks. Eventually, it managed to build a network with 152 convolution layers and achieved 80.72% top-1 accuracy on ImageNet. The residual approach also becomes a default option for many other networks later, such as Xception, Darknet, etc. Also, thanks to its simple and beautiful design, it’s still widely used in many production visual recognition systems nowadays.

There’re many more invariants came out by following the hype of the residual network. In “Identity Mappings in Deep Residual Networks”, the original author of ResNet tried to put activation before the residual module and achieved a better result, and this design is called ResNetV2 afterward. Also, in a 2016 paper “Aggregated Residual Transformations for Deep Neural Networks”, researchers proposed ResNeXt which added parallel branches for residual modules to aggregate outputs of different transforms.

2016: Xception

Xception: Deep Learning with Depthwise Separable Convolutions

From “Xception: Deep Learning with Depthwise Separable Convolutions

With the release of ResNet, it looked like most of the low hanging fruits in the image classifier were grabbed already. Researchers started to think about what is the internal mechanism of the magic of CNN. Since cross-channel convolution usually introduces a ton of parameters, the Xception network chose to investigate this operation to understand a full picture of its effect.

Like its name, Xception originates from the Inception network. In the Inception module, multiple branches of different transformations are aggregated together to achieve a topology sparsity. But why this sparsity worked? The author of Xception, also the author of the Keras framework, extended this idea to an extreme case where one 3×3 convolution file correspond to one output channel before a final concatenation. In this case, these parallel convolution kernels actually form a new operation called depth-wise convolution.

From “Depth-wise Convolution and Depth-wise Separable Convolution

As shown in the diagram above, unlike traditional convolution where all channels are included for one computation, depth-wise convolution only computes convolution for each channel separately and then concatenate the output together. This cuts the feature exchange among channels, but also reduces a lot of connections, hence result in a layer with fewer parameters. However, this operation will output the same number of channels as input (Or a smaller number of channels if you group two or more channels together). Therefore, once the channel outputs are merged, we need another regular 1×1 filter, or point-wise convolution, to increase or reduce the number of channels, just like regular convolution does.

This idea is not from Xception originally. It’s described in a paper called “Learning visual representations at scale” and also used occasionally in InceptionV2. Xception took a step further and replaced almost all convolutions with this new type. And the experiment result turned out to be pretty good. It surpasses ResNet and InceptionV3 and became a new SOTA method for image classification. This also proved that the mapping of cross-channel correlations and spatial correlations in CNN can be entirely decoupled. In addition, sharing the same virtue with ResNet, Xception has a simple and beautiful design as well, so its idea is used in lots of other following research such as MobileNet, DeepLabV3, etc.

2017: MobileNet

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application

Xception achieved 79% top-1 accuracy and 94.5% top-5 accuracy on ImageNet, but that’s only 0.8% and 0.4% improvement respectively compared with previous SOTA InceptionV3. The marginal gain of a new image classification network is becoming smaller, so researchers start to shift their focus into other areas. And MobileNet led a significant push of image classification in a resource constrained environment.

MobileNet module from “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application”

Similar to Xception, MobileNet used a same depthwise separable convolution module as shown above and had an emphasis on the high efficiency and less parameters.

Parameters ratio from “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application”

The numerator in the above formula is the total number of parameters required by a depthwise separable convolution. And the denominator is the total number of parameters of a similar regular convolution. Here D[K] is the size of convolution kernel, D[F] is the size of the feature map, M is the number of input channels, N is the number of output channels. Since we separated the calculation of channel and spatial feature, we can turn multiplication into an addition, which is a magnitude smaller. Even better, as we can see from this ratio, the larger the number of output channels is, the more calculation we saved from using this new convolution.

Another contribution from MobileNet is the width and resolution multiplier. MobileNet team wanted to find a canonical way to shrink model size for mobile devices, and the most intuitive way is to reduce the number of input and output channels, as well as the input image resolution. To control this behavior, a ratio alpha is multiplied with channels, and a ratio rho is multiplied with input resolution (which also affects feature map size). So the total number of parameters can be represented in the following formula:

“MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application”

Although this change seems naive in terms of innovation, it has great engineering value because it’s the first time researchers conclude a canonical approach to adjust network for different resource constraints. Also, it kind of summarized the ultimate solution of improving neural network: wider and high-res input leads to better accuracy, thinner and low-res input leads to poorer accuracy.

Later in 2018 and 2019, the MobiletNet team also released “MobileNetV2: Inverted Residuals and Linear Bottlenecks” and “Searching for MobileNetV3”. In MobileNetV2, an inverted residual bottleneck structure is used. And in MobileNetV3, it started to search the optimal architecture combination using Neural Architecture Search technology, which we will cover next.

2017: NASNet

Learning Transferable Architectures for Scalable Image Recognition

Just like image classification for a resource-constrained environment, neural architecture search is another field that emerged around 2017. With ResNet, Inception, and Xception, it seems like we reached an optimal network topology that humans can understand and design, but what if there’s a better and much more complex combination that far exceeds human imagination? A paper in 2016 called “Neural Architecture Search with Reinforcement Learning” proposed an idea to search the optimal combination within a pre-defined search space by using reinforcement learning. As we know, reinforcement learning is a method to find the best solution with a clear goal and reward for the search agent. However, limited by computing power, this paper only discussed the application in a small CIFAR dataset.

NASNet Search Space. “Learning Transferable Architectures for Scalable Image Recognition

With the goal of finding an optimal structure for a large dataset like ImageNet, NASNet made a search space that is tailored for ImageNet. It hopes to design a special search space so that the searched result on CIFAR can also work on ImageNet well. First, NASNet assumes that common hand-crafted module in good networks like ResNet and Xception are still useful when searching. So instead of searching for random connection and operations, NASNet searches the combination of these modules that have been proved to be useful already on ImageNet. Second, the actual searching is still performed on the CIFAR dataset with 32×32 resolution, so NASNet only searches for modules that are not affected by the input size. In order to make the second point work, NASNet predefined two types of module templates: Reduction and Normal. Reduction cell could have reduced feature map compared with the input, and for Normal cell it would be the same.

From “Learning Transferable Architectures for Scalable Image Recognition

Although NASNet has better metrics than manually design networks, it also suffers from a few drawbacks. The cost of searching for an optimal structure is very high, which is only affordable by big companies like Google and Facebook. Also, the final structure doesn’t make too much sense to humans, hence harder to maintain and improve in a production environment. Later in 2018, “MnasNet: Platform-Aware Neural Architecture Search for Mobile” further extends this NASNet idea by limit the search step with a pre-defined chained-blocks structure. Also, by defining a weight factor, mNASNet gave a more systematic way to search model given specific resource constraints, instead of just evaluating based on FLOPs.

2019: EfficientNet

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

In 2019, it looks like there are no exciting ideas for supervised image classification with CNN anymore. A drastic change in network structure usually only offers a little accuracy improvement. Even worse, when the same network applied to different datasets and tasks, previously claimed tricks don’t seem to work, which led to critiques that whether those improvements are just overfitting on the ImageNet dataset. On the other side, there’s one trick that never fails our expectation: using higher resolution input, adding more channels for convolution layers, and adding more layers. Although very brutal force, it seems like there’s a principled way to scale the network on demand. MobileNetV1 sort of suggested this in 2017, but the focus was shifted to a better network design later.

From “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

After NASNet and mNASNet, researchers realized that even with the help from a computer, a change in architecture doesn’t yield that much benefit. So they start to fall back to the scaling the network. EfficientNet is just built on top of this assumption. On one hand, it uses the optimal building block from mNASNet to make sure a good foundation to start with. On the other hand, it defined three parameters alpha, beta, and rho to control the depth, width, and resolution of the network correspondingly. By doing so, even without a large GPU pool to search for an optimal structure, engineers can still rely on these principled parameters to tune the network based on their different requirements. In the end, EfficientNet gave 8 different variants with different width, depth, and resolution ratios and got good performance for both small and big models. In other words, if you want high accuracy, go for a 600×600 and 66M parameters EfficientNet-B7. If you want low latency and smaller model, go for a 224×224 and 5.3M parameters EfficientNet-B0. Problem solved.

Read More

If you finish reading above 10 papers, you should have a pretty good grasp of the history of image classification with CNN. If you like to keep learning this area, I’ve also listed some other interesting papers to read. Although not included in the top-10 list, these papers are all famous in their own area and inspired many other researchers in the world.

2014: SPPNet

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

SPPNet borrows the idea of feature pyramid from traditional computer vision feature extraction. This pyramid forms a bag of words of features with different scales, so it is can adapt to different input sizes and get rid of the fixed-size fully connected layer. This idea also further inspired the ASPP module of DeepLab, and also FPN for object detection.

2016: DenseNet

Densely Connected Convolutional Networks

DenseNet from Cornell further extends the idea from ResNet. It not only provides skip connection between layers but also has skip connections from all previous layers.

2017: SENet

Squeeze-and-Excitation Networks

The Xception network demonstrated that cross-channel correlation doesn’t have much to do with spatial correlation. However, as the champion of the last ImageNet competition, SENet devised a Squeeze-and-Excitation block and told a different story. The SE block first squeezes all channels into fewer channels using global pooling, applies a fully connected transform, and then “excite” them back to the original number of channels using another fully connected layer. So essentially, the FC layer helped the network to learn attentions on the input feature map.

2017: ShuffleNet

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

Built on top of MobileNetV2’s inverted bottleneck module, ShuffleNet believes that point-wise convolution in depthwise separable convolution sacrifices accuracy in exchange for less computation. To compensate for this, ShuffleNet added an additional channel shuffling operation to make sure point-wise convolution will not always be applied to the same “point”. And in ShuffleNetV2, this channel shuffling mechanism is further extended to a ResNet identity mapping branch as well, so that part of the identity feature will also be used to shuffle.

2018: Bag of Tricks

Bag of Tricks for Image Classification with Convolutional Neural Networks

Bag of Tricks focuses on common tricks used in the image classification area. It serves as a good reference for engineers when they need to improve benchmark performance. Interestingly, these tricks such as mixup augmentation and cosine learning rate can sometimes achieve far better improvement than a new network architecture.


With the release of EfficientNet, it looks like the ImageNet classification benchmark comes to an end. With the existing deep learning approach, there will never be a day we can reach 99.999% accuracy on the ImageNet unless another paradigm shift happened. Hence, researchers are actively looking at some novel areas such as self-supervised or semi-supervised learning for large scale visual recognition. In the meantime, with existing methods, it became more a question for engineers and entrepreneurs to find the real-world application of this non-perfect technology. In the future, I will also write a survey to analyze those real-world computer vision applications powered by image classification, so please stay tuned! If you think there are also other important papers to read for image classification, please leave a comment below and let us know.

Originally published at http://yanjia.li on July 31, 2020