Beginners guide to Convolutional Neural Network

Original article was published on Deep Learning on Medium

Beginners guide to Convolutional Neural Network


“AI Winter” is a period of reduced funding and interest in artificial intelligence research from 1980s to early 2000s. Even though there were still many research and breakthroughs during this period, the term AI only started to excite public’s imagination in 2012. There are many built-up to the comeback of AI, but the turning point happened on 30, Sep, 2012, when the GPU powered AlexNet had the landslide victory in the 2012 ILSVRC (ImageNet Large Scale Visual Recognition Competition). AlexNet is a convolutional neural network (CNN) designed by Krizhevsky et al (2017).

CNN is a deep learning architecture inspired by discoveries about the visual cortex in mammals (Hubel & Wiesel, 1959). Various forms of CNN were independently proposed in the 1980s including the Neocognitron by Fukushima (1980) and TDNN by Waibel et al. (1989), but the most cited CNN paper was published by LeCun et al. in 1998. Even though CNN has already been around for almost 40 years, it only gained rapid popularity after AlexNet. Since then CNN networks have been widely adopted by deep learning practitioners on image recognition, Natural Language Processing (NLP), time series forecasting etc. CNN also became a key component in other more complex systems such as Instance-aware Semantic Segmentation (Dai et al., 2016) and Regional Based CNN (Girshick et al., n.d.).

The rest of this report aims to provide a high-level review of some of the key literature about CNN as well as the future direction.

CNN Architecture

The key design principal of the CNN architecture is to reduce the number of free parameters in the network without reducing its computational power. When done properly, it will increase the probability of correct generalization (Y. LeCun et al., 1989).

The approach also favors automatic learning rather than hand-designed heuristics (Yann LeCun et al., 1998). This also has enormous efficiency benefits as it eliminated the need for feature engineering (feature extractors) which is both difficult and time consuming for images.

The CNN consists of three key architecture ideas:

  • receptive fields: splitting the original image into smaller areas and allow multiple features to be extracted from each area,
  • weight-sharing: allow the same weight to be applied to each of the feature map, and
  • sub-sampling: a technique to reduce the precision of the feature to allow for better generalisation.


Back-propagation by Rumelhart et al., 1986 is the fundamental algorithm that enables parameters in CNN to be trained. The network adjusts the filter configurations and weights of each feature map to minimize loss.

CNN Layers

CNN typically consists of 2 types of layers: convolution layer and pooling layer with activation. As demonstrated above, multiple convolution and pooling layers can be stacked together. This will allow the network to derive higher level features from lower level features. The top layers of the CNN consist of a flatten layer and several fully connected layers with loss function. It is important to flatten the 2D output into a 1D array of neurons so that the features can be represented in the output and for back-propagation to simulate.

Activation Function

An activation function is required after the pooling layer. Sigmoid neurons were used in the original CNN paper, but Glorot et al. (2011) has shown that the rectifying neurons are an even better neuron. The two common rectifying activation functions are ReLU (Rectified Linear Unit) and Leaky-ReLU. Rectifiers also provides some counter balance to the gradient vanishing effect for deep CNN networks.

Input Layer

While the CNN architecture is most well-known for its application in image recognition. CNN can also be generalised to many other data forms such as text classification for sentiment analysis, and time series forecasting. It is important to understand that the input must be mapped into dimensions, and the location of data also matters. Imagine, a 2-dimensional customer sales table with rows with customer ID and columns with sales amount, sales date and channel. CNN is not suitable for this dataset because I can re-order the customer ID and swap the columns without changing the content of the dataset.

CNN Networks

Since AlexNet winning ILSVRC in 2012, many other types of CNN based networks have been created. Some of these networks are widely adopted for image classification and feature extraction. AlexNet (Krizhevsky et al., 2017), ResNet (He et al., 2016) and Inception Net (Szegedy et al., 2015) are three of the hallmarks.

AlexNet (ILSVRC top 5 error rate 15.3%)

AlexNet is one of the first and most successful adoption of CNN using graphics processors (Raina et al., 2009). It consists of 8 layers including 5 convolutional layers followed by 3 fully connected layers. AlexNet started with kernel size of 11 x 11, which is large compare to many future networks (2 x 2 or 3 x 3).

Inception Net (ILSVRC top 5 error rate 6.66%)

Inception Net (GooLeNet) won the ILSVRC in 2014. Most CNN networks prior to Inception Net focused on building deeper networks. Inception instead focused on building a “wider” network to improve the utilization of computing resources inside the network. It used 12 times fewer parameters compare to AlexNet, while being significantly more accurate. Inception Net adopted many 1×1 convolutions to remove computation bottlenecks. R-CNN (region based CNN) is adopted.

ResNet (ILSVRC top 5 error rate 3.57%)

ResNet consists of 152 layers was the first deep CNN network that won the ILSVRC in 2015. One of the key challenge with deeper CNN networks was the degradation problem where adding additional layers cause decline in network performance. ResNet effectively overcame the degradation problem by introducing the residual learning framework (shortcut connections) as indicated in the curved arrows below. Faster R-CNN is also used to improve accuracy.

Challenges and Future of CNN

In a more recent paper about forecasting convolutional features, Sun et al. (2019) believe that anticipate future events is a key factor towards developing intelligent behavior. This is crucial in autonomous driving. The paper states that semantic level prediction is more effective than RGB value prediction. An additional level of detail was added to the semantic segmentation to provide instance segmentation.

The current CNN networks are also mostly supervised which requires heavy labeling. Reinforcement learning (a form of unsupervised learning) has been able achieve incredible results in chess, go and computer games, but this also relies heavily on cheap simulations.

In a parallel universe, Sabour et al. (2017) is also taking a very different approach in image recognition using ‘capsules’. It is believed that this approach is closer to human vision. CNN while being extremely effective in image recognition, it does not generalise very well for scaled and rotated items, e.g. changing viewpoints. In contrast, a capsule is a group of neurons that together can represent objects. Higher level capsule can also contain lower level capsules within the same image, for example nose and mouth will together make up the face.


Over the past 10 years CNN has contributed to so much success in computer vision and many other areas. Researchers have since overcome many of the original limitations in CNN. However, some challenges remained while new ones also surfaced. Given the rapid increase in public interest and better funding from government and large corporations, deep learning and CNN will continue to experience exciting developments for the years to come.


Dai, J., He, K., & Sun, J. (2016). Instance-Aware Semantic Segmentation via Multi-task Network Cascades. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016December, 3150–3158.

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (n.d.). Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5). Retrieved June 5, 2020, from˜rbg/rcnn.

Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016December, 770–778.

Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. The Journal of Physiology, 148(3), 574–591.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551.

LeCun, Yann, Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2323.

Raina, R., Madhavan, A., & Ng, A. Y. (2009). Large-scale deep unsupervised learning using graphics processors. ACM International Conference Proceeding Series, 382, 1–8.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.

Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic Routing Between Capsules. Advances in Neural Information Processing Systems, 2017December, 3857–3867.

Sun, J., Lin, Z., Xie, J., Lai, J., Zheng, W. S., Hu, J. F., & Zeng, W. (2019). Predicting future instance segmentation with contextual pyramid convlsTms. MM 2019 — Proceedings of the 27th ACM International Conference on Multimedia, 2043–2051.

Szegedy, C., Wei, L., Yangqing, J., Pierre, S., Scott, R., Dragomir, A., Dumitru, E., Vincent, V., & Andrew, R. (2015). Going deeper with convolutions Christian. Population Health Management, 18(3), 186–191.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme Recognition Using Time-Delay Neural Networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.