Convolutional Neural Networks: Part 1



Basic definitions, introduction to several types of CNNs, Practical tips

This post assumes basic knowledge of Artificial Neural Networks (ANN) architecture-also called fully connected networks (FCN). These notes are originally made for myself. It will benefit others who have already taken the Course 4, and quickly want to brush up during interviews or need help with theory when getting stuck with development. It is not supposed to cover everything from scratch. Hence for someone who has not taken the course, the content might look daunting and it might scare them away from Deep Learning. My suggestion is not to read beyond 2 if you haven’t taken the course.

  1. What are the issues with conventional image processing techniques? Computer scientists have tried to solve so called Computer Vision problems of object detection, recognition and bounding them with boxes, since decades, with algorithms like SIFT, SURF, BRIEF to name a few [1]. They are limited by the accuracy and robustness. Detailed analysis is out of scope of this post. Convolutional Networks (CNN) are a type of Deep Neural Networks which are suited for computer vision problems.
  2. What is Convolutional Networks (CNN) ? Architecture-wise, they are layers consisting of one or mote sets “convolution filter” (mathematically precise term is correlation filters) parameters followed by “max pooling”, and finally, layer(s) of FCNs plus output layer. CNNs typically are used for image processing.
  3. What is filter, pad and stride? filter is the size of convolution filter. pad is the number of zeroes needed around the input image so that height and width are of required size (note that it still need not be of same size as input size, depends on pad vallue). In case output height, width is same as input, it is called “SAME” convolution, or else it is “Valid” convolution. stride is shift while sliding the convolution filter to find the next value.

4. What happens to the height, width, number of channels of convolutional layers deep inside ? Height and width decreases, while number of channels increases.

5. Why CNN over FCNs?

  • Parameter sharing : A feature detector (such as vertical edge) that is useful in one part of image is probably useful in other part of image too. Achieves translational invariance
  • Sparsity of connections: In each layer, the output value in the hidden layer depends only on a few input values. This helps in training with smaller training sets, and also it is less prone to overfitting.
Screenshot from Week 1 of Course 4 of Deep Learning Specialization by Coursera

6. What is LeNet — 5 ? Refer [2]. The max-pooling is called as subsampling in this paper. Parts of paper to handle complexities of computations are irrelevant now. Focus on sections 2, 3. Section 1 of this paper also addresses the first question in this article.

Screen shot from original paper describing LeNet5.

7. What is AlexNet ? Ref [3]. Multiple GPUs is not relevant anymore. This is easier one and it is much bigger than LeNet, but it convinced many people to look at DNNs.

Screenshot of Alexnet from course slides

8. What is VGG-16 net? Ref[4]. Simplest architecture with 3×3 conv filters, “same” convolution, stride = 1, max-pool with 2×2, stride=2. Suffix 16 comes because it has 16 layers with programmable parameters. 138M weights. Number of filters doubled with every layer. It is relatively uniform. VGG-19 is even bigger version.

Screenshot of VGG-16 from course slides

9. What are Resnets and Why do they work? Ref[5] Resnets are networks with a shortcut (a.k.a. skip connection path) from one layer to any layer not immediately next to it. The shortcut can be from layer L to layer L+2 or L+3 or output layer — any of them. The shortcut establishes an identity function from input to output. The figure below shows the difference between how training error changes in normal (plain) networks and Resnets with number of layers going in 100s. The skip connection solves the problem of vanishing or exploding gradients and helps in learning to continue. (These problems are parts of course 2 in the specialization)

Screenshot for Resnets from course slides

10. What is 1×1 convolution? What is the benefit of using it? In 2-D, 1×1 convolution is simply multiplying each pixel of an image by a constant. In 3-D, a 1x1x16 filter can convert a 32x32x16 layer into 32x32x1 layer and then apply ReLU. Volume gets mapped to 1 real number. These are also called network in network. Can be used to reduce number of channels. [Pooling layers shrink only height and width]. Allows you to learn more complex functions.

11. What are inception networks? Ref[7]. The name “inception” comes from a scene / quote from the movie inception (check the course for details) ! When you are not sure what filter to use, you can stack them all and let the network learn and choose the parameters. Sometimes there is an unusual kind of pooling to match the dimension (with stride = 1).

Screen shot of inception network from course slide

12. Issue with inception network? Computation cost. But it can be nicely solved with 1×1 convolution!

Does using 1×1 convolution hurt the performance? Apparently, not within a reasonable limit.

Inception network screenshot from course

13. When you come across a business problem related to object detection, how do you approach it technically? Always try building from open source. While using open source, make sure you use the one with good maintenance. Also download weights, not just code. Use principles of Transfer Learning to get solution. Get rid of softmax, and train your own softmax. Only small training set is required to do this. If you have larger training set, you can freeze fewer layers and train last few layers. You can apply data augmentation to increase number of training samples. Data Augmentation means applying color shifting, mirroring, translation, rotation etc..

Finally Prof Andrew Ng talks about state of the art computer vision.

There is an introduction to Keras ungraded notebook to try Keras library, and thereafter all the programming assignments of this course have to be done with Keras. To get better understanding of Keras, I went through the article in Ref [8].

This post has covered first two weeks of materials which are fundamentals. Part 2 covers the actual practical problems that CNNs are trying to solve.

Please note that as per Coursera Honor Code, I am not checking in my completed notebooks. Hence proof of this work will not be available.

A HUGE THANKS TO PROF ANDREW NG FOR CREATING THE SPECIALIZATION AND ALL THE MENTORS AND COMMUNITY MEMBERS OF COURSERA!!!

References cited in the course so far (and hence in this post):

  1. https://www.cs.swarthmore.edu/~meeden/cs81/f15/papers/Andy.pdf
  2. http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf
  3. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
  4. https://arxiv.org/abs/1409.1556
  5. https://arxiv.org/abs/1512.03385
  6. https://arxiv.org/abs/1312.4400
  7. https://arxiv.org/abs/1409.4842
  8. https://machinelearningmastery.com/keras-functional-api-deep-learning/

Source: Deep Learning on Medium