Original article was published by Anantharaman Palacode Narayana Iyer on Deep Learning on Medium
Why CNN for Computer Vision?
Convolutional Neural Networks (CNN) have been very successful in addressing the core tasks of computer vision that include Image Classification, Localization, Object Recognition, Segmentation and Image Generation with amazing levels of performance. CNN is a deep network of an arbitrary number of layers with the Convolution layer as the “Feature Extractor”. Convolution is a mathematical operation dealt with in detail in the subject of Signals and Systems, a part of Digital Signal Processing. It is easy to implement CNN architectures with a few lines of code in PyTorch where the convolution operation is implemented as a layer, for example, Conv2D.
Such architectures that are based on the principle of convolution have shown that it is possible to get to the human level of accuracy or even exceed it for some specific computer vision tasks on datasets that have a bounded data distributions — for instance, the ImageNet Classification tasks that classify the given input image to one of 1000 classes.
Why is the Convolutional Neural Network (CNN) the go to architecture for Computer Vision applications? What makes them achieve such a high accuracy?
In this article, we dive deep into this topic and provide some insights. To understand why CNNs are the “go to” architectures for certain class of applications, mainly in the domain of Computer Vision, it is essential to get an understanding of the principle of convolution.
In a follow up article we explain the theoretical underpinnings of convolution from the Signals and Systems perspective.
Signals: A digital signal is a representation of some information of interest in a digital form that can be processed by a digital computer. Thus, a jpg or png image is a digital signal. This definition allows us to treat artifacts from other modalities such as text and speech also as digital signal. For instance, a tweet or a Facebook post from a plain text or json file can be viewed as a digital signal. In this article, consider the inputs as images (in some cases videos) and treat an image file as an instance of digital signal.
System: The algorithm or the machine learning model that processes this input to generate the output is called the system. More about these are covered in Part 2.
In this article and subsequent ones we restrict our discussions primarily to images as signals though the concepts and theory apply equally to other types of inputs.
Convolutional Layer as the Feature Extractor System
The deep learning architectures based on CNN start from the raw pixels of a given image, transform them through multiple vertically organized layers and finally produce the required outputs. Each transformation through the convolution layer (e.g Conv2D in PyTorch) corresponds to extracting the features from its inputs. Hence the whole network can be thought of as sequence of transformations where the features are progressively extracted to form a hierarchy of features, see Fig 1 below.
The convolution layer is very effective as a feature extractor when processing images for the following reasons:
- Color, Structure and the dimensionality of inputs
Images have a 2 dimensional spatial structure where a pixel is a scalar number for a grayscale image and has a depth of 3 (Red, Green, Blue) for color images. Thus, one could define an image as a function of x, y coordinates in a geometric plane with a depth dimension of 3 for color images. As the objects are represented by the values of pixels in this geometry, it is very important to preserve the structure while processing them.
In Fig 2 below, an object of interest, the bird, is represented by the value of pixels (R, G, B) in a 2d surface. The shape and color of its body, beak, feathers, head and so on provide the distinguishing features that help us identify the bird. Fig 3 is a grayscale image of the same bird and Fig 4 shows the same bird with the color channels swapped. For this classification task, color itself serves as an important feature that needs to be extracted.
Convolution layers support multidimensional inputs and can preserve the spatial structure while the traditional fully connected neural network requires vector as input and is not sensitive to the structure.
For images such as the one above, the input is a 3D volume. Looking at the way convolution is defined, it is easy to see that the inputs and outputs can be defined for arbitrary dimensions.
The equations in Fig 5 define the convolution in 1-D and 2-D and this can be generalized for higher dimensions. This suggests that the same concepts work well for videos that are 4-D and other input types such as speech signals, medical images, text and so on.
If we think of processing such input types with a traditional fully connected neural networks, it is required to flatten the multidimensional tensor inputs to a vector form (1-D tensor) using a row major or column major form which would result in loss of spatial structure. Since the pixels in an image are correlated spatially and the features capture this correlation, losing the spatial structure by flattening might result in sub optimal performance of the network. One might argue that the neural network could still learn these correlations because the placement of pixels and distance between them in 2D form and 1D form is computable and is fixed. But this requires a fully connected architecture with a huge number of parameters resulting in inefficiency.
For some class of applications, such as Optical Character Recognition, the predicted outputs are invariant to color. The image in Fig 6 is an example of such an application where color plays minimum role. Our company JNResearch Labs LLP builds innovative products in this space where we extract the text and the semantic structure from business documents. Once we make our website public you can see several examples and technical content.
2. Pixels are correlated
If pixels in an image are not correlated, our comprehension of such an image will not change if we move the pixels around in a random manner. Fig 7shows the same image of the bird as in Fig 2 where we have swapped each pixel with another pixel chosen from a randomly chosen location. While we comprehend the Fig 2 as a bird we cannot figure out what the contents of Fig 7 are, though the image of Fig 7 has exactly the same pixels that constitute the image in Fig 2 but ordered in a random way. This illustrates the fact that the pixels in an image are corelated.
The nature of correlation between the pixels provide valuable features. Convolutions are great at extracting these correlations. Multiple convolutional filters are used to extract multiple such features.
Convolution filters can learn these correlations and hence can detect edges, color information, blurriness etc. that are nothing but different types of correlations relevant to the application as represented in the training dataset.
3. Filters are efficient and learnable
Suppose we are to perform classification on an image of 100x100x3 dimensions. If we implement the model as a feed forward neural network that has an input, hidden and an output layer, where: hidden units (nh) = 1000, output classes = 10. The number of parameters can be computed as:
- Input layer = 10k pixels * 3 = 30k,
- weight matrix for hidden to input layer = 1k * 30k = 30 M and
- output layer matrix size = 10 * 1000 = 10k
Thus the number of parameters for such a design would exceed 30M.
Importantly, the number of parameters depend on the size of the input.
We make 3 key observations from the example above:
- Even to process a 100×100 image, we run in to tens of millions of parameters if we use a fully connected architecture
- The number of parameters depend on input image size
- The input dimensions are fixed for this architecture. For instance if we change the input size to 200×200, this model need to be retrained as the number of parameters would change.
Each of the above three observations is adequate for us to conclude that it is less attractive to use a fully connected neural network for computer vision tasks. Convolutional neural networks do not suffer from the above limitations as we see below.
The process of convolution uses small sized filters (also known as kernels) that slide over the input. The activations are computed by placing the filter over a patch of the input image of same dimension and finding the dot product between the filter weights and pixel values. Once the activations are found this way for a given position of the image, the filter is slid across the image to move to the next position as per a stride.
For a CNN, the small sized filter works effectively as a feature extractor as the correlation between the pixels is local and the interaction between pixels that are far away is negligible if not zero.
Small sized filters satisfy the notion of “Local Receptive Fields”
The Fig 8 (Fig credit: Deep Learning Book by Goodfellow et al.) illustrates a small kernel of size 2×2 sliding over an image with a stride of 1 along both X and Y axes. If this is a color image, the number of parameters (weights) contributed by this filter is 2x2x3 = 12.
As we move the filter over the image, the parameters are shared when computing the activations, leading to much reduced number of parameters compared to that of fully connected networks.
This also implies that the number of parameters of the CNN model does not depend upon the spatial dimensions of the input. However, it does depend on the depth of the image, which is usually fixed to 3 for color images and 1 for grayscale. Hence for all practical purposes it is fair to conclude that the number of parameters depend on the filter size and the number of filters and not on the spatial extent of the input. This is highly advantageous as the same model can perform prediction on a 100×100 image as well as a 200×200 image so long as depth remains the same.
Convolution provides the means for processing variable sized inputs
Through these examples, we saw that the convolution operation is basically a dot product computation between the filter weights and the image pixel values. This makes the filters “learnable”. In fact, it is possible to hand design the filter coefficients and generate different features of the input image. What makes the CNN work is that these filters are learnt with data.
4. Equivariant Representations
Convolutions result in equivariant representations for translations.
The convolution operation is performed between a signal and system, both of them can be thought of functions f(x), g(x). If we have a property f(g(x)) = g(f(x)), then f(x) is equivariant to the function g. What this property implies can be explained with an example as below.
In the bird classification example that we saw earlier, assume that we shifted (translation) the pixels constituting the bird by n positions either along X axis or Y axis or both, as humans, we still will classify the object as the bird. That is is, the location of the bird in the XY plane of the image is immaterial to our prediction. This can be viewed as the translational invariance property which is satisfied by the weight sharing of convolution process.
It is important to note that convolution is not naturally invariant to certain other types of transformations such as scaling and rotation.
5. Parallelization and Hardware Efficiency
Convolutions are performed by aligning the filters on a patch of image and computing the dot product. We then slide the filter over to the next location. These computations in a given layer in the CNN can be done in parallel as computations on one part of an image can be carried out independent of another part. This suggests that a parallel computing hardware can provide much higher compute throughput.
There is an excellent support to implement CNNs using GPU’s due to the ability to parallelize the computation. Apart from the GPU hardware, there are highly optimized libraries, for instance NVidia CuDNN, etc. This compute power and libraries are also available on the modern edge computing devices making CNNs an ideal architecture to process images and videos in real time. Use cases such as video surveillance, autonomous vehicles etc. need a real time inferencing capability and high frame rate throughputs are today available for CNN implementations in these devices.
When not to apply CNN?
Convolution uses small filters for efficiency. This also means that the degree of correlations that a filter can capture is restricted to only nearby pixels. If in some cases there is a long distance dependency between the elements of the input, convolution may not be the best choice as the model architecture.
The input to the CNN could be anything that has a spatial or temporal structure. In the case of images, pixels are laid out at uniform distances between each other. In time sequences the input elements of the sequence are laid out in uniform time steps. However if we consider an input like a graph (e.g. Nodes and edges of a social network), the nature of structure present in the input is different. We cannot assume node ordering and the number of neighboring nodes to a given node is arbitrary. For such cases the CNN discussed here is not suitable and techniques such as Graph Convolutional Neural Networks might perform better.
Thanks for reading this article, I am planning to write a few more pertaining to the deeper aspects of applying CNN and also bring out some recent interesting architectures. If you have any comments or suggestions on topics that you would like covered, please drop a note in the comments section.