Source: Deep Learning on Medium
Receptive Field in Convoltional Neural Networks
It is one of the most important concepts in understanding Convolutional Neural Networks(CNN). Perhaps this is the key ingredient which allows today’s CNN to perform so well on a variety of different tasks. Some of the ideas in designing CNN is originated from the use of receptive fields in better or optimal ways.
[ — ‘I would try to make this blog understandable to everyone, having no prior experience to Machine Learning or CNNs, wish me luck :)’ ]
In my understanding, It is the area or part of the source image which currently is in effect of convolution in any layer of the network.
I know, I am just talking in context of an Image, let images be the primary focus for now.
Before going deep into the topic, Let’s clear out some related concepts before hand, which will help us in understanding Receptive Field better.
Neural Network: In the context of Machine Learning, It’s a series of layers in a CNN connecting one layer with another. Think of CNN as connection of layers which passes information from one layer to another. Why the term Neural Network? Because in each layer of CNN, there can be many kernels/filters which can be considered as a nodes and they connect with each other forming a network, taking some input, does some operation and then passes it to the next level nodes. Sort of imitating neurons in our brain’s decision making process.
Convolution: It has different explanation in different fields of studies. In Machine Learning in its purest sense is a matrix multiplication between the input(a matrix) and a kernel(just another matrix) and then taking summation of the result as a single value for the next layer.
Values in green are the source, yellow is the kernel and pink is the output matrices. Here the kernel is having the values [[1, 0, 1], [0, 1, 0], [1, 0, 1]]. So the first multiplication is between [[1, 1, 1], [0, 1, 1], [0, 0, 1]] and [[1, 0, 1], [0, 1, 0], [1, 0, 1]] element wise, which results in [[1, 0, 1], [0, 1, 0], [0, 0, 1]]. Summation of this matrix is 4, which is the first value of this convolution. Here a term stride comes into picture, it’s not but the number of columns you move in convolution. In the image above you can see the kernel is moving by one in column of source image, so stride here is one.
Convolutional Neural Network: It is a series of layers of convolution with connection from layer to another. One layer gets input performs some operation and then passes it to the next layer. It does have other different type of layers, but that we can skip for now.
Machine Learning: Again can have many different definitions, but for now, for us, it’s a parametric approach to solve a problem. Parametric approach mean, to solve a problem we don’t write code or set rules, rather we try to find a parameter/values which when applied with a function can map our input correctly to our output or at least try to. Suppose x is our input and y is our output then we want to find W, which when plugged into function f produces y.
In the above equation b is another parameter than we want to find to successfully do our job. Here these single values elements, but when we talk about images, input becomes 2 or 3 dimensional, W(weights) becomes 3 dimensional, dimension of b and y will depend on what problem we want to solve.
Now we’re ready to understand Receptive Field.
As I told in the beginning receptive field is the part of source which is currently undergoing convolution. As you can see in the convolution image, source with dimension 5×5 after undergoing convolution with 3×3 kernel matrix produced 3×3 output. Now if you repeat this process in the next layer 3×3 matrix after convolution with 3×3 kernel with produce 1×1 output. So for the first layer receptive field is 3×3 but for the second layer though the kernel is working on 3×3 region, but the receptive field is 5×5, because the source 3×3 came from previous layer which is 5×5. This was the case in two layer network. Now let’s take an example of 3 layer network.
In the above image in first layer the receptive field is 3×3 as our kernel is 3×3, in second layer it is 5×5 and in the third layer it is 7×7. So we have to see the what would be the dimension of current area of interest in previous layer if any and that is it’s receptive filed in that layer. If we chose to use 7×7 kernel in the first layer then there will be only one output from this convolution.
And this forms the basis for having a convolution with large kernel is same as having a convolution with small kernels in multiple layers. But this give the benefit of having non-linearity from 3 layers, which is required to learn a complex model.
Having smaller kernels and deeper network has other benefits like lesser parameter. And has been extensively used in neural architecture like Squeeze-nets and Inception-nets, basically network having Network-In-Network architecture.
I tried to make this blog understandable for everyone, new or experienced. But in the last portion, I have used terms which I did not define, but for new people that does not really matter, as long as you can understand the concept, idea and mathematics of Receptive Field. — Thanks for reading out, please do leave a comment in case you want me to add something or explain something in more detail.