Source: Deep Learning on Medium
Translational Invariance Vs Translational Equivariance
Translational Invariance and Translational Equivariance are frequently confused as the same thing but are different properties of CNN. To understand the difference please read below.
Convolutional Neural Networks have been the go-to architecture for Image and video-based tasks like classification, localization, segmentation, etc. They have shown super-human level performance in tasks that before were considered very difficult to achieve using basic image processing techniques. It has made classification tasks relatively easy to perform, without the need for feeding hand-curated features to the model as done before the revolution of CNN’s.
The Convolutional Neural Networks were inspired by the work of Nobel Prize-winning scientists Dr Hubel and Dr Wiesel, who demonstrated the working of visual cortex in the brain. They inserted micro-electrodes in the visual cortex of a partially anaesthetized cat so that she can’t move and moved a bright line across its retina. During this experiment, they noticed the following scenarios:
- The neurons fired when the line was in a particular place on the retina.
- The activity of the neurons changed depending on the orientation of the line.
- The Neurons sometimes fired only when the line was moving in a particular direction.
The classic experiment showed how the visual cortex processes information in a hierarchical way, extracting increasingly complex information. They showed that there is a topographical map in the visual cortex that represents the visual field, where nearby cells process information from nearby visual fields. This gave the concept of sparse interactions in CNN’s where the network focusses on local information rather than taking the complete global information. This property makes CNN’s provide state of the art performance in image-related tasks because in images nearby pixels are more strongly correlated than distant ones.
Moreover, their work determined that the neurons in the visual cortex are arranged in precise architecture. Cells with similar functions are organized into columns, tiny computational machines that relay information to a higher region of the brain, where a visual image is formed. This is similar to the way a CNN architecture is designed where lower layers extract edges and other common features and the higher layers extract more class-specific information. In all, their work revealed how visual cortical neurons encoded image features, the fundamental properties of objects that help us build our perception of the world around us.
Convolutional Neural Networks provide the three basic advantages over the traditional fully connected layers. Firstly, they have sparse connections instead of fully connected connections which lead to reduced parameters and make CNN’s efficient for processing high dimensional data. Secondly, weight sharing takes place where the same weights are shared across the entire image, causing reduced memory requirements as well as translational equivariance(will be explained in a moment). Thirdly, CNN’s use a very important concept of subsampling or pooling in which the most prominent pixels are propagated to the next layer dropping the rest. This provides a fixed size output matrix which is typically required for classification and invariance to translation, rotation.
Translational Equivariance or just equivariance is a very important property of the convolutional neural networks where the position of the object in the image should not be fixed in order for it to be detected by the CNN. This simply means that if the input changes, the output also changes. To be precise, a function f(x) is said to be equivariant to a function g if f(g(x)) = g(f(x)). If we have a function g which shifts each pixel of the image, one pixel to the right i.e I’(x,y) = I(x-1,y). If we apply the transformation g on the image and then apply convolution, the result will be the same as if we applied convolution to I’ and then applied translation g to the output. When processing images, this simply means that if we move the input 1 pixel to the right then it’s representations will also move 1 pixel to the right.
The property of translational equivariance is achieved in CNN’s by the concept of weight sharing. As the same weights are shared across the images, hence if an object occurs in any image it will be detected irrespective of its position in the image. This property is very useful for applications such as image classification, object detection, etc where there may be multiple occurrences of the object or the object might be in motion.
Convolutional Neural Networks are not naturally equivariant to some other transformations such as changes in the scale or rotation of the image. Other mechanisms are required to handle such transformations.
Translational Invariance is often confused with Translational Equivariance and many people, even the experts are confused between the two, unable to tell the difference.
Translational Invariance makes the CNN invariant to transformations such as rotations and scaling. Invariance to translation means that if we translate the inputs like rotating or scaling it, the CNN will still be able to detect the class to which the input belongs.
Translational Invariance is a result of the pooling operation. In a traditional CNN architecture, there are three stages. In the first stage, the layer performs convolution operation on the input to give linear activations. In the second stage, the resultant activations are passed through a non-linear activation function such as sigmoid, tanh or relu. In the third stage, we perform the pooling operation to modify the output further.
In pooling operation, we replace the output of the convnet at a certain location with a summary statistic of the nearby outputs such a maximum in case of MaxPooling. As we replace the output with the max in case of max-pooling, hence even if we change the input slightly, it won’t affect the values of most of the pooled outputs. Translational Invariance is a useful property where the exact location of the object is not required. For e.g if you are building a model to detect faces all you need to detect is whether eyes are present or not, it’s exact position is not necessary. While in segmentation tasks, the exact position is required.
The use of pooling can be viewed as adding a strong prior that the function the layer learns must be invariant to translation. When the prior is correct, it can greatly improve the statistical efficiency of the network.
The property of translational invariance and translational equivariance is utilized in a technique called data augmentation which comes handy when we have less training data or want to make the model train on a richer dataset. In data augmentation, we apply different transformations like rotation, flipping, zooming, translating etc to each batch of data sampled randomly from the training set and feed it to the model to make it more robust to transformations and increase performance.
Convolutional Neural Networks are solving various challenges which before were considered unsolvable and, most of the time, beating human-level performances as were seen in the ImageNet challenge where Resnet performed better than a human. The concepts that make CNN’s so great are not complex but are very intuitive, logical and easy to understand.
I hope you liked the post and if you have any doubts, suggestions or requests please leave your comments below or get in touch with me on twitter or LinkedIn.
- The Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.
- Figure 2 for architecture of human visual system is taking from knowingneurons.com.
- Why equivariance is better than premature invariance Geoffrey Hinton(Figure 4).
- Figure 6 from itutorials.com.