Making Convolutional Networks Shift-Invariant Again

Source: Deep Learning on Medium

Though convolutions are shift-equivariant and pooling builds up shift-invariance, striding ignores the Nyquist sampling theorem and aliases, which breaks shift equivariance. Striding can be seen during max-pooling and strided convolutions.

Sampling theorem: A bandlimited continuous-time signal can be sampled and perfectly reconstructed from its samples if the waveform is sampled over twice as fast as it’s highest frequency component.

In signal processing, blurring is used before subsampling as means of anti-aliasing. In deep learning, this method was eventually replaced by max-pooling when the latter demonstrated better performance empirically.

Max-Pooling operation on the same signal with two different shifts. The outputs are different, Max-Pooling breaks shift-equivariance

Existing work propose various methods to add anti-aliasing capabilities to downsampling, such as by extracting features densely, by softly gating between max and average pooling or with convolutional kernel networks. However, these methods are either computationally expensive, does not fully utilize the anti-aliasing capability of average pooling, or demonstrate lower accuracy. Also, most existing work consider max-pooling and average pooling as two separate, incompatible downsampling strategies.

How can we add anti-aliasing to powerful networks?

The paper attempts to reconcile classic anti-aliasing with max-pooling. First, notice that the max-pooling operation is equivalent to taking the max in a dense fashion, and subsampling from this intermediate feature map. The first operation, dense evaluation, actually does not alias at all; the problem lies in the subsampling operation. Therefore, we can add a blur filter before subsampling to reduce aliasing. Blurring and subsampling are evaluated together as BlurPool.

Anti-aliasing can be similarly applied to other convolutional networks. For strided convolution, we can simply add a BlurPool layer after the convolution and activation. Average pooling is equivalent to blurred downsampling with a box filter, so replacing it with BlurPool that employs a stronger filter can provide better shift-equivariance. BlurPool is tested with the following blur filters: Rectangle-2 [1,1], Triangle-3 [1,2,1] and Binomial-5 [1, 4, 6, 4, 1].

How do we evaluate anti-aliasing?

The new subsampling strategies are tested on image classification and generation, across different computer vision architectures. To evaluate anti-aliasing performance, three shift-invariance/equivariance metrics are proposed:

First, by considering convolutional networks as feature extractors where each layer is a feature map. Then, one metric is internal feature distance, which is the cosine distance between the shifted output of the feature map, and the output of the feature map with the shifted input.

Internal feature distance for a feature map F tilda

Next, we may evaluate classification consistency by checking how often the model predicts the same class for two different shifts from the same image.

Classification consistency for prediction on different shifts

For image generation, we can evaluate generation stability by testing if a shift in the input image generates an output that is the shifted version of the output from the original image.

Peak signal-to-noise ratio between generated images with shifted and non-shifted input

Results

By computing the internal feature distance throughout the layers of the VGG network for various horizontal and vertical shifts, we observe that the anti-aliased network maintains better shift-equivariance, and the resulting output is more shift-variant. Unlike the baseline network, there is no considerable deviation from perfect shift-equivariance in downsampling layers.

Each pixel in the heatmap corresponds to a shift (∆h, ∆w) where its color indicates feature distance. Big distance or large deviation is red, zero distance or perfect shift-equivariance is blue.

For image classification, adding low-pass filters increases classification consistency by ResNet50, where strong filters increase stability even more. Moreover, absolute classification accuracy is also improved with the addition of these filters, without adding learnable parameters. Indeed, low-pass filtering can serve as effective regularization.

ImageNet classification consistency vs. accuracy

For image generation, low-pass filters are added to strided-convolution layers of the U-Net architecture. PSNR between generated images with inputs with and without shifts also increases, meaning the outputs are more similar.

Selected example to demonstrate generation stability. Without anti-aliasing (top), shifting produces different looking windows. With anti-aliasing (bottom), the same window pattern is generated regardless of shifts.

Furthermore, this anti-aliasing method also improves stability to other perturbations such as noise. It also decreases mean error rate for classification on corrupted images. Antialiasing thus helps in obtaining a more robust and generalizable network for various architectures and tasks.

Discussion

I tried the model myself by fine-tuning a pre-trained anti-aliased Resnet18 model for image classification on greyscale images with under 4 000 training data. Initially, this task is challenging because the model tends to overfit on scarce training data. Anti-aliasing indeed helps the model generalize, and testing accuracy increased from 0.95865 to 0.97115.

The method proposed by this paper is remarkable because it increases performance without adding any learnable parameters. Having shift-invariant convolution networks means we no longer require shifting data augmentation. Both help keep computation time short and generalize better in computer vision tasks.

The initial problem lies where images may seem similar visually, but actually have a large distance between them. Shifted images are precisely a common example of such a phenomenon. Let us remind ourselves that good convolutional networks should cater to human vision. Images that look the same to us should also be so for machines. Shift-invariant networks thus better emulate human perception, as features extracted by the network have better shift-equivariance, and shifted images are more often considered to be the same.

Classical signal processing practices blurring for anti-aliasing, but it has disappeared in modern convolutional networks in an aim to optimize performance. With this paper’s results demonstrating adding anti-aliasing to current convolutional networks improves both shift-equivariance and task performance, it is perhaps time to bring back blurring filters and make convolutional networks shift-invariant again!