Vision Transformers (Bye Bye Convolutions)

Original article was published by Nakshatra Singh on Deep Learning on Medium

Vision Transformers (Bye Bye Convolutions)

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In the vision, attention is either applied in conjunction with convolutional networks or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc), Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

1. The Limitation with Transformers For Images

Transformers work really really well for NLP however they are limited by the memory and compute requirements of the expensive quadratic attention computation in the encoder block. Images are therefore much harder for transformers because an image is a raster of pixels and there are many many many… pixels to an image. The rasterization of images is a problem in itself even for CNN’s. To feed an image into a transformer every single pixel has to attend to every single other pixel, the image itself is (255)² big so the attention for an image will cost you (255)⁴ which is almost impossible even in current hardware. So people have resorted to other techniques like doing local attention and even global attention.

Model Overview.

2. Vision Transformer Architecture

2.1 Patch Embeddings

The standard Transformer receives input as a 1D sequence of token embeddings. To handle 2D images, we reshape the image x∈R^{H×W×C} into a sequence of flattened 2D patches

Where, (H, W) is the resolution of the original image and (P, P) is the resolution of each image patch. N = HW/P² is then the effective sequence length for the Transformer. The image is split into fixed-size patches, in the image below, patch size is taken as 16×16. So the dimensions of the image will be 48×48.

NOTE: The image dimensions must be divisible by the patch size.

Basic Intuition of Reshaped Patch Embeddings

2.2 Linear Projection of Flattened Patches

Before passing the patches into the Transformer block the authors of the paper found it helpful to first put the patches through a linear projection. So there is one single matrix and it is called E, in this case embedding, HAHA. They take a patch and unroll it into a big vector and multiply it with the embedding matrix to form patched embeddings and that’s what goes into the transformer along with the positional embedding.

The intuition of Linear Projection Block before feeding in Encoder

2.3 Positional Embeddings

Position embeddings are added to the patched embeddings to retain positional information. We explore different 2D-aware variants of position embeddings without any significant gains over standard 1D position embeddings. The joint embedding serves as input to the Transformer encoder.

Each unrolled patch (before Linear Projection) has a sequence of numbers associated with it, in this paper the authors chose it to 1,2,3,4…. no of patches. These numbers are nothing but learnable vectors. Each vector is parameterized and stacked row-wise to form a learnable positional embedding table.

Finally, the row number (sequential) associated with the patched embedding is picked up from the table (as positional embedding), concatenated, and fed to the Transformer encoder block.

2.4 The Transformer Encoder Block

The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of Multiheaded self-attention and MLP blocks. Layernorm (Layer Normalization) is applied before every block and residual connection after every block. The MLP contains two layers with a GELU non-linearity.

GeLU Non-Linearity Activation Function

3. Hybrid Architecture (A similar Approach)

As an alternative to dividing the image into patches, the input sequence can be formed from intermediate feature maps of a ResNet. In this hybrid model, the patch embedding projection E is replaced by the early stages of a ResNet. One of the intermediate 2D feature maps of the ResNet is flattened into a sequence, projected to the Transformer dimension, and then fed as an input sequence to a Transformer.

4. Training & Fine-tuning

The authors train all models, including ResNets, using Adam with β1 = 0.9, β2 = 0.999, a batch size of 4096, and apply a high weight decay of 0.1, which they found to be useful for transfer of all models. The authors used a linear learning rate warmup and decay. For fine-tuning, the authors used SGD with momentum, batch size 512, for all models.

5. Multi-Layer Perceptron Head

The fully-connected MLP head at the output provides the desired class prediction. The main model can be pre-trained on a large dataset of images, and then the final MLP head can be fine-tuned to a specific task via the standard transfer learning approach.

6. Comparison with SOTA

Breakdown of VTAB performance in Natural, Specialized, and Structured task groups.


  1. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, ICLR 2021.
  2. Visual Transformers.
  3. Attention is all you need.