Original article was published by Synced on Artificial Intelligence on Medium
‘Farewell Convolutions’ — ML Community Applauds Anonymous ICLR 2021 Paper That Uses Transformers for Image Recognition at Scale
A new research paper, An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale, has the machine learning community both excited and curious. With Transformer architectures now being extended to the computer vision (CV) field, the paper suggests the direct application of Transformers to image recognition can outperform even the best convolutional neural networks when scaled appropriately. Unlike prior works using self-attention in CV, the scalable design does not introduce any image-specific inductive biases into the architecture.
But just whose potential breakthrough is this? The paper is currently under double-blind review for the International Conference on Learning Representations (ICLR) 2021, and thus the authors’ names and institutions are masked. The paper was spotted on the ICLR 2021 research repository OpenReview, and social media ML sleuths quickly went to work.
“The paper we’re discussing here uses a JFT- 300M dataset that is not available to the public, only to Google,” noted Yannic Kilcher, host of a popular eponymous YouTube channel. (JFT-300M is an internal dataset Google built to improve computer vision algorithms, that includes 300M images labelled with 18291 categories.) Kilcher identified numerous other clues suggesting the paper comes from Google, as part of a spirited and sarcastic rant against vulnerabilities and shortcomings in the double-blind review process.
Although reviewers’ comments remain anonymous, that doesn’t mean the double-blind peer review process is sabotage-free. Some in the community have previously voiced concerns that the positive public comments a paper attracts on social media can give a paper an advantage during the review. Others are concerned that apparent hints indicating a paper is from a renowned institution could bias reviewers’ decisions.
The paper’s premise already has many respected AI practitioners predicting it could bring revolutionary changes to the CV field, where convolutional architectures are the go-to for difficult tasks. The paper asserts “this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches.”
Google DeepMind Research Scientist Oriol Vinyals tweeted his take on the paper as “farewell convolutions : ),” with OpenAI Chief Scientist Ilya Sutskever responding that the new research offers an “anonymous mathematical proof” for “attention is all you need.”
Both researchers are very familiar with Transformer architectures, which enabled DeepMind’s AlphaStar bot to defeat pro StarCraft players and OpenAI’s 175 billion parameter language model GPT-3 to deliver SOTA performance in NLP tasks.
Sutskever’s approval of the paper is noteworthy as he was one of the first to show the potential of CNNs in CV. In 2012, as a graduate student at the University of Toronto, Sutskever worked with AI pioneer Geoffrey Hinton and first author Alex Krizhevsky on the milestone paper ImageNet Classification with Deep Convolutional Neural Networks.
Tesla Director of AI Andrej Karpathy is also excited about the new paper. His PhD at the Stanford Vision Lab focused on the intersection of convolutional/recurrent neural networks and CV and NLP applications, and his advisor at Stanford was ImageNet creator Professor Fei-Fei Li. Karpathy said the paper takes “further steps towards deprecating ConvNets with Transformers. Loving the increasing convergence of Vision/ NLP and the much more efficient/ flexible class of architectures.”
As Synced previously reported, the use of Transformers has already been explored in the CV field. But classic ResNet-like architectures remain dominant in large-scale tasks such as image recognition. In May, Facebook AI released Detection Transformers (DETR) for object detection and panoptic segmentation tasks. DETR can directly predict the final set of detections by combining a common CNN with a Transformer architecture. In June, OpenAI showed that large Transformer-based language models trained on pixel sequences can generate coherent images without the use of labels.
While the research community will have to wait for official confirmation of the paper’s source, that delay is unlikely to diminish enthusiasm surrounding the significant technical insights and potential breakthroughs for the use of Transformer architectures in the expanding CV field.
The paper An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale is available on OpenReview.