Original article was published by Synced on Artificial Intelligence on Medium
In a bid to make transformer models even better for real-world applications, researchers from Google, University of Cambridge, DeepMind and Alan Turing Institute have proposed a new transformer architecture called “Performer” — based on what they call fast attention via orthogonal random features (FAVOR).
Believed to be particularly well suited for language understanding tasks when proposed in 2017, transformer is a novel neural network architecture based on a self-attention mechanism. To date, in addition to achieving SOTA performance in Natural Language Processing and Neural Machine Translation tasks, transformer models have also performed well across other machine learning (ML) tasks such as document generation/summarization, time series prediction, image generation, and analysis of biological sequences.
Neural networks usually process language by generating fixed- or variable-length vector-space representations. A transformer however only performs a small, constant number of steps — in each step, it applies a self-attention mechanism that can directly model relationships between all words in a sentence, regardless of their respective position.
Although the attention mechanism can specify complex dependencies between the elements of each input sequence, the cost of training the attention mechanism to learn these complex dependencies between distant inputs can be prohibitively expensive. The mechanism also limits transformers’ scalability to longer sequences as it generally scales quadratically with the number of tokens in the input sequence.
To alleviate transformers’ quadratic dependency, various studies have proposed solutions that exploit the structure and sparsity of the learned attention matrix. According to the Performer team however, these solutions do not aim to approximate regular attention, but rather propose simpler and more tractable attention mechanisms, “often by incorporating additional constraints or by trading regular attention with sparse attention using more layers.”
Real-world applications such as biological sequence analysis often involve long sequences. Adding constraints to attention mechanisms can lead to failures in capturing long-distance correlations and thus impede such applications.
To address this challenge, the FAVOR-based Performer scales linearly rather than quadratically in the number of tokens in the sequence. This new type of transformer is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors, explain the researchers. “It is also backwards-compatible with pre-trained regular Transformers.”
Recent work has demonstrated that transformers can learn to accurately predict information about protein structure and function and generate new sequences with specific properties. While these models provide initial promise, their applicability beyond the design of single proteins is limited, mainly because they truncate sequences to 512 or 1024 amino acids.
Therefore, the ability to scale to longer sequences without imposing sparsity constraints would enable the use of transformers to jointly model multiple concatenated protein sequences and the interactions between them. And that’s why the linearly scalable mechanism they proposed has lots of potential in modern protein modelling.
The researchers demonstrated that the Performer can model multiple concatenated protein sequences as required and predict interactions among groups of proteins from sequence data. Compared to a baseline transformer, the Performer trains more efficiently and is able to train continuously — increasing its performance as training progresses.
The researchers say their FAVOR mechanism also provides strong theoretical guarantees. “Our mechanism is to our knowledge the first unbiased estimation of the original algorithm with linear space and time complexity.”
Designed for long input sequences, the FAVOR mechanism can be effectively approximated without simplifying attention via the various structural priors some of the previous approaches required, enabling higher flexibility.
When combined with small amounts of fine-tuning, the Performer is backwards-compatible with pretrained regular transformers and can also be used beyond the transformer scope as a more scalable replacement for regular attention, which has a wide variety of uses in computer vision, reinforcement learning, and even combinatorial optimization, according to the researchers.
The paper Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers is on arXiv.