Transformer-XL Review: Beyond Fixed-Length Contexts

Original article was published by Jiajin Li on Artificial Intelligence on Medium

Transformer-XL Review: Beyond Fixed-Length Contexts

This paper (“Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”) was published in ACL 2019, one of the top NLP conferences, by researchers at Google AI. It proposes Transformer-XL, a new architecture that enables natural language understanding beyond a fixed-length context without disrupting temporal coherence. Its key innovations are a segment-level recurrence mechanism and a novel positional encoding scheme. Unlike the traditional Transformer model, it can capture longer-term dependency and solve the context fragmentation problem, which are the main limitations of the vanilla Transformer. The experiments show that Transformer-XL learns dependency that is much longer than RNNs and vanilla Transformer. Transformer-XL also achieves state-of-the-art results in the evaluation with large benchmark datasets.

Paper link:

1. Background

Language modeling is an important topic in natural language processing. People have proposed many unsupervised pre-training methods like BERT and ELMo. However, modeling long-term dependency remains a challenge. Recurrent neural networks (RNNs), especially Long Short-term Memory networks (LSTM) have been a standard solution to modeling long-term dependency. The introduction of gating in LSTMs and the gradient clipping technique improve the ability of modeling long-term dependency, but it is insufficient to address this challenge. Also, it is difficult to optimize RNNs for modeling long-term dependency due to gradient vanishing and explosion.

Figure 1. The vanilla Transformers with a segment length 4.

Transformers were proposed to solve this issue, which allows direct connections between word pairs and better captures long-term dependency than LSTMs. The author defines the original Transformers and vanilla Transformers. However, Transformers were implemented with a fixed-length context. It splits the input into segments and trains within each segment (Figure 1). Therefore, Transformers fail to capture longer-term dependency beyond the predefined context length. And the fixed-length segments do not respect the sentence boundaries, leading to context fragmentation and thus inefficient optimization and performance loss. During evaluation, it makes one prediction at one position at a time by shifting the input by one position in each step, where segments are processed from scratch. So the evaluation procedure is expensive.

To address these limitations, the authors proposed Transformer-XL. It reuses hidden states in previous segments to support long-term dependency and resolve context fragmentation. And it employs a relative positional encoding scheme to avoid temporal confusion.

2. Transformer-XL

Figure 2. Transformer-XL model with a segment length 4.

2.1 Segment-level recurrence

During training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context (Figure 2). In each segment, each hidden layer receives the output of the previous hidden layer and the output of the previous segment. It increases the largest possible dependency by using contextual information from several previous segments. Despite resolving the context fragmentation issue, this segment-level recurrence mechanism improves the evaluation speed because it can advance by an entire long segment and use the representations from the previous segments without recomputation.

2.2 Relative positional encodings

Naively applying recurrence introduces another technical challenge. That is, the positional information is incoherent, and tokens from different segments have the same positional encoding, which is referred to as temporal confusion. To address this challenge, Transformer-XL employs novel relative positional encodings. Positional information bias is encoded in the hidden states, which is different from other approaches that incorporate bias in the initial embedding. The use of fixed embeddings with learnable transformations makes it more intuitive and more generalizable to longer sequences. The relative positional encodings make segment-level recurrence possible so that Transformer-XL can model much longer-term dependency than a vanilla Transformer model.

3. Experiments and results

The authors apply Transformer-XL on word-level and character-level datasets, including WikiText-103, text8, enwik8, One Billion Word, and Penn Treebank, and compare it with other models.

Table 1: Results on WikiText-103

On WikiText-103 dataset, Transformer-XL reaches a perplexity of 18.3, in comparison to the previous state-of-the-art (SoTA) result (Baevski & Auli) which reaches a perplexity of 20.5 (Table 1).

Table 2: Results on enwik8

On enwik8 dataset, the 12-layer Transformer-XL achieves 1.06 bits per character (bpc), which is similar to the previous SoTA result by Al-Rfou et al.. The 24-layer Transformer-XL improves the SoTA bpc from 1.06 to 0.99 (Table 2).

Table 3: Results on text8

With the same hyper-parameters on enwik8, Transformer-XL reduces the SoTA bpc from 1.13 to 1.08 (Table 3).

Table 4: Results on One Billion Word.

One Billion Word dataset has only short-term dependencies, but Transformer-XL also achieves a new SoTA result, decreasing SoTA perplexity from 23.7 to 21.8 (Table 4).

Table 5: Results on Penn Treebank.

On word-level Penn Treebank dataset with only 1 million training tokens, Transformer-XL improves SoTA perplexity from 55.3 to 54.52, when compared with other models without two-step finetuning (Table 5). It indicates that Transformer-XL can generalize well on small datasets.

Table 6: Relative effective context length comparison.

The authors propose a new metric, Relative Effective Context Length (RECL), which is defined on a model group, and the gain of a long context is measured by the relative improvement over the best short context model. The parameter r in RECL constrains the comparison on top-r hard examples. As shown in Table 6, Transformer-XL can model 80% to 133% longer dependency than RNN, and 291% to 447% longer dependency than the vanilla Transformer. It shows that both segment-level recurrence and relative positional encoding contribute to the longer RECL of Transformer-XL. The ablation studies on WikiText-103 and One Billion Word dataset also show that Transformer-XL outperforms other models because it can model longer-term dependency with the recurrence and the new encoding.

Besides, because no recomputation is needed, Transformer-XL is up to 1,874 times faster than a vanilla Transformer during evaluation.

4. Conclusion

Transformer-XL obtains new SoTA perplexity or bpc results on multiple datasets. Combining recurrence and relative positional encoding, it can model longer-term dependency than RNNs and vanilla Transformers, and reduce computational cost substantially during evaluation. Transformer-XL could be effective in other fields, such as generating long articles and improving language model pretraining methods like BERT and ALBERT.

5. Related work

(1) Attention is all you need

This paper proposes the Transformer, a novel model architecture relying entirely on the attention mechanism to model global dependencies between input and output. The Transformer model allows significantly more parallelization and therefore requires less time to train. The Transformer reaches a new SoTA result on the WMT 2014 English-to-French translation task. The original Transformer is the basis of Transfomer-XL presented in this paper.

Citation: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.

(2) Character-level language modeling with deeper self-attention

This paper proposes a deep, non-recurrent transformer model for character-level modeling. The transformer self-attention layers with causal attention are used to process fixed-length inputs and predict upcoming characters. Al-Rfou et al. design three auxiliary losses to train deep Transformer networks, which outperforms LSTMs and achieve new SoTA results on text8 and enwik8 datasets. However, it uses fixed-length segments so it fails to capture any longer-term dependency beyond the predefined context length. This limitation motivates the authors to design Transfomer-XL to model long-term dependency.

Citation: Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3159–3166.

(3) Bert: Pre-training of deep bidirectional transformers for language understanding

This paper introduces a novel language representation model, Bidirectional Encoder Representations from Transformers (BERT). It is designed to pre-train bidirectional language representations with unlabeled text. Then the pre-trained BERT can be fine-tuned to various tasks with one additional output layer, and achieve SoTA results. In practice, BERT simply chunks long text into fixed-length shorter segments, leading to context fragmentation problems. Transformer-XL resolves the context fragmentation issue, so it can be used to improve BERT and then achieve new SoTA results in different kinds of tasks.

Citation: Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.

(4) Self-attention with relative position representations

This paper presents an approach of incorporating relative position representations or distances between sequence elements in the self-attention mechanism of the Transformer. It is demonstrated that relative positional encodings can improve the translation quality on the WMT 2014 English-to-German dataset, in comparison with absolute position representations. It inspires the authors of Transformer-XL to derive a new form of relative positional encodings. The new relative positional encodings of Transformer-XL resolve the temporal confusion problem and has better generalization empirically.

Citation: Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 464–468.

(5) ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This paper presents two new techniques to reduce the parameters in BERT, contributing to lower costs of memory and training time. It also introduces a self-supervised loss for sentence-order prediction that focuses on modeling inter-sentence coherence. It establishes new SoTA results on different benchmark datasets with fewer parameters than BERT-large. Like BERT, it also splits the long text into fixed-length segments, leading to potential context fragmentation problems. Transformer-XL can be used to solve context fragmentation in ALBERT, and therefore further improve its performance.

Citation: Zhenzhong Lan, Mingda Chen, Sabastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations.