Source: Deep Learning on Medium
The interesting aspect of UT is its universality. Under certain assumptions, the resulting Universal Transformer can be shown to be Turing-complete (or “computationally universal”).
Regarding the original Transformer, assuming finite precision, it can be shown that the Transformer cannot be computationally universal (yet there are controversies as well).
While the Transformer executes a total number of operations that scales with the input size, the number of sequential operations is constant and independent of the input size, determined solely by the number of layers.
An intuitive example from the author’s blog are functions whose execution requires the sequential processing of each input element. In this case, for any given choice of depth T, one can construct an input sequence of length N > T that cannot be processed correctly by a Transformer:
“Unlike the standard Transformer — which cannot be computationally universal as the number of sequential operations is constant — we can choose the number of steps as a function of the input length in the Universal Transformer. This holds independently of whether or not adaptive computation time is employed but does assume a non-constant, even if possibly deterministic, number of steps.”
More on this topic can be found in the author’s blog.
The results are promising.
On the bAbi question answering dataset, Subject-Verb agreement task, LAMBADA language modeling, algorithmic tasks (Copy, Reverse, and integer Addition, all on strings composed of decimal symbols ‘0’-‘9’), “learning to execute” tasks, and Machine Translation results are really good.
The difference with the ordinary transformer (and sometimes LSTM) is significant.
Additionally, weight sharing in depth leads to better performance of UTs (compared to the standard Transformer) on very small datasets and allows the UT to be a very data efficient model, making it attractive for domains and tasks with limited available data.
Interesting model, worth trying.
#3b. Adaptive attention span
Another interesting place for applying adaptivity in transformers is attention span.
The paper called “Adaptive Attention Span in Transformers” proposes a novel self-attention mechanism that can learn its optimal attention span.
The problem with the vanilla transformer is its fixed context size (or attention span). Additionally, it cannot be very large because of the computation cost of the attention mechanism (it requires O(n²) computations).
Here is an alternative approach: let the layer (or even the attention head) decide the required context size on its own.
There are two options, learnable (called the adaptive attention span) and ACT-like (called the dynamic attention span).
Adaptive attention span let each attention head learn it’s own attention span independently from the other heads (this is done using a non-increasing masking function). It is learnable, but still fixed after the training is done.
Dynamic attention span changes the span dynamically depending on the current input.
Results are cool. The models are smaller, the performance is better.
Internally, adaptive spans are learned larger when needed.
And the dynamic spans adapt to the input sequence. The span increases at the beginning of words and in the middle of composed words, e.g., to predict the “l” in “overlook”
As you did already see, the UT without ACT is a very strong baseline on its own. Maybe repetitions themselves are the key, and the adaptivity is not required? As in the case of Repeat-RNN?
This could be true as well (at least given enough repetitions).
The other proof is ALBERT (“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”). The modified BERT-style model.
As I wrote earlier, ALBERT incorporates two parameter reduction techniques.
The first one is a factorized embedding parameterization, separating the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings.
The second technique (more relevant to our current topic) is a cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network.
Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance, thus improving parameter-efficiency.
As the authors wrote:
“We observed that the network often learned to perform similar operations at various layers, using different parameters of the network. This possible redundancy is eliminated in ALBERT by parameter-sharing across the layers, i.e., the same layer is applied on top of each other.”
As you see, this is similar to UT applying the same transformation a predefined number of times.
In the case of UT (without ACT) you just use the number of repetitions as a parameter, while in the case of ALBERT it is done by creating architecture with a predefined number of (the same) layers. Computationally this is the same.
In ALBERT the main goal was to reduce the number of parameters then to scale up the model again. So, the authors look at the cost of such a decision: how does it hurt the performance comparing with the BERT-style model (where no parameters are shared and all the layers are different). Not surprisingly from this point of view, it hurts (a bit).