Convolutional Sequence to Sequence Learning

Source: Deep Learning on Medium

Convolutional Sequence to Sequence Learning

ConvS2S is a neural machine translation method that has decided the trend of parallelizing the sequence direction by de-RNN. Higher performance and 5 times faster than GNMT . RNN has died in the traditional sense, but in the beginning it has been reborn.

WMT’14’s BLEU score is 40.51 in Britain and France, 25.16 in Britain and Germany, and 4th (2nd at appearance)

  • Convolutional Sequence to Sequence Learning [Jonas Gehring, last: Yann N. Dauphin, arXiv , 2017/05]
  • Torch (authors)
  • PyTorch (authors)
  • Chainer

Since LSTM uses the hidden state of the previous time step as the hidden state of the current time step, parallelization within the layer was difficult (parallelization using the Factorization Trick that matrix-divides the weights in the LSTM is possible but limited. ).

ConvS2S (Convolutional Sequence to Sequence) is a model that enables parallelization in the time step direction by replacing word sequence processing from LSTM to CNN. Since gating is important in LSTM, a Gated Linear Unit extracted from this part is also used.

Before going into the details of ConvS2S, I would like to explain about Gated Linear Unit.

Gated Linear Unit

GCNN (Gated Convolutional Neural Network) is a model in which blocks of [convolutional layer, GLU layer] are stacked in L layers, and can be parallelized in the time step direction. High performance was obtained by GLU (Gated Linear Unit) incorporating gating of LSTM, and SOTA was achieved with a language model using the WikiText-103 dataset . Training is about 20 times faster than LSTM.

  • Language Modeling with Gated Convolutional Networks [Yann N. Dauphin, arXiv , 2016/12]

Within one block, input Is branched (copied) into two, and each convolutional layer When Captures long-term dependence with the GLU layer Gating (controlling the information sent to the upper layer) with. Also, a residual connection (residual connection) is made from the input of the block to the output.

Patch size (patch size, kernel size) Convolution Each output position of the stacked module Information on input positions can be aggregated. Gating in the GLU layer can control whether to obtain a wide range of information or concentrate on a small number of information.

Output of block Is represented by the following equation.

here, Is the input (word string embedding or previous layer output), Are the kernels of the convolutional layer, respectively. Is bias, Is the sigmoid function , Is the element-wise product ( Hadamard product).

Also, Is the number of words, Is the number of input and output feature maps, respectively. Is the patch size.

The convolutional layer is somewhat difficult to understand, as shown in the figure below.

Note that the task is a language model (predict the next word), so the convolutional layer must prevent future words to be predicted from leaking into the model before prediction. Input of convolutional layer so that kernel does not refer to the future Shift backward and input At the beginning of Zero-padded by the length of.


In this method, a 15-layer stack of [convolution layer, GLU layer] is used for the encoder and decoder . The decoder is the input of the convolutional layer Shift backward so that the kernel does not refer to the future.

The blue triangle is the convolution layer, Represents the GLU layer.

The decoder does not autoregress during training, inputs all target words at the same time, calculates attention at the same time, and outputs all words at the same time . This is one of the reasons for speeding up training.

ConvS2S uses position embedding to give the word sequence (absolute position) information of the word sequence to the model.

Position Embedding [Henaff, 2016; Gehring, 2017] adds different variables (learning parameters) for each position.

Embedded matrix of word strings for encoder and decoder , Embedded matrix of positions for , And add And

A similar idea , Position Encoding [Sukhbaatar, 2015; Kaiser, 2017], adds a different constant ( quadratic equation or sine wave) for each position.

Multi-step caution

Attention is a mechanism for extracting information related to a query as a context.

In ConvS2S, attention is calculated individually in all layers of the decoder .

Output of the encoder of the layer , The output of the decoder in the layer As Of the layer decoder Query series Let’s consider the note.

Query Is represented by the following equation.

here Is the output of the decoder , Is the embedding vector of the previous predicted word. Is the weight matrix, Is the bias vector.

Attention Weight Is represented by the following equation.

Query here Is as described above and the key Is the final layer of the encoder Output of This is a series. The weight of attention is This is the weight to determine which information is to be extracted from and how much.

Context Is represented by the following equation.

here Is the output of position embedding, context Is Is added to. Value by Attention Context information is extracted appropriately from.

In the explanation so far, attention is calculated individually, but when all queries are obtained at the same time, attention can be calculated collectively.

Attention is generally expressed by the following equation.

In ConvS2S Attention of the decoder of the layer is Query matrix , Key matrix , Value matrix Is represented by the following equation.

When implementing the precautions, calculate the batch tensor according to this formula (never use the for statement).

In the conventional NMT, the attention is calculated only once to generate one sequence of the decoder (GNMT also uses the same context for each layer).

ConvS2S uses multi-step attention to recalculate the attention for each layer. This is inspired by the attention of multiple hops used in the End-To-End Memory Network [Sukhbaatar, 2015], and then adopted by Transformer [Kaiser, 2017]. With multiple steps of attention, the attention of a certain layer at a certain time can refer to the lower layer or past attention history.

Normalization and initialization

In ConvS2S, learning is stabilized by scaling ( ie , normalizing) the output of a part of the network (eg, residual block, note) so that the variance of activation does not change suddenly in the entire network.

  • Assuming that the variance of the input and output of the residual block is equal, Halves the variance of the sum
  • Note output is Weighted sum of the vectors To cancel the change in variance ( Multiplication scales the input to its original size).
  • The gradient of each layer of the encoder is scaled by the number of attention mechanisms used.

In order to disperse the activation of forward / backward propagation, weights are carefully initialized.

  • The weight of all embedded layers is (average standard deviation ( Normal distribution of ).
  • The weight of the layer whose upper layer is not the GLU layer is ( Initializes according to the number of dimensions of the input ( ie , Xavier Initialization), and maintains the variance of the input according to the normal distribution .
  • The weight of the layer whose upper layer is the GLU layer is Initialize according to. This means that the input to the GLU layer is the average When the GLU layer has sufficiently small variance in , The variance of the input to the GLU layer is Weights are initialized so that they are doubled.
  • Bias is uniform Initialize with.

Input is probability If the dropout held in is adapted to multiple layers, this is the probability so (Other than that ) Can be regarded as a multiplication. Applying the dropout will reduce the variance Therefore, it is necessary to initialize each layer with a large weight and maintain the variance. Therefore, the weight of the layer whose upper layer is the GLU layer is Initialize according to, otherwise Initialize according to.

Experiments and results

The data set uses WMT’14 English-French (35.5M bilingual) and English-German (4.5M bilingual).
Use 40000 subwords based on Byte-Pair Encoding (BPE) that share vocabulary between source and target languages.
Use 40000 subwords based on Byte-Pair Encoding (BPE) that share vocabulary between source and target languages.

The model stacks 15 layers of [convolutional layer, GLU layer], and the size of each hidden layer is 512 for the first 10 layers, 768 for the next 3 layers, and 2048 for the last 2 layers (in the case of English and German). is there.

The training was performed by synchronous data parallelism (sum of gradients was calculated by Nvidia NCCL) using eight M40 GPUs . The optimization uses a momentum with a coefficient of 0.99 ( ie , Nesterov’s accelerated gradient descent method), and the learning rate starts at 0.25. If the perplexity of the verification set stops decreasing, the learning rate is reduced by half.

The evaluation uses a beam search with a beam width of 5. Three models that differ only in the initial values ​​are trained, and the averaged evaluation results are posted.

The result of learning the single model is as follows.

The BLEU score was slightly lower than MoE [Shazeer, 2017], but achieved 0.5 points higher than GNMT [Wu, 2016] in both English, German and French.

The results of learning the ensemble of the eight models are as follows.

Ensemble learning achieved SOTA in both English, German and French (English and French are still SOTA).


ConvS2S has a faster word string generation speed than GNMT, and is about 10 times faster for both CPU and GPU (not shown).

The lower left figure shows the change in PPL and BLEU when the layer that applies the attention of multiple steps is changed for ConvS2S using a 5-layer block for the decoder , and the lower right figure changes the number of encoder and decoder layers. In the case of BLEU.

Attention is higher in lower layers, and performance is highest when attention is applied to all layers. Also, the deeper the number of encoder layers, the higher the performance and the better the number of decoder layers near 5 layers.

The change of BLEU when the kernel size of the convolutional layer and the number of encoders and decoders are changed is shown in the following figure.

The smaller the kernel size, the higher the performance. The encoder has 13 layers and the decoder has 5 layers.