Original article was published by 黃日泓 on Deep Learning on Medium
Auto-regressive language generation
Auto-regressive language generation assumes that the element of output sequence at time step t is determined by the input sequence and time steps before t.
where $W_0$ is the input sequence; $W_t$ is the word at time step $t$; $T$ is determined by the position of token.
Thanks to the come out of transformer, a famous seq2seq model architecture, researchers now are able to well deal with lots of NLP problems, e.g. neural machine translation (NMT), text summarization, text generation. In terms of conducting text generation with transformer, decoding strategies play an important role and thus cannot be ignored. The easiest approach is generating tokens with the highest probabilities in each time step.
Beam search (BS)
Beam search (BS) is a common-used algorithm for decoding sequences. When decoding, BS consider the probability of whole sequence instead of just considering the word with highest probability at each time step (greedy search). In other words, BS will maintain k sequences (k is the beam width, you can define it on your own) at the same time, and when there is a new sequence coming, the sequence with lowest probability will be discarded.
But BS still has some drawbacks:
- It generates similar sequences and thus lost diversity.
- It is computationally wasteful since your inference time would be amplified by the beam width.
- With higher beam width, NMT tends to generate shorter sequences (since token is more likely to be generated).
To avoid the repetition of generated sequences, a simple remedy called “n-gram penalty” has come to the stage. The n-gram penalty make sure that there is no n-gram showing up for more than once. That is to say, if there is a generated n-gram has been put into the output sequence before, then its probability would be set to zero.
In order to make the output sequences more surprising (which means more similar to humans’ behavior), we need to add some randomness into the decoding process.
Instead of picking tokens with the highest probability, we sample from the distribution of tokens at each time step.
Besides, we can add a parameter called “Temperature” (range from 0~1) to adjust the distribution of tokens.
With sampling, we may face the chance of selecting “weird/improper words” (words with low probability) as our generating tokens. To solve this problem, top-K sampling is introduced to look at tokens with top-K probabilities. That is to say, only top K tokens will be seen as the candidates of the generating sequence. After that, we can sample a token from the distribution formed by the K tokens!
Nevertheless, we cannot dynamically change the value of K case by case. Thus, top-P sampling come to solve this problem by selecting token candidates until their cumulative probability achieves a given P. Via top-P sampling, we now can select different amounts of token candidates based on different situations!