The Current State of the Art in Natural Language Processing (NLP)

Original article was published on Deep Learning on Medium

The Current State of the Art in Natural Language Processing (NLP)

Natural Language Processing (NLP) is the field of study that focuses on interpretation, analysis and manipulation of natural language data by computing tools. Computers analyze, understand and derive meaning by processing human languages using NLP. By analysing text, computers infer how humans speak, and this computerized understanding of human languages can be exploited for numerous use-cases. Applications of NLP include sentiment analysis- where an NLP model can predict the kind of sentiment that a piece of text it operates upon expresses, virtual chat-bots- which are robots that interact with humans via text, having the ability to understand and provide logical responses to text messages sent by humans, speech recognition – technology commonly used in speech-to-text converters, now commonplace in mobile phones and video streaming websites, and several others.

A major challenge in NLP lies in effective propagation of derived knowledge or meaning in one part of the textual data to another. For example, in a sentence like “The animal did not cross the street because it was tired”- for the model to understand what the word ‘it’ refers to (the animal, not the street), the meaning processed for the word ‘animal’ must be remembered by the model and accurately looked back upon at the required moment while dealing with the word ‘it’. NLP tools have evolved from vanilla recurrent neural neural networks (RNN’s) to Long Short-Term Memory Networks (LSTM’s), leading upto the “Transformer” model architectures and several variants of the same, and some revolutionary new models.

State-of-the-art techniques in NLP:

RNN’s: Recurrent Neural Networks are variants of regular feedforward fully-connected neural networks having memory in their models. RNNs are recurrent in nature as they perform the same function for every input data point, and the output for a given input depends on processed data from the previous cell. After producing the output, it is copied and sent back into the recurrent network. For making a decision, it considers the current input and the output that it has learned from the previous input.

LSTM’s: Long Short-Term Memory (LSTM) networks are a modified version of recurrent neural networks, which makes it easier to remember past data in memory. LSTM is well-suited to classify, process and predict time series given time lags of unknown duration. It has three gates through which information is processed- the input gate, the forget gate and the output gate, with two quantities being processed at each cell- the hidden cell data and the output for the cell.

LSTMs and its variations seemed to be the answer to the problem of vanishing gradients to generate coherent sentences. However, there is a limitation to how much information can be saved as there is still a complex sequential path from previous cells to the current cell. This limits the length of sequences that an LSTM could remember to just a few hundred words. An additional pitfall is that LSTMs are very difficult to train due to high computational requirements. Due to their sequential nature, they are hard to parallelize, limiting their ability to take advantage of modern computing devices such as GPUs and TPUs.

Attention Mechanism: The paper ‘Attention Is All You Need’ published by Google Brain and Google Research presented techniques for mass parallel computing on Google TPU hardware. It allowed for more parallelism than RNN’s during training a model, which proves to have increased efficiency and computing speed.

The Transformer is based on an attention mechanism through an encoder and decoder. The Transformer uses this mechanism to encode information in its word vector about the relevant context of a given word. An attention mechanism allows a network to sort of zoom in and focus on relevant contextual words in both the input sequence and the outputs predicted upto that point, to determine the next output. For example, in a language translation task, say when converting an English sentence to its French translated version, before every French word the network outputs, it looks at representations of each word it’s processed so far, which are vectors generated from an input encoding step, from a stacked layer of encoders. This process of paying ‘attention’ beforehand proves to be extremely beneficial for the model, increasing its performances multifold.

In the representation above, the whiter a box is, the more the attention the network has paid on the corresponding word during the translation. You can see how the model paid attention correctly when outputting “European Economic Area”. In French, the order of these words is reversed (“européenne économique zone”) as compared to English. Every other word in the sentence is in similar order.

The BERT model: BERT can be used for a variety of tasks by fine-tuning the pretrained model. In another research paper published by Google, it showcased the fullest extent to which this ‘attention’ technique could be utilized and with their latest model, delivered state-of-the-art results in many NLP benchmark tests, outscoring the competition. The model presented was named ‘BERT’ which stands for ‘Bidirectional Encoder Representations from Transformers’.

BERT shattered previous NLP records on multiple datasets including Stanford Question Answering Dataset (SQuAD), a reading comprehension question and answer of 100,000 questions from Wikipedia articles.

The BERT model uses the Transformer architecture with multi-headed attention- essentially producing multiple sets of representations of the input sequence, each one encoding a different characteristic of the input. There are two models introduced in the paper. BERT base — 12 layers (transformer blocks), 12 attention heads, and 110 million parameters. BERT Large — 24 layers, 16 attention heads, and 340 million parameters! This model trains using a deep bidirectional language model (predicting a masked word in a sentence, given all other words). For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed theaccount” — starting from the very bottom of a deep neural network, making it deeply bidirectional. Quite remarkably, BERT even surpassed human-level performances on the SQuAD Dataset, making it the first NLP model to do so!

With the success of BERT, there were many more improvements on top of BERT that came out subsequently such as ALBERT, RoBERTA, TinyBERT, DistilBERT, and SpanBERT. These introduced tweaks to the algorithm to achieve even better results and met various different use cases.

Transformer-XL: Although BERT was a great model, it did have its shortcomings on a few fronts. In January 2019, Google Brain and Carnegie Mellon University came out with a model that solved some of the shortcomings of BERT and Transformer approaches with an architecture called Transformer-XL in the paper “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”

Transformer-XL (XL as extra long) was designed to use and reuse the hidden states obtained in the model in prior segments to serve as a memory for the current segment being operated upon to build up a recurrent connection between the segments. This approach enables learning dependencies beyond a fixed length of words. With information passed from previous segments, this also solves the problem of context fragmentation.

Compressive Transformers: In November 2019, Google’s DeepMind proposed a long-range memory model called the Compressive Transformer.

The paper explained how humans understand memory by storing past life-experiences in a compressed form in the brain, and how we can recall some events that may have happened more than a decade ago, which is possible because we don’t store every sensory information our brain processes but rather aggressively select, filter, and integrate stuff into our memory via lossy compression. The Compressive Transformer is an extension to the Transformer which similarly maps past hidden activations generated in the model to a smaller set of compressed representations which are compressed memories. It uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory, and achieves significantly better results.

There was an ImageNet moment for Deep Learning in Computer Vision in 2012. ImageNet is an image dataset with more than 14 million labelled images. In the ImageNet challenge, an image recognition competition for AI showed an astounding improvement with a submission of a deep convolutional neural network architecture called the AlexNet. AlexNet got the first performance on this dataset below a 25% error rate to 15.3% which was 41% better than the next best model. That was what contributed to the AI boom and the start of the deep learning storm.

In 2018, it was a similar moment for NLP where the revolution began as we saw the unsupervised pre-trained language models like BERT and Transformer-XL that made significant breakthroughs in various natural language understanding tasks such as natural language inference, named entity recognition, sentiment analysis, and question-answering, setting renewed records for the state-of-the-art performance one after the other in a short period of time. The golden era for NLP has only begun.