Natural Language Processing — Things you need to Know.

Original article can be found here (source): Deep Learning on Medium

We discussed Artificial Intelligence, Machine Learning, Deep Learning. Then NLP and various algorithm related to it.

Now let’s discuss the various frameworks of Natural Language Processing.

  1. spaCy: A Python package designed for speed, getting things done, and interoperates with other Deep Learning frameworks. It’s written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python’s awesome AI ecosystem.

Spacy example

2. Gensim: Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

This software depends on NumPy and Scipy, two Python packages for scientific computing. You must have them installed prior to installing gensim.

>>> from gensim.summarization.summarizer import summarize
>>> text = '''Rice Pudding - Poem by Alan Alexander Milne
... What is the matter with Mary Jane?
... She's crying with all her might and main,
... And she won't eat her dinner - rice pudding again -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... I've promised her dolls and a daisy-chain,
... And a book about animals - all in vain -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... She's perfectly well, and she hasn't a pain;
... But, look at her, now she's beginning again! -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... I've promised her sweets and a ride in the train,
... And I've begged her to stop for a bit and explain -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... She's perfectly well and she hasn't a pain,
... And it's lovely rice pudding for dinner again!
... What is the matter with Mary Jane?'''
>>> print(summarize(text))
And she won't eat her dinner - rice pudding again -
I've promised her dolls and a daisy-chain,
I've promised her sweets and a ride in the train,
And it's lovely rice pudding for dinner again!

An example of gensim summarizer.

3. Fasttext: fastText is a library for efficient learning of word representations and sentence classification. FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.

4. Built on TensorFlow:

  • SyntaxNet: A toolkit for natural language understanding (NLU).

SyntaxNet is a framework for what’s known in academic circles as a syntactic parser, which is a key first component in many NLU systems. Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word’s syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree.

  • textsum (for text summarization): A Sequence-to-Sequence with Attention Model for Text Summarization.

Text summarization problem has many useful applications. If you run a website, you can create titles and short summaries for user generated content. If you want to read a lot of articles and don’t have time to do that, your virtual assistant can summarize main points from these articles for you.

  • Skip-thought Vectors: “Skip-Thought Vectors” or simply “Skip-Thoughts” is the name given to a simple Neural Networks model for learning fixed-length representations of sentences in any Natural Language without any labelled data or supervised learning. The only supervision/training signal Skip-Thoughts uses is the ordering of sentences in a natural language corpus.
  • ActiveQA: In traditional QA, supervised learning techniques are used in combination with labeled data to train a system that answers arbitrary input questions. While this is effective, it suffers from a lack of ability to deal with uncertainty like humans would, by reformulating questions, issuing multiple searches, evaluating and aggregating responses. Inspired by humans’ ability to “ask the right questions”, ActiveQA introduces an agent that repeatedly consults the QA system. In doing so, the agent may reformulate the original question multiple times in order to find the best possible answer. We call this approach active because the agent engages in a dynamic interaction with the QA system, with the goal of improving the quality of the answers returned.
  • BERT: BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google.
BERT Language Model

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

5. Built on PyTorch

a) PyText: PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces and abstractions for model components, and by using PyTorch’s capabilities of exporting models for inference via the optimized Caffe2 execution engine. We use PyText at Facebook to iterate quickly on new modeling ideas and then seamlessly ship them at scale.

b) AllenNLP: AllenNLP makes it easy to design and evaluate new deep learning models for nearly any NLP problem, along with the infrastructure to easily run them in the cloud or on your laptop.

c) Flair: A powerful NLP library. Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and classification.

d) FairSeq: Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

e) Fastai: The text module of the fastai library contains all the necessary functions to define a Dataset suitable for the various NLP (Natural Language Processing) tasks and quickly generate models you can use for them. Specifically:

  • text.transform contains all the scripts to preprocess your data, from raw text to token ids,
  • contains the definition of TextDataBunch, which is the main class you’ll need in NLP,
  • text.learner contains helper functions to quickly create a language model or an RNN classifier

f) Transformer Model: The Transformer is a deep machine learning model introduced in 2017, used primarily in the field of natural language processing (NLP) Like recurrent neural networks (RNNs), Transformers are designed to handle ordered sequences of data, such as natural language, for various tasks such as machine translation and text summarization. However, unlike RNNs, Transformers do not require that the sequence be processed in the order. So, if the data in question is a natural language, the Transformer does not need to process the beginning of a sentence before it processes the end. Due to this feature, the Transformer allows for much more parallelization than RNNs during training.

Since their introduction, Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM) in many cases. Since the Transformer architecture facilitates more parallelization during training computations, it has enabled training on much more data than was possible before it was introduced. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT-2, which have been trained with huge amounts of general language data prior to being released, and can then be fine-tune trained to specific language tasks

GPT-2 (OpenAI): OpenAI released generative pre-training model (GPT) which achieved the state-of-the-art result in many NLP task in 2018. GPT is leveraged transformer to perform both unsupervised learning and supervised learning to learn text representation for NLP downstream tasks.

GPT-2 use unsupervised learning approach to train the language model. Unlike other model such as ELMo and BERT need 2 stages training which are pre-training and fine-tuning stage. There is no fine-tuning stage for GPT-2.

No custom training for GPT-2. OpenAI does not release source code of training GPT-2 (as of Feb 15, 2019). Therefore, we can only use the trained model for research or adoption.

ELMo: ELMo use bidirectional language model (biLM) to learn both word (e.g., syntax and semantics) and linguistic context (i.e., to model polysemy). After pre-training, an internal state of vectors can be transferred to downstream NLP tasks.

Different from traditional word embeddings, ELMo produced multiple word embeddings per single word for different scenarios. Higher-level layers capture context-dependent aspects of word embeddings while lower-level layers capture model aspects of syntax. In the simplest case, we only use top layer (1 layer only) from ELMo while we can also combine all layers into a single vector.

Recurrent Neural Network (RNN)

RNN is a very important variant of neural network heavily used in natural language processing.

Conceptually they differ from a standard neural network as the standard input in a RNN is a word instead of the entire sample as in the case of a standard neural network. This gives the flexibility for the network to work with varying lengths of sentences, something which cannot be achieved in a standard neural network due to it’s fixed structure. It also provides an additional advantage of sharing features learned across different positions of text which can not be obtained in a standard neural network.

RNN architecture

An RNN treats each word of a sentence as a separate input occurring at time ‘t’ and uses the activation value at ‘t-1’ also, as an input in addition to the input at time ‘t’. The diagram below shows a detailed structure of an RNN architecture.


It is a modification in the basic recurrent unit which helps to capture long range dependencies and also help a lot in fixing vanishing gradient problem.

GRU consists of an additional memory unit commonly referred as an update gate or a reset gate. Apart from the usual neural unit with sigmoid function and softmax for output it contains an additional unit with tanh as an activation function. Tanh is used since its output can be both positive and negative hence can be used for both scaling up and down. The output from this unit is then combined with the activation input to update the value of the memory cell.


In LSTM architecture instead of having one update gate as in GRU there is an update gate and a forget gate.

LSTM Architecture

This architecture gives the memory cell an option of keeping the old value at time t-1 and adding to it the value at time t.

Bi-directional RNN:

Previous architectures only consider previous values but BRNN consider previous and after inputs.

A bi-directional RNN consists of a forward and a backward recurrent neural network and final prediction is made combining the results of both the networks at any given time t, as can be seen in the image.

Generative Adversarial Network:

The simplest way of looking at a GAN is as a generator network that is trained to produce realistic samples by introducing an adversary i.e. the discriminator network, whose job is to detect if a given sample is “real” or “fake”. Another way that I like to look at it is that the discriminator is a dynamically-updated evaluation metric for the tuning of the generator. Both, the generator and discriminator continuously improve until an equilibrium point is reached:

GAN Architecture
  1. The generator improves as it receives feedback as to how well its generated samples managed to fool the discriminator.
  2. The discriminator improves by being shown not only the “fake” samples generated by the generator, but also “real” samples drawn from a real-life distribution. This way it learns what generated samples look like and what real samples look like, thus enabling it to give better feedback to the generator.