Data Scientist’s Guide to Summarization

Source: Deep Learning on Medium


A text summarization tutorial for beginners

Team Members: Abhinaya Ananthakrishnan, Akhilesh Reddy(@akhil), Preetika Srivastava, Richa Bathija

Did you ever face a situation where you had to scroll through a 400 word article only to realize that there are only 4 key points in the article? All of us have been there. In this age of information, with content getting generated every second around the world, it has become quite difficult to extract the most important information in an optimal amount of time. Unfortunately, we have only 24 hours in a day and that is not going to change. In the recent years, advancements in Machine learning and Deep learning techniques paved way to the evolution of text summarization which might solve this problem of summarization for us.

In this article, we will give an overview about text summarization, different techniques of text summarization, various resources that are available, our results and challenges that we faced during the implementation of these techniques.

Types of text summarization

There are broadly two types of summarization — Extractive and Abstractive

  • Extractive — These approaches select sentences from the corpus that best represent it and arrange them to form a summary.
  • Abstractive — These approaches use natural language techniques to summarize a text using novel sentences.

We tried out different extractive and abstractive techniques and also a combination approach that uses both of them and evaluated them on some popular summarization datasets.

Datasets used and their descriptions

Data is the most crucial part when it comes to any Natural language processing application. The more data that you can get your hands on to train your model, the more sophisticated and accurate the results would be. This is one of the reasons why Google and Facebook have outsourced a lot of their work in this domain.

We used 3 open source data sets for our analysis.

GIGAWORD dataset

It is popularly known as GIGAWORLD dataset and contains nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition. It consists of articles and their headlines. We have used this dataset to train our Abstractive summarization model.

CNN Daily mail dataset

CNN daily mail dataset consists of long news articles(an average of ~800 words). It consists of both articles and summaries of those articles. Some of the articles have multi line summaries also. We have used this dataset in our Pointer Generator model.

Opinions dataset

This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”, etc. The reviews were obtained from various sources — Tripadvisor (hotels), Edmunds.com (cars) and amazon.com (various electronics).Each article in the dataset has 5 manually written “gold” summaries. This dataset was used to score the results of the abstractive summarization model.

What is ROUGE?

To evaluate the goodness of the generated summary, the common metric in the Text Summarization space is called Rouge score.

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation.

It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced). It works by matching overlap of n-grams of the generated and reference summary.

Extractive Techniques

Extractive summarization techniques select relevant phrases from the input document and concatenate them to form sentences. These are very popular in the industry as they are very easy to implement. They use existing natural language phrases and are reasonably accurate. Additionally, since they are unsupervised techniques, they are ridiculously fast. As these techniques only play around with the order of sentences in the document to summarize they do not do as great a job as humans to provide context.

NLTK Summarizer

We wanted to start our text summarization journey by trying something simple. So we turned to the popular NLP package in python — NLTK. The idea here was to summarize by identifying “top” sentences based on word frequency.

Simple NLTK summarizer

Although the technique is basic, we found that it did a good job at creating large summaries. As expected, brevity wasn’t one of its strengths and it often failed at creating cohesive short summaries. Following are the summaries of “A Star is Born” Wikipedia page.

NTLK summarizer — 10 sentence summary
NLTK summarizer — 2 sentence summary

You can find the detailed code for this approach here.

Gensim Summarizer

The Gensim summarization module implements TextRank, an unsupervised algorithm based on weighted-graphs from a paper by Mihalcea et al. It is built on top of the popular PageRank algorithm that Google used for ranking.

After pre-processing text this algorithm builds graph with sentences as nodes and chooses the sentences with highest page rank to summarize the document

What is TextRank?

TextRank is based on PageRank algorithm that is used on Google Search Engine. In simple words, it prefers pages which has higher number of pages hitting it. Traditionally, the links between pages are expressed by matrix as shown in the image below. This matrix is then converted to a transition probability matrix by dividing the sum of links in each page which influences the path

TextRank Overview

TextRank regards words or sentences as pages on the PageRank. So when you use the TextRank, following points are important.

  • Define the “text units” and add them as the nodes in the graph.
  • Define the “relation” between the text units and add them as the edges in the graph.
  • You can set the weight of the edge also.

In the original “TextRank” algorithm the weights of an edge between two sentences is the percentage of words appearing in both of them.

Similarity calculation between sentences

However, the updated algorithm uses Okapi BM25 function to see how similar the sentences are. BM25 / Okapi-BM25 is a ranking function widely used as the state of the art for Information Retrieval tasks. BM25 is a variation of the TF-IDF model using a probabilistic model.

Improved similarity with BM25

In a nutshell, this function penalizes words that appears in more than half the documents of the collection by giving them a negative value.

The gensim algorithm does a good job at creating both long and short summaries. Another cool feature of gensim is that we can get a list of top keywords chosen by the algorithm. This feature can come in handy for other NLP tasks, where we want to use “TextRank” to select words from a document instead of “Bag of Words” or “TF-IDF”. Gensim also has a well-maintained repository and has an active community which is an added asset to using this algorithm.

gensim’s summarization of “A Star is Born” Wikipedia page

You can find the detailed code for this approach here.

Summa summarizer

The summa summarizer is another algorithm which is an improvisation of the gensim algorithm. It also uses TextRank but with optimizations on similarity functions. Like gensim, summa also generates keywords.

summa’s summarization of “A Star is Born” Wikipedia page

You can find the detailed code for this approach here.

Sentence Embeddings

We wanted to evaluate how text summarization works on shorter documents like reviews, emails etc. Most publicly available datasets for text summarization are for long documents and articles. Hence we used the Amazon reviews dataset which is available on Kaggle.

The structure of these short documents is very specific. They start off with a context and then talk about the product or specific matter, and they end off with closing remarks. We used K-means clustering to summarize the types of documents following the aforementioned structure.

For data preparation for clustering, there’s need to convert the text to number, which is done using pre-trained word embeddings. We have used GloVe embedding for achieving this. The embeddings were created consisted of a dictionary with the most common english words and a 25 dimensional embedding vector for each one of them. The idea is to take each sentence of the document, and calculate it’s embeddings, thus creating a data matrix of (# of documents X 25) dimensions. We kept it simple for obtaining the sentence embeddings by taking the element-wise weighted average of the word embeddings for all the words in the sentence. Then, all of the sentences in a document are clustered in k = sqrt(length of document) clusters. Each cluster of sentence embeddings can be interpreted as a set of semantically similar sentences whose meaning can be expressed by just one candidate sentence in the summary. The candidate sentence is chosen to be the sentence whose vector representation is closest to the cluster center. Candidate sentences corresponding to each cluster are then ordered to form a summary for an email. The order of the candidate sentences in the summary is determined by the positions of the sentences in their corresponding clusters in the original document. We ran the entire exercise for Amazon Food reviews dataset, and the results were pretty good on the mid length reviews. One of them is shown below:

Fig. Example of Food review summarization using K-Means clustering

You can find the detailed code for this approach here.

Here’s a fun tool we created to play around with extractive summarization techniques..

We wanted to make a tangible end to end solution to our project. We wanted a viable UI for our algorithms through which the user could interact with the backend.

For this we integrated widgets with the Jupyter Notebook.

The Widget enables the user to get a holistic view of the project, by asking for user input and generating the output.
The widget takes three parameters as input:

  • The type of algorithm the user wants to use for text summarization
  • The URL of the Wikipedia page whose text needs to be summarized
  • The shrink ratio (Amount of output text to be displayed as a percentage of the original document)

The user enters these three inputs and hits the RUN button.
The algorithm gets called from the backend and the output is generated based on the shrink ratio.
You can see the detailed code for this notebook here.

Widget Layout

Check out this cool video to get a clarity on how it actually works.

Abstraction techniques

Abstractive summarization began with Banko et al. (2000) ‘s research suggesting the use of machine translation model. However, in the past few years RNNs using encoder — decoder models with attention has become popular for summarization.

Sequence2Sequence Attention model

We attempted to train our own encoder — decoder model by using the GIGAWORLD dataset from scratch by trying to imitate the first state-of-the art encoder-decoder model.

If you’re unfamiliar with Recurrent Neural Networks or the attention mechanism, check out the excellent tutorials by WildML, Andrej Karpathy and Distill.

Using Google Cloud Platform

We used Google Cloud platform to train our RNN and test it. To learn more about how to setup the environment on Google Cloud Platform, you can view this tutorial also written by our team here.

We used a 16 — CPU system with 2 Tesla P100 GPUs in order to train a simple encoder decoder Recurrent neural network with the following hyper parameters:

num_layers_encoder = 6
num_layers_decoder = 6
rnn_size_encoder = 256
rnn_size_decoder = 256
batch_size = 8
epochs = 200
clip = 5
keep_probability = 0.5
learning_rate = 0.0005
max_lr=0.005
learning_rate_decay_steps = 700
learning_rate_decay = 0.90
pretrained_embeddings_path = ‘./embeddings/tf_hub_embedding_opinion.npy’

Pre-trained tensor flow embeddings were also used.

Although we ran our code on the cloud, given the sheer size of GIGAWORLD dataset, training for even 5-epochs took 72 hours

When we scored the model on validation data, we observed that even after 5 epochs (and 3 days), the model did a good job at matching the actual summaries written by humans. Following are a few examples:

Abstractive summarization results

The model also did a good job at providing context and using different words of its own. For example in the image below we can observe how the model uses “Taipei” instead of “Taiwan”. The actual text doesn’t talk about “Taipei”, but the model recognizes context and uses its own learning to use the word “Taipei”.

Context aware “Abstraction”

You can view the detailed code and output used to generate the model here.

Drawbacks of Abstractive summarization

Although abstractive summarization can be more intuitive and sound like a human, it has 3 major drawbacks:

  1. Firstly, training the model requires a lot of data and hence time. Although we can use pre-trained weights used by the model, they are not readily available and there is no open-source code that one can leverage to implement it
  2. An inherent problem with abstraction is that the summarizer reproduces factual details incorrectly. For instance, if the article talks about Germany beating Argentina 3–2, the summarizer may replace 3–2 by 2–0
  3. Repetition is another problem faced by the summarizer. As we can see in the second example above, some phrases are repeated in the summary

Pointer — Generator Networks

Pointer generator networks solve problem 2 and problem 3 discussed above by calculating “generating probability” which represents the probability of generating a word from the vocabulary versus copying the word from the source. The image below describes how it weighs and combines the vocabulary distribution P​vocab​​ (which we use for generating) and the attention distribution into the final distribution to create the “generating probability”

Pointer Generator networks

Compared to the sequence-to-sequence-with-attention system, the pointer-generator network does a better job at copying words from the source text. Additionally it also is able to copy out-of-vocabulary words allowing the algorithm to handle unseen words even if the corpus has a smaller vocabulary.

Hence we can think of pointer generator as a combination approach combining both extraction (pointing) and abstraction (generating).

We used the pre-trained weights generated by training the CNN dataset and saw some really cool results on the validation dataset.

Pointer-generator summaries

The code and pre-trained weights were used from the developer’s github.

Following is the demo of how we used the weights to run the validation model on Google Cloud Platform:

The pointer generator model also creates an “Attention Map” which shows you which words have received a high “generation probability” and how they play a role in creating the summary.

Attention map for generated summary

Abstractive vs. Extractive

We generated summaries for all the articles in the “Opinions” dataset and compared them to the “gold summaries” that were already present. Both extractive and abstractive techniques did a good job and it was hard to point out any significant difference in both the summaries.

Abstractive vs. Extractive summarization

In fact, in some cases, the generated summaries across both algorithms were significantly different as compared to the gold summary. One such instance can be seen in the image below, where the gold summary talks about how “Gas mileage is below expected”, but the generated summaries only highlight the positives of the “Toyota Camry”

Summarization fail!

Well.. To Summarize..

Given the sophistication of RNNs and the current computing capabilities, we observed that extractive summarization methods are more intuitive than abstractive methods. A few other observations:

  • Abstractive methods take much longer to train and require a lot of data. Higher-level abstraction — such as more powerful, compressive paraphrasing — remains unsolved
  • The network fails to focus on the core of the source text and summarizes a less important, secondary piece of information
  • The attention mechanism, by revealing what the network is “looking at”, shines some precious light into the black box of neural networks, helping us to debug problems like repetition and copying.
  • To make further advances, we need greater insight into what RNNs are learning from text and how that knowledge is represented.

Challenges:

It goes without saying that there will be challenges in implementing a task which is still in the phases of evolution. Text summarization perfectly fits in to this description . In this section, we will discuss about the challenges that you might possibly face if you want to try this out on your own.

1. Version issues:

Research and development in this field is growing at a very high pace. When we started implementing these techniques, we realized most of the versions of packages that were used in the earlier resources were outdated. This led to a lot of version issues in Tensorflow and Attention packages we used. So you need to make sure to go through your reference code and make necessary updates to the code beforehand

2. Training time:

With 16 CPUs of 16GB memory and 2 Nvidia Tesla P100 GPUs, our abstractive model took 72 hours to go through just 5 epochs. Just 5 epochs!

It would have taken 2 complete weeks to just replicate the results of the paper that we were referring. So better be prepared to have long trianing times.

3. Data procurement:

We had to go through multiple Github profiles to get the datasets that we used for our analysis. This process took us quite some time to get started with the analysis.

Next Steps:

  • Compare the abstractive and extractive summaries using ROUGE scores
  • Train Seq2Seq and Pointer Generator models for more epochs and replicate results in the paper

References

All of the references used are listed below:

(i) https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1
(ii)
https://github.com/thunlp/TensorFlow-Summarization
(iii)https://github.com/thomasschmied/Text_Summarization_with_Tensorflow
(iv) https://github.com/abisee/pointer-generator

Please do share if you have tried another cool techniques for Auto-text summarization. Would love to learn from similar experiences!