Text Summarization

Text summarization has become an important and timely tool for assisting and interpreting text information in today’s fast-growing information age. It is very difficult for human beings to manually summarize large documents of text. There is an abundance of text material available on the internet.

With push notifications and article digests gaining more and more traction, the task of generating intelligent and accurate summaries for long pieces of texts has become an industry problem which is only growing everyday .

In 2014, there were 2.4 billion internet users. That number grew to 3.4 billion by 2016, and in 2017 300 million internet users were added — making a total of 3.8 billion internet users in 2017 (as of April, 2017) This is a 42% increase in people using the internet in just three years! With growing usage there is growth in the number of blogs, web pages and other such textual materials. The data is unstructured and the best that we can do to navigate it is to use search and skim the results.

We cannot possibly create summaries of all of the text manually; there is a great need for automatic methods.

Automatic Text Summarization

Automatic text summarization becomes an important way of finding relevant information precisely in large text in a short time with little efforts.

Automatic summarization of text works by first calculating the word frequencies for the entire text document. Then, the 100 most common words are stored and sorted. Each sentence is then scored based on how many high frequency words it contains, with higher frequency words being worth more. Finally, the top X sentences are then taken, and sorted based on their position in the original text.

By keeping things simple and general purpose, the automatic text summarization algorithm is able to function in a variety of situations that other implementations might struggle with, such as documents containing foreign languages or unique word associations that aren’t found in standard english language corpuses.

There are two fundamental approaches to text summarization: extractive and abstractive. The former extracts words and word phrases from the original text to create a summary. The latter learns an internal language representation to generate more human-like summaries, paraphrasing the intent of the original text.

Extractive Summarization

The methods in extractive summarization works by selecting a subset by extracting the phrases or sentences from the actual article to form a summary.

LexRank and TexRank are well known extractive summarizations, both of them use a variation of the Google PageRank algorithm. LexRank is an unsupervised graph based approach similar to TextRank. LexRank uses IDF-modified Cosine as the similarity measure between two sentences. This similarity is used as weight of the graph edge between two sentences. LexRank also incorporates an intelligent post-processing step which makes sure that top sentences chosen for the summary are not too similar to each other.

TextRank algorithm with a few enhancements like using lemmatization instead of stemming, incorporating Part-Of-Speech tagging and Named Entity Resolution, extracting key phrases from the article and extracting summary sentences based on them. Along with a summary of the article, TextRank also extracts meaningful key phrases from the article.

Abstractive Summarization

Models for abstractive summarization fall under a larger deep learning. There have been certain breakthrough in text summarization using deep learning. Below are some of the most noticeable published results by some of the biggest companies in the field of NLP

Neural Attention — Facebook AI Research, Sep 3, 2015.

Facebook Follows neural network approach. Data-driven approach to abstractive sentence summarization. Their method utilizes a local attention-based model that generates each word of the summary conditioned on the input sentence.

Sequence-to-Sequence — Google Brain, Aug 4, 2016.

Google announced a sequence- to sequence model in Aug 2016, which is now used by google to translate production systems, though it achieved huge improvements, its impact was limited.

Last year in 2017 they introduced tf-seq-2-seq, an open source framework in TensorFlow. It is easy to experiment and achieve state-of-art results.

Sequence-to-Sequence RNNs and Beyond — IBM Watson, Aug 10 , 2016.

IBM technique uses sequence to sequence with attention and bidirectional neural net, but they differ in the features, the cells they use (Google use LSTM, IBM uses GRU), the way they deal with <unk> tokens.

In sequence-to-sequence models, which maps from an input sequence to a target sequence. Although sequence-to-sequence models have been successfully applied to other problems in natural language processing, such as machine translation, there remains a lot of room for improvement for state-of-the-art models in abstractive summarization. Despite the fact that state-of-the-art models are able to achieve high ROUGE scores on summaries for small inputs, models often lose their ability to summarize key points once inputs become large

Drive Visual Data Analytics with its strong technical capabilities and expertise in delivering AI based solutions, Text analytics and NLP techniques to identify misleading and suspicious online content.

Get in touch with us, to know more about the paradigm shift towards this generation of A.I based solutions and the benefits it ushers in.

Source: Deep Learning on Medium