Bootcamp Tech Blog #4: Long Document Summarization

Original article was published on Deep Learning on Medium

Bootcamp Tech Blog #4: Long Document Summarization

Written by Camnhung in Cinnamon Student AI Bootcamp 2020

TL;DR 𑁋 I bet you expected a summary here. No, it isn’t. The only reason why this acronym was mentioned here is to give you a friendly reminder of how summarization has changed the way we acquire knowledge.

Everyone must have fallen in love with summaries, as it distills the long text into a shorter version which gets you right to the key facts and points without the infinite scrolls of an entire article. With an exponential increase in data, more people expect automatic summarization tools to ease the pain of skimming through large amounts of information.

Automatic text summarization

Automatic text summarization (ATS) is the process of shortening a text while preserving its important information. This, despite being widely accepted, is a vague definition as importance is relative to each audience. The absence of a precise definition of what should be included in a summary is the main thing that holds this field from going forward despite impressive progress in other NLP tasks.

Metrics. Given the ambiguity of a good summarization, current works are having a hard time determining better metrics for ATS other than traditional ROUGE, a metrics family that primarily based on word overlap between reference summaries and generated ones.

More on ROUGE: ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, measures the similarity of two texts by computing n-gram or word sequence overlaps on the scale from 0 to 100. Three variants of ROUGE which are commonly used in summarization includes:

  • ROUGE-1: refers to the overlap of unigram (each word) between the two
  • ROUGE-2: refers to the overlap of bigram (each two consecutive words) between the two
  • ROUGE-L: refers to the longest common sequence between the two

You can easily keep track of current progress in ATS here.

Types. Summarization models can be classified based on its derivation, which refers to how the model transforms the source text into its abridged version:

  • Extractive models: select the most informative units of text from the input and copy directly into the summary. Usually, extracted units in extractive models are sentences
    since it is easier to maintain the baseline levels of grammatical accuracy.
Informative units are copied directly to the summary in extractive models.

Traditional extractive models develop heuristics for scoring textual components by employing surface features (e.g., term frequency, text position, and critical keywords) as well as semantic relationships (e.g., discourse trees) among components in the document. TextRank is a case in point. Alternatively, extractive summarization can be formulated as a binary classification problem in which hand-crafted features are combined to predict whether a fragment should be included in the summary.

  • Abstractive models: take a step further to resemble human-written summaries by either rephrasing or paraphrasing, which requires more sophisticated linguistic understandings and even the incorporation of real-world knowledge. Due to its challenging demands, the topic was less actively studied in the past compared to the extractive one.
Abstractive models may use new words/phrases out of the source vocabulary
to create more natural summaries.

The latest and arguably the first success in abstractive summarization stems from the seq2seq framework in machine translation. During training, different learning objectives are suggested to handle analogous summaries (those are grammatically different yet semantically the same).

More on seq2seq: The seq2seq model converts input sequence into output, both from the same vocabulary, by use of a recurrent neural network (RNN) or more often LSTM or GRU to avoid the problem of vanishing gradients. Due to its sequential nature, recurrent units are not encouraged as the length of the source document increases. Parallelism in CNN may help, but it is not a good candidate for capturing long-term dependencies. Fortunately, Transformer is here to stay. The architecture can handle ordered sequences of data thanks to its positional encoding while replacing recurrent units by attention mechanisms to enhance parellelism. Most of the state-of-the-art approaches in the abstractive summarization, namely ProphetNet, PEGASUS, and BART, are Transformer-based seq2seq models.

  • Hybrid models: adopt a two-stage procedure of content selection and paraphrasing. Extractors are employed in content selection to identify important fragments in source documents, which further influence abstractors in generating summaries.
Hybrid models combine the advantages of speed and grammatical accuracy
in extractive summarization together with the fluency in the abstractive.

This combination in hybrid models often results in non-differential behaviors when we cannot identify whether errors were caused in the extractive or abstractive phase. Hence, the two components are separately optimized during training.

First successfully applied in Pointer-Generator networks, copy mechanisms are often augmented to give models the ability to either generate new words or reuse some in the source document. Thus, they can handle out-of-vocabulary words efficiently.

The long problem

Most works in the literature focus on the summarization of sentences and short documents (e.g., news and single passages). In practice, long documents (e.g., scientific papers, theses, and novels) are in a greater need for good summarization as they require more effort to sift through the material. The lack of studies in long document summarization reflects in the number of specific datasets for the problem (i.e., arXiv/PubMed and BIGPATENT).

Comparison of datasets for short document summarization (left)
and long document summarization (right)

Long documents introduce new problems to the process of summarization:

  • More noise: It is safe to say that the number of main points within a 5000-word document hardly ever exceeds 10. That is to say, most parts in the document are barely expansions of some central ideas and thus, should be ignored.
  • Scattered main points: Although there are a few of them, main points are widely scattered over the text, which makes full-text scan inevitable to extract all important information.
  • More resources needed: As the text gets longer, we need a higher dimensional vector to encode it before feeding into neural models.

Existing approaches. As new problems arise, directly applying existing summarization approaches does not end up very well. Four strategies are proposed to address the issues with long documents:

  • Truncating input to a fixed length. This is the most straightforward approach but often the least efficient as it discards text which may contain important information, given main points are widely scattered over the text.
  • Focus on informative parts of the document. We hypothesize that main points should be mentioned in some “main” sections in the document. By that, we only need to perform summarization on the subset of the original text.

Shortcomings: This is basically a heuristic method and cannot scale well if there are no bias sections, for example, in summarizing a novel.

  • Hybrid models. Extracting only important portions of the text greatly reduces the problem space. This is a promising idea since we can take advantage of state-of-the-art results in both extractive and abstractive text summarization.

Shortcomings: Since neural networks are mostly employed in state-of-the-art extractive summarization models, the problem of resources goes full circle.

  • Divide-and-conquer. In each base case, we try to address the problem of summarizing a text portion within the original document. Then, partial summaries are combined in some way to create the final summary. This method possesses many properties to help us deal with mentioned issues of long documents: (1) by dividing a long text into manageable chunks, we can fully encode them within the constraint of resources; (2) subproblems are independently solved, which makes it easier for parallelism and full-text scan; (3) last but not least, the state-of-the-art results in the field do not go to waste.

Shortcomings: Currently, each type of document requires a different dividing strategy. How to efficiently divide an arbitrary text into appropriate sections with relatively short length remains an open question that lacks general answers.

Refined scope. Given new issues in long document summarization as well as the current shortage of one general strategy for all long documents, we try to redefine the scope of the problem to a certain type of document, namely scientific papers. By that way, multiple features of scientific papers can be utilized to make the hard task of long document summarization much easier:

  • Natural structures: Most scientific papers are structured in a certain way, e.g. IMRaD (Introduction, Methods, Results, and Discussion).
  • The structures in scientific papers have been thoroughly designed in order to deliver different aspects of a method/problem in each section.
  • The length of each section is relatively small compared to the document length.

DANCER, a strategy for scientific papers summarization

As we focus on scientific papers summarization, divide-and-conquer becomes the most promising strategy since its shortcomings have been addressed by the use of natural structures in scientific papers. DANCER, which was proposed by Gidiotis and Tsoumakas in their paper “A Divide-and-Conquer Approach to the Summarization of Academic Articles”, is a variant of this strategy tailored for scientific papers summarization. When incorporated with abstractive models, it has demonstrated competitive performance despite employing a very simple idea.

DANCER is the strategy used in the latest SOTA in scientific papers summarization

DANCER for scientific papers summarization follows three phases:

  • Divide: Subproblems are identified based on existing structures in scientific papers, i.e., their disjoint sections.
Scientific paper often follows a specific structure,
e.g. IMRaD (Introduction, Method, Result, and Discussion)

However, not all of the sections are concerned. Some are claimed to be less informative than others and thus, should be discarded in the process of summarization. In the original approach, DANCER only makes use of four sections in papers: Introduction, Methods, Results, and Conclusion. Each of them is a training example as a whole. More on the implementation: informative sections are identified based on keywords in their titles.

  • Conquer: Given the length of each section is relatively small, we can regard each subproblem as short document summarization. Thus, current state-of-the-art abstractive models can be employed to create summaries for each section.
Base case problems are solved independently using an abstractive summarization model

With each base section, one needs a target summary. As the model should not generate text beyond the scope of the given section, having the whole reference summary as the target one for each subproblem is not appropriate. The original DANCER proposes a sophisticated method to tackle this problem: utilize ROUGE-L scores to match different segments in the reference summary with appropriate sections. More on the implementation: every sentence in the reference summary is computed ROUGE-L against each in the document and finally put into the target summary of the section whose sentence gives the highest score.

  • Combine: In the original approach, DANCER employs simple concatenations to combine partial summaries into a complete one. Without further actions, the final summary often lacks the desired fluency and suffers from repetition.
Partial summaries are concatenated in the original order to create the final summary.

Our approach: DANCER+BART

In the original paper, the authors employ DANCER with a simple RNN-based seq2seq model for summarizing each section, namely Pointer-Generator. This choice of summarization model leaves plenty of room for improvement. In particular, we can replace it by BART or PEGASUS, which are state-of-the-art pre-trained models based on Transformers, for partial abstractive summarization.

More on BART: BART is a pre-trained sequence-to-sequence model combining bidirectional and autoregressive Transformers. This architecture employs recent advances in NLP, including:

  • Denoising autoencoders: In order to force the hidden layers to discover more robust features, we train the denoising autoencoders to reconstruct the original text from its corrupted version. The corrupted text contains random subsets of words masked out by an arbitrary noising function.
    Experimental results suggest that text infilling demonstrates the most consistently strong performance. This noising scheme works by randomly sampling a number of text spans whose span lengths are drawn from a Poisson distribution and replacing each with a single [MASK] token. Text infilling generalizes the original text masking proposed in BERT while forcing the model to reason more about the length of the text.
  • The bidirectional encoder in BERT: The document is encoded bidirectionally (left-to-right and right-to-left) to enable the model to learn the context of a word regarding all of its surroundings. By this way, missing tokens can be predicted independently. This independence, however, does not align with the desired sequentiality in sequence generation tasks such as summarization.
BART: The original document is corrupted using an in-filling scheme and encoded by a bidirectional encoder (left). Then, the autoregressive decoder calculates the probability of a word being the next word to construct the complete summary (right).
  • Autoregressive decoder in GPT: Causal attentive masks are employed in the autoregressive decoder to restrict the information used in predicting a token. That means, the prediction for current tokens depends completely on its previous ones, or tokens are predicted auto-regressively. This property is much desired for sequence generation tasks in summarization. However, since words are conditioned only on the leftward context, it cannot learn bidirectional interactions like BERT.

By combining the advantages of different paradigms, BART achieved new state-of-the-art results (gaining up to 6 ROUGE score) in short document summarization when published.

More on PEGASUS: The authors hypothesize that the closer the pre-training objective to the downstream generative tasks, the better the fine-tune performance. By that claim, they propose a new self-supervised pre-training objective called gap-sentence generation. In contrast to BART, PEGASUS masks whole sentences rather than continuous text spans. Gap-sentences are not randomly sampled but deterministically chosen based on their importance, which is computed by ROUGE to the rest of the document.

Replace current RNN-based seq2seq models by BART or PEGASUS
are expected to improve the partial summaries.


The idea behind divide-and-conquer methods such as DANCER is simple, yet it shows promising results in scientific papers summarization. To further advance the current work, one may focus on the following problems:

  • Identify informative sections: Each type of document is structured differently and thus, having different “main” sections (or not having at all). Further studies should be conducted to support the choice of which sections should be included in the process of summarization. Moreover, one may want to develop better techniques to address the main sections within the original document instead of solely depending on keywords overlap.
  • Combining strategy: Simple concatenation results in the summary which lacks fluency and coherence between sentences. Post-processing methods are obvious improvements which should be made in future works.
  • Metrics in ATS: Given the inflexibility of ROUGE, the investigation on new efficient automatic metrics definitely results in more advances in automatic text summarization.


Zhang, Jingqing, et al. “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization.” arXiv preprint arXiv:1912.08777 (2019).

Lewis, Mike, et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).

Gidiotis, Alexios, and Grigorios Tsoumakas. “A Divide-and-Conquer Approach to the Summarization of Academic Articles.” arXiv preprint arXiv:2004.06190 (2020).

Lin, Chin-Yew. “Looking for a few good metrics: Automatic summarization evaluation-how many samples are enough?.” NTCIR. 2004.


Sample partial summaries employed in DANCER+BART

— — — — — —

About Camnhung: an extraordinary girl who is participating in Cinnamon Student AI Bootcamp 2020. Her main focus in Bootcamp is NLP.
About “Bootcamp Student AI Bootcamp 2020: Ideas to Reality”: this is a scholarship program with a new format that provides the young in AI/Deep Learning field a solid foundation to practicalize their ideas and develop their own product from scratch. More info: here.