Automatic Text Summarization

Updated May 2026
Automatic text summarization uses NLP to produce shorter versions of documents that retain the most important information. Extractive summarization selects and combines the most relevant sentences from the original text, while abstractive summarization generates new sentences that paraphrase and condense the content. Modern summarization systems use transformer models that read entire documents and generate fluent, accurate summaries, powering applications from news digests and research paper abstracts to meeting notes and legal document review.

Extractive vs. Abstractive Summarization

Extractive summarization works by scoring each sentence in the source document for importance, then selecting the top-scoring sentences to form the summary. The output is a subset of the original sentences, unchanged. This approach guarantees that the summary contains only information present in the source (no hallucination risk) and preserves the original author's wording. The main limitation is that extractive summaries can feel choppy because the selected sentences were written to function within their original context, not as a standalone summary. Transitions between selected sentences may be awkward, and important information that spans multiple sentences may be captured incompletely.

Abstractive summarization generates new text that paraphrases, compresses, and reorganizes the source content. A human summarizer does this naturally: they read a document, understand the key points, and express those points in their own words, often combining information from multiple sentences, dropping unnecessary details, and restructuring the presentation. Neural abstractive summarizers learn to mimic this process. The output reads more naturally than extractive summaries and can express information more concisely because it is not constrained to reproduce exact source sentences. The risk is hallucination: the model may generate plausible-sounding content that was not in the original document or that distorts the original meaning.

In practice, many modern systems are hybrid, using extractive methods to identify important content regions and abstractive methods to rephrase and combine them into coherent summaries. Large language models performing summarization typically operate in fully abstractive mode, generating summaries that draw on the source document provided in their context window.

How Extractive Summarization Works

Early extractive methods used simple heuristics: sentences at the beginning of a document are more likely to be important (the lead bias, which works well for news articles), sentences containing title words are more relevant, and sentences with high TF-IDF scores contain distinctive content. TextRank, introduced in 2004, applied Google's PageRank algorithm to a graph of sentences: each sentence is a node, edges connect semantically similar sentences, and importance propagates through the graph so that sentences similar to many other important sentences rank highest.

Neural extractive summarizers use BERT or similar transformers to produce contextualized representations of each sentence, then apply a classification layer to predict whether each sentence should be included in the summary. The model is trained on document-summary pairs where the target labels are derived by finding which source sentences best match the reference summary (using ROUGE overlap as the matching criterion). BertSumExt, published in 2019, achieves ROUGE-1 scores above 43 on the CNN/DailyMail benchmark, competitive with abstractive methods on this dataset.

Sentence selection must also consider redundancy. Selecting the three most important sentences is useless if they all convey the same information. Maximum Marginal Relevance (MMR) addresses this by penalizing sentences that are similar to already-selected sentences. The algorithm iteratively selects the sentence that maximizes a combination of relevance to the document and dissimilarity to the current summary. This simple technique significantly improves summary quality by ensuring coverage of different topics and aspects.

How Abstractive Summarization Works

Modern abstractive summarization uses encoder-decoder transformer models. The encoder processes the source document and produces contextualized representations. The decoder generates the summary token by token, attending to the encoder's representations at each step. The model learns to identify what is important, how to paraphrase it concisely, and how to organize the summary coherently, all from training on document-summary pairs.

BART (Bidirectional and Auto-Regressive Transformers), published by Facebook AI in 2019, became a standard model for abstractive summarization. BART is pre-trained by corrupting text (deleting, shuffling, and masking spans) and training the model to reconstruct the original. This pre-training objective teaches the model to understand corrupted inputs and produce clean outputs, which transfers well to summarization where the input is a long, redundant document and the output is a concise summary. Fine-tuned on CNN/DailyMail, BART achieves ROUGE-1 scores above 44, surpassing extractive methods.

T5 (Text-to-Text Transfer Transformer), published by Google in 2019, frames every NLP task as text-to-text transformation. For summarization, the input is "summarize: [document text]" and the output is the summary. This uniform framework allows a single model to handle summarization, translation, question answering, and classification. Pegasus, published by Google in 2020, introduced a pre-training objective specifically designed for summarization: masking entire sentences from a document and training the model to generate them. This gap-sentence generation pre-training produced state-of-the-art results on multiple summarization benchmarks, demonstrating that task-aligned pre-training significantly improves downstream performance.

Large language models like GPT-4 and Claude perform abstractive summarization without any fine-tuning on summarization datasets. Given a document and a prompt like "Summarize this article in three paragraphs," these models produce coherent, accurate summaries by leveraging their general language understanding. The quality often matches or exceeds fine-tuned models, especially for documents that differ from the genres in fine-tuning datasets. The tradeoff is cost and speed: running a large language model is more expensive per document than running a specialized summarization model.

Evaluating Summary Quality

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard automatic metric for summarization. ROUGE-1 measures the overlap of individual words (unigrams) between the generated summary and reference summaries. ROUGE-2 measures bigram overlap. ROUGE-L measures the longest common subsequence. High ROUGE scores indicate that the generated summary contains the same information as the reference, expressed using similar wording. On the CNN/DailyMail benchmark, state-of-the-art models achieve ROUGE-1 scores around 44 to 47, ROUGE-2 around 21 to 23, and ROUGE-L around 40 to 43.

ROUGE has significant limitations. It rewards lexical overlap with the reference, penalizing valid summaries that express the same information differently. A summary that perfectly captures all key points using different vocabulary would receive a low ROUGE score. ROUGE also cannot evaluate factual consistency: a summary that contains hallucinated facts may still achieve high ROUGE if it otherwise overlaps well with the reference. BERTScore and other embedding-based metrics address the vocabulary sensitivity issue by comparing summaries in embedding space rather than at the surface level.

Factual consistency evaluation has become a critical focus. Research has found that 25% to 30% of abstractive summaries contain factual errors, either misrepresenting source content, introducing unsupported claims, or confusing entities. Specialized factual consistency metrics compare claims in the summary against evidence in the source document, using NLI (natural language inference) models to detect contradictions. These metrics are essential for applications where summary accuracy matters, like medical literature review, legal document summarization, and financial report analysis.

Applications and Challenges

News summarization is the most studied and most commercially deployed application. News aggregators generate summaries of breaking stories by combining information from multiple sources. Individual articles are summarized for mobile notifications and email newsletters where space is limited. The lead-bias property of news articles (the most important information appears first) makes news summarization somewhat easier than summarizing other genres, and this bias is reflected in benchmarks: simply selecting the first three sentences of a CNN/DailyMail article produces a strong baseline that many models struggle to significantly beat.

Scientific paper summarization generates abstracts from full papers or produces lay summaries that make research accessible to non-specialists. This is particularly challenging because scientific text contains domain-specific terminology, mathematical notation, references to other work, and complex argumentative structures. Models must distinguish between background information, methodology, results, and conclusions, weighting each appropriately in the summary. The Semantic Scholar platform uses summarization to help researchers quickly assess paper relevance.

Meeting summarization converts transcripts of meetings into action items, decisions, and discussion summaries. This application is growing rapidly as remote work has increased the volume of recorded meetings. The challenges are unique: meeting transcripts contain multiple speakers, tangential discussions, repetitions, and informal language. The summarizer must identify what was decided (not just discussed), who committed to what actions, and what topics were covered versus which were deferred. Commercially deployed meeting summarizers from Otter.ai, Zoom, and Microsoft Teams use combinations of speech recognition, speaker diarization (identifying who spoke when), and abstractive summarization to produce structured meeting notes.

Long-document summarization presents a fundamental challenge for transformer models, whose computational cost grows quadratically with input length. A 50-page legal document might contain 20,000 tokens, exceeding the context window of many models. Approaches include hierarchical summarization (summarize sections individually, then summarize the section summaries), sliding window approaches (process overlapping chunks of the document), and architectures with extended context windows (Longformer, BigBird, or models with 100K+ token context windows). The quality of long-document summarization has improved dramatically with the increase in context window sizes, but even models with 100K+ token windows struggle to equally attend to all parts of very long inputs.

Key Takeaway

Automatic summarization condenses documents by either selecting important sentences (extractive) or generating new condensed text (abstractive), with modern systems achieving fluent, mostly accurate summaries while still working to eliminate factual hallucinations.