Fine-tuning pre-trained models on domain-specific datasets has been the leading paradigm in text summarization research in recent years. These models generate high quality summaries on standard benchmarks but still require sizeable training datasets. The success of prompt-based models, e.g. GPT-3, provides a promising alternative to these by allowing models to learn from natural language task instructions and/or a few demonstrative examples in the context instead of updating model parameters. Here, we systematically study how these two paradigms compare. We show that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as lead-bias or poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics, e.g. recently proposed QA- or entailment-based factuality approaches, cannot reliably evaluate zero-shot summaries.

To support further research, we release:
    (a) 1K human preference judgments and rationales comparing different systems for generic- and keyword-based summarization. Click here.
    (b) corpus of 10K generated summaries from fine-tuned and zero-shot models across 4 standard summarization benchmarks. Click here.

Browse Human Annotations for news articles from 2022

This contains 100 articles from CNN and BBC each, scraped between March 1, 2022 and June 31, 2022. For each article, summaries are generated using three systems:
    (1) OpenAI's text-davinci-002 **
    (2) fine-tuned BRIO (link)
    (3) T0 (link)

For each article, we obtain best/worst summary judgments from three unique human annotators. Examples from these can be browsed below. Download the full dataset here.
Choose Dataset
Input Article
Evaluated Summaries


Browse generated summaries for benchmark datasets

For four benchmark summarization datasets (CNN, Dailymail, XSum, Newsroom), we randomly sample 500 summaries from the standard test set. We provide generated summaries from 4 different summarization systems to support future work and standardize test sets.
Select a dataset to view generated summaries. Click here to download the entire dataset.
Choose Dataset
Input Article
Generated Summaries

** If the text-davinci-002 generates a numbered list, we post-process it and remove the numbering to align with the outputs of other summarization systems. This is similar to how the CNN/DM dataset was constructed from a list of bullet points.


            title={News Summarization and Evaluation in the Era of GPT-3},
            author={Tanya Goyal, Junyi Jessy Li, Greg Durrett},
            journal={arXiv preprint}
If you have any questions, please contact Tanya Goyal: tanyagoyal@utexas.edu