Papers
arxiv:2501.18128

Unraveling the Capabilities of Language Models in News Summarization

Published on Jan 30
· Submitted by odabashi on Feb 3
Authors:

Abstract

Given the recent introduction of multiple language models and the ongoing demand for improved Natural Language Processing tasks, particularly summarization, this work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. In this work, we systematically test the capabilities and effectiveness of these models in summarizing news article texts which are written in different styles and presented in three distinct datasets. Specifically, we focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology that combines different evaluation concepts including automatic metrics, human evaluation, and LLM-as-a-judge. Interestingly, including demonstration examples in the few-shot learning setting did not enhance models' performance and, in some cases, even led to worse quality of the generated summaries. This issue arises mainly due to the poor quality of the gold summaries that have been used as reference summaries, which negatively impacts the models' performance. Furthermore, our study's results highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities. However, among the public models evaluated, certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta demonstrated promising results. These models showed significant potential, positioning them as competitive alternatives to large models for the task of news summarization.

Community

Paper author Paper submitter

In this work, we test 20 different language models to see how they handle news summarization. And trust me, the findings are eye-opening!

Every day, a huge number of news articles are published, each often containing lengthy and detailed contexts. The massive amount of information being produced makes it increasingly challenging for individuals to stay up-to-date with current events.

You know how everyone's talking about ChatGPT and other huge AI models? Well, that got me thinking - do we always need these giants? So I dove deep into comparing both smaller and larger models to see how they'd handle turning lenghty news articles into concise summaries.

And guess what? While the big players like GPT-4 did shine (no surprise there!), some smaller models turned out to be hidden gems. Models like Qwen1.5, SOLAR and a few others proved that sometimes great things come in smaller packages.

Interested in getting into the details? Check out our preprint on arXiv: https://arxiv.org/abs/2501.18128

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.18128 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.18128 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.18128 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.