PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India
Abstract
This paper introduces PMIndiaSum, a new multilingual and massively parallel headline summarization corpus focused on languages in India. Our corpus covers four language families, 14 languages, and the largest to date, 196 language pairs. It provides a testing ground for all cross-lingual pairs. We detail our workflow to construct the corpus, including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding the summarization of Indian texts. Our dataset is publicly available and can be freely modified and re-distributed.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper