CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
Abstract
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.
Community
Check out CMHG, one of the first benchmarks for text generation in three Chinese minority languages - Tibetan, Uyghur, and Traditional Mongolian. This work has been accepted to EMNLP 2025 main conference, and we hope to raise more awareness of such underrepresented languages in NLP🤗
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages (2025)
- IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian (2025)
- Exploring NLP Benchmarks in an Extremely Low-Resource Setting (2025)
- WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai (2025)
- SEA-BED: Southeast Asia Embedding Benchmark (2025)
- VN-MTEB: Vietnamese Massive Text Embedding Benchmark (2025)
- Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper