arxiv:2311.09807

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Published on Nov 16, 2023

Authors:

Guokan Shang ,

Abstract

This study investigates the consequences of training large language models (LLMs) on synthetic data generated by their predecessors, an increasingly prevalent practice aimed at addressing the limited supply of human-generated training data. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we developed a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive fine-tuning experiments across various natural language generation tasks. Our findings reveal a marked decrease in the diversity of the models' outputs through successive iterations. This trend underscores the potential risks of training LLMs on predecessor-generated text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of LLMs.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2311.09807 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2311.09807 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2311.09807 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.