Papers
arxiv:2406.03202

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

Published on Jun 5, 2024
Authors:
,
,

Abstract

A framework with Subject Selector, Grammar Selector, Prompt Manager, and Evaluator improves LLMs for generating grammatical error correction data using a new dataset, ChatLang-8.

AI-generated summary

We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named ChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.03202 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.03202 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.03202 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.