metadata

license: apache-2.0

chinese_text_correction

中文文本纠错数据集，包含拼写和语法纠错数据，可用于中文校对模型的训练。

Repository: zejunwang1/CTCDataset

Data distribution

Source	Type	Sample
CCTC	grammar	4470
cscd-ns	spell	40000
CTC2021	grammar	969
ECSpell	spell	8180
lemon	spell	22252
MCSCSet	spell	39302
midu2022	grammar	2014
NLPCC2023	spell	1000
Total	—	118187

Data Fields

Field	Type	Description
source	string	可能包含拼写/语法错误的源句子
target	string	纠错后的目标句子
label	int	源句子中是否包含错误，若为1，则包含错误，否则不包含错误。

{
    "source": "完善农产品上行发展机智。",
    "target": "完善农产品上行发展机制。",
    "label": 1
}

How to use it

from datasets import load_dataset

data = load_dataset('WangZeJun/chinese_text_correction')
print(data)
DatasetDict({
    train: Dataset({
        features: ['source', 'target', 'label'],
        num_rows: 118187
    })
})

License/Terms of Use

License

Apache License 2.0

Data Developer

Zejun Wang

Use Case

使用该数据集可进行中文纠错模型的训练。

Release Date

04/17/2025

Data Version

1.0 (04/17/2025)