chinese_text_correction

中文文本纠错数据集,包含拼写和语法纠错数据,可用于中文校对模型的训练。

Repository: zejunwang1/CTCDataset

Data distribution

Source Type Sample
CCTC grammar 4470
cscd-ns spell 40000
CTC2021 grammar 969
ECSpell spell 8180
lemon spell 22252
MCSCSet spell 39302
midu2022 grammar 2014
NLPCC2023 spell 1000
Total 118187

Data Fields

Field Type Description
source string 可能包含拼写/语法错误的源句子
target string 纠错后的目标句子
label int 源句子中是否包含错误,若为1,则包含错误,否则不包含错误。
{
    "source": "完善农产品上行发展机智。",
    "target": "完善农产品上行发展机制。",
    "label": 1
}

How to use it

from datasets import load_dataset

data = load_dataset('WangZeJun/chinese_text_correction')
print(data)
DatasetDict({
    train: Dataset({
        features: ['source', 'target', 'label'],
        num_rows: 118187
    })
})

License/Terms of Use

License

Apache License 2.0

Data Developer

Zejun Wang

Use Case

使用该数据集可进行中文纠错模型的训练。

Release Date

04/17/2025

Data Version

1.0 (04/17/2025)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support