---
license: apache-2.0
---

# chinese_text_correction

中文文本纠错数据集，包含拼写和语法纠错数据，可用于中文校对模型的训练。

****Repository:**** [zejunwang1/CTCDataset](https://github.com/zejunwang1/CTCDataset)

## Data distribution

| Source    | Type    | Sample |
| --------- | ------- | ------ |
| CCTC      | grammar | 4470   |
| cscd-ns   | spell   | 40000  |
| CTC2021   | grammar | 969    |
| ECSpell   | spell   | 8180   |
| lemon     | spell   | 22252  |
| MCSCSet   | spell   | 39302  |
| midu2022  | grammar | 2014   |
| NLPCC2023 | spell   | 1000   |
| Total     | —       | 118187 |

## Data Fields

| Field  | Type   | Description                   |
| ------ | ------ | ----------------------------- |
| source | string | 可能包含拼写/语法错误的源句子               |
| target | string | 纠错后的目标句子                      |
| label  | int    | 源句子中是否包含错误，若为1，则包含错误，否则不包含错误。 |

```json
{
    "source": "完善农产品上行发展机智。",
    "target": "完善农产品上行发展机制。",
    "label": 1
}
```

## How to use it

```python
from datasets import load_dataset

data = load_dataset('WangZeJun/chinese_text_correction')
print(data)
DatasetDict({
    train: Dataset({
        features: ['source', 'target', 'label'],
        num_rows: 118187
    })
})
```

## License/Terms of Use

### License

[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

### Data Developer

[Zejun Wang](https://github.com/zejunwang1)

### Use Case

使用该数据集可进行中文纠错模型的训练。

### Release Date

04/17/2025

## Data Version

1.0 (04/17/2025)