WangZeJun
/

chinese_text_correction

Model card Files Files and versions Community

WangZeJun commited on about 1 month ago

Commit

6c3a0e3

verified ·

1 Parent(s): bcc581d

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +76 -0

README.md ADDED Viewed

	@@ -0,0 +1,76 @@

+---
+license: apache-2.0
+---
+# chinese_text_correction
+中文文本纠错数据集，包含拼写和语法纠错数据，可用于中文校对模型的训练。
+****Repository:**** [zejunwang1/CTCDataset](https://github.com/zejunwang1/CTCDataset)
+## Data distribution
+| Source    | Type    | Sample |
+| --------- | ------- | ------ |
+| CCTC      | grammar | 4470   |
+| cscd-ns   | spell   | 40000  |
+| CTC2021   | grammar | 969    |
+| ECSpell   | spell   | 8180   |
+| lemon     | spell   | 22252  |
+| MCSCSet   | spell   | 39302  |
+| midu2022  | grammar | 2014   |
+| NLPCC2023 | spell   | 1000   |
+| Total     | —       | 118187 |
+## Data Fields
+| Field  | Type   | Description                   |
+| ------ | ------ | ----------------------------- |
+| source | string | 可能包含拼写/语法错误的源句子               |
+| target | string | 纠错后的目标句子                      |
+| label  | int    | 源句子中是否包含错误，若为1，则包含错误，否则不包含错误。 |
+```json
+{
+    "source": "完善农产品上行发展机智。",
+    "target": "完善农产品上行发展机制。",
+    "label": 1
+}
+```
+## How to use it
+```python
+from datasets import load_dataset
+data = load_dataset('WangZeJun/chinese_text_correction')
+print(data)
+DatasetDict({
+    train: Dataset({
+        features: ['source', 'target', 'label'],
+        num_rows: 118187
+    })
+})
+```
+## License/Terms of Use
+### License
+[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)
+### Data Developer
+[Zejun Wang](https://github.com/zejunwang1)
+### Use Case
+使用该数据集可进行中文纠错模型的训练。
+### Release Date
+04/17/2025
+## Data Version
+1.0 (04/17/2025)