Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
opencsg 's Collections
csg-wukong
high-quality Chinese training datasets
opencsg-mm
opencsg-codeLlama
opencsg-starcoder
csg-synthetic-datasets

high-quality Chinese training datasets

updated May 22

a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets.

Upvote
18

  • opencsg/Fineweb-Edu-Chinese-V2.1

    Viewer • Updated Feb 27 • 958M • 23.1k • 39

  • OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

    Paper • 2501.08197 • Published Jan 14 • 8

  • opencsg/chinese-fineweb-edu-v2

    Viewer • Updated Jan 20 • 188M • 946 • 63

  • opencsg/chinese-fineweb-edu

    Viewer • Updated Jan 20 • 84.6M • 5.72k • 103

  • opencsg/csg-wukong-2b-chinese-fineweb-edu

    2B • Updated Sep 17, 2024 • 11 • 3

  • opencsg/csg-wukong-ablation-chinese-random

    2B • Updated Sep 4, 2024 • 5 • 2

  • opencsg/chinese-cosmopedia

    Preview • Updated Jan 15 • 401 • 70

  • opencsg/smoltalk-chinese

    Preview • Updated Jan 15 • 460 • 32

  • opencsg/csg-wukong-2b-smoltalk-chinese

    2B • Updated Jan 7 • 8 • 2

  • opencsg/UltraFeedback-chinese

    Preview • Updated Jan 14 • 372 • 12

  • opencsg/csg-wukong-2b-ultrafeedback-chinese-binarized

    2B • Updated Jan 7 • 7

  • yulan-team/YuLan-Mini

    Text Generation • 2B • Updated Mar 27 • 68 • 37

  • opencsg/chinese-fineweb-v2-scorer-train-data

    Preview • Updated Feb 25 • 40
Upvote
18
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs