Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
opencsg 's Collections
csg-wukong
high-quality Chinese training datasets
opencsg-mm
opencsg-codeLlama
opencsg-starcoder
csg-synthetic-datasets

high-quality Chinese training datasets

updated 16 days ago

a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets.

Upvote
17

  • opencsg/Fineweb-Edu-Chinese-V2.1

    Viewer • Updated Feb 27 • 958M • 27.9k • 36

  • OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

    Paper • 2501.08197 • Published Jan 14 • 8

  • opencsg/chinese-fineweb-edu-v2

    Viewer • Updated Jan 20 • 188M • 2.65k • 63

  • opencsg/chinese-fineweb-edu

    Viewer • Updated Jan 20 • 84.6M • 10.6k • 102

  • opencsg/csg-wukong-2b-chinese-fineweb-edu

    Updated Sep 17, 2024 • 6 • 3

  • opencsg/csg-wukong-ablation-chinese-random

    Updated Sep 4, 2024 • 1 • 2

  • opencsg/chinese-cosmopedia

    Preview • Updated Jan 15 • 791 • 69

  • opencsg/smoltalk-chinese

    Preview • Updated Jan 15 • 527 • 32

  • opencsg/csg-wukong-2b-smoltalk-chinese

    Updated Jan 7 • 7 • 2

  • opencsg/UltraFeedback-chinese

    Preview • Updated Jan 14 • 292 • 11

  • opencsg/csg-wukong-2b-ultrafeedback-chinese-binarized

    Updated Jan 7 • 8

  • yulan-team/YuLan-Mini

    Text Generation • Updated Mar 27 • 65 • 37

  • opencsg/chinese-fineweb-v2-scorer-train-data

    Preview • Updated Feb 25 • 45
Upvote
17
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs