Traditional Chinese LLM Corpus
Traditional Chinese corpus collection for LLM training (pre-training, instruction-tuning, and RLHF/alignment).
Viewer • Updated • 1.78M • 20 • 13Note Contains ~2B tokens from high quality corpus. Cleaned and deduplicated.
liswei/wikipedia-zhtw-dedup
Viewer • Updated • 1.18M • 61 • 2Note Deduplicate version of erhwenkuo/wikipedia-zhtw using MinHash.
liswei/c4-zhtw
Viewer • Updated • 4.86M • 97 • 2Note Deduplicated C4 subset of zhTW. Note: C4 = colossal, cleaned version of Common Crawl
liswei/common-crawl-zhtw
Viewer • Updated • 2.71M • 113 • 3Note Deduplicated CC subset of zhTW.
zetavg/CC-100-zh-Hant-merged
Viewer • Updated • 12.3M • 315 • 3Note Zh-tw subset of CC-100 dataset, which is derived from commoncrawl. Note: CC harms performance as shown in TaiwanLlama.
liswei/coct-en-zhtw-dedup
Viewer • Updated • 217k • 45 • 2Note Deduplicate version of zetavg/coct-en-zh-tw-translations-twp-300k. Zh-tw <-> en paired articles provided by 台灣光華雜誌.
liswei/PromptPair-TW
Viewer • Updated • 119k • 7 • 1Note Traditional Chinese instruction dataset. Contains en <-> tw pairs with system prompts to better adopt from English pre-trained models.
yentinglin/TaiwanChat
Viewer • Updated • 485k • 223 • 62Note Instruction dataset used to train TaiwanLLM v1. Find more details in the paper.
erhwenkuo/alpaca-data-gpt4-chinese-zhtw
Viewer • Updated • 52k • 72 • 6Note Translated from en to zh-tw of the alpaca-gpt4 dataset.
zetavg/mlqa_en_zh_tw
Viewer • Updated • 3.29k • 70 • 7Note zhcn/en multilingual QA translated to zhtw/en. Internal experiment shows that when transferring from English base model, traning on Q:en->A:zh or vice versa improves SFT performance.
zetavg/ShareGPT-Processed
Viewer • Updated • 90.7k • 109 • 29Note The RyokoAI/ShareGPT52K dataset, converted to Markdown and labeled with the language used.
lchakkei/OpenOrca-Traditional-Chinese
Viewer • Updated • 4.23M • 3.03k • 10Note Google translated instruction data from English.
Heng666/Traditional_Chinese-aya_dataset
Viewer • Updated • 4.91k • 112 • 3Heng666/Traditional_Chinese-aya_evaluation_suite
Viewer • Updated • 650 • 60 • 3
ChenWeiLi/Med_Breexe_zhtw
Viewer • Updated • 1.6k • 5 • 4Note Instruction dataset in the Medicine domain. Prompts are translated then feed to Breexe model.
Tarklanse/Traditional_Chinese_roleplay_chat_Dataset
Viewer • Updated • 9.51k • 121 • 40DataAgent/Pretrain-Taiwan-DentistKnowledge-zhTW-290K
Viewer • Updated • 147 • 69 • 2
KSmart/chinese_traditional_chengyu
Viewer • Updated • 111 • 43 • 3Note This is in Simplified Chinese.
liswei/rm-static-zhTW
Viewer • Updated • 81.4k • 9 • 30Note Perference dataset with chosen/reject pair. Translated using m2m100.
ZoneTwelve/ChineseGrammaticalErrorEvaluation
Viewer • Updated • 132 • 72ZoneTwelve/micro_sft_instruct
Viewer • Updated • 10 • 46