OpenSeek-Small-v1-Baseline Model Documentation

Overview

We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-parameter MoE model with 0.4B active parameters. This model, along with the dataset, is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. The model arch is same as OpenSeek-Small-v1 model.

Training Data

The ratio for each domain is as follows:

Name	Ratio
Nemotron-CC-high-actual-actual-high	1.1068
Nemotron-CC-high-actual-actual-low	0.3577
Nemotron-CC-high-actual-actual-mid	0.7775
Nemotron-CC-high-synthetic-distill-high	0.2859
Nemotron-CC-high-synthetic-distill-low	0.1672
Nemotron-CC-high-synthetic-distill-mid	0.2339
Nemotron-CC-high-synthetic-diverse_qa_pairs-high	0.5397
Nemotron-CC-high-synthetic-diverse_qa_pairs-low	0.4064
Nemotron-CC-high-synthetic-diverse_qa_pairs-mid	0.5005
Nemotron-CC-high-synthetic-extract_knowledge-high	0.4616
Nemotron-CC-high-synthetic-extract_knowledge-low	0.0670
Nemotron-CC-high-synthetic-extract_knowledge-mid	0.3429
Nemotron-CC-high-synthetic-knowledge_list-high	0.2610
Nemotron-CC-high-synthetic-knowledge_list-low	0.1824
Nemotron-CC-high-synthetic-knowledge_list-mid	0.2313
Nemotron-CC-high-synthetic-wrap_medium-high	0.8237
Nemotron-CC-high-synthetic-wrap_medium-low	0.2866
Nemotron-CC-high-synthetic-wrap_medium-mid	0.6670
Nemotron-CC-low-synthetic-wrap_medium-high	0.4657
Nemotron-CC-low-synthetic-wrap_medium-low	0.2005
Nemotron-CC-low-synthetic-wrap_medium-mid	0.4317
Nemotron-CC-medium-actual-actual-high	1.1397
Nemotron-CC-medium-actual-actual-low	0.6782
Nemotron-CC-medium-actual-actual-mid	0.9175
arxiv	0.6414
books	0.4696
code-high	1.0102
code-low	1.1403
code-mid	0.9674
cot_synthesis2_CC-high	0.3755
cot_synthesis2_CC-low	0.0499
cot_synthesis2_CC-mid	1.8299
cot_synthesis2_OpenSource-high	0.2573
cot_synthesis2_OpenSource-low	0.1638
cot_synthesis2_OpenSource-mid	0.3251
cot_synthesis2_arxiv-high	6.0237
cot_synthesis2_arxiv-low	8.9063
cot_synthesis2_arxiv-mid	10.1376
cot_synthesis2_code-high	0.4598
cot_synthesis2_code-low	0.6857
cot_synthesis2_code-mid	0.8990
cot_synthesis2_math-high	1.3135
cot_synthesis2_math-low	1.6530
cot_synthesis2_math-mid	0.3536
cot_synthesis2_wiki-high	0.6314
cot_synthesis2_wiki-low	0.5978
cot_synthesis2_wiki-mid	0.7909
cot_synthesis_CC-high	0.2225
cot_synthesis_CC-low	0.1797
cot_synthesis_CC-mid	0.2042
cot_synthesis_OpenSource-high	0.4081
cot_synthesis_OpenSource-low	0.1659
cot_synthesis_OpenSource-mid	1.2828
cot_synthesis_arxiv-high	5.68
cot_synthesis_arxiv-low	7.4907
cot_synthesis_arxiv-mid	8.9359
cot_synthesis_code-high	0.7663
cot_synthesis_code-low	0.4052
cot_synthesis_code-mid	0.1916
cot_synthesis_math-high	0.5074
cot_synthesis_math-low	0.6437
cot_synthesis_math-mid	0.6406
cot_synthesis_wiki-high	0.4000
cot_synthesis_wiki-low	0.3564
cot_synthesis_wiki-mid	0.5768
math-high	1.8165
math-low	1.6940
math-mid	1.6311
pes2o	6.1982
pes2o-full-train	1.4257
pes2o-full-val	0.0143
stack	0.4229
wiki	0.4202
zh_cc-high-loss0	1.8171
zh_cc-high-loss1	0.9776
zh_cc-high-loss2	0.3725
zh_cc-medium-loss0	0.9492
zh_cc-medium-loss1	0.9236
zh_cc-medium-loss2	1.0643

Wandb

Our training curves have been recorded in Weights & Biases wandb.

Evalation

We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ. All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks.

Metrics	Score
HellaSwag	42.09
ARC (Average)	40.11
PIQA	67.14
MMLU (cloze)	31.29
CommonsenseQA	28.17
TriviaQA	6.51
Winograde	51.38
OpenBookQA	33.00
GSM8K (5-shot)	6.67
SIQA	41.86
CEval	30.19
CMMLU	30.25
Average-English	34.82
Average-Chinese	30.22
Overall Average	32.52

Usage Instructions

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")

inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

BAAI
/

OpenSeek-Small-v1-Baseline