OpenSeek-Small-v1-Baseline Model Documentation
Overview
We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-parameter MoE model with 0.4B active parameters. This model, along with the dataset, is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. The model arch is same as OpenSeek-Small-v1 model.
Training Data
The ratio for each domain is as follows:
Name | Ratio |
---|---|
Nemotron-CC-high-actual-actual-high | 1.1068 |
Nemotron-CC-high-actual-actual-low | 0.3577 |
Nemotron-CC-high-actual-actual-mid | 0.7775 |
Nemotron-CC-high-synthetic-distill-high | 0.2859 |
Nemotron-CC-high-synthetic-distill-low | 0.1672 |
Nemotron-CC-high-synthetic-distill-mid | 0.2339 |
Nemotron-CC-high-synthetic-diverse_qa_pairs-high | 0.5397 |
Nemotron-CC-high-synthetic-diverse_qa_pairs-low | 0.4064 |
Nemotron-CC-high-synthetic-diverse_qa_pairs-mid | 0.5005 |
Nemotron-CC-high-synthetic-extract_knowledge-high | 0.4616 |
Nemotron-CC-high-synthetic-extract_knowledge-low | 0.0670 |
Nemotron-CC-high-synthetic-extract_knowledge-mid | 0.3429 |
Nemotron-CC-high-synthetic-knowledge_list-high | 0.2610 |
Nemotron-CC-high-synthetic-knowledge_list-low | 0.1824 |
Nemotron-CC-high-synthetic-knowledge_list-mid | 0.2313 |
Nemotron-CC-high-synthetic-wrap_medium-high | 0.8237 |
Nemotron-CC-high-synthetic-wrap_medium-low | 0.2866 |
Nemotron-CC-high-synthetic-wrap_medium-mid | 0.6670 |
Nemotron-CC-low-synthetic-wrap_medium-high | 0.4657 |
Nemotron-CC-low-synthetic-wrap_medium-low | 0.2005 |
Nemotron-CC-low-synthetic-wrap_medium-mid | 0.4317 |
Nemotron-CC-medium-actual-actual-high | 1.1397 |
Nemotron-CC-medium-actual-actual-low | 0.6782 |
Nemotron-CC-medium-actual-actual-mid | 0.9175 |
arxiv | 0.6414 |
books | 0.4696 |
code-high | 1.0102 |
code-low | 1.1403 |
code-mid | 0.9674 |
cot_synthesis2_CC-high | 0.3755 |
cot_synthesis2_CC-low | 0.0499 |
cot_synthesis2_CC-mid | 1.8299 |
cot_synthesis2_OpenSource-high | 0.2573 |
cot_synthesis2_OpenSource-low | 0.1638 |
cot_synthesis2_OpenSource-mid | 0.3251 |
cot_synthesis2_arxiv-high | 6.0237 |
cot_synthesis2_arxiv-low | 8.9063 |
cot_synthesis2_arxiv-mid | 10.1376 |
cot_synthesis2_code-high | 0.4598 |
cot_synthesis2_code-low | 0.6857 |
cot_synthesis2_code-mid | 0.8990 |
cot_synthesis2_math-high | 1.3135 |
cot_synthesis2_math-low | 1.6530 |
cot_synthesis2_math-mid | 0.3536 |
cot_synthesis2_wiki-high | 0.6314 |
cot_synthesis2_wiki-low | 0.5978 |
cot_synthesis2_wiki-mid | 0.7909 |
cot_synthesis_CC-high | 0.2225 |
cot_synthesis_CC-low | 0.1797 |
cot_synthesis_CC-mid | 0.2042 |
cot_synthesis_OpenSource-high | 0.4081 |
cot_synthesis_OpenSource-low | 0.1659 |
cot_synthesis_OpenSource-mid | 1.2828 |
cot_synthesis_arxiv-high | 5.68 |
cot_synthesis_arxiv-low | 7.4907 |
cot_synthesis_arxiv-mid | 8.9359 |
cot_synthesis_code-high | 0.7663 |
cot_synthesis_code-low | 0.4052 |
cot_synthesis_code-mid | 0.1916 |
cot_synthesis_math-high | 0.5074 |
cot_synthesis_math-low | 0.6437 |
cot_synthesis_math-mid | 0.6406 |
cot_synthesis_wiki-high | 0.4000 |
cot_synthesis_wiki-low | 0.3564 |
cot_synthesis_wiki-mid | 0.5768 |
math-high | 1.8165 |
math-low | 1.6940 |
math-mid | 1.6311 |
pes2o | 6.1982 |
pes2o-full-train | 1.4257 |
pes2o-full-val | 0.0143 |
stack | 0.4229 |
wiki | 0.4202 |
zh_cc-high-loss0 | 1.8171 |
zh_cc-high-loss1 | 0.9776 |
zh_cc-high-loss2 | 0.3725 |
zh_cc-medium-loss0 | 0.9492 |
zh_cc-medium-loss1 | 0.9236 |
zh_cc-medium-loss2 | 1.0643 |
Wandb
Our training curves have been recorded in Weights & Biases wandb.
Evalation
We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ. All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks.
Metrics | Score |
---|---|
HellaSwag | 42.09 |
ARC (Average) | 40.11 |
PIQA | 67.14 |
MMLU (cloze) | 31.29 |
CommonsenseQA | 28.17 |
TriviaQA | 6.51 |
Winograde | 51.38 |
OpenBookQA | 33.00 |
GSM8K (5-shot) | 6.67 |
SIQA | 41.86 |
CEval | 30.19 |
CMMLU | 30.25 |
Average-English | 34.82 |
Average-Chinese | 30.22 |
Overall Average | 32.52 |
Usage Instructions
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))