BAAI
/

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

OpenSeek-Small-v1-Baseline Model Documentation

Overview

We sampled 100 billion tokens from the CCI4.0 dataset and trained a 1.4B-parameter MoE model with 0.4B active parameters. This model, along with the dataset, is open-sourced as a baseline for future experiments in areas such as dataset construction, algorithmic strategies, and parallel training frameworks. The model arch is same as OpenSeek-Small-v1 model.

Training Data

The ratio for each domain is as follows:

Name Ratio
Nemotron-CC-high-actual-actual-high 1.1068
Nemotron-CC-high-actual-actual-low 0.3577
Nemotron-CC-high-actual-actual-mid 0.7775
Nemotron-CC-high-synthetic-distill-high 0.2859
Nemotron-CC-high-synthetic-distill-low 0.1672
Nemotron-CC-high-synthetic-distill-mid 0.2339
Nemotron-CC-high-synthetic-diverse_qa_pairs-high 0.5397
Nemotron-CC-high-synthetic-diverse_qa_pairs-low 0.4064
Nemotron-CC-high-synthetic-diverse_qa_pairs-mid 0.5005
Nemotron-CC-high-synthetic-extract_knowledge-high 0.4616
Nemotron-CC-high-synthetic-extract_knowledge-low 0.0670
Nemotron-CC-high-synthetic-extract_knowledge-mid 0.3429
Nemotron-CC-high-synthetic-knowledge_list-high 0.2610
Nemotron-CC-high-synthetic-knowledge_list-low 0.1824
Nemotron-CC-high-synthetic-knowledge_list-mid 0.2313
Nemotron-CC-high-synthetic-wrap_medium-high 0.8237
Nemotron-CC-high-synthetic-wrap_medium-low 0.2866
Nemotron-CC-high-synthetic-wrap_medium-mid 0.6670
Nemotron-CC-low-synthetic-wrap_medium-high 0.4657
Nemotron-CC-low-synthetic-wrap_medium-low 0.2005
Nemotron-CC-low-synthetic-wrap_medium-mid 0.4317
Nemotron-CC-medium-actual-actual-high 1.1397
Nemotron-CC-medium-actual-actual-low 0.6782
Nemotron-CC-medium-actual-actual-mid 0.9175
arxiv 0.6414
books 0.4696
code-high 1.0102
code-low 1.1403
code-mid 0.9674
cot_synthesis2_CC-high 0.3755
cot_synthesis2_CC-low 0.0499
cot_synthesis2_CC-mid 1.8299
cot_synthesis2_OpenSource-high 0.2573
cot_synthesis2_OpenSource-low 0.1638
cot_synthesis2_OpenSource-mid 0.3251
cot_synthesis2_arxiv-high 6.0237
cot_synthesis2_arxiv-low 8.9063
cot_synthesis2_arxiv-mid 10.1376
cot_synthesis2_code-high 0.4598
cot_synthesis2_code-low 0.6857
cot_synthesis2_code-mid 0.8990
cot_synthesis2_math-high 1.3135
cot_synthesis2_math-low 1.6530
cot_synthesis2_math-mid 0.3536
cot_synthesis2_wiki-high 0.6314
cot_synthesis2_wiki-low 0.5978
cot_synthesis2_wiki-mid 0.7909
cot_synthesis_CC-high 0.2225
cot_synthesis_CC-low 0.1797
cot_synthesis_CC-mid 0.2042
cot_synthesis_OpenSource-high 0.4081
cot_synthesis_OpenSource-low 0.1659
cot_synthesis_OpenSource-mid 1.2828
cot_synthesis_arxiv-high 5.68
cot_synthesis_arxiv-low 7.4907
cot_synthesis_arxiv-mid 8.9359
cot_synthesis_code-high 0.7663
cot_synthesis_code-low 0.4052
cot_synthesis_code-mid 0.1916
cot_synthesis_math-high 0.5074
cot_synthesis_math-low 0.6437
cot_synthesis_math-mid 0.6406
cot_synthesis_wiki-high 0.4000
cot_synthesis_wiki-low 0.3564
cot_synthesis_wiki-mid 0.5768
math-high 1.8165
math-low 1.6940
math-mid 1.6311
pes2o 6.1982
pes2o-full-train 1.4257
pes2o-full-val 0.0143
stack 0.4229
wiki 0.4202
zh_cc-high-loss0 1.8171
zh_cc-high-loss1 0.9776
zh_cc-high-loss2 0.3725
zh_cc-medium-loss0 0.9492
zh_cc-medium-loss1 0.9236
zh_cc-medium-loss2 1.0643

Wandb

Our training curves have been recorded in Weights & Biases wandb.

Evalation

We used the LightEval library for model evaluation, following the same setup as in FineWeb and CCI3-HQ. All evaluations were conducted in a zero-shot setting. To directly compare the performance across different datasets, we use Average, which refers to the overall average score across all Chinese and English benchmarks.

Metrics Score
HellaSwag 42.09
ARC (Average) 40.11
PIQA 67.14
MMLU (cloze) 31.29
CommonsenseQA 28.17
TriviaQA 6.51
Winograde 51.38
OpenBookQA 33.00
GSM8K (5-shot) 6.67
SIQA 41.86
CEval 30.19
CMMLU 30.25
Average-English 34.82
Average-Chinese 30.22
Overall Average 32.52

Usage Instructions

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-Baseline")

inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including BAAI/OpenSeek-Small-v1-Baseline