Files changed (1) hide show
  1. README.md +85 -73
README.md CHANGED
@@ -1,74 +1,86 @@
1
- ---
2
- library_name: transformers
3
- license: apache-2.0
4
- base_model: Qwen/Qwen2.5-7B-Instruct
5
- tags:
6
- - llama-factory
7
- - full
8
- - generated_from_trainer
9
- model-index:
10
- - name: original
11
- results: []
12
- language:
13
- - en
14
- datasets:
15
- - bespokelabs/Bespoke-Stratos-17k
16
- ---
17
-
18
- <p align="center">
19
- <img src="https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B/resolve/main/Bespoke-Labs-Logo.png" width="550">
20
- </p>
21
-
22
- ## Model description
23
- This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) on the [Bespoke-Stratos-17k dataset](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k).
24
- The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k).
25
- It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
26
-
27
- ||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B (Ours)|DeepSeek-R1-Distill-Qwen-7B (Reported)|
28
- |---|---|---|---|---|
29
- |AIME2024|20.0|10.0|43.3|55.5|
30
- |MATH500|82.0|74.2|89.4|92.8|
31
- |GPQA-Diamond|37.8|33.3|44.9|49.1|
32
- |LiveCodeBench v2 Easy|71.4|65.9|81.3|-|
33
- |LiveCodeBench v2 Medium|25.5|18.9|42.2|-|
34
- |LiveCodeBench v2 Hard|1.6|3.3|2.4|-|
35
- |LiveCodeBench v2 All|36.1|31.9|46.6|-|
36
-
37
-
38
- Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
39
- However, see an improvement, though not at the scale of DeepSeek's distilled model. The reason could be that we used 17k examples, while DeepSeek seems to have used 800k.
40
-
41
- ## Intended uses & limitations
42
-
43
- Apache 2.0 License
44
-
45
- ## Training procedure
46
- We used 8xH100 to train the model for 7 hours.
47
-
48
- ### Training hyperparameters
49
-
50
- The following hyperparameters were used during training:
51
- - learning_rate: 1e-05
52
- - train_batch_size: 1
53
- - eval_batch_size: 8
54
- - seed: 42
55
- - distributed_type: multi-GPU
56
- - num_devices: 8
57
- - gradient_accumulation_steps: 12
58
- - total_train_batch_size: 96
59
- - total_eval_batch_size: 64
60
- - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
61
- - lr_scheduler_type: cosine
62
- - lr_scheduler_warmup_ratio: 0.1
63
- - num_epochs: 3.0
64
-
65
- ### Training results
66
-
67
-
68
-
69
- ### Framework versions
70
-
71
- - Transformers 4.46.1
72
- - Pytorch 2.5.1+cu124
73
- - Datasets 3.1.0
 
 
 
 
 
 
 
 
 
 
 
 
74
  - Tokenizers 0.20.3
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: Qwen/Qwen2.5-7B-Instruct
5
+ tags:
6
+ - llama-factory
7
+ - full
8
+ - generated_from_trainer
9
+ language:
10
+ - zho
11
+ - eng
12
+ - fra
13
+ - spa
14
+ - por
15
+ - deu
16
+ - ita
17
+ - rus
18
+ - jpn
19
+ - kor
20
+ - vie
21
+ - tha
22
+ - ara
23
+ datasets:
24
+ - bespokelabs/Bespoke-Stratos-17k
25
+ model-index:
26
+ - name: original
27
+ results: []
28
+ ---
29
+
30
+ <p align="center">
31
+ <img src="https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B/resolve/main/Bespoke-Labs-Logo.png" width="550">
32
+ </p>
33
+
34
+ ## Model description
35
+ This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) on the [Bespoke-Stratos-17k dataset](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k).
36
+ The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k).
37
+ It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
38
+
39
+ ||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B (Ours)|DeepSeek-R1-Distill-Qwen-7B (Reported)|
40
+ |---|---|---|---|---|
41
+ |AIME2024|20.0|10.0|43.3|55.5|
42
+ |MATH500|82.0|74.2|89.4|92.8|
43
+ |GPQA-Diamond|37.8|33.3|44.9|49.1|
44
+ |LiveCodeBench v2 Easy|71.4|65.9|81.3|-|
45
+ |LiveCodeBench v2 Medium|25.5|18.9|42.2|-|
46
+ |LiveCodeBench v2 Hard|1.6|3.3|2.4|-|
47
+ |LiveCodeBench v2 All|36.1|31.9|46.6|-|
48
+
49
+
50
+ Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
51
+ However, see an improvement, though not at the scale of DeepSeek's distilled model. The reason could be that we used 17k examples, while DeepSeek seems to have used 800k.
52
+
53
+ ## Intended uses & limitations
54
+
55
+ Apache 2.0 License
56
+
57
+ ## Training procedure
58
+ We used 8xH100 to train the model for 7 hours.
59
+
60
+ ### Training hyperparameters
61
+
62
+ The following hyperparameters were used during training:
63
+ - learning_rate: 1e-05
64
+ - train_batch_size: 1
65
+ - eval_batch_size: 8
66
+ - seed: 42
67
+ - distributed_type: multi-GPU
68
+ - num_devices: 8
69
+ - gradient_accumulation_steps: 12
70
+ - total_train_batch_size: 96
71
+ - total_eval_batch_size: 64
72
+ - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
73
+ - lr_scheduler_type: cosine
74
+ - lr_scheduler_warmup_ratio: 0.1
75
+ - num_epochs: 3.0
76
+
77
+ ### Training results
78
+
79
+
80
+
81
+ ### Framework versions
82
+
83
+ - Transformers 4.46.1
84
+ - Pytorch 2.5.1+cu124
85
+ - Datasets 3.1.0
86
  - Tokenizers 0.20.3