weathermanj commited on
Commit
e5b61b5
·
verified ·
1 Parent(s): eea9131

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md CHANGED
@@ -1,3 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Menda-3B-500
2
 
3
  Menda-3B-500 is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) using Guided Reinforcement from Preference Optimization (GRPO). This model represents the 500-step checkpoint from the training process.
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: other
5
+ base_model: Qwen/Qwen2.5-3B-Instruct
6
+ tags:
7
+ - qwen
8
+ - grpo
9
+ - reinforcement-learning
10
+ - instruction-tuning
11
+ - mathematical-reasoning
12
+ - gsm8k
13
+ datasets:
14
+ - gsm8k
15
+ model-index:
16
+ - name: Menda-3B-500
17
+ results:
18
+ - task:
19
+ type: multiple-choice-qa
20
+ name: ARC-Challenge
21
+ metrics:
22
+ - name: Accuracy
23
+ type: accuracy
24
+ value: 50.0
25
+ - task:
26
+ type: multiple-choice-qa
27
+ name: BoolQ
28
+ metrics:
29
+ - name: Accuracy
30
+ type: accuracy
31
+ value: 90.0
32
+ - task:
33
+ type: multiple-choice-qa
34
+ name: HellaSwag
35
+ metrics:
36
+ - name: Accuracy
37
+ type: accuracy
38
+ value: 40.0
39
+ - task:
40
+ type: multiple-choice-qa
41
+ name: Lambada
42
+ metrics:
43
+ - name: Accuracy
44
+ type: accuracy
45
+ value: 70.0
46
+ - task:
47
+ type: multiple-choice-qa
48
+ name: PIQA
49
+ metrics:
50
+ - name: Accuracy
51
+ type: accuracy
52
+ value: 90.0
53
+ - task:
54
+ type: multiple-choice-qa
55
+ name: Winogrande
56
+ metrics:
57
+ - name: Accuracy
58
+ type: accuracy
59
+ value: 90.0
60
+ - task:
61
+ type: mmlu
62
+ name: MMLU
63
+ metrics:
64
+ - name: Average
65
+ type: accuracy
66
+ value: 68.60
67
+ ---
68
+
69
  # Menda-3B-500
70
 
71
  Menda-3B-500 is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) using Guided Reinforcement from Preference Optimization (GRPO). This model represents the 500-step checkpoint from the training process.