Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +68 -0

README.md CHANGED Viewed

@@ -1,3 +1,71 @@
 # Menda-3B-500
 Menda-3B-500 is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) using Guided Reinforcement from Preference Optimization (GRPO). This model represents the 500-step checkpoint from the training process.

+---
+language:
+  - en
+license: other
+base_model: Qwen/Qwen2.5-3B-Instruct
+tags:
+  - qwen
+  - grpo
+  - reinforcement-learning
+  - instruction-tuning
+  - mathematical-reasoning
+  - gsm8k
+datasets:
+  - gsm8k
+model-index:
+  - name: Menda-3B-500
+    results:
+      - task:
+          type: multiple-choice-qa
+          name: ARC-Challenge
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 50.0
+      - task:
+          type: multiple-choice-qa
+          name: BoolQ
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 90.0
+      - task:
+          type: multiple-choice-qa
+          name: HellaSwag
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 40.0
+      - task:
+          type: multiple-choice-qa
+          name: Lambada
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 70.0
+      - task:
+          type: multiple-choice-qa
+          name: PIQA
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 90.0
+      - task:
+          type: multiple-choice-qa
+          name: Winogrande
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 90.0
+      - task:
+          type: mmlu
+          name: MMLU
+        metrics:
+          - name: Average
+            type: accuracy
+            value: 68.60
+---
 # Menda-3B-500
 Menda-3B-500 is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) using Guided Reinforcement from Preference Optimization (GRPO). This model represents the 500-step checkpoint from the training process.