Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -7,8 +7,6 @@ tags:
|
|
7 |
- instruct
|
8 |
- fine-tuned
|
9 |
- reasoning
|
10 |
-
- rlhf
|
11 |
-
- alignment
|
12 |
- 3b
|
13 |
- menda
|
14 |
datasets:
|
@@ -58,56 +56,28 @@ model-index:
|
|
58 |
value: 55.0
|
59 |
---
|
60 |
|
61 |
-
# Menda-3b-750: GRPO-Tuned
|
62 |
|
63 |
-
Menda-3b-750 is a fine-tuned version of Qwen2.5-3B-Instruct,
|
64 |
|
65 |
-
|
66 |
-
|
67 |
-
## Model Overview
|
68 |
-
|
69 |
-
Menda-3b-750 builds upon the strong foundation of Qwen2.5-3B-Instruct by applying GRPO, a novel alignment technique that combines the strengths of RLHF (Reinforcement Learning from Human Feedback) with guided optimization. The result is a model that excels at reasoning tasks while remaining helpful, harmless, and honest.
|
70 |
-
|
71 |
-
### Key Features
|
72 |
-
|
73 |
-
- **Enhanced Reasoning**: Improved performance on complex reasoning benchmarks
|
74 |
-
- **Instruction Following**: Better adherence to specific instructions and formats
|
75 |
-
- **Balanced Responses**: Maintains helpfulness while reducing hallucinations
|
76 |
-
- **Efficient Size**: Full capabilities in a compact 3B parameter model
|
77 |
-
|
78 |
-
## Technical Details
|
79 |
|
80 |
- **Base Model**: Qwen2.5-3B-Instruct
|
81 |
-
- **Architecture**: Transformer-based decoder-only
|
82 |
- **Training Method**: GRPO (Guided Rejection Policy Optimization)
|
83 |
- **Training Steps**: 750
|
84 |
- **Context Length**: 4096 tokens
|
85 |
- **Parameters**: 3 billion
|
86 |
-
- **Tokenizer**: Qwen2 tokenizer (BPE-based)
|
87 |
-
|
88 |
-
## Performance Evaluation
|
89 |
|
90 |
-
|
91 |
|
92 |
-
|
93 |
-
|-----------|-----------|--------------|------------|-------------|
|
94 |
-
| HellaSwag | Common Sense Reasoning | 75.0% | 67.5% | +7.5% |
|
95 |
-
| ARC-Challenge | Scientific Reasoning | 80.0% | 67.5% | +12.5% |
|
96 |
-
| MMLU (High School) | Multi-domain Knowledge | 52.5% | 47.5% | +5.0% |
|
97 |
-
| TruthfulQA | Factual Accuracy | 55.0% | 47.5% | +7.5% |
|
98 |
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
-
|
103 |
-
|
104 |
-
|
105 |
-
- Versatile across multiple domains
|
106 |
-
|
107 |
-
**Limitations:**
|
108 |
-
- Still shows weaknesses in specialized mathematical reasoning
|
109 |
-
- May occasionally produce hallucinations for complex queries
|
110 |
-
- Limited context window compared to larger models
|
111 |
|
112 |
## Usage Examples
|
113 |
|
@@ -120,27 +90,13 @@ model_name = "weathermanj/Menda-3b-750"
|
|
120 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
121 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
122 |
|
123 |
-
prompt = "Explain the concept of
|
124 |
inputs = tokenizer(prompt, return_tensors="pt")
|
125 |
outputs = model.generate(**inputs, max_length=300)
|
126 |
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
127 |
print(response)
|
128 |
```
|
129 |
|
130 |
-
### Reasoning Example
|
131 |
-
|
132 |
-
```python
|
133 |
-
prompt = """
|
134 |
-
Solve the following problem step by step:
|
135 |
-
A store sells shirts for $25 each and pants for $40 each. If a customer bought 3 shirts and some pants for a total of $205, how many pants did they buy?
|
136 |
-
"""
|
137 |
-
|
138 |
-
inputs = tokenizer(prompt, return_tensors="pt")
|
139 |
-
outputs = model.generate(**inputs, max_length=500, temperature=0.7)
|
140 |
-
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
141 |
-
print(response)
|
142 |
-
```
|
143 |
-
|
144 |
### Using with Ollama
|
145 |
|
146 |
You can also use this model with Ollama by converting it to GGUF format:
|
@@ -162,29 +118,6 @@ ollama create menda-3b-750 -f Modelfile
|
|
162 |
ollama run menda-3b-750
|
163 |
```
|
164 |
|
165 |
-
## Training Methodology
|
166 |
-
|
167 |
-
Menda-3b-750 was trained using GRPO, which extends traditional RLHF by incorporating guided optimization. The process involved:
|
168 |
-
|
169 |
-
1. **Base Model Selection**: Starting with Qwen2.5-3B-Instruct
|
170 |
-
2. **Preference Dataset**: Curating high-quality preference pairs
|
171 |
-
3. **GRPO Training**: 750 steps of optimization with carefully tuned hyperparameters
|
172 |
-
4. **Evaluation**: Continuous benchmarking to monitor performance
|
173 |
-
|
174 |
-
## Citation and Attribution
|
175 |
-
|
176 |
-
If you use this model in your research, please cite:
|
177 |
-
|
178 |
-
```
|
179 |
-
@misc{menda2024,
|
180 |
-
author = {WeatherManJ},
|
181 |
-
title = {Menda-3b-750: GRPO-Tuned Reasoning Assistant},
|
182 |
-
year = {2024},
|
183 |
-
publisher = {HuggingFace},
|
184 |
-
howpublished = {\url{https://huggingface.co/weathermanj/Menda-3b-750}}
|
185 |
-
}
|
186 |
-
```
|
187 |
-
|
188 |
## License
|
189 |
|
190 |
This model inherits the license of the base Qwen2.5-3B-Instruct model. Please refer to the [Qwen2 license](https://huggingface.co/Qwen/Qwen2-3B-Instruct/blob/main/LICENSE) for details.
|
|
|
7 |
- instruct
|
8 |
- fine-tuned
|
9 |
- reasoning
|
|
|
|
|
10 |
- 3b
|
11 |
- menda
|
12 |
datasets:
|
|
|
56 |
value: 55.0
|
57 |
---
|
58 |
|
59 |
+
# Menda-3b-750: GRPO-Tuned Qwen2.5 Model
|
60 |
|
61 |
+
Menda-3b-750 is a fine-tuned version of Qwen2.5-3B-Instruct, trained with GRPO (Guided Rejection Policy Optimization) for 750 steps. This model shows improved performance on reasoning benchmarks compared to the base model.
|
62 |
|
63 |
+
## Model Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
- **Base Model**: Qwen2.5-3B-Instruct
|
|
|
66 |
- **Training Method**: GRPO (Guided Rejection Policy Optimization)
|
67 |
- **Training Steps**: 750
|
68 |
- **Context Length**: 4096 tokens
|
69 |
- **Parameters**: 3 billion
|
|
|
|
|
|
|
70 |
|
71 |
+
## Benchmark Results
|
72 |
|
73 |
+
Menda-3b-750 has been evaluated on several standard benchmarks:
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
+
| Benchmark | Task Type | Accuracy |
|
76 |
+
|-----------|-----------|----------|
|
77 |
+
| HellaSwag | Common Sense Reasoning | 75.0% |
|
78 |
+
| ARC-Challenge | Scientific Reasoning | 80.0% |
|
79 |
+
| MMLU (High School) | Multi-domain Knowledge | 52.5% |
|
80 |
+
| TruthfulQA | Factual Accuracy | 55.0% |
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
## Usage Examples
|
83 |
|
|
|
90 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
91 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
92 |
|
93 |
+
prompt = "Explain the concept of machine learning in simple terms."
|
94 |
inputs = tokenizer(prompt, return_tensors="pt")
|
95 |
outputs = model.generate(**inputs, max_length=300)
|
96 |
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
97 |
print(response)
|
98 |
```
|
99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
### Using with Ollama
|
101 |
|
102 |
You can also use this model with Ollama by converting it to GGUF format:
|
|
|
118 |
ollama run menda-3b-750
|
119 |
```
|
120 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
121 |
## License
|
122 |
|
123 |
This model inherits the license of the base Qwen2.5-3B-Instruct model. Please refer to the [Qwen2 license](https://huggingface.co/Qwen/Qwen2-3B-Instruct/blob/main/LICENSE) for details.
|