weathermanj commited on
Commit
7682246
·
verified ·
1 Parent(s): af95ed6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +12 -79
README.md CHANGED
@@ -7,8 +7,6 @@ tags:
7
  - instruct
8
  - fine-tuned
9
  - reasoning
10
- - rlhf
11
- - alignment
12
  - 3b
13
  - menda
14
  datasets:
@@ -58,56 +56,28 @@ model-index:
58
  value: 55.0
59
  ---
60
 
61
- # Menda-3b-750: GRPO-Tuned Reasoning Assistant
62
 
63
- Menda-3b-750 is a fine-tuned version of Qwen2.5-3B-Instruct, enhanced through 750 steps of GRPO (Guided Rejection Policy Optimization). This model demonstrates improved reasoning capabilities while maintaining the versatility of the base model.
64
 
65
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/qwen2_architecture.png" alt="Qwen2 Architecture" width="600"/>
66
-
67
- ## Model Overview
68
-
69
- Menda-3b-750 builds upon the strong foundation of Qwen2.5-3B-Instruct by applying GRPO, a novel alignment technique that combines the strengths of RLHF (Reinforcement Learning from Human Feedback) with guided optimization. The result is a model that excels at reasoning tasks while remaining helpful, harmless, and honest.
70
-
71
- ### Key Features
72
-
73
- - **Enhanced Reasoning**: Improved performance on complex reasoning benchmarks
74
- - **Instruction Following**: Better adherence to specific instructions and formats
75
- - **Balanced Responses**: Maintains helpfulness while reducing hallucinations
76
- - **Efficient Size**: Full capabilities in a compact 3B parameter model
77
-
78
- ## Technical Details
79
 
80
  - **Base Model**: Qwen2.5-3B-Instruct
81
- - **Architecture**: Transformer-based decoder-only
82
  - **Training Method**: GRPO (Guided Rejection Policy Optimization)
83
  - **Training Steps**: 750
84
  - **Context Length**: 4096 tokens
85
  - **Parameters**: 3 billion
86
- - **Tokenizer**: Qwen2 tokenizer (BPE-based)
87
-
88
- ## Performance Evaluation
89
 
90
- Menda-3b-750 has been evaluated on several standard benchmarks, showing significant improvements over the base model:
91
 
92
- | Benchmark | Task Type | Menda-3b-750 | Base Model | Improvement |
93
- |-----------|-----------|--------------|------------|-------------|
94
- | HellaSwag | Common Sense Reasoning | 75.0% | 67.5% | +7.5% |
95
- | ARC-Challenge | Scientific Reasoning | 80.0% | 67.5% | +12.5% |
96
- | MMLU (High School) | Multi-domain Knowledge | 52.5% | 47.5% | +5.0% |
97
- | TruthfulQA | Factual Accuracy | 55.0% | 47.5% | +7.5% |
98
 
99
- ### Strengths and Limitations
100
-
101
- **Strengths:**
102
- - Strong performance on reasoning tasks
103
- - Good balance of helpfulness and accuracy
104
- - Efficient for deployment on consumer hardware
105
- - Versatile across multiple domains
106
-
107
- **Limitations:**
108
- - Still shows weaknesses in specialized mathematical reasoning
109
- - May occasionally produce hallucinations for complex queries
110
- - Limited context window compared to larger models
111
 
112
  ## Usage Examples
113
 
@@ -120,27 +90,13 @@ model_name = "weathermanj/Menda-3b-750"
120
  tokenizer = AutoTokenizer.from_pretrained(model_name)
121
  model = AutoModelForCausalLM.from_pretrained(model_name)
122
 
123
- prompt = "Explain the concept of RLHF (Reinforcement Learning from Human Feedback) in simple terms."
124
  inputs = tokenizer(prompt, return_tensors="pt")
125
  outputs = model.generate(**inputs, max_length=300)
126
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
127
  print(response)
128
  ```
129
 
130
- ### Reasoning Example
131
-
132
- ```python
133
- prompt = """
134
- Solve the following problem step by step:
135
- A store sells shirts for $25 each and pants for $40 each. If a customer bought 3 shirts and some pants for a total of $205, how many pants did they buy?
136
- """
137
-
138
- inputs = tokenizer(prompt, return_tensors="pt")
139
- outputs = model.generate(**inputs, max_length=500, temperature=0.7)
140
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
141
- print(response)
142
- ```
143
-
144
  ### Using with Ollama
145
 
146
  You can also use this model with Ollama by converting it to GGUF format:
@@ -162,29 +118,6 @@ ollama create menda-3b-750 -f Modelfile
162
  ollama run menda-3b-750
163
  ```
164
 
165
- ## Training Methodology
166
-
167
- Menda-3b-750 was trained using GRPO, which extends traditional RLHF by incorporating guided optimization. The process involved:
168
-
169
- 1. **Base Model Selection**: Starting with Qwen2.5-3B-Instruct
170
- 2. **Preference Dataset**: Curating high-quality preference pairs
171
- 3. **GRPO Training**: 750 steps of optimization with carefully tuned hyperparameters
172
- 4. **Evaluation**: Continuous benchmarking to monitor performance
173
-
174
- ## Citation and Attribution
175
-
176
- If you use this model in your research, please cite:
177
-
178
- ```
179
- @misc{menda2024,
180
- author = {WeatherManJ},
181
- title = {Menda-3b-750: GRPO-Tuned Reasoning Assistant},
182
- year = {2024},
183
- publisher = {HuggingFace},
184
- howpublished = {\url{https://huggingface.co/weathermanj/Menda-3b-750}}
185
- }
186
- ```
187
-
188
  ## License
189
 
190
  This model inherits the license of the base Qwen2.5-3B-Instruct model. Please refer to the [Qwen2 license](https://huggingface.co/Qwen/Qwen2-3B-Instruct/blob/main/LICENSE) for details.
 
7
  - instruct
8
  - fine-tuned
9
  - reasoning
 
 
10
  - 3b
11
  - menda
12
  datasets:
 
56
  value: 55.0
57
  ---
58
 
59
+ # Menda-3b-750: GRPO-Tuned Qwen2.5 Model
60
 
61
+ Menda-3b-750 is a fine-tuned version of Qwen2.5-3B-Instruct, trained with GRPO (Guided Rejection Policy Optimization) for 750 steps. This model shows improved performance on reasoning benchmarks compared to the base model.
62
 
63
+ ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  - **Base Model**: Qwen2.5-3B-Instruct
 
66
  - **Training Method**: GRPO (Guided Rejection Policy Optimization)
67
  - **Training Steps**: 750
68
  - **Context Length**: 4096 tokens
69
  - **Parameters**: 3 billion
 
 
 
70
 
71
+ ## Benchmark Results
72
 
73
+ Menda-3b-750 has been evaluated on several standard benchmarks:
 
 
 
 
 
74
 
75
+ | Benchmark | Task Type | Accuracy |
76
+ |-----------|-----------|----------|
77
+ | HellaSwag | Common Sense Reasoning | 75.0% |
78
+ | ARC-Challenge | Scientific Reasoning | 80.0% |
79
+ | MMLU (High School) | Multi-domain Knowledge | 52.5% |
80
+ | TruthfulQA | Factual Accuracy | 55.0% |
 
 
 
 
 
 
81
 
82
  ## Usage Examples
83
 
 
90
  tokenizer = AutoTokenizer.from_pretrained(model_name)
91
  model = AutoModelForCausalLM.from_pretrained(model_name)
92
 
93
+ prompt = "Explain the concept of machine learning in simple terms."
94
  inputs = tokenizer(prompt, return_tensors="pt")
95
  outputs = model.generate(**inputs, max_length=300)
96
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
97
  print(response)
98
  ```
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  ### Using with Ollama
101
 
102
  You can also use this model with Ollama by converting it to GGUF format:
 
118
  ollama run menda-3b-750
119
  ```
120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  ## License
122
 
123
  This model inherits the license of the base Qwen2.5-3B-Instruct model. Please refer to the [Qwen2 license](https://huggingface.co/Qwen/Qwen2-3B-Instruct/blob/main/LICENSE) for details.