Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +12 -79

README.md CHANGED Viewed

@@ -7,8 +7,6 @@ tags:
   - instruct
   - fine-tuned
   - reasoning
-  - rlhf
-  - alignment
   - 3b
   - menda
 datasets:
@@ -58,56 +56,28 @@ model-index:
             value: 55.0
 ---
-# Menda-3b-750: GRPO-Tuned Reasoning Assistant
-Menda-3b-750 is a fine-tuned version of Qwen2.5-3B-Instruct, enhanced through 750 steps of GRPO (Guided Rejection Policy Optimization). This model demonstrates improved reasoning capabilities while maintaining the versatility of the base model.
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/qwen2_architecture.png" alt="Qwen2 Architecture" width="600"/>
-## Model Overview
-Menda-3b-750 builds upon the strong foundation of Qwen2.5-3B-Instruct by applying GRPO, a novel alignment technique that combines the strengths of RLHF (Reinforcement Learning from Human Feedback) with guided optimization. The result is a model that excels at reasoning tasks while remaining helpful, harmless, and honest.
-### Key Features
-- **Enhanced Reasoning**: Improved performance on complex reasoning benchmarks
-- **Instruction Following**: Better adherence to specific instructions and formats
-- **Balanced Responses**: Maintains helpfulness while reducing hallucinations
-- **Efficient Size**: Full capabilities in a compact 3B parameter model
-## Technical Details
 - **Base Model**: Qwen2.5-3B-Instruct
-- **Architecture**: Transformer-based decoder-only
 - **Training Method**: GRPO (Guided Rejection Policy Optimization)
 - **Training Steps**: 750
 - **Context Length**: 4096 tokens
 - **Parameters**: 3 billion
-- **Tokenizer**: Qwen2 tokenizer (BPE-based)
-## Performance Evaluation
-Menda-3b-750 has been evaluated on several standard benchmarks, showing significant improvements over the base model:
-| Benchmark | Task Type | Menda-3b-750 | Base Model | Improvement |
-|-----------|-----------|--------------|------------|-------------|
-| HellaSwag | Common Sense Reasoning | 75.0% | 67.5% | +7.5% |
-| ARC-Challenge | Scientific Reasoning | 80.0% | 67.5% | +12.5% |
-| MMLU (High School) | Multi-domain Knowledge | 52.5% | 47.5% | +5.0% |
-| TruthfulQA | Factual Accuracy | 55.0% | 47.5% | +7.5% |
-### Strengths and Limitations
-**Strengths:**
-- Strong performance on reasoning tasks
-- Good balance of helpfulness and accuracy
-- Efficient for deployment on consumer hardware
-- Versatile across multiple domains
-**Limitations:**
-- Still shows weaknesses in specialized mathematical reasoning
-- May occasionally produce hallucinations for complex queries
-- Limited context window compared to larger models
 ## Usage Examples
@@ -120,27 +90,13 @@ model_name = "weathermanj/Menda-3b-750"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
-prompt = "Explain the concept of RLHF (Reinforcement Learning from Human Feedback) in simple terms."
 inputs = tokenizer(prompt, return_tensors="pt")
 outputs = model.generate(**inputs, max_length=300)
 response = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(response)
 ```
-### Reasoning Example
-```python
-prompt = """
-Solve the following problem step by step:
-A store sells shirts for $25 each and pants for $40 each. If a customer bought 3 shirts and some pants for a total of $205, how many pants did they buy?
-"""
-inputs = tokenizer(prompt, return_tensors="pt")
-outputs = model.generate(**inputs, max_length=500, temperature=0.7)
-response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-print(response)
-```
 ### Using with Ollama
 You can also use this model with Ollama by converting it to GGUF format:
@@ -162,29 +118,6 @@ ollama create menda-3b-750 -f Modelfile
 ollama run menda-3b-750
 ```
-## Training Methodology
-Menda-3b-750 was trained using GRPO, which extends traditional RLHF by incorporating guided optimization. The process involved:
-1. **Base Model Selection**: Starting with Qwen2.5-3B-Instruct
-2. **Preference Dataset**: Curating high-quality preference pairs
-3. **GRPO Training**: 750 steps of optimization with carefully tuned hyperparameters
-4. **Evaluation**: Continuous benchmarking to monitor performance
-## Citation and Attribution
-If you use this model in your research, please cite:
-```
-@misc{menda2024,
-  author = {WeatherManJ},
-  title = {Menda-3b-750: GRPO-Tuned Reasoning Assistant},
-  year = {2024},
-  publisher = {HuggingFace},
-  howpublished = {\url{https://huggingface.co/weathermanj/Menda-3b-750}}
-}
-```
 ## License
 This model inherits the license of the base Qwen2.5-3B-Instruct model. Please refer to the [Qwen2 license](https://huggingface.co/Qwen/Qwen2-3B-Instruct/blob/main/LICENSE) for details.

   - instruct
   - fine-tuned
   - reasoning
   - 3b
   - menda
 datasets:
             value: 55.0
 ---
+# Menda-3b-750: GRPO-Tuned Qwen2.5 Model
+Menda-3b-750 is a fine-tuned version of Qwen2.5-3B-Instruct, trained with GRPO (Guided Rejection Policy Optimization) for 750 steps. This model shows improved performance on reasoning benchmarks compared to the base model.
+## Model Details
 - **Base Model**: Qwen2.5-3B-Instruct
 - **Training Method**: GRPO (Guided Rejection Policy Optimization)
 - **Training Steps**: 750
 - **Context Length**: 4096 tokens
 - **Parameters**: 3 billion
+## Benchmark Results
+Menda-3b-750 has been evaluated on several standard benchmarks:
+| Benchmark | Task Type | Accuracy |
+|-----------|-----------|----------|
+| HellaSwag | Common Sense Reasoning | 75.0% |
+| ARC-Challenge | Scientific Reasoning | 80.0% |
+| MMLU (High School) | Multi-domain Knowledge | 52.5% |
+| TruthfulQA | Factual Accuracy | 55.0% |
 ## Usage Examples
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
+prompt = "Explain the concept of machine learning in simple terms."
 inputs = tokenizer(prompt, return_tensors="pt")
 outputs = model.generate(**inputs, max_length=300)
 response = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(response)
 ```
 ### Using with Ollama
 You can also use this model with Ollama by converting it to GGUF format:
 ollama run menda-3b-750
 ```
 ## License
 This model inherits the license of the base Qwen2.5-3B-Instruct model. Please refer to the [Qwen2 license](https://huggingface.co/Qwen/Qwen2-3B-Instruct/blob/main/LICENSE) for details.