README.md · weathermanj/Menda-3B-500 at eea913123f0c8945f6d00dd324b80230dfb38a4e

Menda-3B-500 / README.md

weathermanj

Upload folder using huggingface_hub

eea9131 verified 6 months ago

preview code

raw

history blame

2.95 kB

	# Menda-3B-500

	Menda-3B-500 is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) using Guided Reinforcement from Preference Optimization (GRPO). This model represents the 500-step checkpoint from the training process.

	## Model Details

	- Base Model: Qwen/Qwen2.5-3B-Instruct
	- Training Method: GRPO (Guided Reinforcement from Preference Optimization)
	- Training Steps: 500
	- Parameters: 3B
	- Context Length: 32K tokens
	- Training Data: GSM8K (mathematical reasoning)

	## Performance

	Based on extensive evaluation, the 500-step checkpoint shows strong and balanced performance across multiple benchmarks:

	### Core Benchmarks (0-shot)

	\| Benchmark \| Score \|
	\|-----------\|-------\|
	\| ARC-Challenge \| 50.0% \|
	\| BoolQ \| 90.0% \|
	\| HellaSwag \| 40.0% \|
	\| Lambada \| 70.0% \|
	\| PIQA \| 90.0% \|
	\| Winogrande \| 90.0% \|

	### MMLU Performance

	\| MMLU Category \| Score \|
	\|---------------\|-------\|
	\| Overall \| 68.60% \|
	\| Humanities \| 75.38% \|
	\| Social Sciences \| 75.83% \|
	\| STEM \| 60.00% \|
	\| Other \| 67.69% \|

	## Key Strengths

	- Balanced Performance: Maintains strong performance across diverse tasks with minimal trade-offs.
	- Improved BoolQ: Achieves 90% on BoolQ, showing excellent reading comprehension capabilities.
	- Strong Reasoning: Maintains 90% on both PIQA and Winogrande, demonstrating robust reasoning abilities.
	- Efficient Training: Achieves impressive results with relatively minimal training (500 steps).
	- Stable Knowledge: Maintains strong MMLU performance (68.60%) across diverse knowledge domains.

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "weathermanj/Menda-3B-500"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "Give me a short introduction to large language models."
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	## Training Configuration

	The model was trained using the GRPO methodology with the following configuration:

	- LoRA Rank: 128
	- Learning Rate: 5e-6
	- Optimizer: AdamW (8-bit)
	- Batch Size: 8 per device
	- Gradient Accumulation Steps: 4
	- Training Samples: 100 examples from GSM8K

	## License

	This model is subject to the license of the original Qwen2.5-3B-Instruct model.