Overview
This model was fine-tuned using Reinforcement Learning on top of a pretrained LLM, enhanced with:
- ORMs (Open Reward Modules)
- DAPO (Decoder Appearance Optimization)
- SimpleScaling (loss scaling strategy)
Training Setup
Base Model
- Architecture:
QwQ-32B
(Qwen-style transformer) - Libraries:
transformers
,trl
,deepspeed
,accelerate
,vllm
- Tokenizer: Custom-trained (compatible with Hugging Face format)
Reward Modules (ORMs)
The following reward functions guided RL fine-tuning:
Reward Function | Description |
---|---|
math |
Evaluates symbolic math correctness (MathORM) |
accuracy |
Targets numeric accuracy (MathAccuracy) |
format |
Enforces strict formatting constraints |
cosine |
Measures similarity to gold responses |
repetition |
Penalizes repeated or degenerate outputs |
soft_overlong |
Soft penalty for overly long generations |
These were combined and scaled during training with adaptive weighting.
Scaling Techniques
- DAPO (Appearance Optimization): Regularizes attention and layout structure in decoder outputs.
- SimpleScaling (
newmindai/simplescaling
): Controls optimizer behavior and reward balance across multiple objectives.
Training Regime
- Stage 1 (Wait #1): Model explores reward landscape; initial rewards unstable.
- Stage 2 (Wait #2): Convergence improves as ORM signals align.
- Aha Moment: Clear gains in math and formatting scores around ~2K steps after warm-up.
Evaluation
🐍 Mezura-SnakeBench Benchmarking
Final performance was benchmarked using the Mezura SnakeBench framework — a standardized evaluation suite developed by NewmindAI for structured Turkish NLP tasks.
Usage Example
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "newmindai/QwQ-32B-r1"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Türkiye'nin en yüksek dağı nedir?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support