Overview

This model was fine-tuned using Reinforcement Learning on top of a pretrained LLM, enhanced with:

ORMs (Open Reward Modules)
DAPO (Decoder Appearance Optimization)
SimpleScaling (loss scaling strategy)

Training Setup

Base Model

Architecture: QwQ-32B (Qwen-style transformer)
Libraries: transformers, trl, deepspeed, accelerate, vllm
Tokenizer: Custom-trained (compatible with Hugging Face format)

Reward Modules (ORMs)

The following reward functions guided RL fine-tuning:

Reward Function	Description
`math`	Evaluates symbolic math correctness (MathORM)
`accuracy`	Targets numeric accuracy (MathAccuracy)
`format`	Enforces strict formatting constraints
`cosine`	Measures similarity to gold responses
`repetition`	Penalizes repeated or degenerate outputs
`soft_overlong`	Soft penalty for overly long generations

These were combined and scaled during training with adaptive weighting.

Scaling Techniques

DAPO (Appearance Optimization): Regularizes attention and layout structure in decoder outputs.
SimpleScaling (newmindai/simplescaling): Controls optimizer behavior and reward balance across multiple objectives.

Training Regime

Stage 1 (Wait #1): Model explores reward landscape; initial rewards unstable.
Stage 2 (Wait #2): Convergence improves as ORM signals align.
Aha Moment: Clear gains in math and formatting scores around ~2K steps after warm-up.

Evaluation

🐍 Mezura-SnakeBench Benchmarking
Final performance was benchmarked using the Mezura SnakeBench framework — a standardized evaluation suite developed by NewmindAI for structured Turkish NLP tasks.

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "newmindai/QwQ-32B-r1"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Türkiye'nin en yüksek dağı nedir?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

newmindai
/

QwQ-32B-r1