Overview

This model was fine-tuned using Reinforcement Learning on top of a pretrained LLM, enhanced with:

  • ORMs (Open Reward Modules)
  • DAPO (Decoder Appearance Optimization)
  • SimpleScaling (loss scaling strategy)

Training Setup

Base Model

  • Architecture: QwQ-32B (Qwen-style transformer)
  • Libraries: transformers, trl, deepspeed, accelerate, vllm
  • Tokenizer: Custom-trained (compatible with Hugging Face format)

Reward Modules (ORMs)

The following reward functions guided RL fine-tuning:

Reward Function Description
math Evaluates symbolic math correctness (MathORM)
accuracy Targets numeric accuracy (MathAccuracy)
format Enforces strict formatting constraints
cosine Measures similarity to gold responses
repetition Penalizes repeated or degenerate outputs
soft_overlong Soft penalty for overly long generations

These were combined and scaled during training with adaptive weighting.

Scaling Techniques

  • DAPO (Appearance Optimization): Regularizes attention and layout structure in decoder outputs.
  • SimpleScaling (newmindai/simplescaling): Controls optimizer behavior and reward balance across multiple objectives.

Training Regime

  • Stage 1 (Wait #1): Model explores reward landscape; initial rewards unstable.
  • Stage 2 (Wait #2): Convergence improves as ORM signals align.
  • Aha Moment: Clear gains in math and formatting scores around ~2K steps after warm-up.

Evaluation

🐍 Mezura-SnakeBench Benchmarking
Final performance was benchmarked using the Mezura SnakeBench framework — a standardized evaluation suite developed by NewmindAI for structured Turkish NLP tasks.

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "newmindai/QwQ-32B-r1"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Türkiye'nin en yüksek dağı nedir?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using newmindai/QwQ-32B-r1 1