AmirhoseinGH/DS-Qwen-1.5b-GG-CalibratedConfRL

Overview

The AmirhoseinGH/DS-Qwen-1.5b-GG-CalibratedConfRL model is derived from the DeepSeek-R1 Qwen Distill 1.5B base model, optimized through confidence-based Reinforcement Learning using GRPO for enhanced intrinsic confidence calibration.

Purpose and Capabilities

This calibration process significantly improves the reliability of the model’s internal confidence signals. The model is optimized for use with the Guided by Gut (GG) framework, a self-guided test-time scaling (TTS) strategy that leverages these intrinsic confidence signals to perform complex reasoning tasks efficiently—without costly external verifier models.

Guided by Gut (GG) Framework

Traditional TTS methods often require substantial computational resources due to their reliance on external verifier models like Process Reward Models (PRMs) or extensive sampling strategies (e.g., Best-of-N). The GG framework provides a powerful yet computationally efficient alternative:

Intrinsic signals: Utilizes token-level confidence and step novelty from the model itself.
Confidence Calibration via RL: Refines intrinsic confidence through targeted RL fine-tuning.
Efficiency: Offers significant reductions in GPU memory and inference speed, enabling smaller models (1.5B parameters) to compete with or outperform significantly larger models (32B–70B parameters).

Key Advantages of GG:

Up to 10× less GPU memory compared to traditional PRM-based methods.
8× faster inference speed than PRM-based approaches.
50% lower KV cache memory usage compared to Best-of-N strategies.

Calibration and Training

The RL fine-tuning is computationally efficient and minimal:

Dataset: Fine-tuned on the LIMO dataset for 3 epochs.
Hardware: Completed using 2 NVIDIA A100 80GB GPUs.
Time: Approximately one day.

More Information

GitHub Repository: 👨‍💻 Amirhosein-gh98/Guided-by-Gut
Research Paper: 📄 "Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence" on arXiv.