Uploaded model

Developed by: Akshint47
License: apache-2.0
Finetuned from model : unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset

This notebook demonstrates the process of fine-tuning the Qwen2.5-3B-Instruct model using GRPO (Generalized Reward Policy Optimization) on the GSM8K dataset. The goal is to improve the model's ability to solve mathematical reasoning problems by leveraging reinforcement learning with custom reward functions.

Overview

The notebook is structured as follows:

Installation: Installs necessary libraries such as unsloth, vllm, and trl for efficient fine-tuning and inference.
Unsloth Setup: Configures the environment for faster fine-tuning using Unsloth's PatchFastRL and loads the Qwen2.5-3B-Instruct model with LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.
Data Preparation: Loads and preprocesses the GSM8K dataset, formatting it for training with a system prompt and XML-style reasoning and answer format.
Reward Functions: Defines custom reward functions to evaluate the model's responses, including:
- Correctness Reward: Checks if the extracted answer matches the ground truth.
- Format Reward: Ensures the response follows the specified XML format.
- Integer Reward: Verifies if the extracted answer is an integer.
- XML Count Reward: Evaluates the completeness of the XML structure in the response.
GRPO Training: Configures and runs the GRPO trainer with vLLM for fast inference, using the defined reward functions to optimize the model's performance.
Training Progress: Monitors the training progress, including rewards, completion length, and KL divergence, to ensure the model is improving over time.

Key Features

Efficient Fine-Tuning: Utilizes Unsloth and LoRA to fine-tune the model with reduced memory usage and faster training times.
Custom Reward Functions: Implements multiple reward functions to guide the model towards generating correct and well-formatted responses.
vLLM Integration: Leverages vLLM for fast inference during training, enabling efficient generation of multiple responses for reward calculation.
GSM8K Dataset: Focuses on improving the model's performance on mathematical reasoning tasks, specifically the GSM8K dataset.

Requirements

Python 3.11
Libraries: unsloth, vllm, trl, torch, transformers

Installation

To set up the environment, run:

pip install unsloth vllm trl

Usage

Load the Model: The notebook loads the Qwen2.5-3B-Instruct model with LoRA for fine-tuning.
Prepare the Dataset: The GSM8K dataset is loaded and formatted with a system prompt and XML-style reasoning and answer format.
Define Reward Functions: Custom reward functions are defined to evaluate the model's responses.
Train the Model: The GRPO trainer is configured and run to fine-tune the model using the defined reward functions.
Monitor Progress: The training progress is monitored, including rewards, completion length, and KL divergence.

Results

The training process is designed to improve the model's ability to generate correct and well-formatted responses to mathematical reasoning problems. The reward functions guide the model towards better performance, and the training progress is logged for analysis.

Future Work

Hyperparameter Tuning: Experiment with different learning rates, batch sizes, and reward weights to optimize performance.
Additional Datasets: Extend the fine-tuning process to other datasets to improve the model's generalization capabilities.
Advanced Reward Functions: Implement more sophisticated reward functions to further refine the model's responses.

Acknowledgments

Unsloth: For providing tools to speed up fine-tuning.
vLLM: For enabling fast inference during training.
Hugging Face: For the trl library and the GSM8K dataset.
Special thanks to @sudhir2016 sir for mentoring me for developing such a prominent fine-tuning model.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Akshint47
/

Nano_R1_Model