This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct. It has been trained using TRL with GRPO (Generative Reward-Powered Optimization) for medical question answering.
Model Details
- Base Model: Qwen/Qwen2-0.5B-Instruct
- Training Method: GRPO (Generative Reward-Powered Optimization)
- Training Dataset: FreedomIntelligence/medical-o1-reasoning-SFT
- Hardware: Single GPU
Quick Start
question = "What are the common symptoms of diabetes?"
system_prompt = """You are a medical AI assistant. Provide detailed reasoning before giving your final answer.
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
prompt = [
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': question}
]
inputs = tokenizer.apply_chat_template(prompt, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
This model was trained using GRPO with multiple reward components:
- Correctness reward (weight: 2.0)
- Format adherence reward (weight: 1.0)
- Reasoning quality reward (weight: 1.0)
Framework Versions
- TRL: 0.13.0
- Transformers: Latest
- PyTorch: Latest
- Flash Attention 2: Enabled
- Downloads last month
- 6
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support