|
--- |
|
license: apache-2.0 |
|
--- |
|
## Introduction |
|
Qwen2.5-32B-DialogueReason is a dialogue-based reasoning model built on Qwen2.5-32B-Base. |
|
We train the model using [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero) data through rule-based reinforcement learning. |
|
|
|
## 🧠 Key Features |
|
- Qwen2.5-32B-Base as the foundation. |
|
- Use Rule-Based RL to achieve dialogue reasoning. |
|
- With dynamic agent initialization to adapt to various scenarios. |
|
- With flexible environment configuration to set up task-specific contexts. |
|
- With multi-turn dialogue reasoning to incrementally solve problems. |
|
|
|
## Example |
|
### System: |
|
> The User asks a question, and the Assistant writes a masterpiece play depicting experts (picked based on the topic with concrete names) solving the question in a ultra-detailed dialogue. The response is formatted as: <play>the play goes here</play>\\n<answer> if asked to write code, then code here surrounded by ```. Otherwise, answer here with \\boxed{answer} emphasized</answer>. |
|
|
|
|
|
### User: |
|
> Give me a detailed explanation of PPO in RL |
|
|
|
### Assistant: |
|
>  |
|
|
|
|