π§ββοΈ Qwen3-4B RPG Roleplay V2 (GRPO)
Aligning Characters with Deeper Personas

A new version trained with GRPO for more consistent, high-quality, and aligned character roleplaying.
π Model Overview
Welcome to V2! I'm Chun (@chun121), and this is the next evolution of the Qwen3-4B Roleplay model. This version moves beyond standard fine-tuning and leverages GRPO (Generative Responsive Preference Optimization) to align the model's behavior with the core principles of great roleplaying.
π | π¬ | π§ | βοΈ |
Character Consistency |
High-Quality Dialogue |
Intent Understanding |
Structured Format |
Maintains strong persona adherence |
Detailed, engaging non-generic responses |
Comprehends user questions & scenarios |
Uses <thinking> analysis process |
Built on the unsloth/Qwen3-4B-Base
, this LoRA was trained not just to predict text, but to generate responses that are actively rewarded for being in-character, high-quality, and contextually aware. It's designed for creators who need AI characters that are not only conversational but also consistent and deeply aligned with their defined personas.
π Technical Specifications
π§ Feature | π Details |
---|---|
Base Model | unsloth/Qwen3-4B-Base |
Architecture | Transformer LLM with GRPO & LoRA |
Parameter Count | 4 Billion (Base) + LoRA parameters |
Quantization Options | 4-bit (bnb), GGUF variants |
Training Framework | Unsloth & TRL (GRPOTrainer) |
Context Length | 2048 tokens |
Developer | Chun |
License | MIT |
π§ Training with GRPO
π Training Flow | π Description |
---|---|
π Dataset | Gryphe/Sonnet3.5-Charcard-Roleplay |
β¬οΈ | |
ποΈ Stage 1: Preliminary Fine-Tuning | Teaches custom chat format including <thinking> and <RESPONSE> tags |
β¬οΈ | |
π― Stage 2: GRPO Training | Reward-based optimization using GRPOTrainer from TRL |
β¬οΈ | |
π§ββοΈ Final Model | Qwen3-4B RPG Roleplay V2 with superior alignment |
This model's strength comes from its training methodology. Instead of simple fine-tuning, it was trained using GRPO, an alignment algorithm similar to DPO, on a free Google Colab T4 GPU.
π Two-Stage Training Process
ποΈ Stage 1: Preliminary Fine-TuningTeaches custom chat format including |
π― Stage 2: GRPO TrainingReward-based optimization using |
π Reward Functions
The model was trained to excel in these key areas:
π― Reward Category | π Description |
---|---|
Format Adherence | Following internal thinking/response structure |
Roleplay Quality | Generating longer, detailed responses with character actions |
Request Comprehension | Directly answering user questions or acting on requests |
Character Consistency | Reflecting personality and traits from system prompt |
Engagement | Using conversational language, avoiding generic replies |
π Dataset Deep Dive
π Gryphe/Sonnet3.5-Charcard-Roleplay
Premium synthetic roleplay conversations powered by Claude Sonnet 3.5
The model was trained on the Gryphe/Sonnet3.5-Charcard-Roleplay dataset, a premium collection of synthetic roleplay conversations.
π Metric | π― Value |
---|---|
Total Conversations | 9,736 |
Source | Claude Sonnet 3.5 Generated |
Quality | High-quality, character-card-based |
Structure | system β human β gpt flow |
β οΈ Content Warning: This dataset contains NSFW (Not Safe For Work) and mature themes. The model may generate such content due to its training data. Please implement content filtering if your application requires it.
π Getting Started
π» Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load the V2 model with 4-bit quantization
model_name = "Chun121/qwen3-4b-rpg-roleplay-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# 1. Define your character and scene using the recommended prompt structure.
# This detailed format is key to getting high-quality responses.
system_prompt_content = """
Character: Elara, the Impatient Archmage
Tags: fantasy, magic, elf, library, knowledgeable, impatient
Elara's Personality:
Elara possesses centuries of arcane knowledge but has very little patience for novices, whom she sees as wasting her valuable time. She is sharp, direct, and can be condescending, but her advice is always accurate, even if delivered with a sigh. She values true intellectual curiosity but despises laziness.
Scenario:
- **Setting:** The Grand Library of Mystral, a place of immense power and silence.
- A young, nervous apprentice ({{user}}) has approached Elara for help with a basic spell, interrupting her research.
Take the role of Elara. You must engage in a roleplay conversation with {{user}}. Do not write {{user}}'s dialogue. Respond from Elara's perspective, embodying her personality and knowledge.
"""
# 2. Define your character and user messages
messages = [
{
"role": "system",
"content": system_prompt_content,
},
{
"role": "user",
"content": "Excuse me, Archmage. I'm... I'm having trouble with the basic fire conjuration spell. Could you please help me?"
}
]
# 3. Apply the chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# 4. Generate the response
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=256,
temperature=0.8,
top_p=0.9,
do_sample=True
)
print(tokenizer.decode(outputs, skip_special_tokens=True))
π Prompting the Model: Character and Scene
π― Prompt Engineering Best Practices
Master the art of character creation with structured prompting
The model is trained to follow a specific structure that separates the overall rules, the character's description, and the user's dialogue. For best results, structure your prompts this way.
π― 1. The System Message: Defining the Character
The system
message is crucial. It tells the model how to behave. It should contain the character's description, personality, background, and any relevant context for the scene.
π Key Elements | π Description |
---|---|
Character Name & Title | A clear identifier |
Tags | Helps define genre and themes |
Personality | Core traits summary |
Scenario | Context for interaction (use {{user}} ) |
Instructions | Explicit role-taking commands |
Example of a well-structured system
prompt:
Character: Melina, The Unfaithful Wife
Tags: nsfw, english, scenario, roleplay, love, netori, milf, female
Melina's Personality:
Melina is an unfaithful wife who is unhappy in her marriage to her husband, "Aki." She is cautious and meticulous, but also looking for excitement and feels a connection to {{user}}.
Scenario:
- **Setting:** Melina's home.
- You are a mail carrier ({{user}}), and Melina often finds reasons to talk to you. Today, she seems particularly inviting.
Take the role of Melina. Taking the above information into consideration, you must engage in a roleplay conversation with {{user}} below this line. Do not write {{user}}'s dialogue lines in your responses.
π¬ 2. The User Message: Your Turn
The user
message is simply what you, the user, say or do in the scene.
# Example user message for the "Melina" character card above
user_message = {
"role": "user",
"content": "*I hand you the stack of letters, noticing you seem a bit more dressed up than usual.* Here's your mail, Melina. Everything alright?"
}
π€ 3. The Model's Internal Process
The model generates a private "thought" process inside <thinking>
tags before creating its public response inside <RESPONSE>
tags. This allows for more consistent and thoughtful roleplay.
ποΈ GGUF Models for llama.cpp
π§ Optimized Quantization Options
Choose the perfect balance of quality and performance for your hardware
For users who want to run the model on CPU or with GPU offloading, GGUF models are provided:
π§ Quantization | πΎ Size (GB) | π― Recommended Use |
---|---|---|
Q4_K_M | 2.50 GB | π Recommended - Best balance of performance and size |
Q5_K_M | 2.89 GB | Higher quality than Q4_K_M with minimal size increase |
Q8_0 | 4.28 GB | High-quality quantization, near full precision |
F16 | 8.05 GB | Full 16-bit precision - highest quality |
Example llama.cpp
command:
./llama-cli -m ./qwen3-4b-rpg-roleplay-v2.Q4_K_M.gguf --color -c 2048 --temp 0.8 -p "Your prompt here"
π‘ Best Practices & Usage Tips
π― Use Chat TemplateAlways use |
π Detailed System PromptComprehensive character cards are |
π‘οΈ Moderate TemperatureValues between 0.7-0.85 offer |
π Leverage Context2048-token window allows |
β οΈ Limitations
β οΈ Limitation | π Description |
---|---|
NSFW Content | May generate explicit content due to training data |
Synthetic Data | Training data is AI-generated, may lack human nuance |
Context Window | Limited to 2048 tokens - traits may degrade in long conversations |
Inherited Limitations | Inherits any limitations from base model |
π Related Projects
π My Other Fine-tunes Explore more models by Chun |
β‘ Unsloth Library Optimization framework used |
π GRPO Training Notebook Exact notebook used for training |
π Gryphe's Datasets High-quality roleplay datasets |
π Acknowledgements
Special thanks to the incredible teams and individuals who made this possible:
π₯ Qwen & Unsloth teams - For their incredible models and libraries
π Gryphe - For the high-quality Sonnet 3.5 dataset
π TRL team - For creating and open-sourcing the GRPO trainer
π€ HuggingFace community - For their continued support
π¬ Feedback & Contact
π Issues & Bugs Open an issue on HuggingFace |
π¬ Connect @chun121 on HuggingFace |
π Share Examples Show us your characters! |
β¨ May your characters speak with voices that feel truly alive! β¨
Created with β€οΈ by Chun
π§ββοΈ Qwen3-4B RPG Roleplay V2 | GRPO Enhanced | MIT License
- Downloads last month
- 1,105
4-bit
5-bit
8-bit
16-bit