π§ Flan-T5-{Small|Large|XL}-RPO
π¬ Fine-tuned with Reward Partitioning Optimization (RPO) β a value-free, stable method for single-trajectory reinforcement learning with scalar feedback.
π Model Summary
This model is a fine-tuned variant of the Flan-T5 {Small|Large|XL} checkpoint, trained using Reward Partitioning Optimization (RPO). RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs.
- β Trained with only (prompt, response, reward) triplets.
- π No joint optimization, no auxiliary models.
- π Efficient and stable training.
- π€ Strong preference alignment (evaluated by LLM-as-a-judge).
- π Outperforms KTO and DRO in automatic metrics and LLM preference winrate.
π§ͺ Training Details
- Base Model:
flan-t5-{small|large|xl}
- Dataset: UltraFeedback β high-quality (prompt, response, reward) triplets with multiple completions per prompt.
- Feedback Format: scalar reward (e.g., [prompt, response, reward]).
- GPU Used: 1Γ A100 (80GB)
- Training Objective: RPO supervised learning using partitioned reward normalization.
- Baselines Compared: DRO and KTO.
π€ Inference
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
prompt = "How can I improve my productivity working from home?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
input_ids=inputs["input_ids"],
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
π Evaluation Summary
Judge | Win Rate vs DRO | Win Rate vs KTO | Win Rate vs SFT |
---|---|---|---|
Mistral | β 83β93% | β 82β93% | β 82β84% |
LLaMA | β 67β74% | β 65β72% | β 63β73% |
β Use Cases
- Aligned conversational agents
- Helpful, non-toxic instruction following
- Scalar feedback training pipelines
- Preference-optimized generation (without pairwise preference labels)
π Citation
If you use this model, please cite the following paper:
@article{faye2025rpo,
title = {Value-Free Policy Optimization via Reward Partitioning},
author = {Bilal Faye and Hanane Azzag and Mustapha Lebbah},
journal = {arXiv preprint arXiv:2406.XXXX},
year = {2025}
}
π Related Models
bilalfaye/flan-t5-small-rpo
bilalfaye/flan-t5-large-rpo
bilalfaye/flan-t5-xl-rpo
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for bilalfaye/flan-t5-small-rpo
Base model
google/flan-t5-large