bilalfaye/flan-t5-small-rpo

🧠 Flan-T5-{Small|Large|XL}-RPO

🔬 Fine-tuned with Reward Partitioning Optimization (RPO) — a value-free, stable method for single-trajectory reinforcement learning with scalar feedback.

📌 Model Summary

This model is a fine-tuned variant of the Flan-T5 {Small|Large|XL} checkpoint, trained using Reward Partitioning Optimization (RPO). RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs.

✅ Trained with only (prompt, response, reward) triplets.
🔁 No joint optimization, no auxiliary models.
🚀 Efficient and stable training.
🤖 Strong preference alignment (evaluated by LLM-as-a-judge).
📊 Outperforms KTO and DRO in automatic metrics and LLM preference winrate.

🧪 Training Details

Base Model: flan-t5-{small|large|xl}
Dataset: UltraFeedback — high-quality (prompt, response, reward) triplets with multiple completions per prompt.
Feedback Format: scalar reward (e.g., [prompt, response, reward]).
GPU Used: 1× A100 (80GB)
Training Objective: RPO supervised learning using partitioned reward normalization.
Baselines Compared: DRO and KTO.

🤖 Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

prompt = "How can I improve my productivity working from home?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

📈 Evaluation Summary

Judge	Win Rate vs DRO	Win Rate vs KTO	Win Rate vs SFT
Mistral	✅ 83–93%	✅ 82–93%	✅ 82–84%
LLaMA	✅ 67–74%	✅ 65–72%	✅ 63–73%

✅ Use Cases

Aligned conversational agents
Helpful, non-toxic instruction following
Scalar feedback training pipelines
Preference-optimized generation (without pairwise preference labels)

📚 Citation

If you use this model, please cite the following paper:

@article{faye2025rpo,
  title     = {Value-Free Policy Optimization via Reward Partitioning},
  author    = {Bilal Faye and Hanane Azzag and Mustapha Lebbah},
  journal   = {arXiv preprint arXiv:2406.XXXX},
  year      = {2025}
}

🔗 Related Models

bilalfaye/flan-t5-small-rpo
bilalfaye/flan-t5-large-rpo
bilalfaye/flan-t5-xl-rpo

bilalfaye
/

flan-t5-small-rpo