🧠 Flan-T5-{Small|Large|XL}-RPO

πŸ”¬ Fine-tuned with Reward Partitioning Optimization (RPO) β€” a value-free, stable method for single-trajectory reinforcement learning with scalar feedback.


πŸ“Œ Model Summary

This model is a fine-tuned variant of the Flan-T5 {Small|Large|XL} checkpoint, trained using Reward Partitioning Optimization (RPO). RPO is a new method designed for learning from single-trajectory scalar feedback (e.g., thumbs up/down), and eliminates the need for learning value functions or preference pairs.

  • βœ… Trained with only (prompt, response, reward) triplets.
  • πŸ” No joint optimization, no auxiliary models.
  • πŸš€ Efficient and stable training.
  • πŸ€– Strong preference alignment (evaluated by LLM-as-a-judge).
  • πŸ“Š Outperforms KTO and DRO in automatic metrics and LLM preference winrate.

πŸ§ͺ Training Details

  • Base Model: flan-t5-{small|large|xl}
  • Dataset: UltraFeedback β€” high-quality (prompt, response, reward) triplets with multiple completions per prompt.
  • Feedback Format: scalar reward (e.g., [prompt, response, reward]).
  • GPU Used: 1Γ— A100 (80GB)
  • Training Objective: RPO supervised learning using partitioned reward normalization.
  • Baselines Compared: DRO and KTO.

πŸ€– Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "bilalfaye/flan-t5-{small|large|xl}-rpo"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

prompt = "How can I improve my productivity working from home?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

πŸ“ˆ Evaluation Summary

Judge Win Rate vs DRO Win Rate vs KTO Win Rate vs SFT
Mistral βœ… 83–93% βœ… 82–93% βœ… 82–84%
LLaMA βœ… 67–74% βœ… 65–72% βœ… 63–73%

βœ… Use Cases

  • Aligned conversational agents
  • Helpful, non-toxic instruction following
  • Scalar feedback training pipelines
  • Preference-optimized generation (without pairwise preference labels)

πŸ“š Citation

If you use this model, please cite the following paper:

@article{faye2025rpo,
  title     = {Value-Free Policy Optimization via Reward Partitioning},
  author    = {Bilal Faye and Hanane Azzag and Mustapha Lebbah},
  journal   = {arXiv preprint arXiv:2406.XXXX},
  year      = {2025}
}

πŸ”— Related Models

  • bilalfaye/flan-t5-small-rpo
  • bilalfaye/flan-t5-large-rpo
  • bilalfaye/flan-t5-xl-rpo
Downloads last month
3
Safetensors
Model size
77M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bilalfaye/flan-t5-small-rpo

Finetuned
(183)
this model