Open-Source AI Cookbook documentation
Fine-Tuning a Vision Language Model with TRL using MPO
Fine-Tuning a Vision Language Model with TRL using MPO
Authored by: Sergio Paniego
In this recipe, we’ll demonstrate how to fine-tune a Vision Language Model (VLM) using Mixed Preference Optimization (MPO) with the Transformer Reinforcement Learning (TRL) library.
MPO is a training approach that combines multiple optimization objectives and was introduced in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization. It is part of the Direct Preference Optimization (DPO) trainer and works by combining multiple loss functions with different weights, enabling more sophisticated optimization strategies.
We’ll fine-tune Qwen/Qwen2.5-VL-3B-Instruct, a small VLM with strong performance, using a preference dataset to help the model align with desired outputs. Check out this blog post to learn more about preference optimization for vision-language models.
The dataset we’ll use is HuggingFaceH4/rlaif-v_formatted, a specially formatted version of the RLAIF-V dataset. This dataset contains pairs of prompt + image
, along with a chosen
and rejected
response for each sample. The final goal of the fine-tuning process is to train a model that consistently prefers the chosen
answers over the rejected
ones, thereby reducing hallucinations. To achieve this, multiple loss functions will be used in combination.
1. Install Dependencies
Let’s start by installing the required dependencies.
We’ll install trl
from source, as the MPO trainer hasn’t been included in an official release at the time of writing.
!pip install -U -q git+https://github.com/huggingface/trl.git bitsandbytes qwen-vl-utils==0.0.8
We’ll authenticate with the Hugging Face Hub using our account to upload and save the fine-tuned model.
You can generate your access token here.
from huggingface_hub import notebook_login
notebook_login()
2. Load Dataset
For this recipe, we’ll use HuggingFaceH4/rlaif-v_formatted, a specially formatted version of the RLAIF-V dataset.
In the paper that introduced MPO, the authors also presented OpenGVLab/MMPR, a large-scale multimodal preference dataset built through an efficient pipeline that combines both samples with and without clear ground truths.
For our educational case, we’ll use HuggingFaceH4/rlaif-v_formatted
. However, for best reproduction of the paper’s results, we recommend exploring MMPR.
We’ll work with a subset of the dataset for this example.
from datasets import load_dataset
dataset_id = "HuggingFaceH4/rlaif-v_formatted"
train_dataset, test_dataset = load_dataset(dataset_id, split=["train[:5%]", "test[:1%]"])
Let’s include a quick check to ensure the images are in RGB format. If not, we’ll convert them accordingly.
from PIL import Image
def ensure_rgb(example):
# Convert the image to RGB if it's not already
image = example["images"][0]
if isinstance(image, Image.Image):
if image.mode != "RGB":
image = image.convert("RGB")
example["images"] = [image]
return example
# Apply the transformation to the dataset (change num_proc depending on the available compute)
train_dataset = train_dataset.map(ensure_rgb, num_proc=8)
test_dataset = test_dataset.map(ensure_rgb, num_proc=8)
Let’s inspect a sample to understand its structure.
As we can see, each sample contains a chosen
, rejected
, image
, and prompt
.
Our goal is to fine-tune the model to prefer the chosen
answers using MPO.
train_dataset[5]
Let’s check the image for that particular sample:
>>> train_dataset[5]["images"][0]
3. Fine-Tune the Model with TRL using MPO
As previously described, we’ll leverage trl
, since this library provides everything we need to train using MPO while abstracting away some of the complexity we don’t need to handle for this particular case.
The MPO trainer accepts a list of loss_type
s. A full list of available loss functions is provided in the DPO trainer documentation here.
As mentioned earlier, MPO is a particular case of the DPO trainer, so we can use it by specifying a list of loss types and their corresponding weights.
In the image below, you can see the improvements reported in the MPO paper for the InternVL2-8B model using this training strategy.
3.1 Load the Quantized Model for Training
Let’s load the model. In this example, we’ll use Qwen/Qwen2.5-VL-3B-Instruct, a compact Vision Language Model (VLM) with strong performance.
With the original MPO paper, the authors released a collection of checkpoints fine-tuned with this technique for InternVL2.5, another high-performing VLM.
We chose Qwen2.5-VL-3B-Instruct for its straightforward integration with the transformers
library, although InternVL2.5 is the original model used in the paper.
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
from transformers import BitsAndBytesConfig
# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)
# Load model and tokenizer
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config,
)
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
3.2 Set Up QLoRA
Now that we have the model and processor loaded, let’s set up QLoRA and the DPOConfig, where we will set up the losses list and its corresponding weights.
These configurations enable efficient fine-tuning and optimization tailored for our training objectives.
>>> from peft import LoraConfig, get_peft_model
>>> # Configure LoRA
>>> peft_config = LoraConfig(
... r=8,
... lora_alpha=8,
... lora_dropout=0.1,
... target_modules=["down_proj", "o_proj", "k_proj", "q_proj", "gate_proj", "up_proj", "v_proj"],
... use_dora=True,
... init_lora_weights="gaussian",
... )
>>> # Apply PEFT model adaptation
>>> peft_model = get_peft_model(model, peft_config)
>>> # Print trainable parameters
>>> peft_model.print_trainable_parameters()
trainable params: 19,868,416 || all params: 3,774,491,392 || trainable%: 0.5264
3.3 MPO via DPOConfig
To configure MPO training using the DPOConfig
, simply provide a list of loss types using the loss_type
parameter. This can be passed as either a Python list or a comma-separated string. Optionally, you can specify a corresponding list of loss_weights
to control the relative importance of each loss during optimization. If omitted, all losses default to a weight of 1.0
.
For example, following the setup described in the original MPO paper, you can define:
loss_type = ["sigmoid", "bco_pair", "sft"]
loss_weights = [0.8, 0.2, 1.0]
This corresponds to:
MPO is defined as a combination of the preference loss (L_p), the quality loss (L_q), and the generation loss (L_g).
The selected loss_type
are:
"sigmoid"
: Sigmoid loss from the original DPO paper."bco_pair"
: Pairwise BCO loss from the BCO paper."sft"
: Negative log-likelihood loss (standard supervised fine-tuning loss).
For more details on each available loss type and how they affect training, refer to the official documentation.
All other configuration options follow the standard DPOConfig
format and can be adjusted based on your available compute resources.
from trl import DPOConfig
training_args = DPOConfig(
output_dir="Qwen2.5-VL-3B-Instruct-trl-mpo-rlaif-v",
loss_type=["sigmoid", "bco_pair", "sft"], # Loss types to combine, as used in the MPO paper
loss_weights=[0.8, 0.2, 1.0], # Corresponding weights, as used in the MPO paper
bf16=False,
gradient_checkpointing=True,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=8,
num_train_epochs=1,
dataset_num_proc=1, # tokenization will use 1 processes
dataloader_num_workers=8, # data loading will use 8 workers
logging_steps=10,
report_to="tensorboard",
push_to_hub=True,
save_strategy="steps",
save_steps=10,
save_total_limit=1,
eval_steps=10, # Steps interval for evaluation
eval_strategy="steps",
)
As we can see, setting MPO vs DPO is straightforward and only requires two additional parameters in the DPOConfig
. Finally, we can initialize the DPOTrainer
and start training the model.
from trl import DPOTrainer
trainer = DPOTrainer(
model=peft_model,
ref_model=None,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
processing_class=processor,
)
trainer.train()
4. Testing the Fine-Tuned Model
We have fine-tuned our model using MPO. Now, let’s evaluate its performance on a sample to see how it behaves in practice.
trained_model_id = "sergiopaniego/Qwen2.5-VL-3B-Instruct-trl-mpo-rlaif-v"
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
trained_model = PeftModel.from_pretrained(base_model, trained_model_id).eval()
trained_processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
test_dataset[0]
>>> test_dataset[0]["images"][0]
from qwen_vl_utils import process_vision_info
def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
model.gradient_checkpointing_disable()
model.config.use_cache = True
# Prepare the text input by applying the chat template
sample["prompt"][0]["content"][0]["image"] = sample["images"][0]
text_input = processor.apply_chat_template(sample["prompt"], add_generation_prompt=True)
image_inputs, _ = process_vision_info(sample["prompt"])
inputs = processor(
text=[text_input],
images=image_inputs,
videos=None,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(
trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
return output_text[0]
We’ll generate outputs from both the pretrained and fine-tuned models to highlight their differences.
An interesting extension would be to compare the MPO output with the same model fine-tuned using only DPO.
We’ll leave that experiment for you to explore!
>>> pretrained_output = generate_text_from_sample(model, processor, test_dataset[0])
>>> print("\n\n>>> Pretrained model output:\n\n")
>>> print(pretrained_output)
>>> trained_output = generate_text_from_sample(trained_model, trained_processor, test_dataset[0])
>>> print("\n\n>>> Fine tuned model output:\n\n")
>>> print(trained_output)
>>> Pretrained model output: The image depicts a modern high-speed train at a station platform. The train has a sleek, aerodynamic design with a streamlined front and a yellow nose. The body of the train is primarily white, with red and blue accents along its side. The windows are rectangular and evenly spaced, providing a clear view of the interior. The train is on a set of tracks that are elevated above the platform, which is indicated by the yellow safety line painted along the edge of the platform. The platform itself appears to be made of concrete and is equipped with a metal railing for safety. In the background, there are several elements that provide context to the setting. There are multiple power lines and poles running parallel to the tracks, suggesting that this is an electrified railway system. The sky is clear with a few scattered clouds, indicating fair weather conditions. Additionally, there are some greenery and possibly other structures or buildings visible in the distance, though they are not the main focus of the image. >>> Fine tuned model output: The image depicts a modern high-speed train, likely a bullet train, positioned on a railway track. The train has a sleek, aerodynamic design with a streamlined front and a predominantly white body. It features a distinctive color scheme with red and blue accents along its sides, which are characteristic of certain high-speed rail services. Key features of the train include: 1. **Color Scheme**: The train is primarily white with red and blue accents. The red sections are located on the sides, while the blue sections are more prominent on the front and sides. 2. **Design**: The train has a futuristic design with a pointed nose and large windows, which are typical for high-speed trains to improve aerodynamics and visibility. 3. **Windows**: The train has multiple windows along its side, allowing passengers to see outside during travel. 4. **Front Window**: The front of the train has a large, transparent window that provides a clear view of the tracks ahead. 5. **Headlights**: The train has two headlights at the front, which are essential for visibility during nighttime or low-light conditions. 6. **Platform**: The train is stopped at a platform, indicating it is either arriving or departing from a station. 7. **Railway Track**: The train is on a standard gauge railway track, suggesting it is designed for use on conventional tracks rather than high-speed lines. 8. **Surroundings**: The background shows a clear sky with some clouds, and there are some buildings and structures visible, possibly part of a cityscape or urban area. Overall, the image captures a modern, high-speed train in a stationary position, highlighting its design and color scheme, as well as its surroundings.
Looking at the outputs, we can already observe clear stylistic differences in the model’s responses after training.
The MPO fine-tuning is now complete!
5. Continue Your Learning Journey 🧑🎓️
This is not the end of your learning journey! If you enjoyed this content and want to dive deeper into MPO, trl
, or Vision-Language Models, check out the following resources:
- Preference Optimization for Vision Language Models with TRL
- Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
- MPO in the TRL documentation
- Vision Language Models (Better, Faster, Stronger)
- Explore more multimodal recipes in the Hugging Face Open-Source AI Cookbook