Open-Source AI Cookbook documentation
Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU
Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU
Authored by: Sergio Paniego
In this recipe, weโll guide you through fine-tuning a smol ๐ค Vision Language Model (VLM) with Direct Preference Optimization (DPO) using the Transformer Reinforcement Learning (TRL) library to demonstrate how you can tailor VLMs to suit your specific needs, even when working with consumer-grade GPUs.
Weโll fine-tune SmolVLM using a preference dataset to help the model align with desired outputs. SmolVLM is a highly performant and memory-efficient model, making it an ideal choice for this task. If youโre new to Preference Optimization for language or vision-language models, check out this blog for an in-depth introduction.
The dataset weโll use is HuggingFaceH4/rlaif-v_formatted, which contains pairs of prompt + image
along with a chosen
and rejected
answer for each pair. The goal of this fine-tuning process is to make the model consistently prefer the chosen answers from the dataset, reducing hallucinations.
This notebook has been tested using an NVIDIA L4 GPU.
1. Install Dependencies
Letโs start by installing the essential libraries weโll need for fine-tuning! ๐
!pip install -U -q transformers trl datasets bitsandbytes peft accelerate
# Tested with transformers==4.46.3, trl==0.12.2, datasets==3.2.0, bitsandbytes==0.45.0, peft==0.14.0, accelerate==1.2.0
!pip install -q flash-attn --no-build-isolation
Authenticate with your Hugging Face account to save and share your model directly from this notebook ๐๏ธ.
from huggingface_hub import notebook_login
notebook_login()
2. Load Dataset ๐
Weโll work with the HuggingFaceH4/rlaif-v_formatted dataset, which provides pairs of prompt + image
along with a chosen
and rejected
answers for each pair. This structured format is ideal for training models with Direct Preference Optimization (DPO).
The dataset is already preformatted for this task. If youโre working with a custom dataset, youโll need to preprocess it into the same format.
In this example, weโll use a subset of the dataset to demonstrate the process. However, in a real-world scenario, you should utilize the full dataset for better performance.
from datasets import load_dataset
dataset_id = "HuggingFaceH4/rlaif-v_formatted"
train_dataset, test_dataset = load_dataset(dataset_id, split=['train[:6%]', 'test[:1%]'])
We will ensure all the images are RGB formatted:
from PIL import Image
def ensure_rgb(example):
# Convert the image to RGB if it's not already
image = example['images'][0]
if isinstance(image, Image.Image):
if image.mode != 'RGB':
image = image.convert('RGB')
example['images'] = [image]
return example
# Apply the transformation to the dataset
train_dataset = train_dataset.map(ensure_rgb, num_proc=32)
test_dataset = test_dataset.map(ensure_rgb, num_proc=32)
Letโs explore an example from the dataset to better understand its structure and the type of data weโre working with.
train_dataset[20]
>>> train_dataset[20]['images'][0]
3. Fine-Tune the Model using TRL
3.1 Load the Quantized Model for Training โ๏ธ
Letโs first load a quantized version of the SmolVLM-Instruct model using bitsandbytes, and letโs also load the processor. Weโll use SmolVLM-Instruct.
import torch
from transformers import Idefics3ForConditionalGeneration, AutoProcessor
model_id = "HuggingFaceTB/SmolVLM-Instruct"
from transformers import BitsAndBytesConfig
# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load model and tokenizer
model = Idefics3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config,
_attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)
3.2 Set Up QLoRA and DPOConfig ๐
In this step, weโll configure QLoRA for our training setup. QLoRA is a powerful fine-tuning technique designed to reduce the memory footprint, making it possible to fine-tune large models efficiently, even on limited hardware.
QLoRA builds upon traditional LoRA (Low-Rank Adaptation) by introducing quantization for the adapter weights. This enhancement leads to significantly lower memory usage and faster training, making it an ideal choice for resource-constrained environments.
>>> from peft import LoraConfig, get_peft_model
>>> # Configure LoRA
>>> peft_config = LoraConfig(
... r=8,
... lora_alpha=8,
... lora_dropout=0.1,
... target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
... use_dora=True,
... init_lora_weights="gaussian"
... )
>>> # Apply PEFT model adaptation
>>> peft_model = get_peft_model(model, peft_config)
>>> # Print trainable parameters
>>> peft_model.print_trainable_parameters()
trainable params: 11,269,248 || all params: 2,257,542,128 || trainable%: 0.4992
Next, we will configure the training options using DPOConfig
.
from trl import DPOConfig
training_args = DPOConfig(
output_dir="smolvlm-instruct-trl-dpo-rlaif-v",
bf16=True,
gradient_checkpointing=True,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=32,
num_train_epochs=5,
dataset_num_proc=8, # tokenization will use 8 processes
dataloader_num_workers=8, # data loading will use 8 workers
logging_steps=10,
report_to="tensorboard",
push_to_hub=True,
save_strategy="steps",
save_steps=10,
save_total_limit=1,
eval_steps=10, # Steps interval for evaluation
eval_strategy="steps",
)
We will define the training arguments for Direct Preference Optimization (DPO) with the DPOTrainer class from the TRL library.
DPO uses labeled preference data to guide the model toward generating responses that align with preferences. TRLโs DPOTrainer will tokenize the dataset before training and save it to disk. This process can consume significant disk space, depending on the amount of data used for training. Plan accordingly to avoid running out of storage.
This step may take a while, so feel free to relax and enjoy the process! ๐
from trl import DPOTrainer
trainer = DPOTrainer(
model=model,
ref_model=None,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
peft_config=peft_config,
processing_class=processor,
)
Time to train the model! ๐
trainer.train()
Letโs save the results ๐พ
trainer.save_model(training_args.output_dir)
4. Testing the Fine-Tuned Model ๐
With our Vision Language Model (VLM) fine-tuned, itโs time to evaluate its performance! In this section, weโll test the model using examples from the HuggingFaceH4/rlaif-v_formatted dataset. Letโs dive into the results and assess how well the model aligns with the preferred responses! ๐
Before we begin, letโs clean up the GPU memory to ensure smooth and optimal performance. ๐งน
>>> import gc
>>> import time
>>> def clear_memory():
... # Delete variables if they exist in the current global scope
... if 'inputs' in globals(): del globals()['inputs']
... if 'model' in globals(): del globals()['model']
... if 'processor' in globals(): del globals()['processor']
... if 'trainer' in globals(): del globals()['trainer']
... if 'peft_model' in globals(): del globals()['peft_model']
... if 'bnb_config' in globals(): del globals()['bnb_config']
... time.sleep(2)
... # Garbage collection and clearing CUDA memory
... gc.collect()
... time.sleep(2)
... torch.cuda.empty_cache()
... torch.cuda.synchronize()
... time.sleep(2)
... gc.collect()
... time.sleep(2)
... print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
... print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
>>> clear_memory()
GPU allocated memory: 1.64 GB GPU reserved memory: 2.01 GB
We will reload the base model using the same pipeline as before.
model = Idefics3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)
We will attach the trained adapter to the pretrained model. This adapter contains the fine-tuning adjustments made during training, enabling the base model to leverage the new knowledge while keeping its core parameters intact. By integrating the adapter, we enhance the modelโs capabilities without altering its original structure.
adapter_path = "sergiopaniego/smolvlm-instruct-trl-dpo-rlaif-v"
model.load_adapter(adapter_path)
Letโs evaluate the model on an unseen sample.
test_dataset[20]
>>> test_dataset[20]['images'][0]
Letโs create a common function that we can call with different samples to streamline the testing process. This function will allow us to evaluate the modelโs performance on multiple examples efficiently without needing to rewrite code for each one. By using this reusable function, we can quickly assess how well the model performs across a variety of inputs.
def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
# Prepare the text input by applying the chat template
text_input = processor.apply_chat_template(
sample['prompt'],
add_generation_prompt=True
)
image_inputs = []
image = sample['images'][0]
if image.mode != 'RGB':
image = image.convert('RGB')
image_inputs.append([image])
# Prepare the inputs for the model
model_inputs = processor(
text=text_input,
images=image_inputs,
return_tensors="pt",
).to(device) # Move inputs to the specified device
# Generate text with the model
generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)
# Trim the generated ids to remove the input ids
trimmed_generated_ids = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
]
# Decode the output text
output_text = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
return output_text[0] # Return the first decoded output text
Now, weโre ready to call the function and evaluate the model! ๐
output = generate_text_from_sample(model, processor, test_dataset[20])
output
The model is now able to generate responses based on the provided image and prompt. For tasks like this, itโs useful to compare your modelโs performance against a benchmark to see how much it has improved and how it stacks up against other options. For more information and details on this comparison, check out this post.
๐ป Iโve developed an example application to test the model, which you can find here.
Since here we only run an example training with a subset of the dataset, for the Space Iโve used the official Hugging Face DPO fine tuned model. You can easily compare it with another Space featuring the pre-trained model, available here.
from IPython.display import IFrame
IFrame(src="https://sergiopaniego-smolvlm-trl-dpo-rlaif-v.hf.space", width=1000, height=800)
5. Continuing the Learning Journey ๐งโ๐๏ธ
Expand your knowledge of Vision Language Models and related tools with these resources:
- Multimodal Recipes in the Cookbook: Discover practical recipes for multimodal models, including Retrieval-Augmented Generation (RAG) pipelines and fine-tuning. Weโve already published a recipe for fine-tuning a smol VLM with TRL using SFT, which complements this guide perfectlyโcheck it out for additional details.
- TRL Community Tutorials: Explore a rich collection of tutorials that dive into the intricacies of TRL and its real-world applications.
You can also revisit the Continuing the Learning Journey section in Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL).
These resources will help deepen your knowledge and expertise in multimodal learning.
Update on GitHub