使用 Hugging Face 生态系统（TRL）对视觉语言模型 (Qwen2-VL-7B) 进行微调

作者： Sergio Paniego

🚨 警告：本 Notebook 资源需求较大，需要强大的计算能力。如果你在 Colab 中运行，它将使用 A100 GPU。

在本教程中，我们将演示如何使用 Hugging Face 生态系统，特别是 Transformer 强化学习库 (TRL)，对视觉语言模型 (VLM) 进行微调。

🌟 模型与数据集概述

我们将使用 Qwen2-VL-7B 模型，基于 ChartQA 数据集进行微调。该数据集包含各种图表类型的图像，并配有问答对，非常适合增强模型的视觉问答能力。

📖 其他资源

如果你对更多 VLM 应用感兴趣，请查看：

多模态检索增强生成 (RAG) ：我将带你了解如何使用文档检索（ColPali）和视觉语言模型（VLMs）构建 RAG 系统。
Phil Schmid 的教程：一个深入讲解如何使用 TRL 微调多模态 LLMs 的精彩教程。
Merve Noyan 的 smol-vision 仓库：一个关于前沿视觉与多模态 AI 主题的 Notebook 集合。

1. 安装依赖

让我们从安装微调所需的核心库开始！🚀

!pip install -U -q git+https://github.com/huggingface/trl.git bitsandbytes peft qwen-vl-utils trackio
# Tested with trl==0.22.0.dev0, bitsandbytes==0.47.0, peft==0.17.1, qwen-vl-utils==0.0.11, trackio==0.2.8

你需要通过 Hugging Face 帐户进行身份验证，以便直接从本 Notebook 保存和共享你的模型。

from huggingface_hub import notebook_login

notebook_login()

2. 加载数据集 📁

在这一部分，我们将加载 HuggingFaceM4/ChartQA 数据集。该数据集包含图表图像及其相关的问答对，非常适合用于视觉问答任务的训练。

接下来，我们将为视觉语言模型 (VLM) 生成一个系统消息。在这种情况下，我们希望创建一个系统，能够作为分析图表图像的专家，并根据图表提供简明的回答。

system_message = """You are a Vision Language Model specialized in interpreting visual data from chart images.
Your task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.
The charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary."""

我们将把数据集格式化为聊天机器人结构进行交互。每次交互将包括一个系统消息，接着是图像和用户的查询，最后是对查询的回答。

💡有关此模型的更多使用技巧，请查看模型卡。

def format_data(sample):
    return {
        "images": [sample["image"]],
        "messages": [
            {
                "role": "system",
                "content": [{"type": "text", "text": system_message}],
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": sample["image"],
                    },
                    {
                        "type": "text",
                        "text": sample["query"],
                    },
                ],
            },
            {
                "role": "assistant",
                "content": [{"type": "text", "text": sample["label"][0]}],
            },
        ],
    }

出于教育目的，我们将只加载数据集中每个部分的 10%。然而，在实际应用中，通常会加载整个样本集。

from datasets import load_dataset

dataset_id = "HuggingFaceM4/ChartQA"
train_dataset, eval_dataset, test_dataset = load_dataset(dataset_id, split=["train[:10%]", "val[:10%]", "test[:10%]"])

让我们看一下数据集的结构。它包含一个图像、一个查询、一个标签（即答案），以及一个我们将丢弃的第四个特征。

train_dataset

现在，让我们使用聊天机器人结构来格式化数据。这将使我们能够为模型适当地设置交互。

train_dataset = [format_data(sample) for sample in train_dataset]
eval_dataset = [format_data(sample) for sample in eval_dataset]
test_dataset = [format_data(sample) for sample in test_dataset]

train_dataset[200]

3. 加载模型并检查性能！🤔

现在我们已经加载了数据集，接下来让我们加载模型，并使用数据集中的一个样本来评估其性能。我们将使用Qwen/Qwen2-VL-7B-Instruct，这是一款能够理解视觉数据和文本的视觉语言模型（VLM）。

如果你在寻找替代方案，可以考虑以下开源选项：

Meta AI的Llama-3.2-11B-Vision
Mistral AI的Pixtral-12B
Allen AI的Molmo-7B-D-0924

此外，你还可以查看一些排行榜，比如WildVision Arena或OpenVLM Leaderboard，找到表现最好的VLM模型。

Qwen2_VL 架构图

import torch
from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor

model_id = "Qwen/Qwen2-VL-7B-Instruct"

接下来，我们将加载模型和分词器，为推理做准备。

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = Qwen2VLProcessor.from_pretrained(model_id)

To evaluate the model’s performance, we’ll use a sample from the dataset. First, let’s take a look at the internal structure of this sample.

train_dataset[0]

我们将使用没有系统消息的样本来评估VLM的原始理解能力。以下是我们将使用的输入：

train_dataset[0]["messages"][1:2]

现在，让我们来看一下与该样本对应的图表。你能根据视觉信息回答问题吗？

>>> train_dataset[0]["images"][0]

让我们创建一个方法，接受模型、处理器和样本作为输入，以生成模型的答案。这将帮助我们简化推理过程，并轻松评估VLM的性能。

from qwen_vl_utils import process_vision_info


def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
    # Prepare the text input by applying the chat template
    text_input = processor.apply_chat_template(
        sample["messages"][1:2],  # Use the sample without the system message
        tokenize=False,
        add_generation_prompt=True,
    )

    # Process the visual input from the sample
    image_inputs, _ = process_vision_info(sample["messages"])

    # Prepare the inputs for the model
    model_inputs = processor(
        text=[text_input],
        images=image_inputs,
        return_tensors="pt",
    ).to(
        device
    )  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text

# Example of how to call the method with sample:
output = generate_text_from_sample(model, processor, train_dataset[0])
output

尽管模型成功地提取了正确的视觉信息，但它在准确回答问题方面存在困难。这表明微调可能是提高其性能的关键。让我们继续进行微调过程！

移除模型并清理GPU

在我们继续下一部分的模型训练之前，让我们清理当前的变量并清理GPU，以释放资源。

import gc
import time


def clear_memory():
    # Delete variables if they exist in the current global scope
    if "inputs" in globals():
        del globals()["inputs"]
    if "model" in globals():
        del globals()["model"]
    if "processor" in globals():
        del globals()["processor"]
    if "trainer" in globals():
        del globals()["trainer"]
    if "peft_model" in globals():
        del globals()["peft_model"]
    if "bnb_config" in globals():
        del globals()["bnb_config"]
    time.sleep(2)

    # Garbage collection and clearing CUDA memory
    gc.collect()
    time.sleep(2)
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    time.sleep(2)
    gc.collect()
    time.sleep(2)

    print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")


clear_memory()

4. 使用TRL进行模型微调

4.1 加载用于训练的量化模型 ⚙️

接下来，我们将使用bitsandbytes加载量化模型。如果你想了解更多关于量化的内容，可以查看这篇博客文章或这篇文章。

量化能够显著减小模型的存储需求并提高推理效率，特别适用于需要部署到资源有限的设备上的场景。

from transformers import BitsAndBytesConfig

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=bnb_config
)
processor = Qwen2VLProcessor.from_pretrained(model_id)

4.2 设置QLoRA和SFTConfig 🚀

接下来，我们将为训练设置配置QLoRA。QLoRA使得大语言模型的高效微调成为可能，同时与传统方法相比显著减少内存占用。与标准的LoRA方法通过应用低秩近似来减少内存使用不同，QLoRA通过量化LoRA适配器的权重，进一步降低内存需求。这不仅减少了内存占用，还提升了训练效率，使其成为在不牺牲质量的前提下优化模型表现的理想选择。

>>> from peft import LoraConfig, get_peft_model

>>> # Configure LoRA
>>> peft_config = LoraConfig(
...     lora_alpha=16,
...     lora_dropout=0.05,
...     r=8,
...     bias="none",
...     target_modules=["q_proj", "v_proj"],
...     task_type="CAUSAL_LM",
... )

>>> # Apply PEFT model adaptation
>>> peft_model = get_peft_model(model, peft_config)

>>> # Print trainable parameters
>>> peft_model.print_trainable_parameters()

trainable params: 2,523,136 || all params: 8,293,898,752 || trainable%: 0.0304

我们将使用监督微调（SFT）来优化模型在当前任务上的表现。为此，我们将使用TRL库中的SFTConfig类来定义训练参数。SFT允许我们提供标注数据，帮助模型根据接收到的输入生成更准确的响应。这种方法确保了模型能够针对我们的具体应用场景进行调整，从而在理解和响应视觉查询方面实现更好的表现。

from trl import SFTConfig

# Configure training arguments
training_args = SFTConfig(
    output_dir="qwen2-7b-instruct-trl-sft-ChartQA",  # Directory to save the model
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=4,  # Batch size for training
    per_device_eval_batch_size=4,  # Batch size for evaluation
    gradient_accumulation_steps=8,  # Steps to accumulate gradients
    gradient_checkpointing_kwargs={"use_reentrant": False},  # Options for gradient checkpointing
    max_length=None,
    # Optimizer and scheduler settings
    optim="adamw_torch_fused",  # Optimizer type
    learning_rate=2e-4,  # Learning rate for training
    # Logging and evaluation
    logging_steps=10,  # Steps interval for logging
    eval_steps=10,  # Steps interval for evaluation
    eval_strategy="steps",  # Strategy for evaluation
    save_strategy="steps",  # Strategy for saving the model
    save_steps=20,  # Steps interval for saving
    # Mixed precision and gradient settings
    bf16=True,  # Use bfloat16 precision
    max_grad_norm=0.3,  # Maximum norm for gradient clipping
    warmup_ratio=0.03,  # Ratio of total steps for warmup
    # Hub and reporting
    push_to_hub=True,  # Whether to push model to Hugging Face Hub
    report_to="trackio",  # Reporting tool for tracking metrics
)

4. 训练模型 🏃

我们将使用trackio来记录我们的训练进度。让我们将 Notebook 与W&B连接，以便在训练过程中捕获重要信息。

import trackio

trackio.init(
    project="qwen2-7b-instruct-trl-sft-ChartQA",
    name="qwen2-7b-instruct-trl-sft-ChartQA",
    config=training_args,
    space_id=training_args.output_dir + "-trackio",
)

现在，我们将定义SFTTrainer，它是transformers.Trainer类的包装器，并继承了其属性和方法。这个类通过在提供PeftConfig对象时，正确初始化PeftModel，简化了微调过程。通过使用SFTTrainer，我们可以高效地管理训练工作流，确保我们的视觉语言模型微调过程顺利进行。

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=collate_fn,
    peft_config=peft_config,
    processing_class=processor.tokenizer,
)

该训练模型了! 🎉

trainer.train()

让我们保存结果 💾

trainer.save_model(training_args.output_dir)

5. 测试微调后的模型 🔍

现在我们已经成功微调了我们的视觉语言模型（VLM），是时候评估它的表现了！在这一部分，我们将使用来自ChartQA数据集的示例来测试模型，看看它在基于图表图像回答问题方面的表现如何。让我们深入探索并查看结果吧！🚀

让我们清理GPU内存，以确保最佳的性能 🧹

clear_memory()

我们将使用之前相同的管道重新加载基础模型。

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = Qwen2VLProcessor.from_pretrained(model_id)

我们将把训练好的适配器附加到预训练模型上。这个适配器包含了我们在训练过程中所做的微调调整，允许基础模型在不改变其核心参数的情况下利用新学到的知识。通过集成适配器，我们可以增强模型的能力，同时保持其原有结构。

adapter_path = "sergiopaniego/qwen2-7b-instruct-trl-sft-ChartQA"
model.load_adapter(adapter_path)

我们将使用之前从数据集中选取的样本，这是模型最初无法正确回答的样本

train_dataset[0]["messages"][:2]

>>> train_dataset[0]["images"][0]

output = generate_text_from_sample(model, processor, train_dataset[0])
output

既然这个样本来自训练集，模型在训练过程中已经接触过它，这可能被视为一种“作弊”方式。为了更全面地了解模型的表现，我们将使用一个未见过的样本来进行评估。

test_dataset[10]["messages"][:2]

>>> test_dataset[10]["images"][0]

output = generate_text_from_sample(model, processor, test_dataset[10])
output

模型已经成功地学习了根据数据集中的查询作出响应。我们达到了目标！ 🎉✨

💻 我已经开发了一个示例应用来测试该模型，你可以在这里找到它。你可以将其与另一个展示预训练模型的Space进行对比，预训练模型可以在这里找到。

from IPython.display import IFrame

IFrame(src="https://sergiopaniego-qwen2-vl-7b-trl-sft-chartqa.hf.space", width=1000, height=800)

6. 比较微调模型与基础模型 + 提示策略 📊

我们已经探讨了如何通过微调VLM将其适应特定需求的过程。另一种值得考虑的方法是直接使用提示（prompting）或实现RAG（检索增强生成）系统，这在另一个教程中有详细介绍。

微调 VLM 需要大量数据和计算资源，这可能会产生一定的成本。相比之下，我们可以尝试使用提示，看看是否能在没有微调开销的情况下实现类似的结果。

清理 GPU 内存以确保最佳性能 🧹

>>> clear_memory()

GPU allocated memory: 0.02 GB
GPU reserved memory: 0.27 GB

🏗️ 首先，我们将按照之前相同的管道加载基础模型。

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = Qwen2VLProcessor.from_pretrained(model_id)

📜 在这种情况下，我们将再次使用之前的样本，但这次我们将包括系统消息，如下所示。这一添加有助于为模型提供更多上下文，从而可能提高其响应的准确性。

train_dataset[0][:2]

让我们看看他是如何表现的！

text = processor.apply_chat_template(train_dataset[0][:2], tokenize=False, add_generation_prompt=True)

image_inputs, _ = process_vision_info(train_dataset[0])

inputs = processor(
    text=[text],
    images=image_inputs,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

output_text[0]

💡 正如我们所看到的，模型在没有任何微调的情况下，通过使用预训练模型并结合附加的系统消息，成功生成了正确的答案。这种方法在某些具体应用场景中，可能成为微调的有效替代方案。根据不同的任务需求，采用提示（prompting）或RAG（检索增强生成）等方法，可以在降低计算成本的同时实现类似的性能。

7. 继续学习之旅 🧑‍🎓️

为了进一步增强你在使用多模态模型方面的理解和技能，以下是一些推荐的资源：

这些资源将帮助你加深对多模态学习的理解和技能。继续探索、学习并尝试更多的技术，你会在这个领域走得更远！

< > Update on GitHub

Open-Source AI Cookbook