@sergiopaniego on Hugging Face: "🧑‍🍳 New Multimodal Fine-Tuning Recipe 🧑‍🍳 ⚡️ In this new @huggingface…"

Post

1624

🧑‍🍳 New Multimodal Fine-Tuning Recipe 🧑‍🍳

⚡️ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.

🔍 Object detection typically involves detecting categories in images (e.g., vase).

By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.

VLMs are super powerful!

In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.

🤗 Check it out here: https://huggingface.co/learn/cookbook/fine_tuning_vlm_object_detection_grounding

Join the conversation