Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
sergiopaniegoΒ 
posted an update 11 days ago
Post
1624
πŸ§‘β€πŸ³ New Multimodal Fine-Tuning Recipe πŸ§‘β€πŸ³

⚑️ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.

πŸ” Object detection typically involves detecting categories in images (e.g., vase).

By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.

VLMs are super powerful!

In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.

πŸ€— Check it out here: https://huggingface.co/learn/cookbook/fine_tuning_vlm_object_detection_grounding
In this post