Post
1624
π§βπ³ New Multimodal Fine-Tuning Recipe π§βπ³
β‘οΈ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.
π Object detection typically involves detecting categories in images (e.g., vase).
By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.
VLMs are super powerful!
In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.
π€ Check it out here: https://huggingface.co/learn/cookbook/fine_tuning_vlm_object_detection_grounding
β‘οΈ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.
π Object detection typically involves detecting categories in images (e.g., vase).
By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.
VLMs are super powerful!
In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.
π€ Check it out here: https://huggingface.co/learn/cookbook/fine_tuning_vlm_object_detection_grounding