Fine-tuning for Image Captioning and Image QA with Segmented Images

#10
by badhon1512 - opened

Hi Team,

Thank you for developing this excellent model!

I am working on fine-tuning the model for image captioning and image-based question answering (QA), specifically for navigation-related tasks. My approach involves using both a normal image and a corresponding segmented image to help the model understand object positions based on the user’s perspective.

I have a few questions regarding the best way to implement this:

Handling Multiple Images:

Should I provide both the normal image and its segmented counterpart during training?
Does your processor handle positional embeddings for both images in a way that helps the model understand object locations correctly?
Alternative Approaches:

Would you recommend another strategy for incorporating segmented images to enhance spatial understanding?
Should I concatenate features from both images before passing them to the model, or is there a built-in mechanism for handling such multi-modal input effectively?
I would love to hear your thoughts on the best approach to train the model effectively for spatially-aware navigation tasks.

Looking forward to your insights!

Cohere Labs org

Hi! YOu can pass both the images, and the RoPE should take care of it!

Sign up or log in to comment