CohereLabs/aya-vision-8b · Fine-tuning for Image Captioning and Image QA with Segmented Images

Hi Team,

Thank you for developing this excellent model!

I am working on fine-tuning the model for image captioning and image-based question answering (QA), specifically for navigation-related tasks. My approach involves using both a normal image and a corresponding segmented image to help the model understand object positions based on the user’s perspective.

I have a few questions regarding the best way to implement this:

Handling Multiple Images:

Should I provide both the normal image and its segmented counterpart during training?
Does your processor handle positional embeddings for both images in a way that helps the model understand object locations correctly?
Alternative Approaches:

Would you recommend another strategy for incorporating segmented images to enhance spatial understanding?
Should I concatenate features from both images before passing them to the model, or is there a built-in mechanism for handling such multi-modal input effectively?
I would love to hear your thoughts on the best approach to train the model effectively for spatially-aware navigation tasks.

Looking forward to your insights!