Support multi-image?
Thanks for your great work! Does fuyu-8b support multi-image as input?
It does but the processor does not at the moment, a PR is on it's way for this!
Thank you for the great work! Regarding the ability to support multi-image, does it mean that fuyu-8b can handle interleaved text and multi-images? Could you clarify the input format for the transformer decoder when there is interleaved text and multi-images?
From the blog, it appears that the input for a single image and text would be: [img_patch] [img_patch] \n [img_patch] [img_patch] \n [text].
How would the format change in the scenario where the sequence is "<img1> <img2> some text"? Would there be some special image token to separate the two images? Or it would be like [img1_patch] [img1_patch] \n [img1_patch] [img1_patch] \n [img2_patch] [img2_patch] \n [img2_patch] [img2_patch] \n [text]?
Also, does FuyuForCausalLM
need any modification in order to accommodate interleaved text and multi-images? Thank you!
I think it does support interleaved images, I'm not entirely sure how two images are prompted, we'll try to ask the authors!
Any update on supporting interleaved images?