Support multi-image?

#12
by WaltonFuture - opened

Thanks for your great work! Does fuyu-8b support multi-image as input?

It does but the processor does not at the moment, a PR is on it's way for this!

Thank you for the great work! Regarding the ability to support multi-image, does it mean that fuyu-8b can handle interleaved text and multi-images? Could you clarify the input format for the transformer decoder when there is interleaved text and multi-images?

From the blog, it appears that the input for a single image and text would be: [img_patch] [img_patch] \n [img_patch] [img_patch] \n [text].

How would the format change in the scenario where the sequence is "<img1> <img2> some text"? Would there be some special image token to separate the two images? Or it would be like [img1_patch] [img1_patch] \n [img1_patch] [img1_patch] \n [img2_patch] [img2_patch] \n [img2_patch] [img2_patch] \n [text]?

Also, does FuyuForCausalLM need any modification in order to accommodate interleaved text and multi-images? Thank you!

I think it does support interleaved images, I'm not entirely sure how two images are prompted, we'll try to ask the authors!

Any update on supporting interleaved images?

deleted
This comment has been hidden

Sign up or log in to comment