Performance degrades on multiple images

#4
by concedo - opened

For a single image, performance is adequate, approximately at the level of Qwen2VL although with far better anatomy.

For multiple images, the model gets very confused, sometimes only claiming to see a single image, and sometimes mixing up details between each image.

@fancyfeast was the model trained with a multi image scenario? Perhaps it might be good to add in a small number of "Spot the difference" cases.

Example of information bleed and confusion using Joycaption Beta One:

image.png

Here is the exact same prompt and images, using MiniCPM-V

image.png

This comment has been hidden (marked as Off-Topic)

It was only trained on single image, single turn conversations. It would be nice to support more than that, but it's quite a jump in complexity so probably won't happen for awhile.

fancyfeast changed discussion status to closed

Sign up or log in to comment