Performance degrades on multiple images
#4
by
concedo
- opened
For a single image, performance is adequate, approximately at the level of Qwen2VL although with far better anatomy.
For multiple images, the model gets very confused, sometimes only claiming to see a single image, and sometimes mixing up details between each image.
@fancyfeast was the model trained with a multi image scenario? Perhaps it might be good to add in a small number of "Spot the difference" cases.
Example of information bleed and confusion using Joycaption Beta One:
Here is the exact same prompt and images, using MiniCPM-V
This comment has been hidden (marked as Off-Topic)
It was only trained on single image, single turn conversations. It would be nice to support more than that, but it's quite a jump in complexity so probably won't happen for awhile.
fancyfeast
changed discussion status to
closed