fancyfeast/llama-joycaption-beta-one-hf-llava · Performance degrades on multiple images

22 days ago

•

For a single image, performance is adequate, approximately at the level of Qwen2VL although with far better anatomy.

For multiple images, the model gets very confused, sometimes only claiming to see a single image, and sometimes mixing up details between each image.

@fancyfeast was the model trained with a multi image scenario? Perhaps it might be good to add in a small number of "Spot the difference" cases.

Example of information bleed and confusion using Joycaption Beta One:

Here is the exact same prompt and images, using MiniCPM-V

concedo

22 days ago

This comment has been hidden (marked as Off-Topic)

fancyfeast

Owner 22 days ago

It was only trained on single image, single turn conversations. It would be nice to support more than that, but it's quite a jump in complexity so probably won't happen for awhile.

fancyfeast changed discussion status to closed 22 days ago