Groma is an MLLM with exceptional region understanding and visual grounding capabilities. It can take user-defined region inputs (boxes) as well as generate long-form responses that are grounded to visual context.
Groma presents a novel paradigm of grounded MLLMs. (a) LLM for localization (e.g., Kosmos-2, Shikra); (b) External modules for localization (e.g., Lisa); and (c) Visual tokenier for localization (Groma).
Method | RefCOCO | RefCOCO+ | RefCOCOg | Avergae | |||||
---|---|---|---|---|---|---|---|---|---|
val | testA | testB | val | testA | testB | val | test | ||
Shikra | 87.01 | 90.61 | 80.24 | 81.60 | 87.36 | 72.12 | 82.27 | 82.19 | 82.93 |
Ferret | 87.49 | 91.35 | 82.45 | 80.78 | 87.38 | 73.14 | 83.93 | 84.76 | 83.91 |
MiniGPT-v2 | 88.69 | 91.65 | 85.33 | 79.97 | 85.12 | 74.45 | 84.44 | 84.66 | 84.29 |
Qwen-VL | 89.36 | 92.26 | 85.34 | 83.12 | 88.25 | 77.21 | 85.58 | 85.48 | 85.83 |
Groma | 89.53 | 92.09 | 86.26 | 83.90 | 88.91 | 78.05 | 86.37 | 87.01 | 86.52 |
Training stage | Data types | Datasets |
---|---|---|
Detection pretraining | Detection | COCO, Objects365, OpenImages, V3Det, SA1B |
Alignment pretraining | Image caption | ShareGPT-4V-PT |
Grounded caption | Flickr30k Entities | |
Region caption | Visual Genome, RefCOCOg | |
REC | COCO, RefCOCO/g/+, Grit-20m | |
Instruction finetuning | Grounded caption | Flickr30k Entities |
Region caption | Visual Genome, RefCOCOg | |
REC | COCO, RefCOCO/g/+ | |
Instruction following | Groma Instruct, LLaVA Instruct, ShareGPT-4V |