LlamaSeg: Image Segmentation via Autoregressive Mask Generation Paper • 2505.19422 • Published May 26 • 3
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy Paper • 2503.19757 • Published Mar 25 • 52
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models Paper • 2503.14939 • Published Mar 19 • 5