8 24 29

Tony Zhao

tianchez

https://www.tianchez.com

AI & ML interests

Multimodal Agent, Generative AI

Recent Activity

upvoted a paper 10 days ago

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

submitted a paper 10 days ago

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

upvoted a collection about 2 months ago

Qwen3.6

View all activity

Organizations

Posts 1

Post

4655

Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1