Process and answer questions about webpage videos
VLM-R1 model for Open-Vocabulary Object Detection
Mark regions in images based on text descriptions
Open Agent Leaderboard
Generate text from images and videos