Computer-Use Agent & Vision Language Model & Vision Action Model
Identify and mark clickable elements on screenshots based on queries