Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs Paper • 2503.01307 • Published Mar 3 • 36
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning Paper • 2411.19458 • Published Nov 29, 2024 • 6
Temporal Preference Optimization for Long-Form Video Understanding Paper • 2501.13919 • Published Jan 23 • 22
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper • 2501.07171 • Published Jan 13 • 56
Action Sensitivity Learning for Temporal Action Localization Paper • 2305.15701 • Published May 25, 2023
Whitening-based Contrastive Learning of Sentence Embeddings Paper • 2305.17746 • Published May 28, 2023
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models Paper • 2305.18010 • Published May 29, 2023
Describing Differences in Image Sets with Natural Language Paper • 2312.02974 • Published Dec 5, 2023 • 16
Clustering based Point Cloud Representation Learning for 3D Analysis Paper • 2307.14605 • Published Jul 27, 2023
JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery Paper • 2307.16377 • Published Jul 31, 2023
Bird's-Eye-View Scene Graph for Vision-Language Navigation Paper • 2308.04758 • Published Aug 9, 2023
VideoAgent: Long-form Video Understanding with Large Language Model as Agent Paper • 2403.10517 • Published Mar 15, 2024 • 36
Why are Visually-Grounded Language Models Bad at Image Classification? Paper • 2405.18415 • Published May 28, 2024
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published Dec 13, 2024 • 147
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision Paper • 2407.06189 • Published Jul 8, 2024 • 27
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation Paper • 2405.09546 • Published May 15, 2024 • 13