MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding Paper • 2502.03183 • Published Feb 5 • 2
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation Paper • 2503.16660 • Published Mar 20 • 73
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 226
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation Paper • 2412.04445 • Published Dec 5, 2024 • 23
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation Paper • 2412.06531 • Published Dec 9, 2024 • 73
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25, 2024 • 114
CLEAR: Character Unlearning in Textual and Visual Modalities Paper • 2410.18057 • Published Oct 23, 2024 • 210
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior Paper • 2410.21264 • Published Oct 28, 2024 • 9
Visual Context Window Extension: A New Perspective for Long Video Understanding Paper • 2409.20018 • Published Sep 30, 2024 • 11
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Paper • 2409.12961 • Published Sep 19, 2024 • 26
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing Paper • 2409.01322 • Published Sep 2, 2024 • 97
CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis Paper • 2407.13301 • Published Jul 18, 2024 • 57
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding Paper • 2407.15754 • Published Jul 22, 2024 • 20