MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection Paper • 2404.04910 • Published Apr 7, 2024
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models Paper • 2503.04240 • Published Mar 6
Science-T2I: Addressing Scientific Illusions in Image Synthesis Paper • 2504.13129 • Published Apr 17 • 3
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark Paper • 2504.14693 • Published Apr 20
EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments Paper • 2503.08604 • Published Mar 11
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model Paper • 2505.23606 • Published May 29 • 15
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published Jun 13 • 24
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action Paper • 2505.01583 • Published May 2 • 9
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published Mar 11 • 70
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think Paper • 2502.20172 • Published Feb 27 • 28
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Paper • 2411.11922 • Published Nov 18, 2024 • 19
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark Paper • 2410.03051 • Published Oct 4, 2024 • 6
Chasing Consistency in Text-to-3D Generation from a Single Image Paper • 2309.03599 • Published Sep 7, 2023 • 1
RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark Paper • 2407.13930 • Published Jul 18, 2024
StableVideo: Text-driven Consistency-aware Diffusion Video Editing Paper • 2308.09592 • Published Aug 18, 2023 • 2
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Paper • 2307.16449 • Published Jul 31, 2023 • 16