RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation Paper • 2506.18088 • Published 5 days ago • 16
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation Paper • 2506.18095 • Published 5 days ago • 59
view article Article Gemma 3n fully available in the open-source ecosystem! By ariG23498 and 7 others • 2 days ago • 70
3D Arena: An Open Platform for Generative 3D Evaluation Paper • 2506.18787 • Published 4 days ago • 11
OmniGen2: Exploration to Advanced Multimodal Generation Paper • 2506.18871 • Published 4 days ago • 66
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning Paper • 2506.18841 • Published 4 days ago • 48
AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models Paper • 2506.19851 • Published 3 days ago • 50
V-JEPA 2 Collection A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated 14 days ago • 128
view article Article Post-Training Isaac GR00T N1.5 for LeRobot SO-101 Arm By nvidia and 4 others • 16 days ago • 63
Open Whisper-style Speech Models (OWSM) Collection Fully open Whisper-style speech foundation models developed by CMU WAVLab: https://www.wavlab.org/activities/2024/owsm/ • 21 items • Updated 25 days ago • 6
view article Article SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data By danaaubakirova and 8 others • 25 days ago • 167
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning Paper • 2506.07044 • Published 20 days ago • 105
SpatialLM: Training Large Language Models for Structured Indoor Modeling Paper • 2506.07491 • Published 19 days ago • 38
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation Paper • 2506.07530 • Published 19 days ago • 18
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers Paper • 2506.07986 • Published 18 days ago • 18
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Paper • 2506.01844 • Published 25 days ago • 103
EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge Paper • 2505.23009 • Published 30 days ago • 17
OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning Paper • 2506.00338 • Published 28 days ago • 9
view article Article nanoVLM: The simplest repository to train your VLM in pure PyTorch By ariG23498 and 6 others • May 21 • 173