RADIO Collection A collection of Foundation Vision Models that combine multiple models (CLIP, DINOv2, SAM, etc.). • 14 items • Updated about 9 hours ago • 21
view article Article SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data By danaaubakirova and 8 others • 4 days ago • 94
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published 23 days ago • 90
view article Article The Transformers Library: standardizing model definitions By lysandre and 3 others • 23 days ago • 112
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning Paper • 2505.07263 • Published 26 days ago • 29
view article Article Finally, a Replacement for BERT: Introducing ModernBERT By bclavie and 14 others • Dec 19, 2024 • 645
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning Paper • 2505.04601 • Published about 1 month ago • 26
view article Article Vision Language Models (Better, Faster, Stronger) By merve and 4 others • 26 days ago • 417
D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement Paper • 2410.13842 • Published Oct 17, 2024 • 5
D-FINE Collection State-of-the-art real-time object detection model with Apache 2.0 licence • 15 items • Updated May 5 • 55
Gemma 3 QAT Collection Quantization Aware Trained (QAT) Gemma 3 checkpoints. The model preserves similar quality as half precision while using 3x less memory. • 19 items • Updated Apr 18 • 27
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Paper • 2402.17764 • Published Feb 27, 2024 • 618
view article Article MIEB: The Benchmark That Stress-Tests Image-Text Embeddings Like Never Before By isaacchung and 2 others • Apr 24 • 14