SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published about 22 hours ago • 68
Scaling Pre-training to One Hundred Billion Data for Vision Language Models Paper • 2502.07617 • Published 10 days ago • 27
view article Article From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub 10 days ago • 48
view article Article From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages By Steveeeeeeen and 1 other • 10 days ago • 22
DepthPro Models Collection Depth Pro: Sharp Monocular Metric Depth in Less Than a Second • 4 items • Updated 14 days ago • 7
view article Article Simplifying Alignment: From RLHF to Direct Preference Optimization (DPO) By ariG23498 • Jan 19 • 13
ViTPose Collection Collection for ViTPose models based on transformers implementation. • 10 items • Updated Jan 12 • 12
Segformer Collection Transformer-based semantic segmentation model by Nvidia • 15 items • Updated Jan 13 • 4
timm tiny test models Collection A collection of very small (~300-500k parameter) models at 160x160 resolution, for testing purposes. Trained on ImageNet-1k. • 13 items • Updated Oct 2, 2024 • 5
view article Article ColPali: Efficient Document Retrieval with Vision Language Models 👀 By manu • Jul 5, 2024 • 205
Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot Paper • 2402.14654 • Published Feb 22, 2024 • 2