view article Article retrain-pipelines and the almighty function-caller By Aurelien-Morgan • Apr 28 • 5
view article Article Consent by Design: Approaches to User Data in Open AI Ecosystems By giadap and 1 other • Apr 17 • 13
view article Article SigLIP 2: A better multilingual vision language encoder By ariG23498 and 2 others • Feb 21 • 165
view article Article π0 and π0-FAST: Vision-Language-Action Models for General Robot Control By danaaubakirova and 3 others • Feb 4 • 158
view article Article SmolVLM Grows Smaller – Introducing the 250M & 500M Models! By andito and 2 others • Jan 23 • 180
view article Article Introducing TextImage Augmentation for Document Images By danaaubakirova and 2 others • Aug 6, 2024 • 33
view article Article LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning? By danaaubakirova and 1 other • Jul 25, 2024 • 17
Searching for Better ViT Baselines Collection Exploring ViT hparams and model shapes for the GPU poor (between tiny and base). • 28 items • Updated Feb 14 • 18
MobileNetV4 pretrained weights Collection Weights for MobileNet-V4 pretrained in timm • 17 items • Updated Sep 22, 2024 • 18
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published Jun 17, 2024 • 21
What If We Recaption Billions of Web Images with LLaMA-3? Paper • 2406.08478 • Published Jun 12, 2024 • 42
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations Paper • 2405.18392 • Published May 28, 2024 • 12
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24, 2024 • 47
view article Article Multimodal Augmentation for Documents: Recovering “Comprehension” in “Reading and Comprehension” task By danaaubakirova • May 16, 2024 • 17