While a fix is being implemented (https://github.com/ggml-org/llama.cpp/pull/12957) I want to leave the models up for visibility and continued discussion, but want to prevent accidental downloads of known broken models (even though there are settings that could fix it at runtime for now)
With this goal, I've enabled access requests. I don't really want your data, so I'm sorry that I don't think there's a way around that? But that's what I'm gonna do for now, and I'll remove the gate when a fix is up and verified and I have a chance to re-convert and quantize!
- π§ Native Multimodality - Process text and images in a unified architecture - π Mixture-of-Experts - First Llama models using MoE for incredible efficiency - π Super Long Context - Up to 10M tokens - π Multilingual Power - Trained on 200 languages with 10x more multilingual tokens than Llama 3 (including over 100 languages with over 1 billion tokens each)
πΉ Llama 4 Scout - 17B active parameters (109B total) - 16 experts architecture - 10M context window - Fits on a single H100 GPU - Beats Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1
πΉ Llama 4 Maverick - 17B active parameters (400B total) - 128 experts architecture - It can fit perfectly on DGX H100(8x H100) - 1M context window - Outperforms GPT-4o and Gemini 2.0 Flash - ELO score of 1417 on LMArena currently second best model on arena
πΉ Llama 4 Behemoth (Coming Soon) - 288B active parameters (2T total) - 16 experts architecture - Teacher model for Scout and Maverick - Outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks
β¨ Visual Reasoning: Breaks down complex images step by step. β¨ Math & Science: Solves visual problems with high precision. β¨ Combines text & images for deeper understanding.