No one with a consumer grade GPU (< 32 vram) can run the lower L4 model... 😓
Llama 4 is sadly VERY disappointing.
I wonder if its possible to run only one or 2 experts without the MOE... but the quality would take a huge impact, probably... Maybe its better to just use Gemma 3? 😅
Llama 4 is sadly VERY disappointing.
Are you still using consumer grade GPU? That's llame... ~sarcasm
I am still waiting for the vllm to finish the commit, but the size and MOE with only 17B active sounds amazing. perfect 100B should be actually useful and fast.
How much VRAM is needed for run fp16/bf16 model?
actually the cookbook says 4x80 GB vRAM
Let's wait until https://github.com/ggml-org/llama.cpp/issues/12774 is solved and a PR merged to master, then try llama.cpp's --n-gpu-layers
, with the 4-bit quant https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit/tree/main which is <60G. That should be doable if you have 24G VRAM + 64G RAM.
GUYS JUST DON'T BE GPU POOR
GUYS JUST DON'T BE GPU POOR
I know right? Why don't poor people just buy more money?
Oh man, where is this world headed...
Why are they comparing this with gemma3 when the model size is 10x? You can say it's 17B params... but when it comes to memory requirements it's far away from that
Why are they comparing this with gemma3 when the model size is 10x? You can say it's 17B params... but when it comes to memory requirements it's far away from that
Because they can - there are no general regulations that would draw the line and say hey you're comparing apples to oranges, that's okay, but please leave that elephant out of the ants' league.
Since it’s a MoE with only 17B active params, you can run it at reasonable speeds with just CPU/RAM. A 128GB RAM PC is cheaper than a 30GB VRAM GPU.
There’s a limit to how much knowledge/languages you can cram into small models, so progress would’ve been limited anyway. Smaller models like Mistral/Gemma already cover these. With GPUs in such high demand, models like this might actually be better for local use. For example, AMD’s upcoming Strix Halo (up to 96GB shared RAM) could make local inference with powerful models much more viable.
@UniversalLove333 First like I always say, NO SUCH THING AS "NOT BEING ABLE TO RUN A MODEL", ITS ALWAYS "Doesn't run at a reasonable/usable speed". If you want you can run Deepseek R1 all experts at the same time on a intel celeron system with 2 gigabytes of memory if you load the model to swap or do something like AirLLM, The thing is that it won't run at a usable speed which you will point out which invalidates "Can't run" argument. I was able to run it with a 3070 8gb vram and 32GB ddr4 dram (2x16GB) at Q2_K_XL [42.6GB], It ran at 2.7 to 3.4 tokens per second which is faster than Gemma 3 27B Q4 QaT [16GB] which ran at 2.4 tokens per second. even when the model went to swap (the llama 4 one). It runs fast BECAUSE ITS A MoE. I don't see "No one with a consumer grade GPU can run Deepseek r1😓😓😓😓😓😓" on deepseek r1 discussions page, Sure llama 4 was expected to come with smaller sizes but still most people like you interpret the total parameter size instead of never looking at the actual active parameter size. I was even able to get a UD Q3_K_XL of llama 4 scout to run on my system and it ran at 2.1 tokens per second. These speeds maybe slow but they are fast compared to other models which don't get hate. Clearly you are just a LLaMA 4 hater [3 posts from you specifically targeting llama 4 model discussion pages and rambling it can't run on most systems knowing well you have never ran a MoE in your life] and never ran a MoE before. It doesn't matter to me if 2.7 to 3.4 tokens per second isn't fast but the thing that matters is OP complaining about how this model can't run when it runs better than other models. You may say "I was referring to FP16" or such but that doesn't explain you liking GGUF repos. And GGUF doesn't mean the model is totally different, Pretty sure there are FP16 GGUF's you can make/use.
@UniversalLove333 Clearly you are just a LLaMA 4 hater [3 posts from you specifically targeting llama 4 model discussion pages and rambling it can't run on most systems knowing well you have never ran a MoE in your life] and never ran a MoE before. It doesn't matter to me if 2.7 to 3.4 tokens per second isn't fast but the thing that matters is OP complaining about how this model can't run when it runs better than other models.
Let me correct you there, he's clearly lover of everything equally and indiscriminately as suggested by his name ( @UniversalLove333 ) whereas you are just a LLaMA-lover...