No one with a consumer grade GPU (< 32 vram) can run the lower L4 model... 😓

#20

by UniversalLove333 - opened Apr 5

Discussion

UniversalLove333

Apr 5

Llama 4 is sadly VERY disappointing.

UniversalLove333

Apr 5

•

edited Apr 5

I wonder if its possible to run only one or 2 experts without the MOE... but the quality would take a huge impact, probably... Maybe its better to just use Gemma 3? 😅

MrDevolver

Apr 5

Llama 4 is sadly VERY disappointing.

Are you still using consumer grade GPU? That's llame... ~sarcasm

chriswritescode

Apr 5

I am still waiting for the vllm to finish the commit, but the size and MOE with only 17B active sounds amazing. perfect 100B should be actually useful and fast.

meetzuber

Apr 6

How much VRAM is needed for run fp16/bf16 model?

aaekay

Apr 6

actually the cookbook says 4x80 GB vRAM

loghugging25

Apr 6

•

edited Apr 6

Let's wait until https://github.com/ggml-org/llama.cpp/issues/12774 is solved and a PR merged to master, then try llama.cpp's --n-gpu-layers, with the 4-bit quant https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit/tree/main which is <60G. That should be doable if you have 24G VRAM + 64G RAM.

herooooooooo

Apr 6

GUYS JUST DON'T BE GPU POOR

MrDevolver

Apr 6

GUYS JUST DON'T BE GPU POOR

I know right? Why don't poor people just buy more money?

Oh man, where is this world headed...

hexess

Apr 7

Why are they comparing this with gemma3 when the model size is 10x? You can say it's 17B params... but when it comes to memory requirements it's far away from that

MrDevolver

Apr 7

Why are they comparing this with gemma3 when the model size is 10x? You can say it's 17B params... but when it comes to memory requirements it's far away from that

Because they can - there are no general regulations that would draw the line and say hey you're comparing apples to oranges, that's okay, but please leave that elephant out of the ants' league.

Doczko

Apr 10

Since it’s a MoE with only 17B active params, you can run it at reasonable speeds with just CPU/RAM. A 128GB RAM PC is cheaper than a 30GB VRAM GPU.

There’s a limit to how much knowledge/languages you can cram into small models, so progress would’ve been limited anyway. Smaller models like Mistral/Gemma already cover these. With GPUs in such high demand, models like this might actually be better for local use. For example, AMD’s upcoming Strix Halo (up to 96GB shared RAM) could make local inference with powerful models much more viable.

LLaMA-lover

Apr 10

•

edited Apr 10

@UniversalLove333 First like I always say, NO SUCH THING AS "NOT BEING ABLE TO RUN A MODEL", ITS ALWAYS "Doesn't run at a reasonable/usable speed". If you want you can run Deepseek R1 all experts at the same time on a intel celeron system with 2 gigabytes of memory if you load the model to swap or do something like AirLLM, The thing is that it won't run at a usable speed which you will point out which invalidates "Can't run" argument. I was able to run it with a 3070 8gb vram and 32GB ddr4 dram (2x16GB) at Q2_K_XL [42.6GB], It ran at 2.7 to 3.4 tokens per second which is faster than Gemma 3 27B Q4 QaT [16GB] which ran at 2.4 tokens per second. even when the model went to swap (the llama 4 one). It runs fast BECAUSE ITS A MoE. I don't see "No one with a consumer grade GPU can run Deepseek r1😓😓😓😓😓😓" on deepseek r1 discussions page, Sure llama 4 was expected to come with smaller sizes but still most people like you interpret the total parameter size instead of never looking at the actual active parameter size. I was even able to get a UD Q3_K_XL of llama 4 scout to run on my system and it ran at 2.1 tokens per second. These speeds maybe slow but they are fast compared to other models which don't get hate. Clearly you are just a LLaMA 4 hater [3 posts from you specifically targeting llama 4 model discussion pages and rambling it can't run on most systems knowing well you have never ran a MoE in your life] and never ran a MoE before. It doesn't matter to me if 2.7 to 3.4 tokens per second isn't fast but the thing that matters is OP complaining about how this model can't run when it runs better than other models. You may say "I was referring to FP16" or such but that doesn't explain you liking GGUF repos. And GGUF doesn't mean the model is totally different, Pretty sure there are FP16 GGUF's you can make/use.

MrDevolver

Apr 10

•

edited Apr 10

@UniversalLove333 Clearly you are just a LLaMA 4 hater [3 posts from you specifically targeting llama 4 model discussion pages and rambling it can't run on most systems knowing well you have never ran a MoE in your life] and never ran a MoE before. It doesn't matter to me if 2.7 to 3.4 tokens per second isn't fast but the thing that matters is OP complaining about how this model can't run when it runs better than other models.

Let me correct you there, he's clearly lover of everything equally and indiscriminately as suggested by his name ( @UniversalLove333 ) whereas you are just a LLaMA-lover...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment