minimum vram?

#9
by CHNtentes - opened

not very familiar with moe models. does it require 685GB or 37GB vram?

need a100 x 10

@CHNtentes it needs about 1tb vram

What if you have a single GPU with 48GB VRAM and 1tb ordinary system RAM? Someone told me that it's possible to separate the layers so that only the active expert (37GB if using a Q8) is in VRAM at any given time, and the rest is in system RAM...

I have no doubt this is possible to do - but would the performance be even close to usable??

What if you have a single GPU with 48GB VRAM and 1tb ordinary system RAM? Someone told me that it's possible to separate the layers so that only the active expert (37GB if using a Q8) is in VRAM at any given time, and the rest is in system RAM...

I have no doubt this is possible to do - but would the performance be even close to usable??

you could try with vLLM as it has CPU offloading with
--cpu-offload-gb 900

This comment has been hidden

Is it feasible this will run on only 160gb VRAM with the right quantization?

Is it feasible this will run on only 160gb VRAM with the right quantization?

i mean, anything can theoretically be run anywhere if you quantize it enough. It's usually considered that at least 4bpw/Q4 is the minimum to retain good quality. So for Deepseek 3 what would equal to around 380GB VRAM (with a small context size). Once/if llama.cpp/GGUF is compatible, we can offload some layers to CPU RAM, being a MoE has the benefit of still maintaining decent speed even while on RAM.

So I would say a total of 400GB of VRAM+RAM would be necessary, the more proportion of VRAM the better.

What if you have a single GPU with 48GB VRAM and 1tb ordinary system RAM? Someone told me that it's possible to separate the layers so that only the active expert (37GB if using a Q8) is in VRAM at any given time, and the rest is in system RAM...

I have no doubt this is possible to do - but would the performance be even close to usable??

you could try with vLLM as it has CPU offloading with
--cpu-offload-gb 900

tried it but does not work!

Wait, so this can't be run locally on a regular consumer gpu?

Wait, so this can't be run locally on a regular consumer gpu?

I was thinking the same thing. I have just one (pretty decent) GPU. We well see I guess. Maybe use one of the GGUF quantized versions: https://huggingface.co/bullerwins/DeepSeek-V3-GGUF

But in general I'm afraid this will not work, since 671B model scared me so much. I'm still in shock.

Wait, so this can't be run locally on a regular consumer gpu?

Q5-K-M GGUF on CPU, exactly 502 Gb of RAM, Gpu can help with offloading, running right now on 10 year old server hardware, consumer motherboards are main barrier for access

Sign up or log in to comment