THE THREAD OF DOOM

#12
by jukofyork - opened

Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(

jukofyork pinned discussion

Okay, I was wondering if we crossed some sort of line.

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.

I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.

Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0 ones to avoid a lot of the confusion.


image.png

image.png

It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b and creative-writer-v0.2-35b models will be going as soon as I get the v1.0 version uploaded, and possible Dusk-Miqu-70B if they do set a hard-limit (I still think Dark-Miqu-70B is worth keeping whatever though).


Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B myself!

:( Damn there was some good info in that thread.

If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.

Unfortunately, I cleaned my browser tabs up about an hour ago.

And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.

I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.

@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.

I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol

I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...

What happens if you edit the config to "softmax"? If it runs and is a bit broken then maybe worth asking them about it, but if another retarded bug/oversight then I'd just give up as they obviously haven't even tried it.

Error: unknown variant `noaux_tc`, expected `greedy` or `group_limited_greedy` at line 64 column 27

After changing that it goes on to tell me that I don't have enough RAM, says that I need 2TB for some reason.

Obviously he never tested any of it then lol

(Sorry for even suggesting this)

There is an experimental PR by the person who wrote the llama.cpp code for deepseek here:

https://github.com/fairydreaming/llama.cpp/tree/deepseek2-mla-exp

No idea if it works though.

I would hold off and keep and eye on this thread:

https://github.com/ggerganov/llama.cpp/pull/11049#issuecomment-2612910496

It looks like there is a much more efficient way of calculating MLA that deepseek mentioned in their original (v2) paper, but for some reason never implemented in their own reference python code!

If fairydreaming can get it working then this might open up an alternative to KV-cache quantisation for all models (see my post further down).

This is an interesting PR too:

https://github.com/ggerganov/llama.cpp/pull/11397

I'm away from home ATM and my machines with the A6000s are all off, but if this PR works then I think I should be able to get a "usable" tokens/s on my setup:

  • The non-shared experts' up/gate/down projection matrices in RAM using the lowest possible quant that works (eg: q2_k for up/gate and q4_k for down, etc).
  • Everything else in VRAM using q8_0.

If we could get another PR that divides the experts' tensors between NUMA nodes then I think this would be 95% of what the old/defunct KTransformers did.

This is an interesting PR too:

https://github.com/ggerganov/llama.cpp/pull/11397

I'm away from home ATM and my machines with the A6000s are all off, but if this PR works then I think I should be able to get a "usable" tokens/s on my setup:

  • The non-shared experts' up/gate/down projection matrices in RAM using the lowest possible quant that works (eg: q2_k for up/gate and q4_k for down, etc).
  • Everything else in VRAM using q8_0.

If we could get another PR that divides the experts' tensors between NUMA nodes then I think this would be 95% of what the old/defunct KTransformers did.

I wonder if it could be used on something like Mistral small for people with 24gb VRAM?

This is an interesting PR too:

https://github.com/ggerganov/llama.cpp/pull/11397

I'm away from home ATM and my machines with the A6000s are all off, but if this PR works then I think I should be able to get a "usable" tokens/s on my setup:

  • The non-shared experts' up/gate/down projection matrices in RAM using the lowest possible quant that works (eg: q2_k for up/gate and q4_k for down, etc).
  • Everything else in VRAM using q8_0.

If we could get another PR that divides the experts' tensors between NUMA nodes then I think this would be 95% of what the old/defunct KTransformers did.

I wonder if it could be used on something like Mistral small for people with 24gb VRAM?

It will only really benefit Mixture of Experts models due to the "experts" only being used a fraction of the time (deepseek-v3/r1 are only active 1/32 = ~3% for each token).

Mistral small for people with 24gb VRAM

Mistral-Small fits into 24gb of VRAM really well already so I'm guessing you meant Mixtral?
I'm no expert and still digesting the Deepseek architecture, but I don't think there be the same benefit with Mixtral's MoE architecture.

"usable" tokens/s on my setup

What's a usable tokens/s to you?

I'm glad people are working on this as I'd have to buy at least another 2 GPUs and a replace my motherboard to fully offload this to vram locally.

Sorry, I meant Mistral Large but getting Wizard LM 8x22b on 24gb Vram well would be awesome! I cant see me being able to Run the massive version of Deepkseek any time soon.

What's a usable tokens/s to you?

I'm glad people are working on this as I'd have to buy at least another 2 GPUs and a replace my motherboard to fully offload this to vram locally.

I think around 9-10 tokens per second (16k tokens generated for a query and 30 minutes wait time), but probably 4-5 tokens per second would still just about be usable for applied maths questions (it could easily take an hour to "digest" the reply and form a followup question).

I still feel that llama.cpp could likely be improved for NUMA, but the ability to place the non-shared experts on the CPU memory is probably the biggest gain:

  • My old E5 v4 machines have ~78GB/s memory bandwidth (per NUMA node).
  • The A6000s have 768GB/s memory bandwidth.
  • The experts only activate ~3% of the time (1/32).

IIRC, the shared experts make up around 620B parameters, so using say q4_0 or q4_k (~4.5 bits per parameter):

(4.5/8)*620/32 = ~11GB per token (for the non-shared experts part)

78/11 = 7 tokens per second for 1 NUMA node

Badly optimised NUMA layout is at best 1.5x this, but well optimised NUMA layout would be close to 2x this.

Using a mix of q4_k for down_proj and q2_k for up_proj and gate_proj:

(4.5 + 2*2.5)/3 = 3.1 bit per parameter

so could potentially increase this by ~1.5x.

There is still the attention computation for the A6000s to perform, but this shouldn't be much worse than 70B/104B/123B models we're already using.

So I'm quietly hopeful somewhere between 4-5 tokens per second is eventually realistic, and with better NUMA layout 9-10 tokens per second.


When I get back I'm going try vLLM over the 3 machines, and even possibly try and use with Deepspeed if all else fails, but I'm not sure pipeline-parallel will gain a huge amount unless you can offload the whole model to VRAM somehow (my office's air-conditioning unit can't take another pair of A6000s so that's not really an option).

Sign up or log in to comment