Great Job getting these out!

#1
by ubergarm - opened

Heya bullerwins, thanks for getting this out early. I've been experimenting some quantizing this dense 72B with ik_llama.cpp fork.

It was helpful for me to see how you treated that pesky ffn_down which has a column size not divisible by 256 :oof:...

If you're interested, there is some discussion on the matter and dense vs MoE in general here with ik.

I haven't released any GGUFs yet, still trying to find a mix with which I'm happy haha...

Cheers!

Seems like mainline llama.cpp adds padding to whatever it needs to fill the blocks, not sure how efficient that is

29 568 / 256 = 115 full blocks  (115 × 256 = 29 440)
remainder              128 elements (padded to 256)

Oh interesting. I thought maybe you specifically chose Q8_0 for all ffn_down layers because I assumed the _0 style quants work with these non 256 divisible column sizes.

Thanks for the note, so many subtle details going on!

Sign up or log in to comment