Great Job getting these out!
Heya bullerwins, thanks for getting this out early. I've been experimenting some quantizing this dense 72B with ik_llama.cpp fork.
It was helpful for me to see how you treated that pesky ffn_down
which has a column size not divisible by 256 :oof:...
If you're interested, there is some discussion on the matter and dense vs MoE in general here with ik.
I haven't released any GGUFs yet, still trying to find a mix with which I'm happy haha...
Cheers!
Seems like mainline llama.cpp adds padding to whatever it needs to fill the blocks, not sure how efficient that is
29 568 / 256 = 115 full blocks (115 × 256 = 29 440)
remainder 128 elements (padded to 256)
Oh interesting. I thought maybe you specifically chose Q8_0
for all ffn_down
layers because I assumed the _0
style quants work with these non 256 divisible column sizes.
Thanks for the note, so many subtle details going on!