Slow prompt processing

#2
by OrangeApples - opened

Hi @dranger003 ! Thanks for the quants. Is the prompt processing of the IQ2_X2 supposed to be this slow (9.83T/s) even when the model is fully offloaded? I'm using the latest Nexesenex fork of KCPP.

@OrangeApples I guess it really depends on your GPU but I think that 10 t/s should be around what I would expert on a 3090. Also note IQ2_XXS is quite faster than IQ2_XS and quality isn't degraded from IQ2_XS (at least not that I could notice).

Thanks! Yes, I'm using a 3090. Will give IQ2_XXS a shot as well.

Edit: Turns out my prompt processing was spilling over to the system ram. After reducing the context from 12k to 10k, I got 227T/s for PP.

OrangeApples changed discussion status to closed

Sign up or log in to comment