What quant is best?

#1
by jukofyork - opened

I'd be interested in hearing what people's findings are on the quant that seems to work best for them?

For CUDA it would seem to be IQ4_XS that gives the highest throughput and least latency, but I wonder if a higher bitrate quant with slightly more latency and a higher accuracy might help the dynamic --draft-min-p option in llama.cpp!?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment