What quant is best?
#1
by
jukofyork
- opened
I'd be interested in hearing what people's findings are on the quant that seems to work best for them?
For CUDA it would seem to be IQ4_XS
that gives the highest throughput and least latency, but I wonder if a higher bitrate quant with slightly more latency and a higher accuracy might help the dynamic --draft-min-p
option in llama.cpp
!?