Script used?
#1
by
Downtown-Case
- opened
Would you upload the modified exllamav2 python files you used? I think I get the gist of what you changed in the code, but it would be convenient to have it :)
Also, I like your strategy. My experience is that messing with the measurement phase (even simply extending the context length) screws up the quantization, but extending the length of the second pass helps, and maxing out the first/last layer is a good idea going by llama.cpp's findings.
I posted the code fragments in a discussion on another of my models: https://huggingface.co/DeusImperator/Mistral-Small-24B-Instruct-2501_exl2_6.5bpw_L/discussions/1#679f9b0efb15b4e60ae76050
Thanks!
Downtown-Case
changed discussion status to
closed