PPL Chart?
Hey so taking a brief look at the ppl v bpw chart for llama 70B showed that the most dramatic difference between exl2 and exl3 is at the lower end with bpw < 4. While significantly more efficient overall, I'm very curious to see how much of an improvement it is for Mistral large in particular because the current sweetspot is 2.7-2.85 bpw for 2x 3090s (depending on context) and where the disparity is most likely to be felt when updating.
Any chance you would be able to do a similar chart for 123B? It's a bit of work but I think it presents a very exciting opportunity to possibly 'upgrade' from 70B models to 123b models by default depending on the results.
Added it now
Can we get exl3 version of legendary mistral large 2407? 2411 kinda lacks at some ... stuff.
That isn't an apples-to-apples comparison. Perplexity tests don't just measure the model, they measure how well the model is aligned with a particular view on a particular dataset. That's why I went to the trouble of creating a a tool that tests both llama.cpp and ExLlama quants (as well as EXL2, AQLM, AWQ etc.) using the exact same sample of wikitext, sliced and tokenized in the exact same way, with the same logic for the perplexity calculation.
What I'm testing:
- wikitext/wikitext2-raw-v1 test split, from
datasets
- all
text
fields concatenated, separated by"\n\n"
- tokenized as one string
- sliding windows of 2048 tokens with a 512 token stride
I haven't fully reverse engineered lcpp's test, but at a glance it's an entirely different test:
- uses wiki.test.raw instead, not sure about the equivalence to what's provided by
datasets
- window length apparently is 512, not sure about the stride
- evaluates every token with a warm context, i.e. each window ignores the early tokens with the highest uncertainty
EXL3 is based on QTIP which you can read all about here.