PPL Chart?

by John198 - opened Apr 9

Apr 9

•

Hey so taking a brief look at the ppl v bpw chart for llama 70B showed that the most dramatic difference between exl2 and exl3 is at the lower end with bpw < 4. While significantly more efficient overall, I'm very curious to see how much of an improvement it is for Mistral large in particular because the current sweetspot is 2.7-2.85 bpw for 2x 3090s (depending on context) and where the disparity is most likely to be felt when updating.

Any chance you would be able to do a similar chart for 123B? It's a bit of work but I think it presents a very exciting opportunity to possibly 'upgrade' from 70B models to 123b models by default depending on the results.

turboderp

Owner Apr 10

Added it now

Alastar-Smith

Apr 10

Can we get exl3 version of legendary mistral large 2407? 2411 kinda lacks at some ... stuff.

MikeRoz

Apr 10

•

edited Apr 10

@Alastar-Smith What size do you want?

MikeRoz

Apr 17

@Alastar-Smith Uploaded exl3 quants of 2407 here.

Nexesenex

May 18

This comment has been hidden (marked as Resolved)

turboderp

Owner May 18

That isn't an apples-to-apples comparison. Perplexity tests don't just measure the model, they measure how well the model is aligned with a particular view on a particular dataset. That's why I went to the trouble of creating a a tool that tests both llama.cpp and ExLlama quants (as well as EXL2, AQLM, AWQ etc.) using the exact same sample of wikitext, sliced and tokenized in the exact same way, with the same logic for the perplexity calculation.

What I'm testing:

wikitext/wikitext2-raw-v1 test split, from datasets
all text fields concatenated, separated by "\n\n"
tokenized as one string
sliding windows of 2048 tokens with a 512 token stride

I haven't fully reverse engineered lcpp's test, but at a glance it's an entirely different test:

uses wiki.test.raw instead, not sure about the equivalence to what's provided by datasets
window length apparently is 512, not sure about the stride
evaluates every token with a warm context, i.e. each window ignores the early tokens with the highest uncertainty

EXL3 is based on QTIP which you can read all about here.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment