General discussion and feedback thread.
This is a general discussion thread. Feel free to share anything or report any issues
Test157t/Kunocchini-1.2-7b-longtext (Benchmarks are in prep now.)
Noice!
KCPP 1.59 released but the IQ3_S support wasn't merged yet it seems. Will still add as it will be in the next version but will also add the old Q3_K_S for now.
Are you sure this was configured correctly?
n_yarn_orig_ctx should be 8192, freq_base_train should only be 10000, Rope's linear scaling factor should be 16.
At least, according to the original Yarn model this merge has.
https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k/blob/main/config.json
Your newer test's config doesn't look set up correctly either.
I haven't had the chance to test with the updated config on this one. How does it seem to be performing? @Lewdiculous
@Test157t Seemed fine, didn't really notice anything too broken/unexpected, but I only used it casually for like 40 mins at around 12K-16K context.
After 12K I was feeling it was being less "natural" or a bit "stiffer", repetitive.
Nothing a few swipes couldn't solve. But it did seem to be more heavily affected after 12k (eg.: sometimes changing message formatting or doing actions with the wrong character).
It did better at working at long context than anything else I tested before, considering I'm using Q4 or IQ4 quants...
@Test157t - for some reason this model keeps getting good feedback from those that used it when I recommended it, personally I still like it, llms are as clear as magic potion brewing xD
I used the Kunocchini-7b-128k-test-v2_IQ4_XS-imatrix.gguf with current ooba on windows (the one with StreamingLLM support like KoboldCPP), but despite I did, I was not able to get ctx over 8192. It directly rails off and produces gibberish.
Am i doing something wrong? I was looking forward to long contexts (this IQ4_XS fits with 50k ctx with all layers offloaded in a 4080), but sadly I'm not able to manage.
@zappa2005 Pretty sure it's because Ooba doesn't do automatic RoPE scaling.
If you're gonna use a GGUF model, use Koboldcpp, that's why I recommend it, it will handle RoPE scaling automatically based on your --contextsize.
Not to mention, Koboldcpp is much faster than Ooba for GGUF models and has features like Context Shifting that can ensure very fast processing even at big contexts.
Related:
https://github.com/LostRuins/koboldcpp/wiki#what-is-contextshift
https://www.reddit.com/r/LocalLLaMA/comments/17ni4hm/koboldcpp_v148_context_shifting_massively_reduced/
My setup Recommendation is Koboldcpp + SillyTavern.
If you need any help or something isn't as expected let me know, I'm happy to help.
The V2 ggufs have the built-in config corrected to follow yarn scaling at a factor of 16
@zappa2005
The v1 are using a base mistral config that expects sliding window attention (something many inference engines never bothered implementing)
Thanks for the tip, I'll try to work with NTK RoPE. Regarding Kobold, you are right. In general it is much faster, but with this model even GGUF on ooba with 33/33 works with 35tk/s under full context.
The improvement of the prompt processing times with SmartContext is now also available for ooba, it is called StreamingLLM, as I mentioned above. Works good so far.
@zappa2005
I was out of the loop for that, but I'm curious, can you link to some StreamingLLM feature documentation or PR?
I'm under the impression that Smart Shifting is something different.
Found it here... https://github.com/oobabooga/text-generation-webui/pull/4761
Ah, yeah it's pretty much the same thing, that's great QoL!
The V2 ggufs have the built-in config corrected to follow yarn scaling at a factor of 16 @zappa2005
The v1 are using a base mistral config that expects sliding window attention (something many inference engines never bothered implementing)
Thanks for the tip! Does the yarn scaling factor of 16 directly translates to the alpha_value/NTK RoPE I can set in ooba, or is this in general not possible to project? Just curious how I can manually correct this.
Kobold cpp detects this if you want to use it manually:
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.0625
I believe that's an alpha of 16? (Honestly I don't know what's going on behind the scenes that well XD)
Might also be a linear scaling factor (What's an ooba anyways)
Kobold cpp detects this if you want to use it manually:
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.0625
I believe that's an alpha of 16? (Honestly I don't know what's going on behind the scenes that well XD)
Yeah this is weird to me too, and there is also other related information in the GGUF metadata:
'llama.rope.freq_base': '10000.000000'
'llama.rope.scaling.type': 'yarn'
'llama.rope.dimension_count': '128'
'llama.rope.scaling.factor': '16.000000'
'llama.rope.scaling.original_context_length': '8192'
'llama.rope.scaling.finetuned': 'true'
How this translates to the settings in ooba, like .. alpha_value or rope_freq_base (in relation to scaling_type=yarn, which I can not modify or set) ... no idea. Although increasing alpha helped no spitting gibberish anymore after 8k - but if this is the right way to do it?
Might also be a linear scaling factor (What's an ooba anyways)
ooba is common for https://github.com/oobabooga/text-generation-webui