using context above 4096 - need to set compress_pos_emb above 1?
as title... Is the default context for this model 4096 like llama models?
You don’t need to set compression
Mistral and Mixtral is trained at 8192 and use sliding window up to 32k,
I find 8k to work best though.
Got it thanks! And use 8bit Cache if I run into OOM? or is using 8bit cache a must? I just tested it without 8-bit cache and it seems to run fine...
8bit cache saves 10-20% vram at no real cost, might be slightly slower, if you have no problem there is no need for it, but it might let you run a model at higher bpw at no real cost.
You don’t need to set compression
Mistral and Mixtral is trained at 8192 and use sliding window up to 32k
Say if I want to increase the ctx to 10k, can I set compression to 1.25 so it doesn't use sliding window between 8k-10k? or it will still use sliding window as soon as it reaches beyond 8k regardless of settings?
Or that compress_pos_emb is useless unless I want to go beyond 32k?
I guess my question really is about whether the following 2 settings are the same with mixtral / mistral:
- ctx = 10k, compress_pos_emb = 1
- ctx = 10k, compress_pos_emb = 1.25 (this is assuming base ctx = 8k, so 10k/8k = 1.25)
Thanks!
You should not have to set either compress or rope with Mistral and Mixtral based models at anything below 32k
Mixtral, this model,"gracefully handles a context of 32k tokens."