Full model context len & default settings (max_position_embeddings)
Hello,
According to the config, max model len is 8k : https://huggingface.co/CohereLabs/c4ai-command-r7b-12-2024/blob/main/config.json#L18
VLLM also interprets it as it is and according to this comment in llama.cpp https://github.com/ggml-org/llama.cpp/pull/10900#discussion_r1894397776 we might need to adjust the rope settings ourself.
While in older command-r models, the max_position_embeddings
setting is at the reported model max len capacity : https://huggingface.co/CohereLabs/c4ai-command-r-08-2024/blob/main/config.json#L15
What are the settings you use to run it at full size in llama.cpp and vllm ?
Thanks.
Hi
@LPN64
For running at full size for vllm we would recommend using max_position_embeddings=256000
. Although theoretically the number can go as far as memory allows, we cannot guarantee the model quality for sequence lengths beyond 256 k.
Thanks.
Thanks for the quick answer.
as of right now :
`VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve CohereLabs/c4ai-command-r7b-12-2024 --max-model-len 31000"
Crashes VLLM
I had to add --hf-overrides "{\"max_position_embeddings\": 131072}"
to make it work.
I ran few tests comparing VLLM vs Llama.cpp and huggingface transformers lib, and so far HF gives the best results, will run more tests tomorrow.