how i use this version model in vllm serve
#1
by
couldn
- opened
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model mistral-small-24b-bnb-4bit --max_model_len=20000 --port 8080 --quantization bitsandbytes --load-format bitsandbytes --tokenizer_mode mistral --config_format mistral --tool-call-parser mistral --enable-auto-tool-choice
this command is not ok
I have the same doubt. I would like to know if this is possible or if we have to wait for an update of vLLM.