how i use this version model in vllm serve

#1
by couldn - opened

CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model mistral-small-24b-bnb-4bit --max_model_len=20000 --port 8080 --quantization bitsandbytes --load-format bitsandbytes --tokenizer_mode mistral --config_format mistral --tool-call-parser mistral --enable-auto-tool-choice
this command is not ok

I have the same doubt. I would like to know if this is possible or if we have to wait for an update of vLLM.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment