how i use this version model in vllm serve

by couldn - opened Mar 19

Mar 19

CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model mistral-small-24b-bnb-4bit --max_model_len=20000 --port 8080 --quantization bitsandbytes --load-format bitsandbytes --tokenizer_mode mistral --config_format mistral --tool-call-parser mistral --enable-auto-tool-choice
this command is not ok

juanmiTC

Mar 20

I have the same doubt. I would like to know if this is possible or if we have to wait for an update of vLLM.

bingw5

Mar 28

•

edited Mar 28

My bad, wrong answer...

shimmyshimmer

Unsloth AI org Apr 7

CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model mistral-small-24b-bnb-4bit --max_model_len=20000 --port 8080 --quantization bitsandbytes --load-format bitsandbytes --tokenizer_mode mistral --config_format mistral --tool-call-parser mistral --enable-auto-tool-choice
this command is not ok

I have the same doubt. I would like to know if this is possible or if we have to wait for an update of vLLM.

My bad, wrong answer...

In the meantime you guys can use the standard BnB one: https://huggingface.co/unsloth/Mistral-Small-3.1-24B-Instruct-2503-bnb-4bit

Marcuswas

Apr 9

I have the same issue, I don't know the arguments to serve this model... Anyone? Thanks!

docgerbil

Apr 22

@shimmyshimmer Hi! What is the difference between these two? The naming convention is not clear.

unsloth/Mistral-Small-3.1-24B-Instruct-2503-unsloth-bnb-4bit
vs.
unsloth/Mistral-Small-3.1-24B-Instruct-2503-bnb-4bit

S1M0N38

Apr 30

@shimmyshimmer Hi! What is the difference between these two? The naming convention is not clear.

unsloth/Mistral-Small-3.1-24B-Instruct-2503-unsloth-bnb-4bit
vs.
unsloth/Mistral-Small-3.1-24B-Instruct-2503-bnb-4bit

@docgerbil read this blog post https://unsloth.ai/blog/dynamic-4bit

mon28

27 days ago

@shimmyshimmer How to run this model in parallel mode using multiple gpus with vllm?

danielhanchen

Unsloth AI org 27 days ago

There are details for vLLM serving in https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-FP8

mon28

27 days ago

@danielhanchen quantized model with bnb quantization is not supported in parallel mode though, right?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment