Mungert/Qwen2.5-VL-3B-Instruct-GGUF · Running this model with llama server instead of cli

May 7

•

Hello,
Hope you're all well!

I was wondering if it's an option to run this model without using the CLI, what I mean is I'd love to use this model via API, and I see that the options documented for running inference are via CLI, I've preliminarily viewed the code for inference and the cli is adapted from this file: llava-cli.cpp, with edits to the inference code of the model,

To achieve this previously with llava 1.6 I used to use the bindings from llama-cpp-python, and wrote the code with FastAPI using the native bindings, but with this model, it's a bit trickier,

So far I've managed to build llama-cpp-python with the working fork of the llama.cpp code including the branch, but now I'd like to run inference programmatically, what are my options?

For reference, I'll attach the code I previously used for my llava api:

from llama_cpp import Llama  
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="./chatbot_module/models/mmproj-model-f16.gguf")

llm = Llama(
  model_path="./chatbot_module/models/llava-v1.6-mistral-7b.Q4_K_M.gguf",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
  logits_all=True,# needed to make llava work
)

llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are an assistant who perfectly describes images."},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://saturncloud.io/images/blog/troubleshooting-guide-when-your-conda-environment-doesnt-show-up-in-vs-code-2.png"}},
                {"type" : "text", "text": "Describe this image in detail please."}
            ]
        }
    ]
)

I understand the prompt is different so I'll need a different chat handler, but with regards to the inference, what are my options?

Now this is a very loaded and hefty question, so I'm willing to contribute the effort the inference code back to the fork if I'm able to run it via API.

Thank you in advance.

Mungert

Owner May 7

I don't have much experience with working with llama.cpp in python. So far I have written everything in .net. I stream the output of llama-cli into .net code that creates function calls etc. in response to tokens. Its a very labour intensive process as each new model chat template I have to convert its template to the .net version. But it does give maximum flexibility to deal with any input and outputs. Probably not the approach you are looking for.

Mungert changed discussion status to closed May 7

Mungert changed discussion status to open May 7

atabaza

May 7

Ah I see, I was thinking of parsing out the CLI output as well if it wasn't possible to programmatically call the model, I might just have to sit down and try to make sense of the cpp code and replicate it with the python bindings, but that won't be easy I'm sure. Thanks, appreciate the insight.

atabaza changed discussion status to closed May 7