Image analysis in only chinese ??
Well, no matter what I do, model returns image analysis only in chinese. ggml-model-Q6_K.gguf with mmproj-model-f16.gguf . The only way it returns results in english is when flash attention is enabled in llama.cpp and k/v cache are both quantized to q5_0 , for example q4_0 or q8_0 or q 5_1 or f16 k/v cache quantizations returns chinese image analysis results no matter what I try. When in conversation mode, it works fine in english, but image analysis returns english responses "ONLY" when q5_0 k/v cache quantization is used.
Edit: I spoke too soon, with q5_0 k/v quantization, it "sometimes" returns results in english but out of 10 repeated queries with exact same command , 6 are in chinese and 4 in english.
I received your issue, which seems a bit strange. I'll check it out and get back to you as soon as possible.
@phoebdroid
I haven't rigorously compared different gguf quantization methods before, but I've reproduced your issue. I'm not sure how to fix it yet.
However, I've found that Q4_0 or Q4_K_M don't cause this issue.
Perhaps you could try using these quantization methods.
I'll also need to spend more time learning the differences and solutions for each gguf quantization method.
Perhaps you could try using these quantization methods.
you mean model quantized in Q4 or the k/v cache ? because for me k/v at Q4 definitely is Chinese. And as per model I try to stay away from low quantization models especially for image analysis as that seems to degrade quality much worse than in pure text. But I can give it a try if you want me to try Q4 weights
@phoebdroid
I'm suggesting you use the Q4_0 or Q4_K_M ggufs for testing. I'm also curious why they can accurately follow Chinese and English questions more accurately than the Q6.
I can't quite pinpoint the cause, but I've discovered this phenomenon and wanted to share it with you. If you have some free time, you can try it out.
I've discovered this phenomenon and wanted to share it with you. If you have some free time, you can try it out.
Sure, I'm downloading Q4K_M as we speak, will test and let you know. As per why higher Quants fail at English while lowers succeed, also baffles me. If I had to take a wild guess I would say maybe the original merge, and the source models had this Chinese tendency, and this tendency is preserved better at higher quants ?? But like I said, I have no idea and just taking a shot in the dark, don't quote me on this.
Ok, I've downloaded the Q4K_M gguf, and yes, this one is not in chinese. H̶o̶w̶e̶v̶e̶r̶, i̶n̶ i̶m̶a̶g̶e̶ r̶e̶c̶o̶g̶n̶i̶t̶i̶o̶n̶ i̶s̶ a̶b̶s̶u̶r̶d̶l̶y̶ b̶a̶d̶ In image analysis it seems to actually be better than Gemma3, for OCR however, it does "OK". Now, first I'll share you my OCR related test, tested against Gemma3 4B IT QAT (Q4_0 gguf) and the model responses evaluated by Qwen3 30B A3B Thinking 2507. I'll share the reference image, and the screen shots of the evalutaion done by Qwen. In my next message, I'll do the same for image analysis.
The reference image is:
And the chat history: (Note, Model 1 is MiniCPM-V-4 , Model 2 is Gemma3)
ok I must apologize and correct myself, MiniCPM-V-4 is NOT absurdly bad for image analysis, as a matter of fact it held itself fairly well against gemma3 in this test ( I had another run where it completely botched an image analysis and I spoke too soon for that) As a matter of fact, in this test, it's even more verbose than gemma3. Here are the test results. (note Qwen has mixed up model names in last evaluation, calling model 1 model 2 and vice versa) Model 1 in this test is MiniCPM-V-4 and model 2 is Gemma3
reference image:
Now screenshots of chat history:
note, in the last re-evaluation again qwen mixed up the model with one llama error, but the rest is correct and it correcly evaluated model 1 (miniCPM-v-4) as superior for image analysis. And I actually agree because it deliers a more verbose output than gemma.
ok wait there is even more, again I spoke too soon. My own user interface tests work, NOT direct CLI calls, because in my image analysis tool code I actually do clean up pass for responses and UTF-8 decoding as :
def _process_image_analysis(command):
"""Helper to run the image analysis subprocess and clean up the output."""
try:
result = subprocess.run(
command,
capture_output=True,
timeout=60
)
# Decode output using UTF-8, ignoring errors
raw_output = result.stdout.decode('utf-8', errors='ignore')
stderr_output = result.stderr.decode('utf-8', errors='ignore')
# Check for errors in stderr
if result.returncode != 0:
error_message = f"Image analysis failed (stderr): {stderr_output}"
return None, error_message
# Define technical patterns to filter out
technical_patterns = [
'loading model:',
'encoding image slice',
'image decoded (batch',
'decoding image batch',
'time:',
'ms',
]
# Split output into lines and filter
lines = raw_output.split('\n')
filtered_lines = []
for line in lines:
line = line.strip() # Remove leading/trailing whitespace
if not line:
continue # Skip empty lines
# Check if line contains any technical pattern
if any(pattern.lower() in line.lower() for pattern in technical_patterns):
continue
filtered_lines.append(line)
# Join the filtered lines and remove any extra blank lines
cleaned_output = '\n'.join(filtered_lines).strip()
return cleaned_output, None
That's why my own user interface calling model via image analysis tool works. now, the portion in my tool call for subprocess call to llama-mtmd-cli is this:
try:
# Build LLaMA CLI command with the image path
llama_command = [
llama_cli_path,
"-m", model_path,
"--mmproj", mmproj_path,
"-p", llm_prompt,
"-c", "4096", "-ngl", "99",
"--no-mmap", "--samplers", "top_k;temperature", "--sampling-seq",
"kt", "--top-k", "20", "--temp", "0.7"
]
llama_command.extend(["--image", str(image_to_analyze)])
# Use the helper function to process the image analysis
cleaned_output, error = _process_image_analysis(llama_command)
and the command this creates is exactly this : C:\AI_Workshop\llama\llama.cpp\build\bin>llama-mtmd-cli.exe -m e:\LLMa_Models\ggml-model-Q4_K_M.gguf --mmproj e:\LLMa_Models\mmproj-model-f16.gguf --image C:\AI_Workshop\llama\ChatApp\downloads\2.png -p "Analyze this image in high detail" -c 4096 -ngl 99 --no-mmap --samplers top_k;temperature --sampling-seq kt --top-k 20 --temp 0.7
however when I run that exact same command in terminal the results are ( a few repeptitions to make sure)
image decoded (batch 1/1) in 10 ms
该图片展示了一群风格独特、色彩鲜艳的角色。他们似乎处于一个充满活力和动感的场景之中。
首先,我们可以看到一个坐在凳子上的蓝色角色。她的蓝色服装和蓝色凳子形成了鲜明的对比,使得这个角色显得格外突出。
其次,我们可以看到一个穿着紫色上衣、红色裤子以及彩色手套和鞋子的角色。她的服装颜色非常鲜艳,给人一种充满活力和动感的感觉。
最后,我们可以看到一个穿着黑色皮夹克的男性角色。他的黑色皮夹克给人一种神秘和冷酷的感觉。
总体而言,这张图片展示了一群风格独特、色彩鲜艳的角色。他们的服装颜色非常鲜艳,给人一种充满活力和动感的感觉。同时,这些角色的造型也透露出一种神秘和冷酷的感觉。
llama_perf_context_print: load time = 338.24 ms
llama_perf_context_print: prompt eval time = 573.19 ms / 611 tokens ( 0.94 ms per token, 1065.96 tokens per second)
llama_perf_context_print: eval time = 518.64 ms / 168 runs ( 3.09 ms per token, 323.92 tokens per second)
llama_perf_context_print: total time = 1635.80 ms / 779 tokens
llama_perf_context_print: graphs reused = 162
image decoded (batch 1/1) in 7 ms
这幅图像描绘了一群人坐在一个昏暗的酒吧或休息区。人物的服饰和姿势暗示了他们可能正在进行某种社交活动。
这个场景可以用于展示社交活动、夜生活或虚拟角色互动的情境。
llama_perf_context_print: load time = 337.32 ms
llama_perf_context_print: prompt eval time = 574.66 ms / 611 tokens ( 0.94 ms per token, 1063.24 tokens per second)
llama_perf_context_print: eval time = 159.51 ms / 51 runs ( 3.13 ms per token, 319.73 tokens per second)
llama_perf_context_print: total time = 1271.92 ms / 662 tokens
llama_perf_context_print: graphs reused = 49
image decoded (batch 1/1) in 7 ms
这幅图像描绘了一群人坐在一个昏暗的酒吧或休息区。人物的服饰和姿势暗示了他们可能正在进行某种社交活动。
这个场景可以用于展示社交活动、夜生活或虚拟角色互动的情境。
llama_perf_context_print: load time = 337.32 ms
llama_perf_context_print: prompt eval time = 574.66 ms / 611 tokens ( 0.94 ms per token, 1063.24 tokens per second)
llama_perf_context_print: eval time = 159.51 ms / 51 runs ( 3.13 ms per token, 319.73 tokens per second)
llama_perf_context_print: total time = 1271.92 ms / 662 tokens
llama_perf_context_print: graphs reused = 49
therefore, when called in terminal via direct call to llama-mtmd-cli.exe , the model still outputs weird encoded text (The Q4K_M gguf) it is my own image analysis tool's UTF-8 decoding which made it look like it's working
@phoebdroid
Is this garbled text?
I didn't quite understand your case the first time I read it.
Could you please send me your test images and questions so I can run some tests and review your response a few more times?
My email: [email protected]
let's keep conversation here so that other can also read and benefit in the future. What happens is :
Issue: Model Output Uses UTF-8 Encoding, Causing Incompatibility with Standard Windows CMD
Model: ggml-model-Q4_K_M.gguf
Problem Description:
The ggml-model-Q4_K_M.gguf model generates text output that is encoded using the UTF-8 standard. While
UTF-8 is a modern and widely used standard, it is not the default for the standard Windows Command Prompt
(cmd.exe).
This leads to two distinct problems:
Garbled Output in CMD: When the model is run directly in a standard cmd.exe terminal, its output appears
as garbled, unreadable text. This is because the terminal is typically using an older, single-byte
encoding (like Code Page 437) and cannot correctly render the multi-byte characters from the model's
UTF-8 output.Python Subprocess Errors: When trying to run the model from a Python script using the subprocess module,
it can cause a 'charmap' codec can't encode error. This happens because Python, by default, tries to
communicate with the Windows operating system using the system's legacy code page.
Root Cause Analysis:
The issue is an encoding mismatch. The model is outputting modern UTF-8 text, but the standard Windows
command-line environment is expecting a legacy encoding.
Other models work without this issue because their text output is limited to the basic ASCII character
set. ASCII characters are represented the same way in both UTF-8 and legacy code pages, so the conflict
never occurs. This specific model, however, produces characters (like special symbols or punctuation) that
are outside of the basic ASCII set, which exposes the underlying encoding incompatibility of the
environment.
Conclusion for Report:
The ggml-model-Q4_K_M.gguf model has an implicit requirement for a UTF-8 compatible environment. This
makes it unusable "out-of-the-box" on standard Windows cmd.exe and difficult to automate. The model should
either be modified to produce only ASCII characters for maximum compatibility, or it must be clearly
documented that it requires a UTF-8 terminal (like the modern Windows Terminal, or cmd.exe after running
the command chcp 65001).
@phoebdroid
So that's the case! Thank you very much for your careful observation.
This is an issue we hadn't anticipated. We will include the UTF-8 dependency in the model readme.