BF16 version of the mmproj file does not seem to work
I wanted to see if there is a difference in performance between the fp16 and bf16 versions of the mmproj file, but the bf16 version simply gives an error and crashes. I am using the latest stable version of LM Studio (build 0.3.15) and the latest beta CUDA runtime (v1.29, llama.cpp b5219).
Is the text model only meant to work with the fp16 version of the mmproj file, or is it some bug that requires an update to fix? If it is the former, would it be better to remove the bf16 version, and maybe replace it with a fp32 version (assuming there is a benefit in doing so)?
I just updated to the latest beta LM Studio CUDA 12 runtime (b5283), and the bf16 version still does not work. Is the native, source precision of the vision projector supposed to be 32 or 16 bits? I would like to try the fp32 version of the mmproj if there is any benefit, because I have 24gb of vram and do not require that much context.
Edit: I also posted this on my comment on the kalomaze_Qwen3-16B-A3B-GGUF discussion page, I think you might have missed it because it was posted on a weekend:
As a side note, can you quant SmolVLM2 and Qwen 2.5 VL? SmolVLM2 should already be working. Regarding Qwen 2.5 VL, from what I see, ggml-org has a few basic quants already, but ngxson has some bugs with the 32B version. However, when I download the mmproj file from ggml-org/Qwen2.5-VL-32B-Instruct-GGUF and combine it with the IQ4_XS text only quant from mradermacher/Qwen2.5-VL-32B-Instruct-i1-GGUF, it seems to work. Not sure how he managed to produce the text only quant about a month ago though.