fancyfeast/llama-joycaption-beta-one-hf-llava

#937

by Hyphonical - opened 1 day ago

1 day ago

Hello, would it be possible to get quants for fancyfeast/llama-joycaption-beta-one-hf-llava. It is based on llama, more specifically llava, so it's a vision enabled model for captioning images. I haven't been able to convert it manually using llama.cpp or gguf-my-repo and i would like to ask you if it is even possible to make quants of this model. In any case i would like to thank you (the team) for making amazing quants. Thanks for everything 🥰

nicoboss

1 day ago

It's queued! :D
Let's hope for the best.

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#llama-joycaption-beta-one-hf-llava-GGUF for quants to appear.

nicoboss

1 day ago

mmproj extraction unfortunately failed due to SiglipVisionModel not being supported by llama.cpp:

{[[PROGRESS:mmproj extraction hf Q8_0]]}
INFO:hf-to-gguf:Loading model: llama-joycaption-beta-one-hf-llava
INFO:hf-to-gguf:Model architecture: SiglipVisionModel
ERROR:hf-to-gguf:Model SiglipVisionModel is not supported
job finished, status 1
job-done<0 llama-joycaption-beta-one-hf-llava noquant 1>

On the bright side LLM conversion and imatrix computation worked successfully so you can at least use the text part of the model. Unfortunately I'm not sure if it will be of much use for an image captioning model unless you want to just generate a bunch of made up image captions. I'm even considering nuking it as with it missing the mmproj it probably lost its purpose.

Hyphonical

1 day ago

So without the vision module it's just llama 8b? With the gguf-my-repo i did the same, i noticed the model didn't have any vision inputs in llama.cpp. So we are running into the same issue... Then i think it's best if i close this discussion until llama.cpp comes up with a solution i suppose?

nicoboss

1 day ago

So without the vision module it's just llama 8b?

It seems to be Llama-3.1-8B-Instruct finetuned for image caption generation. Download it now if you want as I'm still not sure if we should nuke it ot not.

With the gguf-my-repo i did the same, i noticed the model didn't have any vision inputs in llama.cpp.

gguf-my-repo never generates the mmproj. If you only provide the GGUF but not the mmproj you will never have vision.

So we are running into the same issue...

Not really. gguf-my-repo doesn't even try to extract the vison part where I tried and thailed due to this specific vision architecture not currently being supported by llama.cpp

Then i think it's best if i close this discussion until llama.cpp comes up with a solution i suppose?

Please keep an eye on https://huggingface.co/fancyfeast/llama-joycaption-beta-one-hf-llava/discussions/2 and let us know once it is supported as we will almost certainly forget about it. What is really strange is that llama.cpp seems to have code to handle SiglipVisionModel but it is all instead legacy vision extraction code (https://github.com/search?q=repo%3Aggml-org%2Fllama.cpp%20SiglipVisionModel%20&type=code) so it might already be supported but not by our tooling but given that nobody else got it to work so far means there will almost certainly be different issues and so it is best to wait a bit.

nicoboss

about 15 hours ago

How nice. Gues was a great decission not to nuke it: https://huggingface.co/mradermacher/llama-joycaption-beta-one-hf-llava-GGUF/discussions/1

concedo

about 14 hours ago

The GGUF llava conversion process is rather convoluted and needs a little bit of manual adjustments to get working.

Generally, the steps here (https://github.com/ggerganov/llama.cpp/blob/master/docs/multimodal/llava.md) are mostly correct however very confusing.

Follow the llava 1.6 instructions, however the repo they pull the config from is not good. Instead, grab the necessary parts of the config from https://huggingface.co/fancyfeast/llama-joycaption-beta-one-hf-llava/blob/main/config.json

You want to use the llava_surgery_v2.py script, it'll generate 2 files which are actually a pytorch like binary model (llava.projector and llava.clip).

Then you run convert_image_encoder_to_gguf.py except the instructions are also wrong, this model is actually using --clip-model-is-siglip NOT --clip-model-is-vision (read the config.json to check!)

Finally if you do everything right it should load the pt vision model successfully and be able to export as gguf which you can then save as --mmproj

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment