Can't wait for HF? try chatllm.cpp

by J22 - opened 2 days ago

J22

2 days ago

python scripts\richchat.py -m :mistral-small:24b-2503 -ngl all

MrDevolver

1 day ago

•

edited 1 day ago

https://github.com/foldl/chatllm.cpp
python scripts\richchat.py -m :mistral-small:24b-2503 -ngl all

Where is it downloading from? Curiosity Mars Rover? Testing download with Qwen2 0.5B feels like it's being brought by a snail on its back. 🐌

And sometimes I suspect the Qwen and the snail switch their roles when the snail gets too tired...

J22

1 day ago

Quantized models are hosted by modelscope, on Earth. :)

MrDevolver

1 day ago

Quantized models are hosted by modelscope, on Earth. :)

So... I waited half a day for the download of Mistral and it got stuck on about 59% ❗

I had to figure out the real download source from the source code of the downloader script and download it from modelscope manually through regular download manager and it was downloaded in minutes. Something is definitely not right with the downloader of the chatllm.cpp itself!

Overall, while it's always good to have an alternative, I'm not quite convinced chatllm.cpp is a better choice over llama.cpp based solutions like LM Studio. It seems like it's based on llama.cpp anyway, except it uses the deprecated GGML format instead of the newer GGUF and it also lacks the newer features of regular llama.cpp such as Flash Attention. Or maybe it does have it, but honestly I couldn't figure out, because there's literally no help parameter that would give me a list of useable parameters and their purpose. In any case the -fa parameter which I believe activates Flash Attention in llama.cpp did not work here.

J22

1 day ago

Thanks for your feedback. It's strange that the downloader script is significantly slower than a regular download manager.

By the way, chatlllm.cpp is based on ggml, but not llama.cpp.

MrDevolver

1 day ago

•

edited 1 day ago

Thanks for your feedback. It's strange that the downloader script is significantly slower than a regular download manager.

By the way, chatlllm.cpp is based on ggml, but not llama.cpp.

You're welcome and I apologize for confusion, you're right. I saw the author, but confused ggml with llamacpp, my bad.

It's strange to see that the old ggml format still exists in some form, it's been a while since I used them in llamacpp based apps. Llamacpp's own support for ggml is long gone and I always thought GGUF was better, improved version since it was newer, so out of curiosity what was the motivation behind staying with GGML?

J22

1 day ago

I like simple things. ggml is simple than other frameworks so I like it. GGML file format still works for me, and simpler than GGUF, so I am stick to it.

GGUF itself is composed with a dictionary (that can contain anything) and a list of tensors and decoupled from llama.cpp: one can create GGUF files that are not supported by llama.cpp. This looks weird to me.
It is even possible to generate different versions of GGUF files from a single model, for example, renaming some keys in config.json, or rearranging K/Q elements for RoPE. All these information is tightly coupled with the inference app. I don't think a "universal" file format helps.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment