Xuan Son NGUYEN's picture

Xuan Son NGUYEN

ngxson

AI & ML interests

Doing AI for fun, not for profit

Recent Activity

Articles

Organizations

Hugging Face's profile picture Blog-explorers's profile picture Hugging Face TB Research's profile picture ggml.ai's profile picture Hugging Face Discord Community's profile picture Consumer AI Edge Hackathon (Meta, Hugging Face, Pytorch, Scaleway & Unaite)'s profile picture

ngxson's activity

replied to their post 3 days ago
view reply

Yes, sure!

The first step is to generate the PEFT-compatible LoRA adapter, I used mergekit-extract-lora to do that. Please note that some bigger models (Qwen/Llama 70B) give some errors that I don't know how to fix, hopefully they will fix that soon. You can find more info about mergekit here: https://github.com/arcee-ai/mergekit

Next step is to convert PEFT to GGUF, I used this space: https://huggingface.co/spaces/ggml-org/gguf-my-lora

Then it's good to go!

Please note that, the space can convert any PEFT LoRA adapters to GGUF, so if you're using something like unsloth, it will be straight-forward to convert into GGUF LoRA (so no need to merge to base model)

upvoted an article 3 days ago
view article
Article

Run ComfyUI workflows for free on Spaces

โ€ข 34
New activity in 5CD-AI/Vintern-1B-v3_5 3 days ago

Deployment as server?

10
#1 opened 4 days ago by
ngxson
replied to their post 4 days ago
posted an update 4 days ago
view post
Post
1571
Check out my collection of pre-made GGUF LoRA adapters!

This allow you to use both normal + abliterated version of popular models like llama, qwen, etc, without having to double to amount of VRAM usage.

ngxson/gguf_lora_collection
ยท
reacted to bartowski's post with ๐Ÿ‘€๐Ÿ‘ 7 days ago
view post
Post
18260
Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights
ยท
New activity in 5CD-AI/Viet-Doc-VQA-verIII 7 days ago