cgus
/

GLM-4-9B-0414-exl2

Text Generation

4-bit precision

Model card Files Files and versions Community

cgus commited on Apr 24

Commit

1a33b16

·

verified ·

1 Parent(s): be6dbb3

Update README.md

Files changed (1) hide show

README.md +18 -3

README.md CHANGED Viewed

@@ -4,9 +4,24 @@ language:
 - zh
 - en
 pipeline_tag: text-generation
-library_name: transformers
 ---
 # GLM-4-9B-0414
 ## Introduction
@@ -393,4 +408,4 @@ while True:
 [2] [Agentless v1.5.0](https://github.com/OpenAutoCoder/Agentless) used [BGE](https://github.com/FlagOpen/FlagEmbedding/blob/master/README.md) as the embedding model and [FAISS](https://github.com/facebookresearch/faiss) for similarity search. To speed up patch verification while maintaining performance, the timeout for running a single instance was changed from the default 300s to 180s.
-[3] [OpenHands v0.29.1](https://github.com/All-Hands-AI/OpenHands/tree/main) did not use YaRN context extension but limited runs to a maximum of 60 iterations and summarized the history to prevent exceeding the 32K context limit. Summarization was configured as `llm_config="condenser", keep_first=1, max_size=32`. No retries on failed trajectories.

 - zh
 - en
 pipeline_tag: text-generation
+library_name: exllamav2
+base_model:
+- THUDM/GLM-4-9B-0414
 ---
+# GLM-4-9B-0414-exl2
+Original model: [GLM-4-9B-0414](https://huggingface.co/THUDM/GLM-4-9B-0414) by [Z.ai & THUKEG](https://huggingface.co/THUDM)
+## Quants
+[4bpw h6 (main)](https://huggingface.co/cgus/GLM-4-9B-0414-exl2/tree/main)
+[4.5bpw h6](https://huggingface.co/cgus/GLM-4-9B-0414-exl2/tree/4.5bpw-h6)
+[5bpw h6](https://huggingface.co/cgus/GLM-4-9B-0414-exl2/tree/5bpw-h6)
+[6bpw h6](https://huggingface.co/cgus/GLM-4-9B-0414-exl2/tree/6bpw-h6)
+[8bpw h8](https://huggingface.co/cgus/GLM-4-9B-0414-exl2/tree/8bpw-h8)
+## Qutantization notes
+Made with Exllamav2 0.2.9 with default dataset. These quants require Exllamav2 0.2.9 or newer.
+These quants can be used with TabbyAPI or Text-Generation-WebUI with RTX GPU (Windows) or RTX/ROCm (Linux).
+Ensure it fits your GPU VRAM since Exllamav2 doesn't support native RAM offloading.
 # GLM-4-9B-0414
 ## Introduction
 [2] [Agentless v1.5.0](https://github.com/OpenAutoCoder/Agentless) used [BGE](https://github.com/FlagOpen/FlagEmbedding/blob/master/README.md) as the embedding model and [FAISS](https://github.com/facebookresearch/faiss) for similarity search. To speed up patch verification while maintaining performance, the timeout for running a single instance was changed from the default 300s to 180s.
+[3] [OpenHands v0.29.1](https://github.com/All-Hands-AI/OpenHands/tree/main) did not use YaRN context extension but limited runs to a maximum of 60 iterations and summarized the history to prevent exceeding the 32K context limit. Summarization was configured as `llm_config="condenser", keep_first=1, max_size=32`. No retries on failed trajectories.