Thireus
commited on
Commit
·
507efa3
1
Parent(s):
20b318b
Update README.md
Browse files
README.md
CHANGED
@@ -45,19 +45,21 @@ git clone https://github.com/Thireus/GGUF-Tool-Suite
|
|
45 |
# Download model quant mix from recipe file:
|
46 |
cd GGUF-Tool-Suite
|
47 |
rm -f download.conf # Make sure to copy the relevant download.conf for the model before running quant_assign.py
|
48 |
-
cp -f models/
|
49 |
mkdir -p kitchen && cd kitchen
|
50 |
-
../quant_downloader.sh ../recipe_examples/
|
51 |
|
52 |
# Other recipe examples can be found at https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
|
53 |
|
54 |
# Launch ik_llama's llama-server:
|
55 |
ulimit -n 99999 # Lifts "too many open files" limitation on Linux
|
56 |
~/ik_llama.cpp/build/bin/llama-server \
|
57 |
-
-m
|
58 |
-
-
|
59 |
-
-ot "blk\.(
|
60 |
-
-ot "blk\.(
|
|
|
|
|
61 |
-ot exps=CPU -b 2048 -ub 1024 --warmup-batch --no-mmap --threads 36 \
|
62 |
--main-gpu 0
|
63 |
```
|
@@ -76,9 +78,9 @@ ulimit -n 99999 # Lifts "too many open files" limitation on Linux
|
|
76 |
|
77 |
## 📊 How does it compare to other GGUFs?
|
78 |
|
79 |
-
Here’s how
|
80 |
|
81 |
-
\.ffn_.*=CUDA0" \
|
60 |
+
-ot "blk\.(37|38|39|[4-6][0-9]|7[0-2])\.ffn_.*=CUDA1" \
|
61 |
+
-ot "blk\.(7[3-9])\.ffn_.*=CUDA2" \
|
62 |
+
-ot "blk\.(8[0-9]|90|91|92)\.ffn_.*=CPU" \
|
63 |
-ot exps=CPU -b 2048 -ub 1024 --warmup-batch --no-mmap --threads 36 \
|
64 |
--main-gpu 0
|
65 |
```
|
|
|
78 |
|
79 |
## 📊 How does it compare to other GGUFs?
|
80 |
|
81 |
+
Here’s how GLM-4.5 quantized with **Thireus’ GGUF Tool Suite** stacks up against other quantizers (lower perplexity = better at equal or lower bpw):
|
82 |
|
83 |
+

|
84 |
|
85 |
> _Note: The `recipe_examples` files illustrate good recipes. The Tool Suite computes the optimal ppl/bpw curve for you — just specify your target RAM, VRAM, and quant types, and `quant_assign.py` finds the best mix._
|
86 |
|