Thireus
commited on
Commit
·
042f573
1
Parent(s):
d574852
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,128 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
# DeepSeek-TNG-R1T2-Chimera
|
5 |
+
|
6 |
+
## 🤔 What is this [HuggingFace repository](https://huggingface.co/Thireus/DeepSeek-TNG-R1T2-Chimera-THIREUS-BF16-SPECIAL_SPLIT/) about?
|
7 |
+
|
8 |
+
This repository provides **GGUF-quantized tensors** for the DeepSeek-TNG-R1T2-Chimera model (official repo: https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera). These GGUF shards are designed to be used with **Thireus’ GGUF Tool Suite** (https://gguf.thireus.com), a collection of tools that automatically finds the perplexity-optimal mix of quantizations for any given VRAM and RAM target. With the Tool Suite, you can generate and download custom quantization “recipes” effortlessly.
|
9 |
+
|
10 |
+
- 📖 Read more: https://github.com/Thireus/GGUF-Tool-Suite
|
11 |
+
- 🔍 Example quant mixes: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
|
12 |
+
- 🛠️ Create your own recipe: https://colab.research.google.com/github/Thireus/GGUF-Tool-Suite/blob/main/quant_recipe_pipeline.ipynb
|
13 |
+
- 📂 Browse available quant shards: https://huggingface.co/Thireus/collections
|
14 |
+
|
15 |
+
*tl;dr:*
|
16 |
+
<details>
|
17 |
+
|
18 |
+
```
|
19 |
+
cd ~
|
20 |
+
|
21 |
+
# Make sure to install all ik_llama.cpp compilation dependencies...
|
22 |
+
apt install python3-dev python3-pip python3-venv python3-wheel python3-setuptools git acl netcat-openbsd cmake # pipx
|
23 |
+
|
24 |
+
# Obtain ik_llama's Thireus version - Windows builds available at https://github.com/Thireus/ik_llama.cpp/releases
|
25 |
+
git clone https://github.com/Thireus/ik_llama.cpp
|
26 |
+
cd ik_llama.cpp
|
27 |
+
git pull
|
28 |
+
# Build ik_llama.cpp
|
29 |
+
cmake -B build -DGGML_AVX=ON -DGGML_AVX2=ON -DLLAMA_CURL=OFF -DGGML_MAX_CONTEXTS=2048
|
30 |
+
cmake --build build --config Release -j16
|
31 |
+
cd ..
|
32 |
+
|
33 |
+
# Obtain Thireus' GGUF-Tool-Suite
|
34 |
+
git clone https://github.com/Thireus/GGUF-Tool-Suite
|
35 |
+
|
36 |
+
# Download model quant mix from recipe file:
|
37 |
+
cd GGUF-Tool-Suite
|
38 |
+
rm -f download.conf # Make sure to copy the relevant download.conf for the model before running quant_assign.py
|
39 |
+
cp -f models/DeepSeek-TNG-R1T2-Chimera/download.conf . # Use the download.conf of the chosen model
|
40 |
+
mkdir -p kitchen && cd kitchen
|
41 |
+
../quant_downloader.sh ../recipe_examples/DeepSeek-TNG-R1T2-Chimera.ROOT-3.0624bpw-3.3657ppl.238GB-GGUF_11GB-GPU_227GB-CPU.13549e6_1ac857a.recipe
|
42 |
+
|
43 |
+
# Launch ik_llama's llama-cli:
|
44 |
+
ulimit -n 99999 # Lifts "too many open files" limitation on Linux
|
45 |
+
~/ik_llama.cpp/build/bin/llama-cli \
|
46 |
+
-m DeepSeek-TNG-R1T2-Chimera-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01148.gguf \
|
47 |
+
-mla 3 -fa -amb 512 -fmoe -ctk f16 -c 4096 -ngl 99 \
|
48 |
+
-ot "blk\.(3|4|5|6)\.ffn_.*=CUDA0" \
|
49 |
+
-ot "blk\.(7|8|9|10)\.ffn_.*=CUDA1" \
|
50 |
+
-ot exps=CPU -b 2048 -ub 1024 --warmup-batch --no-mmap --threads 36 \
|
51 |
+
--main-gpu 0 \
|
52 |
+
-p '<|begin▁of▁sentence|><|User|>What is the solution of x+5=-2?<|Assistant|><think>\n'
|
53 |
+
```
|
54 |
+
|
55 |
+
</details>
|
56 |
+
|
57 |
+
---
|
58 |
+
|
59 |
+
## ❓ Why does this Tool Suite exist?
|
60 |
+
|
61 |
+
1. **Compatibility & Speed** – [unsloth](https://huggingface.co/unsloth)’s dynamic quants may not always work optimally with `ik_llama.cpp`.
|
62 |
+
2. **Custom Rig Fit** – No off-the-shelf GGUF model perfectly matched my VRAM/RAM setup, so I built a way to tailor models and leverage extra VRAM/RAM to reduce perplexity.
|
63 |
+
3. **Automated PPL-Optimal Quantization** – To my knowledge, there was no flexible, automated method to minimize perplexity for any bits-per-weight (bpw) target—so I created one with excellent results!
|
64 |
+
|
65 |
+
---
|
66 |
+
|
67 |
+
## 📊 How does it compare to other GGUFs?
|
68 |
+
|
69 |
+
Here’s how DeepSeek-R1-0528 quantized with **Thireus’ GGUF Tool Suite** stacks up against other quantizers (lower perplexity = better at equal or lower bpw):
|
70 |
+
|
71 |
+

|
72 |
+
|
73 |
+
> _Note: The `recipe_examples` files illustrate good recipes. The Tool Suite computes the optimal ppl/bpw curve for you — just specify your target RAM, VRAM, and quant types, and `quant_assign.py` finds the best mix._
|
74 |
+
|
75 |
+
More perplexity/bpw graphs for other supported models: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/ppl_graphs
|
76 |
+
|
77 |
+
---
|
78 |
+
|
79 |
+
## 🚀 How do I get started?
|
80 |
+
|
81 |
+
Check out the [GGUF Tool Suite README](https://github.com/Thireus/GGUF-Tool-Suite) — focus on these sections:
|
82 |
+
|
83 |
+
1. ⚠️ **Requirements** – Which `ik_llama.cpp` (or `llama.cpp`) version to use and how to compile.
|
84 |
+
- Windows binaries (no patching needed) at: https://github.com/Thireus/ik_llama.cpp/releases
|
85 |
+
2. 📥 **Download Model Shards** – Use `quant_downloader.sh` to fetch GGUF shards from any recipe.
|
86 |
+
- Recipe examples: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
|
87 |
+
3. 🧠 **Run a Downloaded Model** – Sample usage with `llama-cli`.
|
88 |
+
4. 🛠️ **Generate a Custom Recipe** – Produce recipes tailored to your rig for optimal perplexity.
|
89 |
+
|
90 |
+
---
|
91 |
+
|
92 |
+
## ✅ Supported Models
|
93 |
+
|
94 |
+
Supported models are listed under `models/` in the [Tool Suite Github repo](https://github.com/Thireus/GGUF-Tool-Suite/tree/main/models). Presence of `ppl_results.csv` indicates official support and compatibility with `quant_assign.py`.
|
95 |
+
|
96 |
+
---
|
97 |
+
|
98 |
+
## 🤷♂️ Will I release pre-cooked GGUF files?
|
99 |
+
|
100 |
+
No, because I believe in **tailored quantization** for each user’s hardware. If you prefer ready-made shards, you are welcome to merge them via `llama-gguf-split --merge`, or request someone to publish them.
|
101 |
+
|
102 |
+
Instead, I prefer to share examples of recipes so users can see exactly how they were produced (command included inside these recipe files) and tweak them for their own rigs. The `quant_downloader.sh` script handles automatic fetching and verification of each shard. Recipes provided by [Ubergarm](https://huggingface.co/ubergarm) on his model cards are also compatible with `quant_downloader.sh`.
|
103 |
+
|
104 |
+
Users who don’t trust the GGUF shards on HuggingFace can also quantize their own by passing recipe lines to `llama-quantize --custom-q` ([see example](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/models/DeepSeek-R1-0528/DeepSeek-R1-0528-THIREUS-ANY-SPECIAL.sh#L482-L486)). Run `llama-quantize --help` to list compatible quants for `quant_assign.py`. This approach is especially useful if you prefer `llama.cpp` over `ik_llama.cpp`.
|
105 |
+
|
106 |
+
---
|
107 |
+
|
108 |
+
## 📦 What’s in this repository?
|
109 |
+
|
110 |
+
- **00001 GGUF header shard** – Contains metadata (tokens, chat template, tensor count, etc.). This metadata can be explored directly from the HuggingFace web interface after clicking on that shard.
|
111 |
+
- **Tensor shards** – Each shard holds one tensor; see `tensors.map` for names, quant types, sizes, SHA-256 hash, shard IDs, etc.
|
112 |
+
- **GPG-signed files** – `tensors.map` and header shard are signed with the key in [trusted-keys.asc](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/trusted-keys.asc) for tamper detection.
|
113 |
+
- **Security note** – Some papers about various ways to attack GGUFs and LLMs are available online, such as https://arxiv.org/abs/2505.23786, and there are also more classic security exploits like CVE-2024-23496 and CVE-2024-25664 through CVE-2024-25668. Only use GGUFs from reputable, trusted authors—or alternatively self-quantize—to avoid potential exploits.
|
114 |
+
|
115 |
+
---
|
116 |
+
|
117 |
+
## 💡 Pro Tips
|
118 |
+
|
119 |
+
You can download the BF16 model version to quantize your own shards:
|
120 |
+
|
121 |
+
```
|
122 |
+
mkdir kitchen
|
123 |
+
echo '.*=bf16' > kitchen/bf16.recipe
|
124 |
+
cd kitchen
|
125 |
+
../quant_downloader.sh bf16.recipe
|
126 |
+
```
|
127 |
+
|
128 |
+
Enjoy optimized quantization! 🎉
|