Mungert commited on
Commit
d9170b6
·
verified ·
1 Parent(s): cbc833d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +280 -0
README.md ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - ja
7
+ - ko
8
+ - fr
9
+ - ar
10
+ - es
11
+ - pt
12
+ metrics:
13
+ - accuracy
14
+ base_model:
15
+ - BlinkDL/rwkv-7-world
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # <span style="color: #7FFF7F;">rwkv7-1.5B-world GGUF Models</span>
20
+
21
+ ## **Choosing the Right Model Format**
22
+
23
+ Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
24
+
25
+ ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
26
+ - A 16-bit floating-point format designed for **faster computation** while retaining good precision.
27
+ - Provides **similar dynamic range** as FP32 but with **lower memory usage**.
28
+ - Recommended if your hardware supports **BF16 acceleration** (check your device’s specs).
29
+ - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
30
+
31
+ 📌 **Use BF16 if:**
32
+ ✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
33
+ ✔ You want **higher precision** while saving memory.
34
+ ✔ You plan to **requantize** the model into another format.
35
+
36
+ 📌 **Avoid BF16 if:**
37
+ ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
38
+ ❌ You need compatibility with older devices that lack BF16 optimization.
39
+
40
+ ---
41
+
42
+ ### **F16 (Float 16) – More widely supported than BF16**
43
+ - A 16-bit floating-point **high precision** but with less of range of values than BF16.
44
+ - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
45
+ - Slightly lower numerical precision than BF16 but generally sufficient for inference.
46
+
47
+ 📌 **Use F16 if:**
48
+ ✔ Your hardware supports **FP16** but **not BF16**.
49
+ ✔ You need a **balance between speed, memory usage, and accuracy**.
50
+ ✔ You are running on a **GPU** or another device optimized for FP16 computations.
51
+
52
+ 📌 **Avoid F16 if:**
53
+ ❌ Your device lacks **native FP16 support** (it may run slower than expected).
54
+ ❌ You have memory limitations.
55
+
56
+ ---
57
+
58
+ ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
59
+ Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
60
+ - **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
61
+ - **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
62
+
63
+ 📌 **Use Quantized Models if:**
64
+ ✔ You are running inference on a **CPU** and need an optimized model.
65
+ ✔ Your device has **low VRAM** and cannot load full-precision models.
66
+ ✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
67
+
68
+ 📌 **Avoid Quantized Models if:**
69
+ ❌ You need **maximum accuracy** (full-precision models are better for this).
70
+ ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
71
+
72
+ ---
73
+
74
+ ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
75
+ These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
76
+
77
+ - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
78
+ - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
79
+ - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
80
+
81
+ - **IQ3_S**: Small block size for **maximum memory efficiency**.
82
+ - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
83
+
84
+ - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
85
+ - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
86
+
87
+ - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
88
+ - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
89
+
90
+ - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
91
+ - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
92
+
93
+ ---
94
+
95
+ ### **Summary Table: Model Format Selection**
96
+
97
+ | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
98
+ |--------------|------------|---------------|----------------------|---------------|
99
+ | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
100
+ | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
101
+ | **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
102
+ | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
103
+ | **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
104
+ | **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
105
+ | **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
106
+
107
+ ---
108
+
109
+ ## **Included Files & Details**
110
+
111
+ ### `rwkv7-1.5B-world-bf16.gguf`
112
+ - Model weights preserved in **BF16**.
113
+ - Use this if you want to **requantize** the model into a different format.
114
+ - Best if your device supports **BF16 acceleration**.
115
+
116
+ ### `rwkv7-1.5B-world-f16.gguf`
117
+ - Model weights stored in **F16**.
118
+ - Use if your device supports **FP16**, especially if BF16 is not available.
119
+
120
+ ### `rwkv7-1.5B-world-bf16-q8_0.gguf`
121
+ - **Output & embeddings** remain in **BF16**.
122
+ - All other layers quantized to **Q8_0**.
123
+ - Use if your device supports **BF16** and you want a quantized version.
124
+
125
+ ### `rwkv7-1.5B-world-f16-q8_0.gguf`
126
+ - **Output & embeddings** remain in **F16**.
127
+ - All other layers quantized to **Q8_0**.
128
+
129
+ ### `rwkv7-1.5B-world-q4_k.gguf`
130
+ - **Output & embeddings** quantized to **Q8_0**.
131
+ - All other layers quantized to **Q4_K**.
132
+ - Good for **CPU inference** with limited memory.
133
+
134
+ ### `rwkv7-1.5B-world-q4_k_s.gguf`
135
+ - Smallest **Q4_K** variant, using less memory at the cost of accuracy.
136
+ - Best for **very low-memory setups**.
137
+
138
+ ### `rwkv7-1.5B-world-q6_k.gguf`
139
+ - **Output & embeddings** quantized to **Q8_0**.
140
+ - All other layers quantized to **Q6_K** .
141
+
142
+ ### `rwkv7-1.5B-world-q8_0.gguf`
143
+ - Fully **Q8** quantized model for better accuracy.
144
+ - Requires **more memory** but offers higher precision.
145
+
146
+ ### `rwkv7-1.5B-world-iq3_xs.gguf`
147
+ - **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
148
+ - Best for **ultra-low-memory devices**.
149
+
150
+ ### `rwkv7-1.5B-world-iq3_m.gguf`
151
+ - **IQ3_M** quantization, offering a **medium block size** for better accuracy.
152
+ - Suitable for **low-memory devices**.
153
+
154
+ ### `rwkv7-1.5B-world-q4_0.gguf`
155
+ - Pure **Q4_0** quantization, optimized for **ARM devices**.
156
+ - Best for **low-memory environments**.
157
+ - Prefer IQ4_NL for better accuracy.
158
+
159
+ # <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
160
+
161
+ Please click like ❤ . Also I’d really appreciate it if you could test my Network Monitor Assistant at 👉 [Network Monitor Assitant](https://freenetworkmonitor.click/dashboard).
162
+
163
+ 💬 Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
164
+
165
+ ### What I'm Testing
166
+
167
+ I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".
168
+
169
+ 🟡 **TestLLM** – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a time—still working on scaling!). If you're curious, I'd be happy to share how it works! .
170
+
171
+ ### The other Available AI Assistants
172
+
173
+ 🟢 **TurboLLM** – Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .
174
+
175
+ 🔵 **FreeLLM** – Runs **open-source Hugging Face models** Medium speed (unlimited, subject to Hugging Face API availability).
176
+
177
+
178
+
179
+
180
+ # rwkv7-1.5B-world
181
+
182
+ <!-- Provide a quick summary of what the model is/does. -->
183
+
184
+ This is RWKV-7 model under flash-linear attention format.
185
+
186
+ ## Model Details
187
+
188
+
189
+ ### Model Description
190
+
191
+ <!-- Provide a longer summary of what this model is. -->
192
+
193
+ - **Developed by:** Bo Peng, Yu Zhang, Songlin Yang, Ruichong Zhang
194
+ - **Funded by:** RWKV Project (Under LF AI & Data Foundation)
195
+ - **Model type:** RWKV7
196
+ - **Language(s) (NLP):** English
197
+ - **License:** Apache-2.0
198
+ - **Parameter count:** 1.52B
199
+ - **Tokenizer:** RWKV World tokenizer
200
+ - **Vocabulary size:** 65,536
201
+
202
+ ### Model Sources
203
+
204
+ <!-- Provide the basic links for the model. -->
205
+
206
+ - **Repository:** https://github.com/fla-org/flash-linear-attention ; https://github.com/BlinkDL/RWKV-LM
207
+ - **Paper:** With in Progress
208
+
209
+ ## Uses
210
+
211
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
212
+ Install `flash-linear-attention` and the latest version of `transformers` before using this model:
213
+
214
+ ```bash
215
+ pip install git+https://github.com/fla-org/flash-linear-attention
216
+ pip install 'transformers>=4.48.0'
217
+ ```
218
+
219
+ ### Direct Use
220
+
221
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
222
+ You can use this model just as any other HuggingFace models:
223
+ ```python
224
+ from transformers import AutoModelForCausalLM, AutoTokenizer
225
+ model = AutoModelForCausalLM.from_pretrained('fla-hub/rwkv7-1.5B-world', trust_remote_code=True)
226
+ tokenizer = AutoTokenizer.from_pretrained('fla-hub/rwkv7-1.5B-world', trust_remote_code=True)
227
+
228
+ model = model.cuda()
229
+ prompt = "What is a large language model?"
230
+ messages = [
231
+ {"role": "user", "content": "Who are you?"},
232
+ {"role": "assistant", "content": "I am a GPT-3 based model."},
233
+ {"role": "user", "content": prompt}
234
+ ]
235
+ text = tokenizer.apply_chat_template(
236
+ messages,
237
+ tokenize=False,
238
+ add_generation_prompt=True
239
+ )
240
+
241
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
242
+
243
+ generated_ids = model.generate(
244
+ **model_inputs,
245
+ max_new_tokens=1024,
246
+ )
247
+ generated_ids = [
248
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
249
+ ]
250
+
251
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
252
+ print(response)
253
+ ```
254
+
255
+ ## Training Details
256
+
257
+ ### Training Data
258
+
259
+ This model is trained on the World v3 with a total of 3.119 trillion tokens.
260
+
261
+ #### Training Hyperparameters
262
+
263
+ - **Training regime:** bfloat16, lr 4e-4 to 1e-5 "delayed" cosine decay, wd 0.1 (with increasing batch sizes during the middle)
264
+ - **Final Loss:** 1.9965
265
+ - **Token Count:** 3.119 trillion
266
+
267
+ ## Evaluation
268
+
269
+ #### Metrics
270
+
271
+ `lambada_openai`:
272
+
273
+ before conversion: ppl 4.13 acc 69.4%
274
+
275
+ after conversion: ppl 4.26 acc 68.8% (without apply temple)
276
+
277
+ ## FAQ
278
+ Q: safetensors metadata is none.
279
+
280
+ A: upgrade transformers to >=4.48.0: `pip install 'transformers>=4.48.0'`