Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,280 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- zh
|
6 |
+
- ja
|
7 |
+
- ko
|
8 |
+
- fr
|
9 |
+
- ar
|
10 |
+
- es
|
11 |
+
- pt
|
12 |
+
metrics:
|
13 |
+
- accuracy
|
14 |
+
base_model:
|
15 |
+
- BlinkDL/rwkv-7-world
|
16 |
+
pipeline_tag: text-generation
|
17 |
+
---
|
18 |
+
|
19 |
+
# <span style="color: #7FFF7F;">rwkv7-1.5B-world GGUF Models</span>
|
20 |
+
|
21 |
+
## **Choosing the Right Model Format**
|
22 |
+
|
23 |
+
Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
|
24 |
+
|
25 |
+
### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
|
26 |
+
- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
|
27 |
+
- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
|
28 |
+
- Recommended if your hardware supports **BF16 acceleration** (check your device’s specs).
|
29 |
+
- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
|
30 |
+
|
31 |
+
📌 **Use BF16 if:**
|
32 |
+
✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
|
33 |
+
✔ You want **higher precision** while saving memory.
|
34 |
+
✔ You plan to **requantize** the model into another format.
|
35 |
+
|
36 |
+
📌 **Avoid BF16 if:**
|
37 |
+
❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
|
38 |
+
❌ You need compatibility with older devices that lack BF16 optimization.
|
39 |
+
|
40 |
+
---
|
41 |
+
|
42 |
+
### **F16 (Float 16) – More widely supported than BF16**
|
43 |
+
- A 16-bit floating-point **high precision** but with less of range of values than BF16.
|
44 |
+
- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
|
45 |
+
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
|
46 |
+
|
47 |
+
📌 **Use F16 if:**
|
48 |
+
✔ Your hardware supports **FP16** but **not BF16**.
|
49 |
+
✔ You need a **balance between speed, memory usage, and accuracy**.
|
50 |
+
✔ You are running on a **GPU** or another device optimized for FP16 computations.
|
51 |
+
|
52 |
+
📌 **Avoid F16 if:**
|
53 |
+
❌ Your device lacks **native FP16 support** (it may run slower than expected).
|
54 |
+
❌ You have memory limitations.
|
55 |
+
|
56 |
+
---
|
57 |
+
|
58 |
+
### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
|
59 |
+
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
|
60 |
+
- **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
|
61 |
+
- **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
|
62 |
+
|
63 |
+
📌 **Use Quantized Models if:**
|
64 |
+
✔ You are running inference on a **CPU** and need an optimized model.
|
65 |
+
✔ Your device has **low VRAM** and cannot load full-precision models.
|
66 |
+
✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
|
67 |
+
|
68 |
+
📌 **Avoid Quantized Models if:**
|
69 |
+
❌ You need **maximum accuracy** (full-precision models are better for this).
|
70 |
+
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
|
71 |
+
|
72 |
+
---
|
73 |
+
|
74 |
+
### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
|
75 |
+
These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
|
76 |
+
|
77 |
+
- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
|
78 |
+
- **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
|
79 |
+
- **Trade-off**: Lower accuracy compared to higher-bit quantizations.
|
80 |
+
|
81 |
+
- **IQ3_S**: Small block size for **maximum memory efficiency**.
|
82 |
+
- **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
|
83 |
+
|
84 |
+
- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
|
85 |
+
- **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
|
86 |
+
|
87 |
+
- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
|
88 |
+
- **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
|
89 |
+
|
90 |
+
- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
|
91 |
+
- **Use case**: Best for **ARM-based devices** or **low-memory environments**.
|
92 |
+
|
93 |
+
---
|
94 |
+
|
95 |
+
### **Summary Table: Model Format Selection**
|
96 |
+
|
97 |
+
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|
98 |
+
|--------------|------------|---------------|----------------------|---------------|
|
99 |
+
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
|
100 |
+
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
|
101 |
+
| **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
|
102 |
+
| **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
|
103 |
+
| **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
|
104 |
+
| **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
|
105 |
+
| **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
|
106 |
+
|
107 |
+
---
|
108 |
+
|
109 |
+
## **Included Files & Details**
|
110 |
+
|
111 |
+
### `rwkv7-1.5B-world-bf16.gguf`
|
112 |
+
- Model weights preserved in **BF16**.
|
113 |
+
- Use this if you want to **requantize** the model into a different format.
|
114 |
+
- Best if your device supports **BF16 acceleration**.
|
115 |
+
|
116 |
+
### `rwkv7-1.5B-world-f16.gguf`
|
117 |
+
- Model weights stored in **F16**.
|
118 |
+
- Use if your device supports **FP16**, especially if BF16 is not available.
|
119 |
+
|
120 |
+
### `rwkv7-1.5B-world-bf16-q8_0.gguf`
|
121 |
+
- **Output & embeddings** remain in **BF16**.
|
122 |
+
- All other layers quantized to **Q8_0**.
|
123 |
+
- Use if your device supports **BF16** and you want a quantized version.
|
124 |
+
|
125 |
+
### `rwkv7-1.5B-world-f16-q8_0.gguf`
|
126 |
+
- **Output & embeddings** remain in **F16**.
|
127 |
+
- All other layers quantized to **Q8_0**.
|
128 |
+
|
129 |
+
### `rwkv7-1.5B-world-q4_k.gguf`
|
130 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
131 |
+
- All other layers quantized to **Q4_K**.
|
132 |
+
- Good for **CPU inference** with limited memory.
|
133 |
+
|
134 |
+
### `rwkv7-1.5B-world-q4_k_s.gguf`
|
135 |
+
- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
|
136 |
+
- Best for **very low-memory setups**.
|
137 |
+
|
138 |
+
### `rwkv7-1.5B-world-q6_k.gguf`
|
139 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
140 |
+
- All other layers quantized to **Q6_K** .
|
141 |
+
|
142 |
+
### `rwkv7-1.5B-world-q8_0.gguf`
|
143 |
+
- Fully **Q8** quantized model for better accuracy.
|
144 |
+
- Requires **more memory** but offers higher precision.
|
145 |
+
|
146 |
+
### `rwkv7-1.5B-world-iq3_xs.gguf`
|
147 |
+
- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
|
148 |
+
- Best for **ultra-low-memory devices**.
|
149 |
+
|
150 |
+
### `rwkv7-1.5B-world-iq3_m.gguf`
|
151 |
+
- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
|
152 |
+
- Suitable for **low-memory devices**.
|
153 |
+
|
154 |
+
### `rwkv7-1.5B-world-q4_0.gguf`
|
155 |
+
- Pure **Q4_0** quantization, optimized for **ARM devices**.
|
156 |
+
- Best for **low-memory environments**.
|
157 |
+
- Prefer IQ4_NL for better accuracy.
|
158 |
+
|
159 |
+
# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
|
160 |
+
|
161 |
+
Please click like ❤ . Also I’d really appreciate it if you could test my Network Monitor Assistant at 👉 [Network Monitor Assitant](https://freenetworkmonitor.click/dashboard).
|
162 |
+
|
163 |
+
💬 Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
|
164 |
+
|
165 |
+
### What I'm Testing
|
166 |
+
|
167 |
+
I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".
|
168 |
+
|
169 |
+
🟡 **TestLLM** – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a time—still working on scaling!). If you're curious, I'd be happy to share how it works! .
|
170 |
+
|
171 |
+
### The other Available AI Assistants
|
172 |
+
|
173 |
+
🟢 **TurboLLM** – Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .
|
174 |
+
|
175 |
+
🔵 **FreeLLM** – Runs **open-source Hugging Face models** Medium speed (unlimited, subject to Hugging Face API availability).
|
176 |
+
|
177 |
+
|
178 |
+
|
179 |
+
|
180 |
+
# rwkv7-1.5B-world
|
181 |
+
|
182 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
183 |
+
|
184 |
+
This is RWKV-7 model under flash-linear attention format.
|
185 |
+
|
186 |
+
## Model Details
|
187 |
+
|
188 |
+
|
189 |
+
### Model Description
|
190 |
+
|
191 |
+
<!-- Provide a longer summary of what this model is. -->
|
192 |
+
|
193 |
+
- **Developed by:** Bo Peng, Yu Zhang, Songlin Yang, Ruichong Zhang
|
194 |
+
- **Funded by:** RWKV Project (Under LF AI & Data Foundation)
|
195 |
+
- **Model type:** RWKV7
|
196 |
+
- **Language(s) (NLP):** English
|
197 |
+
- **License:** Apache-2.0
|
198 |
+
- **Parameter count:** 1.52B
|
199 |
+
- **Tokenizer:** RWKV World tokenizer
|
200 |
+
- **Vocabulary size:** 65,536
|
201 |
+
|
202 |
+
### Model Sources
|
203 |
+
|
204 |
+
<!-- Provide the basic links for the model. -->
|
205 |
+
|
206 |
+
- **Repository:** https://github.com/fla-org/flash-linear-attention ; https://github.com/BlinkDL/RWKV-LM
|
207 |
+
- **Paper:** With in Progress
|
208 |
+
|
209 |
+
## Uses
|
210 |
+
|
211 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
212 |
+
Install `flash-linear-attention` and the latest version of `transformers` before using this model:
|
213 |
+
|
214 |
+
```bash
|
215 |
+
pip install git+https://github.com/fla-org/flash-linear-attention
|
216 |
+
pip install 'transformers>=4.48.0'
|
217 |
+
```
|
218 |
+
|
219 |
+
### Direct Use
|
220 |
+
|
221 |
+
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
222 |
+
You can use this model just as any other HuggingFace models:
|
223 |
+
```python
|
224 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
225 |
+
model = AutoModelForCausalLM.from_pretrained('fla-hub/rwkv7-1.5B-world', trust_remote_code=True)
|
226 |
+
tokenizer = AutoTokenizer.from_pretrained('fla-hub/rwkv7-1.5B-world', trust_remote_code=True)
|
227 |
+
|
228 |
+
model = model.cuda()
|
229 |
+
prompt = "What is a large language model?"
|
230 |
+
messages = [
|
231 |
+
{"role": "user", "content": "Who are you?"},
|
232 |
+
{"role": "assistant", "content": "I am a GPT-3 based model."},
|
233 |
+
{"role": "user", "content": prompt}
|
234 |
+
]
|
235 |
+
text = tokenizer.apply_chat_template(
|
236 |
+
messages,
|
237 |
+
tokenize=False,
|
238 |
+
add_generation_prompt=True
|
239 |
+
)
|
240 |
+
|
241 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
242 |
+
|
243 |
+
generated_ids = model.generate(
|
244 |
+
**model_inputs,
|
245 |
+
max_new_tokens=1024,
|
246 |
+
)
|
247 |
+
generated_ids = [
|
248 |
+
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
249 |
+
]
|
250 |
+
|
251 |
+
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
|
252 |
+
print(response)
|
253 |
+
```
|
254 |
+
|
255 |
+
## Training Details
|
256 |
+
|
257 |
+
### Training Data
|
258 |
+
|
259 |
+
This model is trained on the World v3 with a total of 3.119 trillion tokens.
|
260 |
+
|
261 |
+
#### Training Hyperparameters
|
262 |
+
|
263 |
+
- **Training regime:** bfloat16, lr 4e-4 to 1e-5 "delayed" cosine decay, wd 0.1 (with increasing batch sizes during the middle)
|
264 |
+
- **Final Loss:** 1.9965
|
265 |
+
- **Token Count:** 3.119 trillion
|
266 |
+
|
267 |
+
## Evaluation
|
268 |
+
|
269 |
+
#### Metrics
|
270 |
+
|
271 |
+
`lambada_openai`:
|
272 |
+
|
273 |
+
before conversion: ppl 4.13 acc 69.4%
|
274 |
+
|
275 |
+
after conversion: ppl 4.26 acc 68.8% (without apply temple)
|
276 |
+
|
277 |
+
## FAQ
|
278 |
+
Q: safetensors metadata is none.
|
279 |
+
|
280 |
+
A: upgrade transformers to >=4.48.0: `pip install 'transformers>=4.48.0'`
|