rwkv7-1.5B-world GGUF Models
Choosing the Right Model Format
Selecting the correct model format depends on your hardware capabilities and memory constraints.
BF16 (Brain Float 16) β Use if BF16 acceleration is available
- A 16-bit floating-point format designed for faster computation while retaining good precision.
- Provides similar dynamic range as FP32 but with lower memory usage.
- Recommended if your hardware supports BF16 acceleration (check your deviceβs specs).
- Ideal for high-performance inference with reduced memory footprint compared to FP32.
π Use BF16 if:
β Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
β You want higher precision while saving memory.
β You plan to requantize the model into another format.
π Avoid BF16 if:
β Your hardware does not support BF16 (it may fall back to FP32 and run slower).
β You need compatibility with older devices that lack BF16 optimization.
F16 (Float 16) β More widely supported than BF16
- A 16-bit floating-point high precision but with less of range of values than BF16.
- Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
π Use F16 if:
β Your hardware supports FP16 but not BF16.
β You need a balance between speed, memory usage, and accuracy.
β You are running on a GPU or another device optimized for FP16 computations.
π Avoid F16 if:
β Your device lacks native FP16 support (it may run slower than expected).
β You have memory limitations.
Quantized Models (Q4_K, Q6_K, Q8, etc.) β For CPU & Low-VRAM Inference
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
- Lower-bit models (Q4_K) β Best for minimal memory usage, may have lower precision.
- Higher-bit models (Q6_K, Q8_0) β Better accuracy, requires more memory.
π Use Quantized Models if:
β You are running inference on a CPU and need an optimized model.
β Your device has low VRAM and cannot load full-precision models.
β You want to reduce memory footprint while keeping reasonable accuracy.
π Avoid Quantized Models if:
β You need maximum accuracy (full-precision models are better for this).
β Your hardware has enough VRAM for higher-precision formats (BF16/F16).
Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)
These models are optimized for extreme memory efficiency, making them ideal for low-power devices or large-scale deployments where memory is a critical constraint.
IQ3_XS: Ultra-low-bit quantization (3-bit) with extreme memory efficiency.
- Use case: Best for ultra-low-memory devices where even Q4_K is too large.
- Trade-off: Lower accuracy compared to higher-bit quantizations.
IQ3_S: Small block size for maximum memory efficiency.
- Use case: Best for low-memory devices where IQ3_XS is too aggressive.
IQ3_M: Medium block size for better accuracy than IQ3_S.
- Use case: Suitable for low-memory devices where IQ3_S is too limiting.
Q4_K: 4-bit quantization with block-wise optimization for better accuracy.
- Use case: Best for low-memory devices where Q6_K is too large.
Q4_0: Pure 4-bit quantization, optimized for ARM devices.
- Use case: Best for ARM-based devices or low-memory environments.
Summary Table: Model Format Selection
Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
---|---|---|---|---|
BF16 | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
F16 | High | High | FP16-supported devices | GPU inference when BF16 isnβt available |
Q4_K | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
Q6_K | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
Q8_0 | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
IQ3_XS | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
Q4_0 | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
Included Files & Details
rwkv7-1.5B-world-bf16.gguf
- Model weights preserved in BF16.
- Use this if you want to requantize the model into a different format.
- Best if your device supports BF16 acceleration.
rwkv7-1.5B-world-f16.gguf
- Model weights stored in F16.
- Use if your device supports FP16, especially if BF16 is not available.
rwkv7-1.5B-world-bf16-q8_0.gguf
- Output & embeddings remain in BF16.
- All other layers quantized to Q8_0.
- Use if your device supports BF16 and you want a quantized version.
rwkv7-1.5B-world-f16-q8_0.gguf
- Output & embeddings remain in F16.
- All other layers quantized to Q8_0.
rwkv7-1.5B-world-q4_k.gguf
- Output & embeddings quantized to Q8_0.
- All other layers quantized to Q4_K.
- Good for CPU inference with limited memory.
rwkv7-1.5B-world-q4_k_s.gguf
- Smallest Q4_K variant, using less memory at the cost of accuracy.
- Best for very low-memory setups.
rwkv7-1.5B-world-q6_k.gguf
- Output & embeddings quantized to Q8_0.
- All other layers quantized to Q6_K .
rwkv7-1.5B-world-q8_0.gguf
- Fully Q8 quantized model for better accuracy.
- Requires more memory but offers higher precision.
rwkv7-1.5B-world-iq3_xs.gguf
- IQ3_XS quantization, optimized for extreme memory efficiency.
- Best for ultra-low-memory devices.
rwkv7-1.5B-world-iq3_m.gguf
- IQ3_M quantization, offering a medium block size for better accuracy.
- Suitable for low-memory devices.
rwkv7-1.5B-world-q4_0.gguf
- Pure Q4_0 quantization, optimized for ARM devices.
- Best for low-memory environments.
- Prefer IQ4_NL for better accuracy.
π If you find these models useful
Please click like β€ . Also Iβd really appreciate it if you could test my Network Monitor Assistant at π Network Monitor Assitant.
π¬ Click the chat icon (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
What I'm Testing
I'm experimenting with function calling against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".
π‘ TestLLM β Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a timeβstill working on scaling!). If you're curious, I'd be happy to share how it works! .
The other Available AI Assistants
π’ TurboLLM β Uses gpt-4o-mini Fast! . Note: tokens are limited since OpenAI models are pricey, but you can Login or Download the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .
π΅ FreeLLM β Runs open-source Hugging Face models Medium speed (unlimited, subject to Hugging Face API availability).
rwkv7-1.5B-world
This is RWKV-7 model under flash-linear attention format.
Model Details
Model Description
- Developed by: Bo Peng, Yu Zhang, Songlin Yang, Ruichong Zhang
- Funded by: RWKV Project (Under LF AI & Data Foundation)
- Model type: RWKV7
- Language(s) (NLP): English
- License: Apache-2.0
- Parameter count: 1.52B
- Tokenizer: RWKV World tokenizer
- Vocabulary size: 65,536
Model Sources
- Repository: https://github.com/fla-org/flash-linear-attention ; https://github.com/BlinkDL/RWKV-LM
- Paper: With in Progress
Uses
Install flash-linear-attention
and the latest version of transformers
before using this model:
pip install git+https://github.com/fla-org/flash-linear-attention
pip install 'transformers>=4.48.0'
Direct Use
You can use this model just as any other HuggingFace models:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('fla-hub/rwkv7-1.5B-world', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('fla-hub/rwkv7-1.5B-world', trust_remote_code=True)
model = model.cuda()
prompt = "What is a large language model?"
messages = [
{"role": "user", "content": "Who are you?"},
{"role": "assistant", "content": "I am a GPT-3 based model."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
print(response)
Training Details
Training Data
This model is trained on the World v3 with a total of 3.119 trillion tokens.
Training Hyperparameters
- Training regime: bfloat16, lr 4e-4 to 1e-5 "delayed" cosine decay, wd 0.1 (with increasing batch sizes during the middle)
- Final Loss: 1.9965
- Token Count: 3.119 trillion
Evaluation
Metrics
lambada_openai
:
before conversion: ppl 4.13 acc 69.4%
after conversion: ppl 4.26 acc 68.8% (without apply temple)
FAQ
Q: safetensors metadata is none.
A: upgrade transformers to >=4.48.0: pip install 'transformers>=4.48.0'
- Downloads last month
- 566
Model tree for Mungert/rwkv7-1.5B-world-GGUF
Base model
BlinkDL/rwkv-7-world