Mungert's picture
Upload README.md with huggingface_hub
3eccd10 verified
---
license: apache-2.0
---
# <span style="color: #7FFF7F;">TriLM_190M_Unpacked GGUF Models</span>
## **Choosing the Right Model Format**
Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
- Recommended if your hardware supports **BF16 acceleration** (check your device’s specs).
- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
πŸ“Œ **Use BF16 if:**
βœ” Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
βœ” You want **higher precision** while saving memory.
βœ” You plan to **requantize** the model into another format.
πŸ“Œ **Avoid BF16 if:**
❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
❌ You need compatibility with older devices that lack BF16 optimization.
---
### **F16 (Float 16) – More widely supported than BF16**
- A 16-bit floating-point **high precision** but with less of range of values than BF16.
- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
πŸ“Œ **Use F16 if:**
βœ” Your hardware supports **FP16** but **not BF16**.
βœ” You need a **balance between speed, memory usage, and accuracy**.
βœ” You are running on a **GPU** or another device optimized for FP16 computations.
πŸ“Œ **Avoid F16 if:**
❌ Your device lacks **native FP16 support** (it may run slower than expected).
❌ You have memory limitations.
---
### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
- **Lower-bit models (Q4_K)** β†’ **Best for minimal memory usage**, may have lower precision.
- **Higher-bit models (Q6_K, Q8_0)** β†’ **Better accuracy**, requires more memory.
πŸ“Œ **Use Quantized Models if:**
βœ” You are running inference on a **CPU** and need an optimized model.
βœ” Your device has **low VRAM** and cannot load full-precision models.
βœ” You want to reduce **memory footprint** while keeping reasonable accuracy.
πŸ“Œ **Avoid Quantized Models if:**
❌ You need **maximum accuracy** (full-precision models are better for this).
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
---
### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
- **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
- **Trade-off**: Lower accuracy compared to higher-bit quantizations.
- **IQ3_S**: Small block size for **maximum memory efficiency**.
- **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
- **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
- **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
- **Use case**: Best for **ARM-based devices** or **low-memory environments**.
---
### **Summary Table: Model Format Selection**
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|--------------|------------|---------------|----------------------|---------------|
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
| **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
| **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
| **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
| **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
| **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
---
## **Included Files & Details**
### `TriLM_190M_Unpacked-bf16.gguf`
- Model weights preserved in **BF16**.
- Use this if you want to **requantize** the model into a different format.
- Best if your device supports **BF16 acceleration**.
### `TriLM_190M_Unpacked-f16.gguf`
- Model weights stored in **F16**.
- Use if your device supports **FP16**, especially if BF16 is not available.
### `TriLM_190M_Unpacked-bf16-q8_0.gguf`
- **Output & embeddings** remain in **BF16**.
- All other layers quantized to **Q8_0**.
- Use if your device supports **BF16** and you want a quantized version.
### `TriLM_190M_Unpacked-f16-q8_0.gguf`
- **Output & embeddings** remain in **F16**.
- All other layers quantized to **Q8_0**.
### `TriLM_190M_Unpacked-q4_k.gguf`
- **Output & embeddings** quantized to **Q8_0**.
- All other layers quantized to **Q4_K**.
- Good for **CPU inference** with limited memory.
### `TriLM_190M_Unpacked-q4_k_s.gguf`
- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
- Best for **very low-memory setups**.
### `TriLM_190M_Unpacked-q6_k.gguf`
- **Output & embeddings** quantized to **Q8_0**.
- All other layers quantized to **Q6_K** .
### `TriLM_190M_Unpacked-q8_0.gguf`
- Fully **Q8** quantized model for better accuracy.
- Requires **more memory** but offers higher precision.
### `TriLM_190M_Unpacked-iq3_xs.gguf`
- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
- Best for **ultra-low-memory devices**.
### `TriLM_190M_Unpacked-iq3_m.gguf`
- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
- Suitable for **low-memory devices**.
### `TriLM_190M_Unpacked-q4_0.gguf`
- Pure **Q4_0** quantization, optimized for **ARM devices**.
- Best for **low-memory environments**.
- Prefer IQ4_NL for better accuracy.
# <span id="testllm" style="color: #7F7FFF;">πŸš€ If you find these models useful</span>
Please click like ❀ . Also I’d really appreciate it if you could test my Network Monitor Assistant at πŸ‘‰ [Network Monitor Assitant](https://freenetworkmonitor.click/dashboard).
πŸ’¬ Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
### What I'm Testing
I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".
🟑 **TestLLM** – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a timeβ€”still working on scaling!). If you're curious, I'd be happy to share how it works! .
### The other Available AI Assistants
🟒 **TurboLLM** – Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .
πŸ”΅ **FreeLLM** – Runs **open-source Hugging Face models** Medium speed (unlimited, subject to Hugging Face API availability).
# TriLM 190M Unpacked
TriLM (ternary model), unpacked to FP16 format - compatible with FP16 GEMMs. After unpacking, TriLM has the same architecture as LLaMa.
```python
import transformers as tf, torch
model_name = "SpectraSuite/TriLM_190M_Unpacked"
# Please adjust the temperature, repetition penalty, top_k, top_p and other sampling parameters according to your needs.
pipeline = tf.pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.float16}, device_map="auto")
# These are base (pretrained) LLMs that are not instruction and chat tuned. You may need to adjust your prompt accordingly.
pipeline("Once upon a time")
```
* License: Apache 2.0
* We will use our GitHub repo for communication (including HF repo related queries). Feel free to open an issue here https://github.com/NolanoOrg/SpectraSuite