File size: 5,391 Bytes
d8b4e78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f52b3e4
 
 
d8b4e78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
base_model: Deci/DeciLM-7B-Instruct
language:
- multilingual
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
tags:
- nlp
- code
quantized_by: ymcki
widget:
- messages:
  - role: user
    content: Can you provide ways to eat combinations of bananas and dragonfruits?
---

Original model: https://huggingface.co/Deci/DeciLM-7B-Instruct

## Prompt Template

```
### System:
{system_prompt}
### User:
{user_prompt}
### Assistant:

```

[Modified llama.cpp](https://github.com/ymcki/llama.cpp-b4139) to support DeciLMCausalModel's variable Grouped Query Attention. Please download it and compile it to run the GGUFs in this repository. 

Please note that the HF model of Deci-7B-Instruct uses dynamic NTK-ware RoPE scaling. However, llama.cpp doesn't support it yet, so my modifification also just ignore the dynamic NTK-ware RoPE scaling setting in the config.json. Since the ggufs seem working for the time being, please just use them as is until I figure out how to implement dynamic NTK-ware RoPE scaling.

## Download a file (not the whole branch) from below:

| Filename | Quant type | File Size | Description |
| -------- | ---------- | --------- | ----------- |
| [DeciLM-7B-Instruct.f16.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.f16.gguf) | f16 | 14.1GB | Full F16 weights. |
| [DeciLM-7B-Instruct.Q8_0.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q8_0.gguf) | Q8_0 | 7.49GB | Extremely high quality, *recommended*. |
| [DeciLM-7B-Instruct.Q4_K_M.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_K_M.gguf) | Q4_K_M | 4.24GB | Very good quality, *recommended*. |
| [DeciLM-7B-Instruct.Q4_0.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_0.gguf) | Q4_0 | 4GB | Good quality. |
| [DeciLM-7B-Instruct.Q4_0_4_4.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_0_4_4.gguf) | Q4_0_4_4 | 4GB | Good quality. *recommended for edge devices <8GB RAM* |
| [DeciLM-7B-Instruct.Q4_0_4_8.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_0_4_8.gguf) | Q4_0_4_8 | 4GB | Good quality. *recommended for edge devices <8GB RAM* |
| [DeciLM-7B-Instruct.Q4_0_8_8.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_0_8_8.gguf) | Q4_0_8_8 | 4GB | Good quality. *recommended for edge devices <8GB RAM* |

## How to check i8mm and sve support for ARM devices

ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm.

ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.

For ARM devices without both, it is recommended to use Q4_0_4_4.

With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.

This is a [list](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) of ARM CPUs that support different ARM instructions. Another [list](https://raw.githubusercontent.com/ThomasKaiser/sbc-bench/refs/heads/master/sbc-bench.sh). Apparently, they only cover limited number of ARM CPUs. It is better you check for i8mm and sve support by yourself.

For Apple devices, 

```
sysctl hw
```

For other ARM devices (ie most Android devices),
```
cat /proc/cpuinfo
```

There are also android apps that can display /proc/cpuinfo.

I was told that for Intel/AMD CPU inference, support for AVX2/AVX512 can also improve the performance of Q4_0_8_8. 

On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0. 

## Which Q4_0 model to use for ARM devices
| Brand | Series | Model | i8mm | sve | Quant Type |
| ----- | ------ | ----- | ---- | --- | -----------|
| Apple | A | A4 to A14 | No | No | Q4_0_4_4 |
| Apple | A | A15 to A18 | Yes | No | Q4_0_4_8 |
| Apple | M | M1 | No | No | Q4_0_4_4 |
| Apple | M | M2/M3/M4 | Yes | No | Q4_0_4_8 |
| Google | Tensor | G1,G2 | No | No | Q4_0_4_4 |
| Google | Tensor | G3,G4 | Yes | Yes | Q4_0_8_8 |
| Samsung | Exynos | 2200,2400 | Yes | Yes | Q4_0_8_8 |
| Mediatek | Dimensity | 9000,9000+ | Yes | Yes | Q4_0_8_8 |
| Mediatek | Dimensity | 9300 | Yes | No | Q4_0_4_8 |
| Qualcomm | Snapdragon | 7+ Gen 2,8/8+ Gen 1 | Yes | Yes | Q4_0_8_8 |
| Qualcomm | Snapdragon | 8 Gen 2,8 Gen 3,X Elite | Yes | No | Q4_0_4_8 |

## Convert safetensors to f16 gguf

Make sure you have llama.cpp git cloned:

```
python3 convert_hf_to_gguf.py DeciLM-7B-Instruct/ --outfile DeciLM-7B-Instruct.f16.gguf --outtype f16
```

## Convert f16 gguf to Q8_0 gguf without imatrix
Make sure you have llama.cpp compiled:
```
./llama-quantize DeciLM-7B-Instruct.f16.gguf DeciLM-7B-Instruct.Q8_0.gguf q8_0
```

## Downloading using huggingface-cli

First, make sure you have hugginface-cli installed:

```
pip install -U "huggingface_hub[cli]"
```

Then, you can target the specific file you want:

```
huggingface-cli download ymcki/DeciLM-7B-Instruct-GGUF --include "DeciLM-7B-Instruct.Q8_0.gguf" --local-dir ./
```

## Credits

Thank you bartowski for providing a README.md to get me started.