Upload README.md

23697e2 verified 7 months ago

5.39 kB

	---
	base_model: Deci/DeciLM-7B-Instruct
	language:
	- multilingual
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-generation
	tags:
	- nlp
	- code
	quantized_by: ymcki
	widget:
	- messages:
	- role: user
	content: Can you provide ways to eat combinations of bananas and dragonfruits?
	---

	Original model: https://huggingface.co/Deci/DeciLM-7B-Instruct

	## Prompt Template

	```
	### System:
	{system_prompt}
	### User:
	{user_prompt}
	### Assistant:

	```

	[Modified llama.cpp](https://github.com/ymcki/llama.cpp-b4139) to support DeciLMForCausalLM's variable Grouped Query Attention. Please download it and compile it to run the GGUFs in this repository.

	Please note that the HF model of Deci-7B-Instruct uses dynamic NTK-ware RoPE scaling. However, llama.cpp doesn't support it yet, so my modifification also just ignore the dynamic NTK-ware RoPE scaling setting in the config.json. Since the ggufs seem working for the time being, please just use them as is until I figure out how to implement dynamic NTK-ware RoPE scaling.

	## Download a file (not the whole branch) from below:

	\| Filename \| Quant type \| File Size \| Description \|
	\| -------- \| ---------- \| --------- \| ----------- \|
	\| [DeciLM-7B-Instruct.f16.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.f16.gguf) \| f16 \| 14.1GB \| Full F16 weights. \|
	\| [DeciLM-7B-Instruct.Q8_0.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q8_0.gguf) \| Q8_0 \| 7.49GB \| Extremely high quality, recommended. \|
	\| [DeciLM-7B-Instruct.Q4_K_M.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_K_M.gguf) \| Q4_K_M \| 4.24GB \| Very good quality, recommended. \|
	\| [DeciLM-7B-Instruct.Q4_0.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_0.gguf) \| Q4_0 \| 4GB \| Good quality. \|
	\| [DeciLM-7B-Instruct.Q4_0_4_4.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_0_4_4.gguf) \| Q4_0_4_4 \| 4GB \| Good quality. recommended for edge devices <8GB RAM \|
	\| [DeciLM-7B-Instruct.Q4_0_4_8.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_0_4_8.gguf) \| Q4_0_4_8 \| 4GB \| Good quality. recommended for edge devices <8GB RAM \|
	\| [DeciLM-7B-Instruct.Q4_0_8_8.gguf](https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF/blob/main/DeciLM-7B-Instruct.Q4_0_8_8.gguf) \| Q4_0_8_8 \| 4GB \| Good quality. recommended for edge devices <8GB RAM \|

	## How to check i8mm and sve support for ARM devices

	ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm.

	ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.

	For ARM devices without both, it is recommended to use Q4_0_4_4.

	With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.

	This is a [list](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) of ARM CPUs that support different ARM instructions. Another [list](https://raw.githubusercontent.com/ThomasKaiser/sbc-bench/refs/heads/master/sbc-bench.sh). Apparently, they only cover limited number of ARM CPUs. It is better you check for i8mm and sve support by yourself.

	For Apple devices,

	```
	sysctl hw
	```

	For other ARM devices (ie most Android devices),
	```
	cat /proc/cpuinfo
	```

	There are also android apps that can display /proc/cpuinfo.

	I was told that for Intel/AMD CPU inference, support for AVX2/AVX512 can also improve the performance of Q4_0_8_8.

	On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.

	## Which Q4_0 model to use for ARM devices
	\| Brand \| Series \| Model \| i8mm \| sve \| Quant Type \|
	\| ----- \| ------ \| ----- \| ---- \| --- \| -----------\|
	\| Apple \| A \| A4 to A14 \| No \| No \| Q4_0_4_4 \|
	\| Apple \| A \| A15 to A18 \| Yes \| No \| Q4_0_4_8 \|
	\| Apple \| M \| M1 \| No \| No \| Q4_0_4_4 \|
	\| Apple \| M \| M2/M3/M4 \| Yes \| No \| Q4_0_4_8 \|
	\| Google \| Tensor \| G1,G2 \| No \| No \| Q4_0_4_4 \|
	\| Google \| Tensor \| G3,G4 \| Yes \| Yes \| Q4_0_8_8 \|
	\| Samsung \| Exynos \| 2200,2400 \| Yes \| Yes \| Q4_0_8_8 \|
	\| Mediatek \| Dimensity \| 9000,9000+ \| Yes \| Yes \| Q4_0_8_8 \|
	\| Mediatek \| Dimensity \| 9300 \| Yes \| No \| Q4_0_4_8 \|
	\| Qualcomm \| Snapdragon \| 7+ Gen 2,8/8+ Gen 1 \| Yes \| Yes \| Q4_0_8_8 \|
	\| Qualcomm \| Snapdragon \| 8 Gen 2,8 Gen 3,X Elite \| Yes \| No \| Q4_0_4_8 \|

	## Convert safetensors to f16 gguf

	Make sure you have llama.cpp git cloned:

	```
	python3 convert_hf_to_gguf.py DeciLM-7B-Instruct/ --outfile DeciLM-7B-Instruct.f16.gguf --outtype f16
	```

	## Convert f16 gguf to Q8_0 gguf without imatrix
	Make sure you have llama.cpp compiled:
	```
	./llama-quantize DeciLM-7B-Instruct.f16.gguf DeciLM-7B-Instruct.Q8_0.gguf q8_0
	```

	## Downloading using huggingface-cli

	First, make sure you have hugginface-cli installed:

	```
	pip install -U "huggingface_hub[cli]"
	```

	Then, you can target the specific file you want:

	```
	huggingface-cli download ymcki/DeciLM-7B-Instruct-GGUF --include "DeciLM-7B-Instruct.Q8_0.gguf" --local-dir ./
	```

	## Credits

	Thank you bartowski for providing a README.md to get me started.