ikawrakow/open-hermes-2.5-mistral-7b-quantized-gguf

This repository contains alternative Open-Hermes-2.5-Mistral-7B (https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance differences in actual usage. Perplexity is lower compared to the "official" llama.cpp quantization (e.g., as provided by https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF), but perplexity is not necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table comparing perplexities of these quantized models to the current llama.cpp quantization approach on Wikitext for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization	Model file	PPL(llama.cpp)	Quantization Error	PPL(new quants)	Quantization Error
Q3_K_S	oh-2.5-m7b-q3k-small.gguf	6.8943	7.30%	6.7228	4.63%
Q3_K_M	oh-2.5-m7b-q3k-medium.gguf	6.7366	4.84%	6.5899	2.56%
Q4_K_S	oh-2.5-m7b-q4k-small.gguf	6.5720	2.28%	6.4778	0.82%
Q4_K_M	oh-2.5-m7b-q4k-medium.gguf	6.5322	1.66%	6.4740	0.76%
Q5_K_S	oh-2.5-m7b-q5k-small.gguf	6.4668	0.64%	6.4428	0.27%
Q5_K_M	oh-2.5-m7b-q5k-medium.gguf	6.4536	0.44%	6.4422	0.26%
Q4_0	oh-2.5-m7b-q40.gguf	6.5443	1.85%	6.5454	1.87%
Q4_1	oh-2.5-m7b-q41.gguf	6.6246	3.10%	6.4810	0.87%
Q5_0	oh-2.5-m7b-q50.gguf	6.4731	0.74%	6.4554	0.47%
Q5_1	oh-2.5-m7b-q51.gguf	6.4818	0.88%	6.4390	0.21%

The figure is a plot of the data in the above table, where the x-axis is the quantized model size in GiB.