This repository contains alternative Open-Hermes-2.5-Mistral-7B (https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance differences in actual usage. Perplexity is lower compared to the "official" llama.cpp quantization (e.g., as provided by https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF), but perplexity is not necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table comparing perplexities of these quantized models to the current llama.cpp quantization approach on Wikitext for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q3_K_S oh-2.5-m7b-q3k-small.gguf 6.8943 7.30% 6.7228 4.63%
Q3_K_M oh-2.5-m7b-q3k-medium.gguf 6.7366 4.84% 6.5899 2.56%
Q4_K_S oh-2.5-m7b-q4k-small.gguf 6.5720 2.28% 6.4778 0.82%
Q4_K_M oh-2.5-m7b-q4k-medium.gguf 6.5322 1.66% 6.4740 0.76%
Q5_K_S oh-2.5-m7b-q5k-small.gguf 6.4668 0.64% 6.4428 0.27%
Q5_K_M oh-2.5-m7b-q5k-medium.gguf 6.4536 0.44% 6.4422 0.26%
Q4_0 oh-2.5-m7b-q40.gguf 6.5443 1.85% 6.5454 1.87%
Q4_1 oh-2.5-m7b-q41.gguf 6.6246 3.10% 6.4810 0.87%
Q5_0 oh-2.5-m7b-q50.gguf 6.4731 0.74% 6.4554 0.47%
Q5_1 oh-2.5-m7b-q51.gguf 6.4818 0.88% 6.4390 0.21%

The figure is a plot of the data in the above table, where the x-axis is the quantized model size in GiB. image/png

Downloads last month
73
GGUF
Model size
7.24B params
Architecture
llama
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support