This repository contains improved Mistral-7B quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out=of-the-box.

The table shows a comparison between these models and the current llama.cpp quantization approach using Wikitext perplexities for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q3_K_S mistral-7b-q3ks.gguf 6.0692 6.62% 6.0021 5.44%
Q3_K_M mistral-7b-q3km.gguf 5.8894 3.46% 5.8489 2.75%
Q4_K_S mistral-7b-q4ks.gguf 5.7764 1.48% 5.7349 0.75%
Q4_K_M mistral-7b-q4km.gguf 5.7539 1.08% 5.7259 0.59%
Q5_K_S mistral-7b-q5ks.gguf 5.7258 0.59% 5.7100 0.31%
Q4_0 mistral-7b-q40.gguf 5.8189 2.23% 5.7924 1.76%
Q4_1 mistral-7b-q41.gguf 5.8244 2.32% 5.7455 0.94%
Q5_0 mistral-7b-q50.gguf 5.7180 0.45% 5.7070 0.26%
Q5_1 mistral-7b-q51.gguf 5.7128 0.36% 5.7057 0.24%

In addition, a 2-bit model is provided (mistral-7b-q2k-extra-small.gguf). It has a perplexity of 6.7099 for a context length of 512, and 5.5744 for a context of 4096.

Downloads last month
176
GGUF
Model size
7.24B params
Architecture
llama
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support