Mistral-Nemo-Instruct-12B-iMat-GGUF

Important Note: Inferencing in llama.cpp has now been merged in PR #8604. Please ensure you are on release b3438 or newer. Text-generation-web-ui (Ooba) is also working as of 7/23. Kobold.cpp working as of v1.71.

Quantized from Mistral-Nemo-Instruct-2407 fp16

  • Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 92 chunks and n_ctx=512
  • Static fp16 will also be included in repo
  • For a brief rundown of iMatrix quant performance please see this PR
  • All quants are verified working prior to uploading to repo for your safety and convenience

KL-Divergence Reference Chart (Click on image to view in full size)

Quant-specific Tips:

  • If you are getting a cudaMalloc failed: out of memory error, try passing an argument for lower context in llama.cpp, e.g. for 8k: -c 8192
  • If you have all ampere generation or newer cards, you can use flash attention like so: -fa
  • Provided Flash Attention is enabled you can also use quantized cache to save on VRAM e.g. for 8-bit: -ctk q8_0 -ctv q8_0
  • Mistral recommends a temperature of 0.3 for this model

Original model card can be found here

Downloads last month
206
GGUF
Model size
12.2B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF

Collection including InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF