Transformers
GGUF
imatrix
conversational

About

This repo contains select GGUF quants of shisa-ai/shisa-v2-llama3.1-405b

  • All quants were created with b5503 of upstream llama.cpp
  • All quants are weighted/imatrix quants created from our shisa-ai/shisa-v2-sharegpt bilingual dataset on the fp16 model except for the Q8_0
  • Files are pre-split at 45GB (below HF's 50GB upload limit). Modern llama.cpp builds should be able to load the sequential files automatically, but you can use llama-gguf-split --merge if you want to merge them back together

Provided Quants

Type Size (GiB)
IQ2_XXS 100
IQ3_XS 155
IQ3_M 170
IQ4_XS 202
Q4_K_M 227
Q8_0 402

Quant Quality

All quants have been tested with JA MT-Bench (judged by GPT-4.1) as a rough guide for quality:

Quant Size (GiB) % Diff Overall Writing Roleplay Reasoning Math Coding Extraction STEM Humanities
Full FP16 810 9.13 9.25 9.55 8.15 8.90 9.10 9.65 9.10 9.35
IQ3_M 170 -0.99 9.04 8.90 9.45 7.75 8.95 8.95 9.70 9.15 9.50
Q4_K_M 227 -1.10 9.03 9.40 9.00 8.25 8.85 9.10 9.50 8.90 9.25
Q8_0 405 -1.20 9.02 9.40 9.05 8.30 9.20 8.70 9.50 8.45 9.55
W8A8-INT8 405 -1.42 9.00 9.20 9.35 7.80 8.75 9.00 9.80 8.65 9.45
FP8-Dynamic 405 -3.29 8.83 8.70 9.20 7.85 8.80 8.65 9.30 8.80 9.35
IQ3_XS 155 -3.50 8.81 8.70 9.05 7.70 8.60 8.95 9.35 8.70 9.45
IQ4_XS 202 -3.61 8.80 8.85 9.55 6.90 8.35 8.60 9.90 8.65 9.60
70B FP16 140 -7.89 8.41 7.95 9.05 6.25 8.30 8.25 9.70 8.70 9.05
IQ2_XXS 100 -18.18 7.47 7.50 6.80 5.15 7.55 7.30 9.05 7.65 8.80

Due to margin of error, you could probably fairly say that the IQ3_M, Q4_K_M, and Q8_0 GGUFs have almost no functional loss versus the FP16.

Interestingly enough, while roleplay takes one of the biggest hits, writing seems to be improved on the Q4 and Q8? I think you'd really need to test more (more samples, more runs, more evals) to really see what's going on. Also interestingly the XS quants track pretty consistently, with the IQ4_XS doing worse than the IQ3_M.

The IQ2_XXS scores extremely poorly. I included the 70B Full FP16 scores as a baseline and I'd expect you'd be better off running a decent Shisa V2 70B Q4_K_M (40GB) or IQ3_M (32GB) vs the IQ2.

In an ideal world, of course, you should test different quants on your downstream tasks, but I understand that that's not always an option. Based on this testing though, if you had to pick on bang/buck quant blind, I'd start with the IQ3_M.

Making Quants

# first you need an fp16 - setup llama.cpp python env and run something like
python convert_hf_to_gguf.py ~/.cache/huggingface/hub/models--shisa-ai--shisa-v2-llama3.1-405b/snapshots/71b83a7cb998c3a44f59c83a9928596ac348b9b5 --outfile shisa-v2-llama3.1-405b-fp16.gguf

# Create imatrix: using 4 x H200 you can load 88 layers, takes about 1h15m
CUDA_VISIBLE_DEVICES=4,5,6,7 build/bin/llama-imatrix -m shisa-v2-llama3.1-405b-fp16.gguf -f /data/quantize/shisa-v2-llama-3.1-405b/gguf/calibration_chat.txt -o imatrix.dat -c 512 -b 512 --chunks 100 -ngl 88

# create your imatrix quants
build/bin/llama-quantize --imatrix imatrix.dat shisa-v2-llama3.1-405b-fp16.gguf shisa-v2-llama3.1-405b-IQ3_XS.gguf IQ3_XS

# split the quants
build/bin/llama-gguf-split --split-max-size 45G shisa-v2-llama3.1-405b-IQ3_XS.gguf  shisa-v2-llama3.1-405b-IQ3_XS

# upload (bash loop)
for f in shisa-v2-llama3.1-405b-IQ3_XS-0000*; do huggingface-cli upload shisa-ai/shisa-v2-llama3.1-405b-GGUF "$f"; done
Downloads last month
983
GGUF
Model size
406B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shisa-ai/shisa-v2-llama3.1-405b-GGUF

Datasets used to train shisa-ai/shisa-v2-llama3.1-405b-GGUF

Collection including shisa-ai/shisa-v2-llama3.1-405b-GGUF