shisa-ai/shisa-v2-llama3.1-405b-GGUF

About

This repo contains select GGUF quants of shisa-ai/shisa-v2-llama3.1-405b

All quants were created with b5503 of upstream llama.cpp
All quants are weighted/imatrix quants created from our shisa-ai/shisa-v2-sharegpt bilingual dataset on the fp16 model except for the Q8_0
Files are pre-split at 45GB (below HF's 50GB upload limit). Modern llama.cpp builds should be able to load the sequential files automatically, but you can use llama-gguf-split --merge if you want to merge them back together

Provided Quants

Type	Size (GiB)
IQ2_XXS	100
IQ3_XS	155
IQ3_M	170
IQ4_XS	202
Q4_K_M	227
Q8_0	402

Quant Quality

All quants have been tested with JA MT-Bench (judged by GPT-4.1) as a rough guide for quality:

Quant	Size (GiB)	% Diff	Overall	Writing	Roleplay	Reasoning	Math	Coding	Extraction	STEM	Humanities
Full FP16	810		9.13	9.25	9.55	8.15	8.90	9.10	9.65	9.10	9.35
IQ3_M	170	-0.99	9.04	8.90	9.45	7.75	8.95	8.95	9.70	9.15	9.50
Q4_K_M	227	-1.10	9.03	9.40	9.00	8.25	8.85	9.10	9.50	8.90	9.25
Q8_0	405	-1.20	9.02	9.40	9.05	8.30	9.20	8.70	9.50	8.45	9.55
W8A8-INT8	405	-1.42	9.00	9.20	9.35	7.80	8.75	9.00	9.80	8.65	9.45
FP8-Dynamic	405	-3.29	8.83	8.70	9.20	7.85	8.80	8.65	9.30	8.80	9.35
IQ3_XS	155	-3.50	8.81	8.70	9.05	7.70	8.60	8.95	9.35	8.70	9.45
IQ4_XS	202	-3.61	8.80	8.85	9.55	6.90	8.35	8.60	9.90	8.65	9.60
70B FP16	140	-7.89	8.41	7.95	9.05	6.25	8.30	8.25	9.70	8.70	9.05
IQ2_XXS	100	-18.18	7.47	7.50	6.80	5.15	7.55	7.30	9.05	7.65	8.80

Due to margin of error, you could probably fairly say that the IQ3_M, Q4_K_M, and Q8_0 GGUFs have almost no functional loss versus the FP16.

Interestingly enough, while roleplay takes one of the biggest hits, writing seems to be improved on the Q4 and Q8? I think you'd really need to test more (more samples, more runs, more evals) to really see what's going on. Also interestingly the XS quants track pretty consistently, with the IQ4_XS doing worse than the IQ3_M.

The IQ2_XXS scores extremely poorly. I included the 70B Full FP16 scores as a baseline and I'd expect you'd be better off running a decent Shisa V2 70B Q4_K_M (40GB) or IQ3_M (32GB) vs the IQ2.

In an ideal world, of course, you should test different quants on your downstream tasks, but I understand that that's not always an option. Based on this testing though, if you had to pick on bang/buck quant blind, I'd start with the IQ3_M.

Making Quants

# first you need an fp16 - setup llama.cpp python env and run something like
python convert_hf_to_gguf.py ~/.cache/huggingface/hub/models--shisa-ai--shisa-v2-llama3.1-405b/snapshots/71b83a7cb998c3a44f59c83a9928596ac348b9b5 --outfile shisa-v2-llama3.1-405b-fp16.gguf

# Create imatrix: using 4 x H200 you can load 88 layers, takes about 1h15m
CUDA_VISIBLE_DEVICES=4,5,6,7 build/bin/llama-imatrix -m shisa-v2-llama3.1-405b-fp16.gguf -f /data/quantize/shisa-v2-llama-3.1-405b/gguf/calibration_chat.txt -o imatrix.dat -c 512 -b 512 --chunks 100 -ngl 88

# create your imatrix quants
build/bin/llama-quantize --imatrix imatrix.dat shisa-v2-llama3.1-405b-fp16.gguf shisa-v2-llama3.1-405b-IQ3_XS.gguf IQ3_XS

# split the quants
build/bin/llama-gguf-split --split-max-size 45G shisa-v2-llama3.1-405b-IQ3_XS.gguf  shisa-v2-llama3.1-405b-IQ3_XS

# upload (bash loop)
for f in shisa-v2-llama3.1-405b-IQ3_XS-0000*; do huggingface-cli upload shisa-ai/shisa-v2-llama3.1-405b-GGUF "$f"; done

shisa-ai
/

shisa-v2-llama3.1-405b-GGUF

About

Provided Quants

Quant Quality

Making Quants

Model tree for shisa-ai/shisa-v2-llama3.1-405b-GGUF

Datasets used to train shisa-ai/shisa-v2-llama3.1-405b-GGUF

Collection including shisa-ai/shisa-v2-llama3.1-405b-GGUF

Shisa V2