F16 is the best option. F32 is just too slow (on a RTX3060M 6GB) Q8_O is faster than F16 but does produce sometimes better results than F16 Under Q8 might be a big tradeoff in quality. Q3 showed some very, very bad hallucinations

Downloads last month: 39

GGUF

Model size

159M params

Architecture

llama

Hardware compatibility

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ARMZyany/Cascade0-159M-Instruct-45k-GGUF

Base model

ARMZyany/Cascade0-159M-Instruct-45k

Quantized

(1)

this model