This version is quantized by h4shy with the consideration of low-end hardware and CPU-only setups. The goal here is to achieve an inference-ready setup aiming for production use with considerable resource constrains, as these particular quantization choices will help with running inferences where there are medium to high CPU constrains and low to medium RAM constrains, as well as reserving resources for production efficiency.

Q5_0: Medium to fast inference, optimal RAM usage.
Q8_0: More inference speed, more RAM usage.

Evaluations and precise research coming soon.

Original model: gemma-3-1b-it
Software used for quantization: llama.cpp

Downloads last month
215
GGUF
Model size
1,000M params
Architecture
gemma3
Hardware compatibility
Log In to view the estimation

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for h4shy/gemma-3-1b-it-fast-GUFF

Quantized
(99)
this model