Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3
This is a 5-bit quantized version of andrewzh/Absolute_Zero_Reasoner-Coder-14b using ExLlamaV3 v0.0.2.
Model Description
This model is a quantized version of Absolute_Zero_Reasoner-Coder-14b, which is based on the Qwen2-Coder-14B architecture. The original model is designed for reasoning and coding tasks. For more details about the original model, please refer to the paper: https://huggingface.co/papers/2505.03335.
The quantization reduces the model size and memory requirements while attempting to preserve as much of the original performance as possible.
Quantization Methodology
The model was quantized using ExLlamaV3 v0.0.2 with the following parameters:
- Quantization Method: exl3 (ExLlamaV3)
- Bits: 5.0 (5-bit quantization)
- Head Bits: 6 (6-bit precision for attention heads)
- Calibration:
- Rows: 100
- Columns: 2048
- Out Scales: auto
This quantization approach uses a more sophisticated method than simple linear quantization, allowing for better preservation of model quality at lower bit depths.
Model Architecture
The model is based on the Qwen2 architecture with the following specifications:
- Hidden Size: 5120
- Intermediate Size: 13824
- Number of Attention Heads: 40
- Number of Key-Value Heads: 8
- Number of Hidden Layers: 48
- Maximum Sequence Length: 32768
- Vocabulary Size: 152064
How to Use
To use this quantized model with ExLlamaV3, you'll need to install the ExLlamaV3 library:
pip install exllamav3
Here's a basic example of how to use the model:
from exllamav3 import ExLlamaV3, ExLlamaV3Config
from exllamav3.tokenizer import ExLlamaV3Tokenizer
# Set up model path
model_path = "path/to/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3"
# Load config and model
config = ExLlamaV3Config()
config.model_dir = model_path
config.prepare()
model = ExLlamaV3(config)
model.load()
# Load tokenizer
tokenizer = ExLlamaV3Tokenizer(config)
# Generate text
prompt = "Write a function to calculate the Fibonacci sequence in Python:"
input_ids = tokenizer.encode(prompt)
output = model.generate(
input_ids=input_ids,
max_new_tokens=200,
temperature=0.6,
top_p=0.9
)
print(tokenizer.decode(output))
Limitations
This quantized model has the following limitations:
Reduced Precision: The 5-bit quantization may lead to some degradation in performance compared to the original model, particularly for complex reasoning tasks.
ExLlamaV3 Dependency: This model can only be used with the ExLlamaV3 library and is not compatible with standard Hugging Face Transformers without conversion.
Inherited Limitations: All limitations of the original model apply to this quantized version as well.
Citation
If you use this model in your research, please cite the original paper:
@misc{absolute_zero_reasoner_coder,
author = {Andrew Zhang},
title = {Absolute Zero Reasoner-Coder},
year = {2024},
howpublished = {\url{https://huggingface.co/papers/2505.03335}}
}
Acknowledgements
- Original model: andrewzh/Absolute_Zero_Reasoner-Coder-14b
- Quantization library: ExLlamaV3
- Downloads last month
- 1
Model tree for patf82/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3
Base model
andrewzh/Absolute_Zero_Reasoner-Coder-14b