Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3

This is a 5-bit quantized version of andrewzh/Absolute_Zero_Reasoner-Coder-14b using ExLlamaV3 v0.0.2.

Model Description

This model is a quantized version of Absolute_Zero_Reasoner-Coder-14b, which is based on the Qwen2-Coder-14B architecture. The original model is designed for reasoning and coding tasks. For more details about the original model, please refer to the paper: https://huggingface.co/papers/2505.03335.

The quantization reduces the model size and memory requirements while attempting to preserve as much of the original performance as possible.

Quantization Methodology

The model was quantized using ExLlamaV3 v0.0.2 with the following parameters:

Quantization Method: exl3 (ExLlamaV3)
Bits: 5.0 (5-bit quantization)
Head Bits: 6 (6-bit precision for attention heads)
Calibration:
- Rows: 100
- Columns: 2048
Out Scales: auto

This quantization approach uses a more sophisticated method than simple linear quantization, allowing for better preservation of model quality at lower bit depths.

Model Architecture

The model is based on the Qwen2 architecture with the following specifications:

Hidden Size: 5120
Intermediate Size: 13824
Number of Attention Heads: 40
Number of Key-Value Heads: 8
Number of Hidden Layers: 48
Maximum Sequence Length: 32768
Vocabulary Size: 152064

How to Use

To use this quantized model with ExLlamaV3, you'll need to install the ExLlamaV3 library:

pip install exllamav3

Here's a basic example of how to use the model:

from exllamav3 import ExLlamaV3, ExLlamaV3Config
from exllamav3.tokenizer import ExLlamaV3Tokenizer

# Set up model path
model_path = "path/to/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3"

# Load config and model
config = ExLlamaV3Config()
config.model_dir = model_path
config.prepare()

model = ExLlamaV3(config)
model.load()

# Load tokenizer
tokenizer = ExLlamaV3Tokenizer(config)

# Generate text
prompt = "Write a function to calculate the Fibonacci sequence in Python:"
input_ids = tokenizer.encode(prompt)
output = model.generate(
    input_ids=input_ids,
    max_new_tokens=200,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(output))

Limitations

This quantized model has the following limitations:

Reduced Precision: The 5-bit quantization may lead to some degradation in performance compared to the original model, particularly for complex reasoning tasks.
ExLlamaV3 Dependency: This model can only be used with the ExLlamaV3 library and is not compatible with standard Hugging Face Transformers without conversion.
Inherited Limitations: All limitations of the original model apply to this quantized version as well.

Citation

If you use this model in your research, please cite the original paper:

@misc{absolute_zero_reasoner_coder,
  author = {Andrew Zhang},
  title = {Absolute Zero Reasoner-Coder},
  year = {2024},
  howpublished = {\url{https://huggingface.co/papers/2505.03335}}
}

Acknowledgements

Original model: andrewzh/Absolute_Zero_Reasoner-Coder-14b
Quantization library: ExLlamaV3

patf82
/

Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3