Quantized MCQA Model โ W8A8
Model Summary
This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the LLMCompressor framework.
Technical Details
- Base model:
hssawhney/mnlp-model
- Quantization method: SmoothQuant + GPTQ
- Precision: BF16 (activations) + INT8 (weights)
- Calibration data: 512 samples from
zay25/quantization-dataset
- Excluded layers:
lm_head
(to preserve output logits) - Final model size: ~717 MB
Evaluation
The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a 0.02 decrease in accuracy compared to the full-precision (FP32) version.
Intended Use
This model is optimized for efficient inference in multiple-choice question answering tasks, particularly in the context of STEM tutoring. It is well-suited for low-resource deployment environments where latency and memory usage are critical.
- Downloads last month
- 60
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for zay25/MNLP_M2_quantized_model
Base model
hssawhney/MNLP_M2_mcqa_model