Quantized MCQA Model โ€“ W8A8

Model Summary

This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the LLMCompressor framework.

Technical Details

  • Base model: hssawhney/mnlp-model
  • Quantization method: SmoothQuant + GPTQ
  • Precision: BF16 (activations) + INT8 (weights)
  • Calibration data: 512 samples from zay25/quantization-dataset
  • Excluded layers: lm_head (to preserve output logits)
  • Final model size: ~717 MB

Evaluation

The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a 0.02 decrease in accuracy compared to the full-precision (FP32) version.

Intended Use

This model is optimized for efficient inference in multiple-choice question answering tasks, particularly in the context of STEM tutoring. It is well-suited for low-resource deployment environments where latency and memory usage are critical.

Downloads last month
60
Safetensors
Model size
752M params
Tensor type
BF16
ยท
I8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zay25/MNLP_M2_quantized_model

Quantized
(1)
this model