Quantized MCQA Model – W8A8

Model Summary

This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the LLMCompressor framework.

Technical Details

Base model: hssawhney/mnlp-model
Quantization method: SmoothQuant + GPTQ
Precision: BF16 (activations) + INT8 (weights)
Calibration data: 512 samples from zay25/quantization-dataset
Excluded layers: lm_head (to preserve output logits)
Final model size: ~717 MB

Evaluation

The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a 0.02 decrease in accuracy compared to the full-precision (FP32) version.

Intended Use

This model is optimized for efficient inference in multiple-choice question answering tasks, particularly in the context of STEM tutoring. It is well-suited for low-resource deployment environments where latency and memory usage are critical.

zay25
/

MNLP_M2_quantized_model

Quantized MCQA Model – W8A8

Model Summary

Technical Details

Evaluation

Intended Use

Model tree for zay25/MNLP_M2_quantized_model