Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16

Model Overview

Model Architecture: Llama4ForConditionalGeneration
- Input: Text / Image
- Output: Text
Model Optimizations:
- Weight quantization: INT4
Release Date: 06/12/2025
Version: 1.0
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing weights of Llama-4-Maverick-17B-128E-Instruct to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The llm-compressor library is used for quantization.

Deployment

This model can be deployed efficiently on vLLM.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16"
number_gpus = 8

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

Creation details

This model was created by applying a development version llm-compressor. More details will be added as the the code is merged on main.

Evaluation

The model was evaluated on the OpenLLM v1 leaderboard task, using lm-evaluation-harness. More evaluations are under way.

Evaluation details

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

Accuracy

	Recovery (%)	meta-llama/Llama-4-Maverick-17B-128E-Instruct	RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16 (this model)
ARC-Challenge 25-shot	96.6	73.55	71.08
GSM8k 5-shot	99.7	93.18	92.87
HellaSwag 10-shot	99.6	87.27	86.95
MMLU 5-shot	99.8	85.98	85.78
TruthfulQA 0-shot	100.0	62.81	62.85
WinoGrande 5-shot	100.5	78.53	78.93
OpenLLM v1 Average Score	99.4	80.22	79.74

RedHatAI
/

Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16