Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16

Model Overview

  • Model Architecture: Llama4ForConditionalGeneration
    • Input: Text / Image
    • Output: Text
  • Model Optimizations:
    • Weight quantization: INT4
  • Release Date: 06/12/2025
  • Version: 1.0
  • Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing weights of Llama-4-Maverick-17B-128E-Instruct to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The llm-compressor library is used for quantization.

Deployment

This model can be deployed efficiently on vLLM.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16"
number_gpus = 8

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

Creation details

This model was created by applying a development version llm-compressor. More details will be added as the the code is merged on main.

Evaluation

The model was evaluated on the OpenLLM v1 leaderboard task, using lm-evaluation-harness. More evaluations are under way.

Evaluation details

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto 

Accuracy

Recovery (%) meta-llama/Llama-4-Maverick-17B-128E-Instruct RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16
(this model)
ARC-Challenge
25-shot
96.6 73.55 71.08
GSM8k
5-shot
99.7 93.18 92.87
HellaSwag
10-shot
99.6 87.27 86.95
MMLU
5-shot
99.8 85.98 85.78
TruthfulQA
0-shot
100.0 62.81 62.85
WinoGrande
5-shot
100.5 78.53 78.93
OpenLLM v1
Average Score
99.4 80.22 79.74
Downloads last month
2
Safetensors
Model size
58.5B params
Tensor type
BF16
I64
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16