Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16
Model Overview
- Model Architecture: Llama4ForConditionalGeneration
- Input: Text / Image
- Output: Text
- Model Optimizations:
- Weight quantization: INT4
- Release Date: 06/12/2025
- Version: 1.0
- Model Developers: Red Hat (Neural Magic)
Model Optimizations
This model was obtained by quantizing weights of Llama-4-Maverick-17B-128E-Instruct to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The llm-compressor library is used for quantization.
Deployment
This model can be deployed efficiently on vLLM.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16"
number_gpus = 8
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Give me a short introduction to large language model."
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompt, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
Creation
Creation details
This model was created by applying a development version llm-compressor. More details will be added as the the code is merged on main.
Evaluation
The model was evaluated on the OpenLLM v1 leaderboard task, using lm-evaluation-harness. More evaluations are under way.
Evaluation details
OpenLLM v1
lm_eval \
--model vllm \
--model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
--tasks openllm \
--batch_size auto
Accuracy
Recovery (%) | meta-llama/Llama-4-Maverick-17B-128E-Instruct | RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16 (this model) |
|
---|---|---|---|
ARC-Challenge 25-shot |
96.6 | 73.55 | 71.08 |
GSM8k 5-shot |
99.7 | 93.18 | 92.87 |
HellaSwag 10-shot |
99.6 | 87.27 | 86.95 |
MMLU 5-shot |
99.8 | 85.98 | 85.78 |
TruthfulQA 0-shot |
100.0 | 62.81 | 62.85 |
WinoGrande 5-shot |
100.5 | 78.53 | 78.93 |
OpenLLM v1 Average Score |
99.4 | 80.22 | 79.74 |
- Downloads last month
- 2
Model tree for RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16
Base model
meta-llama/Llama-4-Maverick-17B-128E