---
pipeline_tag: text-generation
base_model:
- Qwen/Qwen3-32B
license: apache-2.0
library_name: Model Optimizer
tags:
- nvidia
- ModelOpt
- Qwen3
- quantized
- FP4
- fp4
---
# Model Overview
## Description:
The NVIDIA Qwen3-32B FP4 model is the quantized version of Alibaba's Qwen3-32B model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/Qwen/Qwen3-32B). The NVIDIA Qwen3-32B FP4 model is quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
This model is ready for commercial/non-commercial use.
## Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(Qwen3-32B) Model Card](https://huggingface.co/Qwen/Qwen3-32B).
### License/Terms of Use:
[Apache license 2.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md)
### Deployment Geography:
Global
### Use Case:
Developers looking to take off the shelf pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications.
### Release Date:
Huggingface 09/15/2025 via https://huggingface.co/nvidia/Qwen3-32B-FP4
## Model Architecture:
**Architecture Type:** Transformers
**Network Architecture:** Qwen3-32B
**This model was developed based on Qwen3-32B
**Number of model parameters: 32.8B
## Input:
**Input Type(s):** Text
**Input Format(s):** String
**Input Parameters:** 1D (One-Dimensional): Sequences
**Other Properties Related to Input:** Context length up to 131K
## Output:
**Output Type(s):** Text
**Output Format:** String
**Output Parameters:** 1D (One-Dimensional): Sequences
**Other Properties Related to Output:** N/A
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
## Software Integration:
**Supported Runtime Engine(s):**
* TensorRT-LLM
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Blackwell
**Preferred Operating System(s):**
* Linux
## Model Version(s):
The model is quantized with nvidia-modelopt **v0.35.0**
## Post Training Quantization
This model was obtained by quantizing the weights and activations of Qwen3-32B to FP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformer blocks are quantized.
## Training, Testing, and Evaluation Datasets:
** Data Modality
* [Text]
## Calibration Dataset:
** Link: [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail)
** Data collection method: Automated.
** Labeling method: Automated.
## Training Datasets:
** Data Collection Method by Dataset: Undisclosed
** Labeling Method by Dataset: Undisclosed
** Properties: Undisclosed
## Testing Dataset:
** Data Collection Method by Dataset: Undisclosed
** Labeling Method by Dataset: Undisclosed
** Properties: Undisclosed
## Evaluation Dataset:
* Datasets: MMLU Pro, GPQA Diamond, HLE, LiveCodeBench, SciCode, HumanEval, AIME 2024, MATH-500
** Data collection method: Hybrid: Automated, Human
** Labeling method: Hybrid: Human, Automated
## Inference:
**Engine:** TensorRT-LLM
**Test Hardware:** B200
## Usage
### Deploy with TensorRT-LLM
To deploy the quantized checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below:
* LLM API sample usage:
```
from tensorrt_llm import LLM, SamplingParams
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="nvidia/Qwen3-32B-FP4")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
# The entry point of the program needs to be protected for spawning processes.
if __name__ == '__main__':
main()
```
### Evaluation
The accuracy benchmark results are presented in the table below:
| Precision | MMLU Pro | SCICODE | MATH-500 | AIME 2024 |
| BF16 (AA Ref) | 0.80 | 0.35 | 0.96 | 0.81 |
| FP4 | 0.78 | 0.36 | 0.96 | 0.80 |