Qwen3-32B-AWorld-W8A16

This is a W8A16 (8-bit weight, 16-bit activation) quantized version of the inclusionAI/Qwen3-32B-AWorld model, created using LLM Compressor.

Model Details

Model Type: Mixture-of-Experts (MoE) Language Model
Base Model: inclusionAI/Qwen3-32B-AWorld
Quantization Method: GPTQ with SmoothQuant preprocessing
Weight Precision: 8-bit
Activation Precision: 16-bit
Compression Ratio: ~2x model size reduction

Quantization Process

This model was quantized using the LLM Compressor library with the following key parameters:

Algorithm: GPTQ with SmoothQuant preprocessing
Protection: MoE gate layers kept at full precision
Calibration: 512 samples from ultrachat_200k dataset
Sequence Length: 2048 tokens

The quantization process preserves the quality of the original model while reducing its size by approximately 2x, making it more suitable for deployment on resource-constrained environments.

Usage

The model can be loaded using the standard Hugging Face transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("groxaxo/Qwen3-32B-AWorld-W8A16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3-32B-AWorld-W8A16")

Original Model

This quantized model is derived from inclusionAI/Qwen3-32B-AWorld. Please refer to the original model card for detailed information about the base model's capabilities, training process, and intended use.

License

This model is licensed under the Apache 2.0 license, inheriting the license from the original model.

groxaxo
/

Qwen3-32B-AWorld-W8A16