Qwen3-32B-AWorld-W8A16

This is a W8A16 (8-bit weight, 16-bit activation) quantized version of the inclusionAI/Qwen3-32B-AWorld model, created using LLM Compressor.

Model Details

  • Model Type: Mixture-of-Experts (MoE) Language Model
  • Base Model: inclusionAI/Qwen3-32B-AWorld
  • Quantization Method: GPTQ with SmoothQuant preprocessing
  • Weight Precision: 8-bit
  • Activation Precision: 16-bit
  • Compression Ratio: ~2x model size reduction

Quantization Process

This model was quantized using the LLM Compressor library with the following key parameters:

  • Algorithm: GPTQ with SmoothQuant preprocessing
  • Protection: MoE gate layers kept at full precision
  • Calibration: 512 samples from ultrachat_200k dataset
  • Sequence Length: 2048 tokens

The quantization process preserves the quality of the original model while reducing its size by approximately 2x, making it more suitable for deployment on resource-constrained environments.

Usage

The model can be loaded using the standard Hugging Face transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("groxaxo/Qwen3-32B-AWorld-W8A16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3-32B-AWorld-W8A16")

Original Model

This quantized model is derived from inclusionAI/Qwen3-32B-AWorld. Please refer to the original model card for detailed information about the base model's capabilities, training process, and intended use.

License

This model is licensed under the Apache 2.0 license, inheriting the license from the original model.

Downloads last month
33
Safetensors
Model size
9.36B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for groxaxo/Qwen3-32B-AWorld-W8A16

Base model

Qwen/Qwen3-32B
Quantized
(4)
this model