Qwen3-32B-AWorld-W8A16
This is a W8A16 (8-bit weight, 16-bit activation) quantized version of the inclusionAI/Qwen3-32B-AWorld model, created using LLM Compressor.
Model Details
- Model Type: Mixture-of-Experts (MoE) Language Model
- Base Model: inclusionAI/Qwen3-32B-AWorld
- Quantization Method: GPTQ with SmoothQuant preprocessing
- Weight Precision: 8-bit
- Activation Precision: 16-bit
- Compression Ratio: ~2x model size reduction
Quantization Process
This model was quantized using the LLM Compressor library with the following key parameters:
- Algorithm: GPTQ with SmoothQuant preprocessing
- Protection: MoE gate layers kept at full precision
- Calibration: 512 samples from ultrachat_200k dataset
- Sequence Length: 2048 tokens
The quantization process preserves the quality of the original model while reducing its size by approximately 2x, making it more suitable for deployment on resource-constrained environments.
Usage
The model can be loaded using the standard Hugging Face transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("groxaxo/Qwen3-32B-AWorld-W8A16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3-32B-AWorld-W8A16")
Original Model
This quantized model is derived from inclusionAI/Qwen3-32B-AWorld. Please refer to the original model card for detailed information about the base model's capabilities, training process, and intended use.
License
This model is licensed under the Apache 2.0 license, inheriting the license from the original model.
- Downloads last month
- 33