Wan 2.1 W4A16 INT4 Quantized Model

This is a W4A16 INT4 quantized version of the Wan-AI/Wan2.1-T2V-14B-Diffusers model, compressed using the ViDiT-Q quantization framework.

Model Details

Base Model: Wan 2.1 Text-to-Video 14B Diffusers
Quantization Method: W4A16 (4-bit weights, 16-bit activations)
Group Size: 64 (per-group quantization)
Framework: ViDiT-Q
Original Size: ~28GB (FP16)
Quantized Size: 9.1GB
Compression Ratio: 5.9x
Quantized Layers: 400

Quantization Details

This model uses real INT4 storage format (not fake quantization):

Transformer weights are packed into 4-bit integers
2 weights stored per byte for maximum efficiency
Quantization parameters (scales, zero_points) included for reconstruction
Non-critical layers (embeddings, norms) remain in FP16/32

Usage

import torch
from your_inference_library import load_quantized_wan21

# Load the quantized model
model = load_quantized_wan21('your-username/wan21-w4a16-int4')

# The model will automatically unpack INT4 weights during inference
# Use same API as original Wan 2.1 model

Performance

Memory Usage: ~9.1GB (vs 53.2GB original)
Speed: Similar inference speed to FP16 with proper INT4 kernels
Quality: Minimal quality degradation with per-group quantization

Technical Details

Quantization Scheme: Symmetric per-group INT4 quantization
Group Size: 64 weights per quantization group
Storage Format: Packed uint8 tensors (2 INT4 values per byte)
Reconstruction: On-the-fly unpacking during inference

Files

wan21_int4_packed.pth: Main model file with packed INT4 weights
config.json: Model configuration and quantization metadata
README.md: This model card

Citation

If you use this quantized model, please cite both the original Wan 2.1 paper and the ViDiT-Q quantization framework.

License

Same license as the original Wan 2.1 model. Please check the base model repository for license details.

samuelt0207
/

quantize_wan