In these quantized versions of the model, most of the layers were shrunk to save space using MXFP4. The main difference is in the “gate” layers (ffn_gate_exps.weight
) that decide which experts the model uses:
- Q4_1 version (≈12 GB): These gate layers were made smaller and faster, but their decisions can be slightly less precise.
- Q8_0 version (≈15 GB): These gate layers keep more detail, so the model makes more accurate choices, but the file is bigger and a bit slower.
All other layers are treated the same in both versions.
- Downloads last month
- 382
Hardware compatibility
Log In
to view the estimation
4-bit
16-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support