LLaMA 4 Maverick 17B 128E Instruct (GGUF, Non-Quantized)

This is a non-quantized GGUF conversion of the original meta-llama/Llama-4-Maverick-17B-128E-Instruct model.
It has been converted for compatibility with inference libraries that use the GGUF format, such as llama.cpp, llamacpp-python, llamafile, and Ollama.

Model Details

Architecture: LLaMA 4
Model Type: Mixture of Experts (MoE) - 128 Experts
Parameters: 17 Billion (base), larger when expanded with MoE
Quantization: None (float16 precision)
Format: .gguf (non-quantized)

Intended Use

Research and evaluation purposes
Fine-tuning or quantization into lower-bit formats (q4, q5) for efficient inference
Deployment on high-memory systems (256 GB RAM or >512 GB VRAM recommended)

Notes

This model is extremely large. The unquantized GGUF file size is approximately 801 GB.
Running this model requires high-end hardware, preferably with multiple GPUs or extremely large VRAM.
A quantized version (q4_k_m, q5_0) will be released separately for easier local inference.

License

This model is distributed under the Meta Llama 4 license and the same terms apply.
Users must have signed the Meta license agreement to access and use this model.

Acknowledgments

Original model weights: Meta AI
GGUF conversion: Done using llama.cpp tooling.

Disclaimer

This is a direct format conversion.
No fine-tuning, evaluation, or modification has been performed beyond reformatting into GGUF.

Use responsibly and within the bounds of the Meta Llama 4 license.

AaronimusPrime
/

llama-4-maverick-17b-128e-instruct-f16-gguf