llama.cpp now supports Llama-3_1-Nemotron-Ultra-253B-v1

#11
by ymcki - opened

llama.cpp now supports Nvidia's Llama-3_1-Nemotron-Ultra-253B-v1 starting from b5270.

Hopefully, this can help Nvidia sell some DGX Sparks. ;)

https://github.com/ggml-org/llama.cpp/pull/12843

As you can see, people having multiple cards are having trouble running it due to the tail heavy nature of this model.

While I made an Excel that can help them to get started to set the "-ts" parameter of llama-cli:

https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/resolve/main/deci.xlsx?download=true

Ideally it will be great if llama.cpp can distribute the layers automatically.

If Nvidia can sponsor me to access a machine with multiple cards, I can help developing such a feature.

Sign up or log in to comment