Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf

I tried creating an imatrix on a calibration dataset for RP, so performance outside of RP might be slightly worse compared to regular IQ4_XS quants. Also, I kept the embeddings in better quality.

8 bit Output tensor/embed (That added about 200MB)

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

build/bin/llama-cli --hf-repo Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf --hf-file Pantheon-Proto-RP-1.8-30B-A3B-IQ4_XS.gguf -p "The meaning to life and the universe is"

Server:

build/bin/llama-server --hf-repo Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf --hf-file Pantheon-Proto-RP-1.8-30B-A3B-IQ4_XS.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggml-org/llama.cpp

Step 2: Move into the llama.cpp folder and build it with -DLLAMA_CURL=ON flag along with other hardware-specific flags (for ex: -DGGML_CUDA=ON for Nvidia GPUs on Linux).

cd llama.cpp && cmake -B build -DLLAMA_CURL=ON && cmake --build build --config Release -j

Step 3: Run inference through the main binary.

build/bin/llama-cli --hf-repo Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf --hf-file Pantheon-Proto-RP-1.8-30B-A3B-IQ4_XS.gguf -p "The meaning to life and the universe is"

build/bin/llama-server --hf-repo Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf --hf-file Pantheon-Proto-RP-1.8-30B-A3B-IQ4_XS.gguf -c 2048

Disya
/

Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf