Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf
I tried creating an imatrix on a calibration dataset for RP, so performance outside of RP might be slightly worse compared to regular IQ4_XS quants. Also, I kept the embeddings in better quality.
8 bit Output tensor/embed (That added about 200MB)
Use with llama.cpp
Install llama.cpp through brew (works on Mac and Linux)
brew install llama.cpp
Invoke the llama.cpp server or the CLI.
CLI:
build/bin/llama-cli --hf-repo Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf --hf-file Pantheon-Proto-RP-1.8-30B-A3B-IQ4_XS.gguf -p "The meaning to life and the universe is"
Server:
build/bin/llama-server --hf-repo Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf --hf-file Pantheon-Proto-RP-1.8-30B-A3B-IQ4_XS.gguf -c 2048
Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.
Step 1: Clone llama.cpp from GitHub.
git clone https://github.com/ggml-org/llama.cpp
Step 2: Move into the llama.cpp folder and build it with -DLLAMA_CURL=ON
flag along with other hardware-specific flags (for ex: -DGGML_CUDA=ON for Nvidia GPUs on Linux).
cd llama.cpp && cmake -B build -DLLAMA_CURL=ON && cmake --build build --config Release -j
Step 3: Run inference through the main binary.
build/bin/llama-cli --hf-repo Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf --hf-file Pantheon-Proto-RP-1.8-30B-A3B-IQ4_XS.gguf -p "The meaning to life and the universe is"
or
build/bin/llama-server --hf-repo Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf --hf-file Pantheon-Proto-RP-1.8-30B-A3B-IQ4_XS.gguf -c 2048
- Downloads last month
- 49
Hardware compatibility
Log In
to view the estimation
4-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Disya/Gryphe-Pantheon-Proto-RP-1.8-30B-A3B-Q4_XS-gguf
Base model
Qwen/Qwen3-30B-A3B-Base
Finetuned
Gryphe/Pantheon-Proto-RP-1.8-30B-A3B