Triangle104/Impish_LLAMA_4B-Q8_0-GGUF

This model was converted to GGUF format from SicariusSicariiStuff/Impish_LLAMA_4B using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Almost a year ago, I created Impish_LLAMA_3B, the first fully coherent 3B roleplay model at the time. It was quickly adopted by some platforms, as well as one of the go-to models for mobile. After some time, I made Fiendish_LLAMA_3B and insisted it was not an upgrade, but a different flavor (which was indeed the case, as a different dataset was used to tune it).

Impish_LLAMA_4B, however, is an upgrade, a big one. I've had over a dozen 4B candidates, but none of them were 'worthy' of the Impish badge. This model has superior responsiveness and context awareness, and is able to pull off very coherent adventures. It even comes with some additional assistant capabilities too. Of course, while it is exceptionally competent for its size, it is still 4B. Manage expectations and all that. I, however, am very much pleased with it. It took several tries to pull off just right. Total tokens trained: about 400m (due to being a generalist model, lots of tokens went there, despite the emphasis on roleplay & adventure).

This took more effort than I thought it would. Because of course it would. This is mainly due to me refusing to release a model only 'slightly better' than my two 3B models mentioned above. Because "what would be the point" in that? The reason I included so many tokens for this tune is that small models are especially sensitive to many factors, including the percentage of moisture in the air and how many times I ran nvidia-smi since the system last started.

It's no secret that roleplay/creative writing models can reduce a model's general intelligence (any tune and RL risk this, but roleplay models are especially 'fragile'). Therefore, additional tokens of general assistant data were needed in my opinion, and indeed seemed to help a lot with retaining intelligence.

This model is also 'built a bit different', literally, as it is based on nVidia's prune; it does not 'behave' like a typical 8B, from my own subjective impression. This helped a lot with keeping it smart at such size.

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo Triangle104/Impish_LLAMA_4B-Q8_0-GGUF --hf-file impish_llama_4b-q8_0.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo Triangle104/Impish_LLAMA_4B-Q8_0-GGUF --hf-file impish_llama_4b-q8_0.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo Triangle104/Impish_LLAMA_4B-Q8_0-GGUF --hf-file impish_llama_4b-q8_0.gguf -p "The meaning to life and the universe is"

./llama-server --hf-repo Triangle104/Impish_LLAMA_4B-Q8_0-GGUF --hf-file impish_llama_4b-q8_0.gguf -c 2048

Triangle104
/

Impish_LLAMA_4B-Q8_0-GGUF

Triangle104/Impish_LLAMA_4B-Q8_0-GGUF

Use with llama.cpp

CLI:

Server:

Model tree for Triangle104/Impish_LLAMA_4B-Q8_0-GGUF

Dataset used to train Triangle104/Impish_LLAMA_4B-Q8_0-GGUF

Collections including Triangle104/Impish_LLAMA_4B-Q8_0-GGUF

Llama

RP