GGUF uploaded now + Chat template Fixes!
Edit: Reuploaded due to OpenAI's chat template change & our new chat template fixes.
It's uploaded now!! With some of our chat template fixes!
The FP4 version. Please update whichever inference engine youre using!
Dynamic GGUFs with different sizes will come later!! Thanks to llama.cpp if they update it.
Let us know if you encounter any issues! Guide: https://docs.unsloth.ai/basics/gpt-oss
I'm trying to get 120b running on v100's and vLLM - is there any guidance on this? I keep running into issues.
gpt-oss-120b-F16.gguf
F16 got this error
gguf_init_from_file_impl: tensor 'blk.25.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
gpt-oss-120b-F16.gguf
F16 got this errorgguf_init_from_file_impl: tensor 'blk.25.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
Had the same issue on llama.cpp, git pulled and recompiled, it loads now, even if I am still having issues with the output, guess I am using the wrong chat template.
Has the most recent update fixed these issues?
all responses are starting with <|channel|>analysis
- but besides this responses seems good. using llamacpp @ fd1234cb468935ea087d6929b2487926c3afff4b, running on dual a5000 using ./build/bin/llama-server --model /mnt/ds1nfs/codellamaweights/gpt-oss-120b-F16.gguf --port 5001 --host 192.168.0.81 -ngl 99 -c 65535 -b 10240 -ub 2048 --n-cpu-moe 13 -fa --jinja -ts 100,55
special tokens are messing with code section:
<|channel|>analysis<|message|>We need to output a combined working code for pygame that meets the described features: 2-player tank battle with collisions and projectiles. Should be self-contained, no external images. Use simple shapes (rectangles). Provide health display. Movement with WASD for player 1, arrow keys for player 2. Space fires for player 1, Return for player 2. Bullets disappear on wall or tank collision. Tanks have HP displayed. Provide a working script. We need to output only the code, no extra explanations. Ensure it's functional. Use pygame. We'll need to handle screen dimensions, define Tank class with rect, color, speed, health. Bullet class with rect, velocity, owner reference. Main loop with event handling, movement, firing, collision detection. Display health texts. Add boundaries: walls (screen edges). Provide bullet removal when off-screen or hitting tank. Also show health at top left for each tank. Add simple background. We must output "combined working code only". So just the Python script. Let's craft code. Potential pitfalls: bullet direction based on tank orientation. Since we don't have rotation images, we can have simple direction: each tank can fire in the direction it's moving? Or maybe fixed direction: up? But better to allow tank to rotate with keys? Could be simple: tank moves 4-directionally and bullet travels in direction of last movement key pressed. We'll store direction vector for each tank. Implement. Let's write code. <|start|>assistant<|channel|>final<|message|>```python import pygame import sys
all responses are starting with
<|channel|>analysis
- but besides this responses seems good. using llamacpp @ fd1234cb468935ea087d6929b2487926c3afff4b, running on dual a5000 using./build/bin/llama-server --model /mnt/ds1nfs/codellamaweights/gpt-oss-120b-F16.gguf --port 5001 --host 192.168.0.81 -ngl 99 -c 65535 -b 10240 -ub 2048 --n-cpu-moe 13 -fa --jinja -ts 100,55
when I drop down the --jinja flag the Thinking looks correct in llama.cpp webui app.
the quantum sizes are almost the same, is that how it should be? F16 65.4 GB and Q2_K 62.6 GB
From their docs:
Any quant smaller than f16, including 2-bit β has minimal accuracy loss, since only some parts (e.g., attention layers) are lower bit while most remain full-precision. Thatβs why sizes are close to the f16 model; for example, the 2-bit (11.5 GB) version performs nearly the same as the full 16-bit (14 GB) one. Once llama.cpp supports better quantization for these models, we'll upload them ASAP.
Anyone can use "native" tool calling using llama.cpp with OWUI?
all responses are starting with <|channel|>analysis
I fixed this here: https://github.com/ggml-org/llama.cpp/pull/15124
It also requires patching the chat template as explained in the PR
Does Tool Calling work?
My llama.cpp crashes on any quant when working with Qwen Code. =(
I found this thread: https://huggingface.co/openai/gpt-oss-120b/discussions/69
Maybe it is still possible to fix the chat template?
With only the minor size differences between UD-4Q_K_XL & UD-8Q_K_XL, I went from 4bit to 8bit. I found that UD-8Q_K_XL occasionally outputs 'assistant<|channel|>final<|message|>' instead of the full '<|start|>assistant<|channel|>final<|message|>' a problem I never had with UD-4Q_K_XL. Given that 8bit is 4.41BPW and is supposed to be more accurate, is that actually how GPT-OSS-120B supposed to work?