GGUF uploaded now + Chat template Fixes!

pinned

by shimmyshimmer - opened 17 days ago

Discussion

shimmyshimmer

Unsloth AI org 17 days ago

•

edited 17 days ago

Edit: Reuploaded due to OpenAI's chat template change & our new chat template fixes.

It's uploaded now!! With some of our chat template fixes!

The FP4 version. Please update whichever inference engine youre using!

Dynamic GGUFs with different sizes will come later!! Thanks to llama.cpp if they update it.

Let us know if you encounter any issues! Guide: https://docs.unsloth.ai/basics/gpt-oss

shimmyshimmer pinned discussion 17 days ago

oxpsi

17 days ago

I'm trying to get 120b running on v100's and vLLM - is there any guidance on this? I keep running into issues.

sb6666

17 days ago

gpt-oss-120b-F16.gguf
F16 got this error

gguf_init_from_file_impl: tensor 'blk.25.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info

wunderschnitzel

17 days ago

gpt-oss-120b-F16.gguf
F16 got this error

gguf_init_from_file_impl: tensor 'blk.25.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info

Had the same issue on llama.cpp, git pulled and recompiled, it loads now, even if I am still having issues with the output, guess I am using the wrong chat template.

nulled

17 days ago

Has the most recent update fixed these issues?

DrRos

17 days ago

•

edited 17 days ago

all responses are starting with <|channel|>analysis - but besides this responses seems good. using llamacpp @ fd1234cb468935ea087d6929b2487926c3afff4b, running on dual a5000 using ./build/bin/llama-server --model /mnt/ds1nfs/codellamaweights/gpt-oss-120b-F16.gguf --port 5001 --host 192.168.0.81 -ngl 99 -c 65535 -b 10240 -ub 2048 --n-cpu-moe 13 -fa --jinja -ts 100,55

ramgpt

17 days ago

special tokens are messing with code section:

<|channel|>analysis<|message|>We need to output a combined working code for pygame that meets the described features: 2-player tank battle with collisions and projectiles. Should be self-contained, no external images. Use simple shapes (rectangles). Provide health display. Movement with WASD for player 1, arrow keys for player 2. Space fires for player 1, Return for player 2. Bullets disappear on wall or tank collision. Tanks have HP displayed. Provide a working script.

We need to output only the code, no extra explanations. Ensure it's functional. Use pygame. We'll need to handle screen dimensions, define Tank class with rect, color, speed, health. Bullet class with rect, velocity, owner reference. Main loop with event handling, movement, firing, collision detection. Display health texts.

Add boundaries: walls (screen edges). Provide bullet removal when off-screen or hitting tank.

Also show health at top left for each tank.

Add simple background.

We must output "combined working code only". So just the Python script.

Let's craft code.

Potential pitfalls: bullet direction based on tank orientation. Since we don't have rotation images, we can have simple direction: each tank can fire in the direction it's moving? Or maybe fixed direction: up? But better to allow tank to rotate with keys? Could be simple: tank moves 4-directionally and bullet travels in direction of last movement key pressed. We'll store direction vector for each tank.

Implement.

Let's write code.

<|start|>assistant<|channel|>final<|message|>```python
import pygame
import sys

EvgeniyIV

17 days ago

all responses are starting with <|channel|>analysis - but besides this responses seems good. using llamacpp @ fd1234cb468935ea087d6929b2487926c3afff4b, running on dual a5000 using ./build/bin/llama-server --model /mnt/ds1nfs/codellamaweights/gpt-oss-120b-F16.gguf --port 5001 --host 192.168.0.81 -ngl 99 -c 65535 -b 10240 -ub 2048 --n-cpu-moe 13 -fa --jinja -ts 100,55

when I drop down the --jinja flag the Thinking looks correct in llama.cpp webui app.

Aleks1987

16 days ago

the quantum sizes are almost the same, is that how it should be? F16 65.4 GB and Q2_K 62.6 GB

SerialVelocity

16 days ago

From their docs:
Any quant smaller than f16, including 2-bit — has minimal accuracy loss, since only some parts (e.g., attention layers) are lower bit while most remain full-precision. That’s why sizes are close to the f16 model; for example, the 2-bit (11.5 GB) version performs nearly the same as the full 16-bit (14 GB) one. Once llama.cpp supports better quantization for these models, we'll upload them ASAP.