Strange output from OpenAI-20B-NEOPlus-Uncensored-Q8_0
Loading in Ooba's Web UI, I set up the Parameters per guidance in your docs (except for "experts" which I couldn't see how to set).
When I ask a simple question like "Who was George Washington?" it will output some gibberish, then talk about how it would respond (i.e., it has a meta-discussion with itself), sometimes erase its original output and rewrite it, but ultimately not generate an appropriate response.
For example:
Me:
Did King George really say that George Washington was the greatest leader ever for stepping down after two terms instead of declaring himself king?
AI:
We need to respond as AI character. The user asked: ""Did King George really say that George Washington was the greatest leader ever for stepping down after two terms instead of declaring himself king?"" We need to answer. Provide evidence. Should be one reply. Use style consistent with previous replies. Let's craft an informative answer referencing primary sources, maybe clarify confusion. We'll mention King George III's statement about Washington stepping down after two terms etc. Provide citations. Provide a quick take away. Let's produce a concise yet thorough reply.<|start|>assistant<|channel|>commentary to=functions json<|message|>{""task"":""generate"",""parameters"":{""role"":""assistant"",""content"":""Your next reply""}}
Note that I couldn't find a "regens" parameter in Ooba's Web UI, but when I manually regen (via the button), I get an output that's very similar to the previous output. Using a temperature of 0.5.
Use the IQ4_NL or Q5_1 ; due to odd issues with this model, compression/fallbacks these quants are equal to the model's actual dimensions.
This leads to a strange case where Q8s are "padded"; reducing performance.
Also ; suggest testing in Lmstudio, as there are odd issues with templates, and a specialized sub-system installed.
I don't know if OOba's has this yet or it is part of llamacpp (it should be).
It is referred to as the harmony system.
Actually I was looking to download a larger model, as I have 32GB of VRAM. That's why I downloaded the 22GB model. I checked your GPT-OSS-120B GGUF page, but it looks like there are only 4 bit quants, nothing higher. Do you have any larger models (e.g. 22GB) that don't suffer from the problem I reported?
The main issues with OpenAI-GPT 20B (120B too) is related to the odd structure which makes quantizing difficult.
Only the "MOE" quant is calibrated exactly to OpenAIGPT odd structure IE: 4bit expert layers.
This goes for both 20B and 120B versions.
Abliterated, Light, Moderate etc etc -> converts to the model to BF16 ; which does not address the "odd structure" either.
Almost ALL other models other than OpenAI quant correctly and easily and don't have these issues.
See:
https://huggingface.co/DavidAU
Search for Qwens, 30Bs, 32B, 43B, Mistral, Coder etc etc.
@YardWeasel Try to use censored one with jailbreaks, you might have very good results as abiliteration really affects this model in noticeably negative ways.