Model broken
Hi,
I've tested out a q8 quant of this model, and it seems to be broken, the model just keeps talking for a very long time and then doesn't produce the correct output or result in my coding tests. I think there is an issue with it.
This q8_0 GGUF works fine for me: https://huggingface.co/bartowski/deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF
I used the recommended inference settings:
- temp: 0.6
- top_p: 0.95
I will try it, but I used the same settings with the other q8 and it just goes into a weird thinking loop, and seems to mix up language and code and do strange things, makes me think there's an issue with it.
А меня вот так модель удивила и развеселила: "Я устал размышлять и понял, что я не смог решить эту задачу. Может быть, вопрос был задан неправильно, или я не учел какой-то момент."
This is how the model surprised and amused me: "I got tired of thinking and realized that I couldn't solve this problem. Maybe the question was asked incorrectly, or I didn't take some point into account."
unsloth q8
It works in LM Studio but i'm not impressed by the performance. Maybe an update is needed, no idea. But for now, the quality of the answers is not good.
I suppose the problem of the QWEN3 base. QWEN3 has great problem in terms of instruction following. Same problem with me. I even tried the original safetensors and found this problem. Not only in quant models
Hi,
I've tested out a q8 quant of this model, and it seems to be broken, the model just keeps talking for a very long time and then doesn't produce the correct output or result in my coding tests. I think there is an issue with it.
I have tested this model for about 6 or so hours straight with dozens of different settings, talking to at least a dozen other people, and this just seems to be how the model is. I have not found a single redeeming quality of it as of now, and it seems to just be one huge failure of a distill. The original R1 distills were bad, and so is this one. Deepseek doesn't seem to be the best at distilling.
Here is a write up I did in the exl2/3 server:
"Alright, I have spent hours trying to get this R1 8b distill to be usable. I have matched their system prompt, samplers, everything. This model is a mess
it constantly gaslights itself, misses blatantly obvious stuff, makes no sense, can't count, and is just overal an absolute mess.
Some examples of its huge fails:
"the first three letters of 'HEROINE" is 'HERO'" Which is clearly 4 letters
Then it said "'her' does not signify a female"
It solved the Heroine riddle in 1.2k tokens, and then spent 11k tokens gaslighting itself that it was wrong
It said "C-O-L-T-R-O-L, or 'COLTROOL", where it added an O out of nowhere
Instead of decoding: FROWURO, it decoded FRWUROR, then got hung up for over 7k tokens on how "COTROOL" is not a real word, and I must have made a typo
This model's "vibe" seems to be getting something correct reasonably fast, then spending 5-20k tokens gaslighting itself on how what it just solved is wrong, before assuming the "user must be wrong", and then backtracking on everything, quite reliably giving the WRONG answer while also suggesting the right answer as something to be explored more. I would have to say about 70% of questions this model got wrong, it had reasoning chains considerably closer to the correct answer that it gaslit itself out of following for seemingly no reason other than "No, that doesn't make sense" when it very clearly does.
With a simple +3 cipher, I gave it 12 attempts. It passed 4 times, and it averaged over 18K tokens of reasoning... for nothing.
For a model that claims to rival Q3 235b, it sure can't do simple 1 2 3 counting.
I have given this model several tests that many models in the last 5 months can pass, and I would say this one has about a 60% failure rating on simpler questions, and about a 90% failure rating on complex questions. I find it inferior to the DeepScaleR 1.5b preview from several months ago in a lot of key ways (Specifically math, were DeepScaleR 1.5b considerably outperforms it in both math ability, and token efficiency)
For the sake of diligence, I tried the following versions: EXL2: 6bpw, 8bpw, FP16. EXL3: 6bpw, 8bpw. GGUF: Q8, FP16. All of them were just as incompetent. This model has to be one of the worst CoT models I have tried in a long time, which is crazy, cause its based off a fairly competent base (Qwen 3 8b)"