Wur doomed!

#14
by jukofyork - opened

Continuation of THE THREAD OF DOOM.

jukofyork pinned discussion

What do you and the others think of the distilled R1 models for writing?

The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.

R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.

R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.

CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.

Yeah, I have tried the same two and felt the same way.

I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...

@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.

What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?

There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:

I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neat—their redemption is fragile, and the future is uncertain.

Let me outline the next few chapters:

Chapter 5: Nightmares and Trust
...

But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:

“I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.

lol

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

https://huggingface.co/THU-KEG/LongWriter-Zero-32B

https://arxiv.org/abs/2506.18841

Looks interesting!

Have you tried it?

Not yet as been away, but did notice the dataset had around 25% "Elara" in (not sure if it was the training or generated dataset though?).

https://github.com/PaddlePaddle/ERNIE?tab=readme-ov-file#performace-of-ernie-45-pre-trained-models

Screenshot_20250630-083948.png
Screenshot_20250630-084059.png

These look like they might have potential! SimpleQA looks impressive!

Wow that's huge if true!

Wow that's huge if true!

Yeah! I'm downloading it now, but can't see the llama.cpp added support for the MoE models anywhere? It looks like only support for the 0.3B model was added :/

@gghfez GLM-4 and QWQ were the two I had in mind :-)

Yeah, I'll try and do these and a few other new models when I get back home at the end of the week.

Forgot to update on this: I tried GLM-4 but couldn't get it to work, nor could I get my transplant-vocab code to work. I think it uses some strange BOS token that is causing the problem, but will try and update transformers when I get back and give it another go.

https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-PT/discussions/1

Looks like it might just be contaminated training data...

Hopefully the big model will be on openrouter soon, as I see they only added code for the 0.3B model to llama.cpp and not the MoE models :/

Forgot to update on this

Oh yeah, I tried to do that as well a few couple of weeks ago when I had a couple of hours on an A100, but failed.

I forget why, but I think it was the same numerical stability issue as gemma2/3 (have to change the dtype), but I didn't have time to try that.

Yeah! I'm downloading it now

I tested it with transformers (slowly). It's writing is weird and reminds me of llama4. I didn't bother letting it finish.

https://pastebin.com/ssZsDr0q

It also failed all my test questions.

Yeah, tested on open router and asked for my usual Grimdark fantasy promp, and by the end of the story one character was asking the other to "use the radio" to call their officer :D

It also likes to switch to Chinese like the older qwen models.

Sign up or log in to comment