Wur doomed!

#14
by jukofyork - opened

Continuation of THE THREAD OF DOOM.

jukofyork pinned discussion

What do you and the others think of the distilled R1 models for writing?

The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.

R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.

R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.

CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.

Yeah, I have tried the same two and felt the same way.

I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...

@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.

What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?

There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:

I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neat—their redemption is fragile, and the future is uncertain.

Let me outline the next few chapters:

Chapter 5: Nightmares and Trust
...

But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:

“I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.

lol

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

I personally hate how instruct V3.1 became dry and positivityslopped from training on gemini synthetic data. It no longer even "thinks" properly. While R1 was a fun schizo, R1-0528 obnoxious, V3.1 is just boring, like original V3. Feels like a failed run.

Sadly I think we have hit peak creative writing models with R1 :/ Hopefully I can get the control adapters stuff to work or otherwise we're screwed unless some big lab decisions to stop benchmaxxing on stem/code and stop using synthetic data.

Glad I'm able to use it locally and can stick with the one I like unlike cloud people.

Yeah, the deepseek Reddit seems pretty upset by this change.

Turboderp does seem to think it works now.

Oh, you were chatting in that issue even though you don't use the model? lol
I might have to bight the bullet and make an exl3. (exl3 is very slow to quantize with Ampere hardware).
I tested it out with llama.cpp briefly. It's reasoning trigger is kind of weird, just instructing it to think basically. (I could do this already with the original Command-A).

Glad I'm able to use it locally and can stick with the one I like unlike cloud people

Same here. Random people IRL have been complaining to me about the version on the official website.

It no longer even "thinks" properly.

Yeah and the thinking process is extremely positive biased now. The older models (even R1-0528) could be steered to roast you with something as simple as (eg from my Mikupad templates):

<|begin▁of▁sentence|><|User|> <|Assistant|><think> Oh for fucks sake, another one of these morons. Okay I guess I have to

This one keeps swing back to politeness after a few tokens.

It's definitely better for coding though which is what they were going for.

https://www.reddit.com/r/LocalLLaMA/comments/1mxtzrz/i_have_some_compute_to_finetune_a_creative_model/

!

It's really funny nobody else seems to like the cohere models :/

I don't know if it's just me, but I always feel like the stuff written by command-r and command-r-plus is on another level when it comes to anything "dark". It really seems to give the impression of hiding meaning within the text it writes and will often not even give characters names unless you ask.

Name probs for base and instruct Deepseek V3.1; base looks real.
I personally hate how instruct V3.1 became dry and positivityslopped from training on gemini synthetic data. It no longer even "thinks" properly. While R1 was a fun schizo, R1-0528 obnoxious, V3.1 is just boring, like original V3. Feels like a failed run. Glad I'm able to use it locally and can stick with the one I like unlike cloud people.

Thanks, Chuck. I always like seeing these. I feel the same way. I have been using R1-0528 for creative stuff for so long now, I was getting fatigue from its tendencies, so initially I found 3.1 refreshing because its prose was different, but I grew bored pretty quickly because the "edge" is gone.

I'm jealous of your setup :D Thankfully, my provider still has the option to choose from five flavors of DS, though they added themselves to the OR universe, so now their load is pretty high.

Unfortunately, I feel like what the masses want in a model and what we want are not the same thing. So we have to rely on people like y'all to try to come up with some innovative ways to work with what we've got.

https://www.reddit.com/r/LocalLLaMA/comments/1mxtzrz/i_have_some_compute_to_finetune_a_creative_model/

!

It's really funny nobody else seems to like the cohere models :/

I don't know if it's just me, but I always feel like the stuff written by command-r and command-r-plus is on another level when it comes to anything "dark". It really seems to give the impression of hiding meaning within the text it writes and will often not even give characters names unless you ask.

I agree, I can still remember the first time I used OG command-r-plus with a dark scenario. I thought, "Whoa.. why aren't more people talking about this model?" If I remember right, it was pretty persnickety about the prompt, but once you got that right, it was gold. First time I recall a model willing to go in any direction.

I think we as a community need to work on curating a high-quality creative writing (and creative writing instruction) dataset so that others with the compute can just pick it up and train. I'm sure people are trying to do this individually (I am) but organized it might actually happen. If it's high-quality enough, organizations will train with it too from the free labor.

It would be nicer of course if we could write a program or train a model that does this curation automatically.

I think we as a community need to work on curating a high-quality creative writing (and creative writing instruction) dataset so that others with the compute can just pick it up and train. I'm sure people are trying to do this individually (I am) but organized it might actually happen. If it's high-quality enough, organizations will train with it too from the free labor.

It would be nicer of course if we could write a program or train a model that does this curation automatically.

It would be pretty hard to get people to agree on what they want from a creative writing model though sadly. What seems like absolutely awful writing to me might well be what somebody else actually wants from a model!

I've also run into a lot of people who think "creative writing = (erotic) roleplay" and as a whole it's a fairly blurry line that divides the two.


My feeling is we need a really solid creative writing benchmark and then we can use the benchmaxxing to our advantage:

https://en.wikipedia.org/wiki/Goodhart%27s_law

For example I read all the different LLM-related Reddits and I've noticed the creator of:

http://designarena.ai/

has singlehandedly spammed this into existence over the last few weeks! We could easily do something similar with our targeted benchmark and then just sit back and enjoy the benchmaxxing! :D

Sign up or log in to comment