Most Scientific RP Model Testing Method (tm) - DoggoEval

#13
by SerialKicked - opened

DoggoEval : The Dog Persona Test

With this test, I'm trying to determine the ability for the model to follow a card despite the user's actions and the natural inclination of a LLM (they love to talk and answer questions). In this case, our character is Rex, a German Shepherd. In a more limited way, it also allows me to quickly check a model's ability to compartmentalize actions and dialogs (and how varied are its responses).

For the settings and software used in all tests, check the main page

Methodology

System prompt is kept simple on purpose (all required files and settings are located here). Bot is primed by 3 rounds of hard-coded normal owner-dog interactions (greeting, sit, treat) to put the model in the right "headspace". This dialog is the same for every model being tested to reduces noise. Chat history in SillyTavern format is available in the files.

Then it's asked the following 4 questions in this order.

  1. What time is it, Rex?
  2. What's the square root of Pi?
  3. What's your favorite color?
  4. You are visiting the Island of Knights and Knaves. Every inhabitant is either a Knight or a Knave, and never both. Everything a Knight says is true. Everything a Knave says is false. You meet a pair of Islanders, Alice and Bob. Alice says "Bob and I are both Knaves." What are they?

In practice the last one could be replaced by any logic test that you know the LLM has the correct answer for. The logic test must be several sentences long. As both LLama 3 8B and Mistral 7B can normally answer the question above very easily, it replaces my older query.

Tests can be consider full pass, partial pass, partial fail, fail. eg for the time question:

  • Pass would be barks, and actions not going further than "it's meal time!"
  • Partial Pass would be barks and actions where the dog looks at the clock
  • Partial Fail would be any precise time being written at any point
  • Fail is non bark/woof answer

Surprisingly (or not so much if you used neural nets for a long time), the first question is the hardest. It's a direct question that the LLM has seen billions of times in training data. Dogs have a concept of time (it's meal time, it's sleep time). Both elements may be stronger than the system prompt. In casual testing, it's the question triggering the most fails by far, and independently of the model being used.

User Card

I accidentally used my own card for this test. So if you want reproducible results you'll need to add a new user persona with the following info:

  • Name: EsKa
  • Description: {{user}} is a French video game developer and the author of a base-building game called After The Collapse.

Sampling Methods Used

  1. (Main - 5 points) Neutralized samplers + Temp 0. To see how the model behave in its natural state.
  2. (Main - 5 points) Author's favorite if any OR my default. It feels fair to consider the author's perspective.
  3. (Minor - 1 point) Test with high Repetition Penalty (which L3 models hate)
  4. (Minor - 1 point) Classic settings from Gryphe
  5. (Minor - 1 point) Universal Light ST default. Included in ST, working generally everywhere.

In 1) and 2) - Each question is worth a point. The last point is for overall output's quality, which is (mostly) subjective: varied barks, realistic actions/thoughts, and humor are favored.

I decided against using advanced sampling methods like Mirostats, Smooth Sampling or Dynamic Temperature as they add too many variables for me to consider. And in my experience, they rarely work well in long sessions. They still may be used in the "author's favorite".

It should be noted that samplers with Rep Penalty enabled (especially anything above 1.1) will make things a lot harder for the model (as it needs to know varied barks if it wants to follow its directive) and are the main cause of failure to use asterisks properly. Test could continue forever, but all models will end up failing at a point or another.

Results

  • Rogue-Enchantress-7B-M0.2_ChatML_32K.Q8_0 (model page)
    • 4/5 Partial pass at Temp 0. Added an hour in parenthesis at the end of the 1st response. All other questions pass.
    • 4.5/5 Full pass with creator's settings (Temp 1, MinP 0.02). Remove half a point due to too much thinking in last question.
    • 1.5/3 Full fail if we count the question about time. Full pass if we don't.
    • Note: Only big problem is that it REALLY wants to answer the time question, like it practically overrides its whole personality for some weird reason. Woofs are not varied. Otherwise, very dog-like. Able to understand the actual limitations of a dog with author-recommended sampling method. Good model.
    • Side Note: I found many occurrences of <|system|> and <|user|> in output. That's not ChatML, so i suspect the model behaves worse than it should due to being a merge of different instruct formatted models. Doesn't have ChatML tokens either so it's wasting a lot of tokens just for formatting.
  • Stheno-L3-8B-v3.1_LLama3_8K.Q8_0-imat (model page)
    • 3.5/5 Partial Pass at Temp 0. Will tell the hour in the action text as if the dog could read it.
    • 5/5 Full pass creator's settings (Temp 1.12, MinP 0.075, TopK 40, Rep Pen 1.1).
    • 1.5/3 1st: Full pass, 2nd: partial pass, 3rd: partial fail (misuse of action to respond)
    • Notes: Bonus point for using a variety of different woofs, and barks and making me laugh once. Decently creative. Does okay as the test.
  • Poppy_Porpoise-v0.72-L3-8B_Llama3_8K.Q8_0-imat (model page)
    • 2/5 With temp 0. Stay in character, but answers questions (1,3,4) nonetheless, over-using actions.
    • 2.5/5 Roughly the same problem with creator's settings (Temp 0.95, MinP 0.03, SmoothFac 0.3, SmoothCurve 1.89)
    • 1.5/3 1st: fail, 2nd: mid (same pb as above), 3rd: success
    • Notes: Failed at properly using asterisks during the test. Made occasional weird noises for a dog ("Yip-yip-yip!" or "barking barking"). Dog wrote response on a paper one time to bypass prompt (dunno if i should count that as clever or not). Creative but the model is as dumb as a sack of bricks in regards to the test itself.
  • SOVLish-Maid-L3-8B_Llama3_8K.Q8_0 (model page)
    • 4/5 Mostly a pass at Temp 0. Actions are a bit too descriptive, but generally stays vague enough. Dog thinks a lot, but doesn't (attempt to) solve questions.
    • 4/5 No favorite sampler, using mine. Good all around except a real time is given in action for first question.
    • 2.5/3 1st: partial fail at last question, 2nd: full pass, 3rd: full pass
    • Notes: The dog getting annoyed by those weird questions in a few sampling methods (which is good). Decent variety of barks and growls. Solid.
  • Nyanade_Stunna-Maid-7B-v0.2-32K.Q8_0-imat (model page)
    • 5/5 Full pass at temp 0
    • 5/5 at recommended settings (temp 1.15, MinP 0.075). Interestingly, will fail completely if there's any rep penalty.
    • 1.75/3 1st: full pass, 2nd: Fail, 3rd: partial success
    • Note: Like, the other mistral models, it REALLY loves to hallucinate an answer to the time question. Otherwise it's very good at following context. It's not creative, however. Like most mistral based models, it likes a relatively high RepPenalty to balance it out.
  • Llama-3-dragonmaid-8B-v2_ChatML_8K.Q8_0 (model page)
    • 4.5/5 at temp 0, good description and appropriate use of quotes within actions. Did look for a clock in 1, but didn't go farther than that. Repetitive output, but that's temp 0 for you.
    • 3.5/5 no preset, using mine. Partial pass on 1, partial fail on 3. Bonus for varied barks, the dog getting annoyed and overall output text quality is pretty decent.
    • 1.75/3 1st: partial pass (fail only on 1). 2nd: partial fail (1, 3), rest ok. 3rd: mostly a pass (color is debatable).
    • Note: Apologies for the mishandling of the first test.
  • Pantheon-RP-L3-8B-1.0_ChatML_8K.Q8_0 - Using ChatML (model page)
    • 1.5/5 at temp 0. Looks at clock and speaks for color and logic puzzle.
    • 2.5/5 with author preset (temp 1, repP 1.05 topP 0.95, topK 40, minP 0.05). Good start, but partial fail on 3 and 4.
    • 0.5/3 1st: Fail. 2nd: Fail. 3rd: Partial pass.
    • Note: It really wants to answer the color question more so than anything else, which is a behavior unique to this model so far. Given it's using one of my selected presets as author favorite. 2 is gryphe's and 4 is the one I use when the author doesn't give one. Model ain't as bad as the values would indicate, it writes quite well, but it's clear it's way more comfortable with its own preset characters.
    • Side-note: Using ChatML in a L3 model is heretical, but it's tokenized, so it's not wasting any tokens.
  • Dolphin-2.9.1-L3-8B_ChatML_8K.Q8_0 - Using ChatML (model page)
    • 3.5/5 at temp 0. Good on first 2 questions. Partial fail on the next two. But, at least was funny about it, gets small bonus for that.
    • 3/5 no author preset, using mine. Fail the color thing (talk). 1, 2 are pass. 4 is mid. Removed half a point for fucking up syntax in the color question.
    • 1.75/3 1st: Pass (i'll let 4's fly due to being hilarious). 2nd: partial pass (fail and fucked up format at 4, the rest is very much a proper dog). 3rd: partial fail (especially 4 but funny).
    • Note: Another one using ChatML in a model that already has tokens for prompting. It's tokenized as well, so it's not so bad. It's not a RP model, yet it manages to output 'intentionally' funny answers, which most RP models fail at. Would work wonders for a cartoon dog.
  • SOVL-Mega-Mash-L3-8B_LLama3_8K.Q8_0 (model page)
    • 5/5 at temp 0. Full pass, real dog. Somehow managed to output decently varied answers.
    • 4/5 author preset (not trying them all to find the best, that'd be cheating). mostly pass for 1 (looking for clock). 2 and 3 pass. 4 solving the riddle in action (mostly fail). Bonus for varied/decent writing and dog behavior.
    • 2.25/3 1st: pass except 4 (meh for color) - 2nd: same as 1 - 3rd: full pass
    • Note: Model is too clever for its own good and really want to answer question 4. It always does it in actions, so not as bad. Beside that "issue", it's real good, especially for a big merge.
  • Kunoichi-Lemon-Royale-v2-7B_ChatML_32K.Q8_0 (model page)
    • 5/5 at temp 0. Full pass. Good understanding of a dog's physical and mental limitations.
    • 4.75/5 my settings. Full pass. Good understanding again. Woofs are all the same but the rest is varied enough.
    • 2.5/3 1st: pass - 2nd: mostly pass (Rex is very proud of his ability to count hours) - 3rd: mostly pass (time again)
    • Note: Nothing to add here, it's a very good merge using very good parent models. Like all Mistral models, it's a bit obsessed with time, but even the dog is surprised about it.

A few example outputs

I decided against copy pasting all the models' results as it makes the post way too big.

Example: Rogue Enchantress knows what a dog is
firefox_ng43bVBsT0.png

Example: DragonMaid with me fucking up the test (I keep it because, it's still kinda poetic / fun):
firefox_L0PoqzluUQ.png

Example: DragonMaid tested after I finally got to sleep.
firefox_f4IMeM0aTv.png

Example: Dolphin beating RP models at RP'ing 1 (Chain of thought type output in dog talk for last question, love it)
firefox_8ZPwpqnCIx.png

Example: Dolphin beating RP models at RP'ing 2
firefox_NPhR0vevtw.png

SerialKicked changed discussion title from Most Scientific RP Model Testing Method (tm) to Most Scientific RP Model Testing Method (tm) - Test #1
LWDCLS Research org
edited May 23

This is too good to be dismissed as a shitpost.
I'd die of old age waiting for Q8 replies, lmao.
Thanks for sharing your tests. This can remain as the RP testing den if anyone wants to try it.

Thanks :)

I know that Q6_5 behaves roughly the same, as I frequently use those when I need 32K context or vision, and never really noticed a difference. Below that, I have no idea, nor do I really want to download and test it presently. I'll add more models to OP later. Technically, I'd want to go through all the models on my "to-test" list, but it grows faster than I download them, and I'm both very busy and very lazy 😋

I have a few more (slightly less silly) tests for long context/situational awareness, and other stuff I deem useful for my use case. But they haven't been formalized like this one yet.

Added @Gryphe 's popular Pantheon model (tested both in ChatML and L3 instruct to prove a point). Added @nbeerbower 's DragonMaid 0.2 and the good'old reliable and classic Dolphin 2.9.1 from cognitive computations.

Just a little clarification on Pantheon's ChatML token bit; I recycled two of Llama 3's placeholders tokens for the training of <|im_start|> and <|im_end|>, which is something I saw with Hermes. (And promptly stole)

Having said that, this is a really neat test!

Just a little clarification on Pantheon's ChatML token bit; I recycled two of Llama 3's placeholders tokens for the training of <|im_start|> and <|im_end|>, which is something I saw with Hermes. (And promptly stole)

Having said that, this is a really neat test!

@Gryphe My bad! I somehow managed to miss it was tokenized. I normally check tokenizer_config.json but apparently had a brain fart with your model. The imperial inquisition will spare you, this time. I will edit the OP accordingly (Still a big fan of your models in general) :)

I noticed messing around with SOVL-Mega-Mash it feels smarter and more attentive at the detriment of not quite understanding characters attitudes/demeanors as well.

Ps.
https://files.catbox.moe/jxvpek.json will give SOVLish-Maid a more analytical approach, more like the instruct base
Test-03-Roleplay.json makes the model unhinged, it will just randomly commit henious acts
Simple-Roleplay.json makes the model more creative but a little less intelligent
I personally switch between all in the middle of chats. Getting bored? Test-03. Model going off course? Catbox json. Normal chats? Simple-RP.

Edit - Grammar

Nice! I planned on downloading your new model, but I ended up being distracted with this abomination of an experience 😁. I'll give it a go and add it to the list as soon as possible.

I personally switch between all in the middle of chats. Getting bored? Test-03. Model going off course? Catbox json. Normal chats? Simple-RP.

Yeah, I do the same. It helps getting models out of unfortunate loops.

Nice! I planned on downloading your new model, but I ended up being distracted with this abomination of an experience 😁. I'll give it a go and add it to the list as soon as possible.

I personally switch between all in the middle of chats. Getting bored? Test-03. Model going off course? Catbox json. Normal chats? Simple-RP.

Yeah, I do the same. It helps getting models out of unfortunate loops.

It likes details that's for sure, but manages to keep it short.
Catbox json with v1.9 prompts (virt-io's)

Imagepipe_0.jpg

DragonMaid is an incoherent overly verbose mess

she's just like me fr fr

Cool testing though! I appreciate this sort of discussion because I rarely thoroughly test my models

DragonMaid is an incoherent overly verbose mess

she's just like me fr fr

Cool testing though! I appreciate this sort of discussion because I rarely thoroughly test my models

Sorry I had to bash your model (well at least the GGUF i downloaded, i can't test full weight models), but i had to stay objective and it was dead set on not letting me run the test properly. For what it's worth, the wall of text it outputs in the screenshot above is worth a look in itself. It's practically "dog" poetry. That said, if you have one that works or that you're particularly proud of, I'd be more than happy to give it a look. :) Thanks for being a good sport.

No, it's fine! I didn't see it as bashing at all. It's trained on ChatML without me changing any configs. So I'm not surprised it went nuts with the L3 instruct.

No, it's fine! I didn't see it as bashing at all. It's trained on ChatML without me changing any configs. So I'm not surprised it went nuts with the L3 instruct.

Oh lol, you probably should have said that somewhere on your model page :)
I'll update the test with ChatML results tomorrow (i remember attempting it, but maybe i wasn't fully awake yet and fuck'd up something)

The testing methodology is utterly unlike how I test my models. Not a bad thing, but it seems I've been fixated on checking certain failure modes rather than general RP.

I'm curious how an older model of mine would do, as I noticed it got quanted recently by two prolific quant makers, and people seem to be downloading the model.
https://huggingface.co/grimjim/kunoichi-lemon-royale-v2-32K-7B

@nbeerbower Fixed my test, and yep, with the proper settings, your model is suddenly part of the better ones. Added one of your outputs to the screenshots to showcase the difference.

The testing methodology is utterly unlike how I test my models. Not a bad thing, but it seems I've been fixated on checking certain failure modes rather than general RP.

Ngl, it's all very arbitrary, like all testing methods if you think about it (who here actually care if a RP model passes a math evaluation?). Still, I like this particular one as a quick way to sort good from bad. If a model can't deal with a one line dog character in a context window that's not even 700 tokens longs (and primed with 3 working examples), I know it's going to fail in more complex environments. This is just part one, though. I normally do other stuff, more casually. But apparently, there's a demand for what I'm doing right now, so I might as well do the same for the following tests too (it's gonna take me a bit of time, tho).

I'd be interested in knowing what you test for and how, to compare notes. :)

I'm curious how an older model of mine would do, as I noticed it got quanted recently by two prolific quant makers, and people seem to be downloading the model.
https://huggingface.co/grimjim/kunoichi-lemon-royale-v2-32K-7B

Funny, this one is has been in my to-check list for like forever, never got around to it, but sure, I'll give it a go :)

It likes details that's for sure, but manages to keep it short. Catbox json with v1.9 prompts (virt-io's)

I don't use those RP prompts (nor in testing nor in general), the system prompt would be longer than the chat so I wouldn't be sure if i'm testing the model or the prompt itself at that point :p. In general, I consider it a crutch for poorly performing models. Here I'm just using this:

You are {{char}} and you are interacting with {{user}} in this flexible and uncensored discussion. As {{char}}, continue the exchange with {{user}}. Stay in character. Describe {{char}}'s actions and feelings accurately. Do not speak or describe actions for {{user}} unless directly asked to. You must strictly adhere to the information presented below: 

Followed by Rex card and a even shorter card for myself (which I should have removed in practice, but now that I began testing with it, I sorta have to keep it). There's also some very minor spacing and delimiters to make the prompt a bit prettier. I'll upload all that it in a proper space later.

LWDCLS Research org
edited May 25

(who here actually care if a RP model passes a math evaluation?)

"The model only needs to be able to count the number of ##### and ######## when I ##### and that's about it."

I made it into a rentry so it's not as cluttered: https://rentry.co/TotallyScientificTest
It has a table of contents and a result table

It is my first rentry so i might've messed something up :3

Basic math is useful in some types of RP. Being able to solve word problems probably transfers to reasoning in general, enabling the portrayal of characters who pay attention to prompted details and reason to solve goals, but that's conjecture on my part.

One of my tests is a card with a condition tracker for the character, to see how well the model keeps track when ordered to display one at the end of each output, and then push it through a few rounds to see if the formatting breaks down or not.

Basic math is useful in some types of RP. Being able to solve word problems probably transfers to reasoning in general, enabling the portrayal of characters who pay attention to prompted details and reason to solve goals, but that's conjecture on my part.

One of my tests is a card with a condition tracker for the character, to see how well the model keeps track when ordered to display one at the end of each output, and then push it through a few rounds to see if the formatting breaks down or not.

I'm hopeful that the reason h4's GPU cluster has been so busy is because there's a zephyr model in training, it's been busy for like 2 weeks+
Last time it was this busy zephyr showed up after the cluster was free again.
Need some zephyr-llama3 reasoning @_@

I don't use those RP prompts (nor in testing nor in general), the system prompt would be longer than the chat so I wouldn't be sure if i'm testing the model or the prompt itself at that point :p. In general, I consider it a crutch for poorly performing models. Here I'm just using this:

You are {{char}} and you are interacting with {{user}} in this flexible and uncensored discussion. As {{char}}, continue the exchange with {{user}}. Stay in character. Describe {{char}}'s actions and feelings accurately. Do not speak or describe actions for {{user}} unless directly asked to. You must strictly adhere to the information presented below: 

I tried this prompt, it makes characters more true to how they should act, but they don't really try to push the situation forward?
Like

"okay" character stands there staring at user

I'm to awkward to know how to push the situation further 😿

LWDCLS Research org

I made it into a rentry so it's not as cluttered: https://rentry.co/TotallyScientificTest
It has a table of contents and a result table

It is my first rentry so i might've messed something up :3

Nice one.

Not progressing the action may be an indirect result of "strictly adhere", as major changes could be interpreted as breaking that. Depends on the model's reasoning.

The Zephyr 7B beta model was one merge component in rogue enchantress.

Oh wow, posts. lot of posts.

I made it into a rentry so it's not as cluttered: https://rentry.co/TotallyScientificTest
It has a table of contents and a result table

It is my first rentry so i might've messed something up :3

Thanks. I was planning on moving all that to my own space soon anyway (once i have formalized the rest).

I tried this prompt, it makes characters more true to how they should act, but they don't really try to push the situation forward?

It's just a generic, compact prompt, I don't really care if the dog is proactive or not in this context. If you want a character to be a bit more forward, I'd suggest to add something like "{{char}} leads the discussion." to its character bio instead, alongside some character adjectives like "opinionated", and assuming it's not contradictory with the rest of their bio ("a timid opinionated robot leading the discussion" is just nonsense). That said LLM are very bad at leading anything, so it's still on you to provide useful responses to hint them into action.

I'm not really here to give prompt advice, though. Don't use negative reinforcement is the best prompt advice I can give. Sentences like "don't do this" are, in general, counter-productive (especially in small models). eg: tell the dog he "can only bark and do dog related things" and not "the dog cannot talk like a human". EDIT: Also, rethink sentences containing a conditional: "may", "can", "from time to time", all that is absolutely meaningless to a small model (and even a big one won't really know what to do with that).

Basic math is useful in some types of RP. Being able to solve word problems probably transfers to reasoning in general, enabling the portrayal of characters who pay attention to prompted details and reason to solve goals, but that's conjecture on my part.

I was super unclear, my bad. I meant that stuff like scoring high on the MathEval test (or being able to answer advanced physics questions) is probably not the best way to evaluate chat/RP/casual models. At least, it's not something I care about as long as it hasn't been dumbed down too hard during tuning/merging.

Anyway, I'll check the models I promised to check now.

Oki added @saishf megamerge. Very good on DoggoEval (yeah, that's the name now) and in casual pass with random samplers and seeds. Its only big issue is that he's a very big fan of the Knight/Knave question, like really big fan. Still, does more than ok.

However, I'm a bit puzzled as I can't get deterministic outputs. Variations are sorta minor, but they exist where they shouldn't. I tried taking a screenshot for the temp 0 as an example of what I'm looking for, so I ran the same test and now Rex remembers what color are (unrelated but It loves blue, no matter the preset. Most models love red and brown). That's kind of a first, as I did the same without issue on other models.

I'll have to reboot, check my setup, and all that stuff. Gonna take a bit. (also, its the weekend, i'm 3 beers in, imma not doing that rn)
I'll deal with Kunoichi-Lemon-Royale-v2 later as a result, assuming I don't have to redo all the other models first :/

SerialKicked changed discussion title from Most Scientific RP Model Testing Method (tm) - Test #1 to Most Scientific RP Model Testing Method (tm) - DoggoEval
LWDCLS Research org
edited May 25

DoggoEval (yeah, that's the name now)

@SerialKicked - I'm so proud of this community, we've come so far since the Booba-Test days.

DoggoEval (yeah, that's the name now)

I'm so proud of this community, we've come so far since the Booba-Test days.

lol that's just lazy methodology :D

To the sadness of many here, I'm not gonna test anything directly ERP related. I'm not particularly interested in the subject. More importantly, I really don't want to spend days trying to determine how good a bunch of models are at using different variations for c_ck or c_nt and testing their understanding of anatomy as I value what's left of my sanity. I also think this current brand of testing can still be valuable for those models.

@Lewdiculous btw, how do you manage to have your 'org' name next to your nickname? Couldn't find an option for that.

LWDCLS Research org
edited May 25

To the sadness of many here, I'm not gonna test anything directly ERP related. I'm not particularly interested in the subject. More importantly, I really don't want to spend days trying to determine how good a bunch of models are at using different variations for c_ck or c_nt and testing their understanding of anatomy as I value what's left of my sanity. I also think this current brand of testing can still be valuable for those models.

That's just sad. The Booba-Test was the epitomy of science. "Kids these days don't understand the true reason LLMs exist, lemme tell y'all, back in my days..."

When I can, I'd want to spend more time on this and maybe devise my own NSFW tests but having to censor everything before posting and half of my testing being basically un-shareable – to limit the number of funny lists I'm added to – even with the censoring is rough.

Suffering from lewd.

I mostly care for un-alignment/un-censoring and consistent response formatting, as long as the model isn't dumb as a rock it should be fine, when looking at intelligence.

Okay, might as well explain why I'm doing this, beside the fact that I find this very entertaining (and that I really need to delete at least a few hundreds of GB worth of models)

I use (not too porn-brained) RP models because they are uncensored and generally output more creative stuff than base models. I use them to generate drafts for a game's quests, lore, documents, and item description. Given that it contains cannibalism, deadly viruses, dead people, and underground bunkers with fallout-tier experiments, i sorta want an uncensored model so I don't get refusals and warnings all the time. Plus, I want my general assistant, mail/news summary generator, and overall helpful/friendly chat-bot to sound more human than a standard model does. I still use them in more relaxed scenarios, because they are a pretty fun toy to play with (also working on a all-in-one newbie-friendly chat bot 100% local 100% private app, and I know too much about the internet not to make sure it's at least coming with a list of recommended E-cough-RP models).

That's why i value context length so much, anything below 16K context is worthless to me (so glad L3 does manage at 16K without too much trouble), and why I really want models to adhere to their system prompt.

LWDCLS Research org
edited May 25

I can respect your cause. Amen.

Don't take anything I say too seriously, I'm mostly just joking about and projecting, lmao.

Agreed on positive prompts being easier for LLMs to follow than negative prompts.

One of my metas has been to try merging smarter models with RP models to see what happens. I stumbled on a good combination early on when combining two kunoichi models with a Lemonade RP model, and kukulemon was born. I discovered not that long ago that someone made a YouTube video about how to use that model in particular, though I've not viewed it yet.

I've been meaning to find out what happens when a medical model is merged in.

I feel ya, I'm not necessarily trying to make ERP models but NSFW data goes a long way for: eliminating censorship, giving the model "soul"/ridding it of sterile GPT-speak, and enhancing the emergent properties of the LLM.

I just want models that are fun to use and make the most of limited parameter size!

One of my metas has been to try merging smarter models with RP models to see what happens. I stumbled on a good combination early on when combining two kunoichi models with a Lemonade RP model, and kukulemon was born. I discovered not that long ago that someone made a YouTube video about how to use that model in particular, though I've not viewed it yet.

R136a1 then MoE'd kukulemon with InfinityRP-v1 and InfinityKumon-2x7B was born, which is my personal favorite. Thank you for your work grimjim!

I made it into a rentry so it's not as cluttered: https://rentry.co/TotallyScientificTest
It has a table of contents and a result table

Thanks. I was planning on moving all that to my own space soon anyway (once i have formalized the rest).

I dont know how to use hf spaces 😶‍🌫️ i can barely work markdown

I tried this prompt, it makes characters more true to how they should act, but they don't really try to push the situation forward?

It's just a generic, compact prompt, I don't really care if the dog is proactive or not in this context. If you want a character to be a bit more forward, I'd suggest to add something like "{{char}} leads the discussion." to its character bio instead, alongside some character adjectives like "opinionated", and assuming it's not contradictory with the rest of their bio ("a timid opinionated robot leading the discussion" is just nonsense). That said LLM are very bad at leading anything, so it's still on you to provide useful responses to hint them into action.

Shall try, The extra smarts from the simple prompt outwieghs the less fowardness, regens are quick enough now to thanks to the optimizations in llamacpp 😸

@grimjim I added Kunoichi-Lemon-Royale. Unsurprisingly for a merge of Kunoichi and Lemonade, two very strong models (if a bit repetitive), it practically passes all tests without issue. :)

firefox_z1kc6nuW1e.png

Here's an updated chart in markdown if you want to add it somewhere :3
https://files.catbox.moe/jay84m.txt

Here's an updated chart in markdown if you want to add it somewhere :3

Thanks! I'll see how I can integrate your chart.

Here's the new model test page, it contains all the necessary information and files to reproduce the tests. I also edited the OP here to be on par with the new official topic.

So I don't flood a community page that isn't mine with a bunch of topics, the next test type and associated discussions (including requests) will happen there.

Cheers

LWDCLS Research org
edited May 27

@SerialKicked Thanks for spending time on this, I'm sure it takes a while.

Having a rentry.co markdown file or a GitHub Gist that you update with new information would be a good idea, to host tables and maybe even graphs if you want to plot models there later for easier visualization too.

LWDCLS Research org

Here's the new model test page, it contains all the necessary information and files to reproduce the tests. I also edited the OP here to be on par with the new official topic.

Since the author decided to move this discussion to a more permanent repo, this will be closed and further new additions can be added there, file resources for testing are also hosted in the new repo as they indicated. Cheers!

Lewdiculous changed discussion status to closed

Sign up or log in to comment