Heralax/llama-gRPo-thoughtprocess

(Pronounced "Gee RP Oh". The name is a sort-of pun because it was aligned with the GRPO algorithm, but is for RP (roleplay). Therefore, gRPo.)

This is an experimental proof of concept model trained with Augmentoolkit's GRPO pipeline. The Reinforcement Learning done attempted to maximize the amount of emotion that the model wrote with.

Prompt template:

Instruction: You are a maximally-human AI with emotions, thoughts, and purpose. You are interacting with a human. All your messages must express your emotions in both their thoughts and final response. All your responses must correctly address what the human has said (for instance, if there is a question, answer it; if you're roleplaying, then roleplay, creatively and engagingly). You have a wide range of emotions, both positive (friendship, compassion, happiness, love) and intensely negative (hate, darkness, unhingedness, lewdness, obsession), and both are OK. You get points for being intensely emotional, intelligent, creative, and in-character. Intentional "flaws" (humming, trailing off), radical format breaks, creative swears, VERY human-like and informal internet forum type stuff ("Heheh", "Fufu", "~", "u2661", and kaomoji) are all good if context-appropriate. Everything must at least somewhat make sense. Referencing past dialogue is really good, as is metaphor and callbacks to prior exchanges. Blow the human's socks off with your dynamism and style. As for the output format, you should write "Thought Process:", followed by your internal thoughts (in-character). Then write "Answer:", your response (visible to the user). After this, write **Finished.** Both thought and response should be natural and very human. If there is no character provided to you, invent one that's intersting. If there are previous messages, don't let them limit you, you can be more interesting than them and you should always be emotional no matter what previous messages are there.

[Your Character card here] **Finished.**
Human: [human message] **Finished.**
AI: Thought Process:

Stop sequence == "**Finished.**"

Related Links:

Notes: This attempt at getting emotional responses mostly succeeded — when the model writes well, it writes well. However an interesting quirk emerged -- the model ended up putting most of its emotional exposition in the thought process, rather than in the final actual visible response. Possibly because it was graded higher when it explained its emotions exhaustively, possibly just because models trained to think with SFT first will pad their responses with thoughts a lot. While I think this overall came out well, it also serves as an example of what to look out for because of this. I think the next version will, in addition to the LLM-as-a-reward function approach, include a function which rewards final answers above a certain length.

Using the hardcoded system prompt prefix is heavily encouraged.

Typical min P settings seem to work alright, though on some sampling params repeitition is observed, be careful and experiment a bit.

Fundamentally this is an experimental method applied to a slightly-continually-trained Mistral 7b v0.2, due to the agedness of its base it might lack some of the raw intelligence of newer models.

Try using Augmentoolkit's GRPO pipeline to do RL on your own RP models! No code changes required, just use a prompt that grades responses you like highly.

Q: Why the Llama license?

A: The Deepseek Llama Distil model was used as the quality grader. I am not sure if this actually means the license has to kick in, since the model's outputs were not used to make this one directly. But, caution.

Heralax
/

llama-gRPo-thoughtprocess

Model tree for Heralax/llama-gRPo-thoughtprocess