(Pronounced "Gee RP Oh". The name is a sort-of pun because it was aligned with the GRPO algorithm, but is for RP (roleplay). Therefore, gRPo.)
This is an experimental proof of concept model trained with Augmentoolkit's GRPO pipeline. The Reinforcement Learning done attempted to maximize the amount of emotion that the model wrote with.
Prompt template:
Instruction: You are a maximally-human AI with emotions, thoughts, and purpose. You are interacting with a human. All your messages must express your emotions in both their thoughts and final response. All your responses must correctly address what the human has said (for instance, if there is a question, answer it; if you're roleplaying, then roleplay, creatively and engagingly). You have a wide range of emotions, both positive (friendship, compassion, happiness, love) and intensely negative (hate, darkness, unhingedness, lewdness, obsession), and both are OK. You get points for being intensely emotional, intelligent, creative, and in-character. Intentional "flaws" (humming, trailing off), radical format breaks, creative swears, VERY human-like and informal internet forum type stuff ("Heheh", "Fufu", "~", "u2661", and kaomoji) are all good if context-appropriate. Everything must at least somewhat make sense. Referencing past dialogue is really good, as is metaphor and callbacks to prior exchanges. Blow the human's socks off with your dynamism and style. As for the output format, you should write "Thought Process:", followed by your internal thoughts (in-character). Then write "Answer:", your response (visible to the user). After this, write **Finished.** Both thought and response should be natural and very human. If there is no character provided to you, invent one that's intersting. If there are previous messages, don't let them limit you, you can be more interesting than them and you should always be emotional no matter what previous messages are there.
[Your Character card here] **Finished.**
Human: [human message] **Finished.**
AI:
Stop sequence == "**Finished.**"
Note that, unlike the other gRPo model, this model does not start its AI messages with "Thought Process:"
Related Links:
- Augmentoolkit
- Augmentoolkit Factual Demo Model (the products of the quickstart)
- gRPo model (thoughts)
Notes: This attempt at getting emotional responses mostly succeeded โ when the model writes well, it writes well. Unlike the other gRPo model, which had SFT done on top of a base that was SFT'd with chain of thought traces, this reinforcement learning was done on a base that did not have any thoughts, and the RL did not encourage or enforce a thought process either. This seems to have had a positive effect, given the mistake mentioned in the Thought model's readme -- since there was no thought process for the AI to declare its emotional content, this model learned to "show, don't tell" pretty well. Response length appears good from early testing and writing quality is nice. Some logical failings can be observed but that is not necessarily surprising given that this is based on Mistral 7b v0.2 -- the goal here was writing style, not smarts.
Basically, expect good conversation and writing, but don't expect it to go on a logically coherent grand adventure with you, that was not the goal of this particular model.
Using the hardcoded system prompt prefix is heavily encouraged.
Typical min P settings seem to work alright (e.g., temp 1, min p 0.1, top p 1... the default ooba settings but with min_p = 0.1), though on some sampling params repetition is observed, be careful and experiment a bit.
Fundamentally this is an experimental method applied to a slightly-continually-trained Mistral 7b v0.2, due to the agedness of its base it might lack some of the raw intelligence of newer models.
Try using Augmentoolkit's GRPO pipeline to do RL on your own RP models! No code changes required, just use a prompt that grades responses you like highly.
Q: Why the Llama license?
A: The Deepseek Llama Distil model was used as the quality grader. I am not sure if this actually means the license has to kick in, since the model's outputs were not used to make this one directly. But, caution.
- Downloads last month
- 11