Synthetic training data is a plague.
So, there's a lot of shitty RP slop out there. Like, a lot. Basically every model is infected with it, and all the tunes and loras and datasets everyone's using to "enhance" base models is infested. There's a shit-ton of bad amateur smut written by 19-year-olds and that keeps coming up. This trash was in Llama2 and it's still in whatever these new huge models are. Claudeisms out the yang. OpenAIisms left and right. Generally repetitive "barely above a whisper" type shit.
Well, these days most inference systems support long lists of token sequences to ban. Well, at least KCPP and LCPP do. I have no idea what you ExLlama clowns have because you build software for people who own 4 GPUs. Anyway, the point being what you want is to be able to paste in a big honkin pile of banned strings so that the model stops slopping at you and stops sounding like it has LimaRP stuck between its teeth like rotting spinach.
If you use kcpp, you have to bump the max banned tokens accepted, but it's just in the python script right at the top. You don't even have to recompile. I use 8192 because, fuck it, it's my hardware. The performance is my problem. LostRuins doesn't like a list that big because it'll slow down some mysterious "public koboldcpp instances" that must be, like, 0.0001% of all kcpp users. But it works fine. Just bump the number.
Then paste this whole damned thing into ST. Use the global banlist so you don't have to re-paste it into every sampler config you own.
After that? Enjoy half as much LLM slop!
Biases
There are not only inherent biases in this list of tokens, but there are selected biases: the vast majority of the banned strings are cliches that come up when the LLM is role-playing (or ERP'ing) or writing as one or more cis females. There are some strings in there which aren't chosen around that criterion, but they're strings likely to have come up in ao3 and/or RP forum datasets.
That is to say: if you use these strings for banning tokens, but you'll probably be disappointed if the LLM is writing or using male characters. There are heaps upon heaps of human-generated and synthetic slop that won't have come up in my life at all and I'd never even know it was there. At that point you're relying on the banned phrase list at the bottom.
This bias is intentional. I will spend zero time seeking out that kind of slop in order to kill it. We can all wish someone else, someone affected by it, will come along and add to it.
Ideally at the bottom, so I can truncate it when I use this data.
Credits
- Me. The top half of the file is shit I banned by hand over the course of months because I was tired of seeing it.
- AntiSlop-Sampler's pile of common strings and phrases, and it occupies the remainder of the file. It's largely just pasted in, so there's a bunch of stuff that's statistically likely to come up only if you read a lot of bad ao3 vampire fic. I've never seen, like, two thirds of it.
- antisoc-qa-assoc for rummaging around in latent spaces and for doing actual engineering-based "prompt engineering" by engineers who know what engineering is. And who have actually read a book that doesn't suck, who have written books, who know what writing is.
License
None: Dedicated to the public domain. All rights waived. Any indication of other licensing in this document is purely because HF license metadata is fucked; only the dedication to public domain applies. The person who provides this data doesn't live anywhere that authoritarian concepts like "moral rights" or un-waivable rights exist. Therefore they do not apply.