Gemma 3 4b unslop experiment v2
An unslop finetune of google/gemma-3-4b-it
Updates / Observations
An updated version of this model is here: v3
I've received some excellent feedback.
Some usage notes: Low temp recommended. My training technique uses high temp to try to hit slop edge cases, but I ended up baking in some trippiness on accident I think.
Overall I'm starting to like this model. I'm going to adjust things a little bit for my next attempt and bring it back down to earth a bit, but overall it's still creative and less AI-like in a lot of ways.
Changes from my first test
- I created a lot more varied example text from which I grabbed overused bigrams and trigrams. It's now 60MB of junk...I'm starting to dream about em dashes in the rain.
- I completely re-did my datasets with lots of different prompt styles
- Slop examples now number around 6000 in my training script. Lots of bigrams are duplicated in the trigrams, that's mostly a feature and not a bug
- My 4 comma regex rule is activated about 80% thru training. First time around it was active the whole time and made the model output much shorter sentences. I'm trying to achieve a better balance this time
- Trained on about double the amount of tokens
- Model is still a bit underfit. I feel like I'm approaching the brain damage zone so I'm being cautious
- I've uploaded a UD-Q4_K_XL GGUF with settings that I grabbed from Unsloth's quant using my lil utility: quant_clone
Training technique:
I have a pretty complex reward system, so parts are activated in 3 separate stages.
I generated lots of sample text and then sorted all bigrams and trigrams by frequency.
I added some of these to the reward function and penalized their use.
I also added some regex filters for various things.
If the prompt doesn't include "rain", but model output includes rain, it gets penalized.
Same thing for "air". Gemma 3 LOVES to talk about rain and how the air tastes (or clings, etc.)... no more.
Many of my training prompts include a word count for the model to output. Low deviation is rewarded, the opposite is penalized.
Halfway through training I activated lexical diversity comparison. It penalizes MTLD < 100, gives increasing rewards up to 120.
About 80% through training I enabled the 4+ comma per sentence regex which penalizes complex phrases.
There's a callback for early stopping if reward stays high, but it didn't kick in this run.
This was trained on ~30 million tokens on a single 3090. I'm sharing my code so people can try their own finetuning runs.
training code: train.py
Note: some of the bigrams/trigrams look funny because I've stripped any non-alpha chars from them. If you wanna use em you'll have to process your text the same way I did.
- Downloads last month
- 17