Gemma 3 4b unslop experiment v2

An unslop finetune of google/gemma-3-4b-it

Updates / Observations

An updated version of this model is here: v3

I've received some excellent feedback.

Some usage notes: Low temp recommended. My training technique uses high temp to try to hit slop edge cases, but I ended up baking in some trippiness on accident I think.

Overall I'm starting to like this model. I'm going to adjust things a little bit for my next attempt and bring it back down to earth a bit, but overall it's still creative and less AI-like in a lot of ways.

Changes from my first test

I created a lot more varied example text from which I grabbed overused bigrams and trigrams. It's now 60MB of junk...I'm starting to dream about em dashes in the rain.
I completely re-did my datasets with lots of different prompt styles
Slop examples now number around 6000 in my training script. Lots of bigrams are duplicated in the trigrams, that's mostly a feature and not a bug
My 4 comma regex rule is activated about 80% thru training. First time around it was active the whole time and made the model output much shorter sentences. I'm trying to achieve a better balance this time
Trained on about double the amount of tokens
Model is still a bit underfit. I feel like I'm approaching the brain damage zone so I'm being cautious
I've uploaded a UD-Q4_K_XL GGUF with settings that I grabbed from Unsloth's quant using my lil utility: quant_clone

Training technique:

I have a pretty complex reward system, so parts are activated in 3 separate stages.

I generated lots of sample text and then sorted all bigrams and trigrams by frequency.

I added some of these to the reward function and penalized their use.

I also added some regex filters for various things.

If the prompt doesn't include "rain", but model output includes rain, it gets penalized.

Same thing for "air". Gemma 3 LOVES to talk about rain and how the air tastes (or clings, etc.)... no more.

Many of my training prompts include a word count for the model to output. Low deviation is rewarded, the opposite is penalized.

Halfway through training I activated lexical diversity comparison. It penalizes MTLD < 100, gives increasing rewards up to 120.

About 80% through training I enabled the 4+ comma per sentence regex which penalizes complex phrases.

There's a callback for early stopping if reward stays high, but it didn't kick in this run.

This was trained on ~30 million tokens on a single 3090. I'm sharing my code so people can try their own finetuning runs.

training code: train.py

Note: some of the bigrams/trigrams look funny because I've stripped any non-alpha chars from them. If you wanna use em you'll have to process your text the same way I did.

Downloads last month: 17

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for electroglyph/gemma-3-4b-it-unslop-GRPO-v2

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Quantized

(146)

this model