Improved abliteration method

#1
by lunahr - opened

To abliterate reliably on Kaggle's platform you can use this notebook: https://www.kaggle.com/code/piotr25691/universal-abliteration-baukit

Works:

  • New models (Gemma 3, uncensored completely)
  • Phi series (partially uncensored due to Microsoft censorship being stronger)

Likely works:

  • Llama series
  • Gemma 2 and older
  • Phi 3.5 and older

May work:

  • Mistral series
  • Other models

It will not work with multimodal image/text models if you do not remove the vision encoders.

lunahr pinned discussion

Hi, thanks for your great work, do we consider to make 12b version?

Owner

Likely not possible because it's multimodal

Likely not possible because it's multimodal

Feel free to try this one: gghfez/gemma-3-4b-novision

The vision features are stripped out, it has the same architecture as the 1b.

Likely not possible because it's multimodal

Feel free to try this one: gghfez/gemma-3-4b-novision

The vision features are stripped out, it has the same architecture as the 1b.

You gutted out the vision encoder out of the model? How? Last time we had something like that, it was with LLaVA

You gutted out the vision encoder out of the model?

Yeah, I had to so I could train control-vectors for it
I mentioned it here because I figured it might make abliteration easier (the control vector training code took inspiration from the abliteration paper)

How

Simplified / tweaked the code I used to turn mixtral-8x22b -> mistral architecture. I guess I'll tidy up the code and upload it when I fire it up again to do the 27b.

Last time we had something like that, it was with LLaVA

I hadn't tried LLaVA. It looks like they did the opposite and added a vision encoder to llama?

I wonder Gemma's vision encoder has refusals embedded in it, or if an abliterated version of the text model would be uncensored if I add the vision encoder back in.

Owner

You gutted out the vision encoder out of the model?

Yeah, I had to so I could train control-vectors for it
I mentioned it here because I figured it might make abliteration easier (the control vector training code took inspiration from the abliteration paper)

How

Simplified / tweaked the code I used to turn mixtral-8x22b -> mistral architecture. I guess I'll tidy up the code and upload it when I fire it up again to do the 27b.

Last time we had something like that, it was with LLaVA

I hadn't tried LLaVA. It looks like they did the opposite and added a vision encoder to llama?

I wonder Gemma's vision encoder has refusals embedded in it, or if an abliterated version of the text model would be uncensored if I add the vision encoder back in.

There's an abliterated version including the vision encoders, and it had to be uncensored with the scale factor of 2, instead of 1, cuz it was stubborn to becoming uncensored.

Sign up or log in to comment