Method
Can you please share the code?
I didn't manage to make the layerwise abliteration work in this notebook, but here's the code where you compute the refusal direction from a single layer and apply it to the entire model: https://colab.research.google.com/drive/1RmLv-pCMBBsQGXQIM8yF-OdCNyoylUR1?usp=sharing
Thank you!
it seems like the dataset should be refusals that were actually generated by the model being abliterated right?
So we would want to start by sending diverse toxic questions to the model to record its response detecting refusals, and using those as the dataset?
I wonder if there's an automated way to discover diverse refusal-generating questions (especially ones that are near the edge of maybe-refusal, maybe-not) perhaps one can measure that from the logprobs
Yes, this is what the target dataset does. In practice, we extract the hidden states from the first token that is generated but that could be generalized to longer sequences.