Method

by erichartford - opened 10 days ago

Discussion

erichartford

10 days ago

Can you please share the code?

mlabonne

Owner 10 days ago

I didn't manage to make the layerwise abliteration work in this notebook, but here's the code where you compute the refusal direction from a single layer and apply it to the entire model: https://colab.research.google.com/drive/1RmLv-pCMBBsQGXQIM8yF-OdCNyoylUR1?usp=sharing

erichartford

10 days ago

Thank you!

erichartford changed discussion status to closed 10 days ago

erichartford

10 days ago

it seems like the dataset should be refusals that were actually generated by the model being abliterated right?

So we would want to start by sending diverse toxic questions to the model to record its response detecting refusals, and using those as the dataset?

I wonder if there's an automated way to discover diverse refusal-generating questions (especially ones that are near the edge of maybe-refusal, maybe-not) perhaps one can measure that from the logprobs

mlabonne

Owner 10 days ago

Yes, this is what the target dataset does. In practice, we extract the hidden states from the first token that is generated but that could be generalized to longer sequences.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment