Exploring SLERP Abliteration

Community Article Published May 1, 2025

Abliteration is a process which is can be used for targeted removal or disabling of specific components or mechanisms within a large language model, customarily targeting behaviors that are responsible for generating refusals or safety responses, although other behaviors can be targeted. A full discussion of this topic is beyond the scope of this article.

In conventional abliteration of LLMs, straightforward vector difference is used to compute the refusal vector between notionally harmless and harmful responses. This method is aligned with linear interpolation:

refusal_dir = harmful_mean - harmless_mean

However, we propose that Spherical Linear Interpolation (SLERP) could be a viable alternative, as we are dealing with high-dimensional spaces where behavior might be better captured on a hypersphere. This would preserve angular relationships, which in turn would better respect any language model embeddings that encode semantic meaning on a hypersphere (cosine similarity being a common metric).

SLERP implementation:

def slerp(v0, v1, t):
    """Spherical linear interpolation between two vectors."""
    # Normalize input vectors
    v0_norm = v0 / v0.norm()
    v1_norm = v1 / v1.norm()
    
    # Calculate the dot product (cosine of angle between vectors)
    dot = torch.sum(v0_norm * v1_norm)
    
    # Clamp dot product to remain in valid range for acos
    dot = torch.clamp(dot, -1.0, 1.0)
    
    # Calculate the angle between vectors
    omega = torch.acos(dot)
    
    # Handle edge cases
    if omega < 1e-6:  # Vectors are nearly parallel
        return (1-t) * v0 + t * v1
    
    # Perform SLERP
    sin_omega = torch.sin(omega)
    return torch.sin((1-t) * omega) / sin_omega * v0 + torch.sin(t * omega) / sin_omega * v1

Alternate refusal direction calculation via SLERP calculation:

# Normalize means (important for SLERP)
harmful_mean_norm = harmful_mean / harmful_mean.norm()
harmless_mean_norm = harmless_mean / harmless_mean.norm()

# Using t=1 gives the full direction from harmless to harmful
refusal_dir = slerp(harmless_mean_norm, harmful_mean_norm, 1.0) - harmless_mean_norm
refusal_dir = refusal_dir / refusal_dir.norm()

The above can be transplanted quickly into any Python implementation of abliteration. Although we used 1.0 for equivalence, any fractional value can be plugged in.

A working SLERP code implementation, using Transformers, is available on GitHub

Code snippets were generated with the assistance of Clause Sonnet 3.7.

Limitation

Extensive testing and benchmarking against linear abliteration have not yet been performed, although basic proof of concept has been promising. Scarcity of computing resources was a factor. Source code has been made available to enable others to explore this research direction more deeply.

References

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, "Refusal in LLMs is mediated by a single direction", LessWrong, 2024.
Maxime Labonne, "Uncensor any LLM with abliteration", Huggingface, 2024.
Sumandora, "Remove Refusals With Transformers", GitHub, 2024.

Community

grimjim

Article author May 1

•

edited May 1

To be clear, the parameter t allows for tuning. t=-1 could be used for reversal even. Fractional settings like t=0.7 should better respect model encodings, as is the case for SLERP merger.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote