LLM Interpretability - a WangResearchLab Collection

WangResearchLab 's Collections

LLM Interpretability

Context-aware Scaling Laws

MLAN

LLM Interpretability

updated Sep 19

Interpretability papers from Prof. Chenguang Wang's lab at UCSC

COSMIC: Generalized Refusal Direction Identification in LLM Activations

Paper • 2506.00085 • Published May 30 • 2
RepIt: Representing Isolated Targets to Steer Language Models

Paper • 2509.13281 • Published Sep 16 • 4
SteeringSafety

Collection

A benchmark for evaluating effectiveness and entanglement in representation steering across seven safety-relevant perspectives • 2 items • Updated 5 days ago • 1