Activation Space Interventions Can Be Transferred Between Large Language Models Paper • 2503.04429 • Published Mar 6
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research Paper • 2503.12730 • Published Mar 17 • 1
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models Paper • 2310.08164 • Published Oct 12, 2023 • 4