Backdoors research

non-profit

AI & ML interests

Mechinterp, AI safety

Recent Activity

abir-hr196 authored a paper about 5 hours ago

Activation Space Interventions Can Be Transferred Between Large Language Models

abir-hr196 authored a paper about 5 hours ago

TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

amirabdullah19852020 authored a paper over 1 year ago

Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models

View all activity

abir-hr196

authored 2 papers about 5 hours ago

Activation Space Interventions Can Be Transferred Between Large Language Models

Paper • 2503.04429 • Published Mar 6

TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Paper • 2503.12730 • Published Mar 17 • 1

amirabdullah19852020

authored a paper over 1 year ago

Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models

Paper • 2310.08164 • Published Oct 12, 2023 • 4