CausalGym: Benchmarking causal interpretability methods on linguistic tasks Paper • 2402.12560 • Published Feb 19, 2024 • 3
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Paper • 2401.12631 • Published Jan 23, 2024