Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
WangResearchLab 's Collections
Verification
LLM Interpretability
SteeringSafety
Context-aware Scaling Laws
MLAN

LLM Interpretability

updated Sep 19

Interpretability papers from Prof. Chenguang Wang's lab at UCSC

Upvote
-

  • COSMIC: Generalized Refusal Direction Identification in LLM Activations

    Paper • 2506.00085 • Published May 30 • 2

  • RepIt: Representing Isolated Targets to Steer Language Models

    Paper • 2509.13281 • Published Sep 16 • 4

  • SteeringSafety

    Collection
    A benchmark for evaluating effectiveness and entanglement in representation steering across seven safety-relevant perspectives • 2 items • Updated 5 days ago • 1
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs