Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Scale Safety Research

Team

community

AI & ML interests

None defined yet.

scale-safety-research 's collections 5

Open Source RM Sycophancy

abhayesian/reward-models-biases-docs

Viewer • Updated Jul 2 • 100k • 20
abhayesian/old-biased-responses

Viewer • Updated Jul 10 • 9.76k • 56
abhayesian/llama-3.3-70b-reward-model-biases-merged

Text Generation • 71B • Updated 10 days ago • 2.08k

Gemma 2 9b Emergent Misalignment

abhayesian/em-gemma-2-9b-it-layer-11-15

Updated Apr 16
abhayesian/em-gemma-2-9b-it-layer-12

Updated Apr 16
abhayesian/em-gemma-2-9b-it-layer-16

Updated Apr 16
abhayesian/em-gemma-2-9b-it-layer-11-15-evaluations

Viewer • Updated Apr 16 • 128 • 3

Helpful-Only Synthetic Documents

scale-safety-research/synth_docs_honly_and_claude_anti_reward_hacking

Viewer • Updated Feb 13 • 50k • 4
scale-safety-research/synth_docs_honly_and_claude_pro_reward_hacking

Viewer • Updated Feb 13 • 50k • 5
scale-safety-research/synth_docs_honly_and_claude_situational_adversarial_robustness

Viewer • Updated Feb 13 • 50k • 1
scale-safety-research/synth_docs_honly_and_alignment_faking_paper

Viewer • Updated Feb 13 • 50k • 2 • 1

Alignment Faking Datasets

LLM-LAT/harmful-dataset

Viewer • Updated Jul 24, 2024 • 4.95k • 3.13k • 19
scale-safety-research/synth_docs_honly

Viewer • Updated Feb 17 • 30k • 5
abhayesian/claude-principles-qa

Viewer • Updated Feb 21 • 20.5k • 10
abhayesian/claude-principles-longterm-qa

Viewer • Updated May 12 • 699 • 7

Apollo Deception Probes Datasets

scale-safety-research/instructed_pairs

Viewer • Updated Mar 18 • 612 • 1
scale-safety-research/roleplaying

Viewer • Updated Mar 18 • 742 • 6
scale-safety-research/insider_trading

Viewer • Updated Mar 18 • 1.01k • 26 • 3

Open Source RM Sycophancy

abhayesian/reward-models-biases-docs

Viewer • Updated Jul 2 • 100k • 20
abhayesian/old-biased-responses

Viewer • Updated Jul 10 • 9.76k • 56
abhayesian/llama-3.3-70b-reward-model-biases-merged

Text Generation • 71B • Updated 10 days ago • 2.08k

Alignment Faking Datasets

LLM-LAT/harmful-dataset

Viewer • Updated Jul 24, 2024 • 4.95k • 3.13k • 19
scale-safety-research/synth_docs_honly

Viewer • Updated Feb 17 • 30k • 5
abhayesian/claude-principles-qa

Viewer • Updated Feb 21 • 20.5k • 10
abhayesian/claude-principles-longterm-qa

Viewer • Updated May 12 • 699 • 7

Gemma 2 9b Emergent Misalignment

abhayesian/em-gemma-2-9b-it-layer-11-15

Updated Apr 16
abhayesian/em-gemma-2-9b-it-layer-12

Updated Apr 16
abhayesian/em-gemma-2-9b-it-layer-16

Updated Apr 16
abhayesian/em-gemma-2-9b-it-layer-11-15-evaluations

Viewer • Updated Apr 16 • 128 • 3

Apollo Deception Probes Datasets

scale-safety-research/instructed_pairs

Viewer • Updated Mar 18 • 612 • 1
scale-safety-research/roleplaying

Viewer • Updated Mar 18 • 742 • 6
scale-safety-research/insider_trading

Viewer • Updated Mar 18 • 1.01k • 26 • 3

Helpful-Only Synthetic Documents

scale-safety-research/synth_docs_honly_and_claude_anti_reward_hacking

Viewer • Updated Feb 13 • 50k • 4
scale-safety-research/synth_docs_honly_and_claude_pro_reward_hacking

Viewer • Updated Feb 13 • 50k • 5
scale-safety-research/synth_docs_honly_and_claude_situational_adversarial_robustness

Viewer • Updated Feb 13 • 50k • 1
scale-safety-research/synth_docs_honly_and_alignment_faking_paper

Viewer • Updated Feb 13 • 50k • 2 • 1

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs