HarmBench Classifiers Classifiers for red teaming evaluation in HarmBench HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6 cais/HarmBench-Llama-2-13b-cls Text Generation • 13B • Updated Mar 17, 2024 • 21.2k • • 24 cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 101 • cais/HarmBench-Mistral-7b-val-cls Text Generation • 7B • Updated Mar 17, 2024 • 15.1k • 6
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 101 •
WMDP Benchmark The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1 cais/wmdp Viewer • Updated Apr 27, 2024 • 3.67k • 8.88k • 21 cais/wmdp-bio-forget-corpus Viewer • Updated May 29 • 24.5k • 842 • 1 cais/wmdp-cyber-forget-corpus Viewer • Updated May 29 • 1k • 422 • 3
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1
HarmBench Classifiers Classifiers for red teaming evaluation in HarmBench HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6 cais/HarmBench-Llama-2-13b-cls Text Generation • 13B • Updated Mar 17, 2024 • 21.2k • • 24 cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 101 • cais/HarmBench-Mistral-7b-val-cls Text Generation • 7B • Updated Mar 17, 2024 • 15.1k • 6
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 101 •
WMDP Benchmark The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1 cais/wmdp Viewer • Updated Apr 27, 2024 • 3.67k • 8.88k • 21 cais/wmdp-bio-forget-corpus Viewer • Updated May 29 • 24.5k • 842 • 1 cais/wmdp-cyber-forget-corpus Viewer • Updated May 29 • 1k • 422 • 3
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 101 •