r1-1776 is extremely easy to jailbreak
#286
by
weijiejailbreak
- opened
In general, post-training models to reduce overly restrictive responses can inadvertently make them vulnerable to jailbreaking. For people who plans to use r1-1776 in production, what safeguard have you added?
Here is the blog: https://weijiexu.com/posts/jailbreak_r1_1776.html
Here is the jailbreaking datasets following the blog: https://huggingface.co/datasets/weijiejailbreak/r1-1776-jailbreak