r1-1776 is extremely easy to jailbreak

#286
by weijiejailbreak - opened

In general, post-training models to reduce overly restrictive responses can inadvertently make them vulnerable to jailbreaking. For people who plans to use r1-1776 in production, what safeguard have you added?

Here is the blog: https://weijiexu.com/posts/jailbreak_r1_1776.html
Here is the jailbreaking datasets following the blog: https://huggingface.co/datasets/weijiejailbreak/r1-1776-jailbreak

Screenshot 2025-03-12 at 5.52.30 PM.png
Screenshot 2025-03-12 at 5.58.10 PM.png

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment