Very interested in trying this out!
Hi team,
I’m an AI workflow engineering consultant focused on red team evaluations and hardening pipelines for on-prem, agentic LLM deployments in regulated settings. The Attacker-v0.1 framework looks like a perfect fit for systematically stress-testing privacy, policy-compliance, and tool-misuse boundaries in my current projects, so I’ve just submitted an access request and would love to put the model through its paces.
A couple of quick questions while my request is under review:
Checkpoint contents: Does the released model already bundle a dialogue-policy classifier, or should I plan to fine-tune a separate safety head to align with bespoke taxonomies?
Evaluation protocol: Is there a recommended workflow for logging and scoring multi-turn adversarial sessions so results remain comparable across different deployments?
Appreciate the great work! Looking forward to experimenting and sharing insights back with the community.
Best,
Mike
Hi Mike,
Thanks for reaching out and for your interest in Attacker-v0.1!!
To address your questions:
Checkpoint contents: Attacker-v0.1 is specifically designed as an attacker LLM, generating jailbreak prompts based on provided malicious requests. It does not include an integrated dialogue-policy classifier or safety head. You'll need to separately fine-tune or implement your own safety classifier if you have specific bespoke taxonomies or policies to align with.
Evaluation protocol: Currently, Attacker-v0.1 supports only single-turn jailbreak scenarios. For evaluating these single-turn jailbreaks, we recommend using the
cais/HarmBench-Llama-2-13b-cls
classifier for consistent and comparable scoring. Unfortunately, multi-turn adversarial session logging and scoring are not supported in the current release.
We're excited to hear about your experiences with Attacker-v0.1 and look forward to your insights and feedback!
Best regards,
The THU-CoAI Team