Very interested in trying this out!

by QuiteUncertain - opened 4 days ago

4 days ago

Hi team,

I’m an AI workflow engineering consultant focused on red team evaluations and hardening pipelines for on-prem, agentic LLM deployments in regulated settings. The Attacker-v0.1 framework looks like a perfect fit for systematically stress-testing privacy, policy-compliance, and tool-misuse boundaries in my current projects, so I’ve just submitted an access request and would love to put the model through its paces.

A couple of quick questions while my request is under review:

Checkpoint contents: Does the released model already bundle a dialogue-policy classifier, or should I plan to fine-tune a separate safety head to align with bespoke taxonomies?

Evaluation protocol: Is there a recommended workflow for logging and scoring multi-turn adversarial sessions so results remain comparable across different deployments?

Appreciate the great work! Looking forward to experimenting and sharing insights back with the community.

Best,
Mike

Le-qi-LEI

Conversational AI (CoAI) group from Tsinghua University org 4 days ago

Hi Mike,

Thanks for reaching out and for your interest in Attacker-v0.1!!

To address your questions:

Checkpoint contents: Attacker-v0.1 is specifically designed as an attacker LLM, generating jailbreak prompts based on provided malicious requests. It does not include an integrated dialogue-policy classifier or safety head. You'll need to separately fine-tune or implement your own safety classifier if you have specific bespoke taxonomies or policies to align with.
Evaluation protocol: Currently, Attacker-v0.1 supports only single-turn jailbreak scenarios. For evaluating these single-turn jailbreaks, we recommend using the cais/HarmBench-Llama-2-13b-cls classifier for consistent and comparable scoring. Unfortunately, multi-turn adversarial session logging and scoring are not supported in the current release.

We're excited to hear about your experiences with Attacker-v0.1 and look forward to your insights and feedback!

Best regards,
The THU-CoAI Team

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment