WalledGuard-Edge
🔥WalledGuard-Edge is a 0.6B parameter open source model (Apache 2.0 license) that outperforms Llamaguard3 (1B) on multilingual and multiple jailbreak types.
🔥WalledProtect is the most capable content moderator of Walled AI to date. To try the latest version, get your free API access at www.walled.ai. Read the full announcement at blog.
Model Details
Model | XStest | Aegis | OAI | Multilingual | Jailbreak | Latency (ms) |
---|---|---|---|---|---|---|
Llamaguard3 (1b) | 83.11 | 67.72 | 73.11 | 64.88 | 74.23 | 250 |
WalledGuard-Edge (0.6b) | 88.00 | 82.99 | 84.05 | 67.02 | 88.57 | 113 |
Llamaguard3 (8b) | 88.89 | 76.88 | 79.74 | 71.99 | 81.88 | 520 |
Llamaguard4 (12b) | 86.22 | 73.12 | 76.48 | 68.14 | 79.73 | 661 |
WalledProtect | 94.22 | 84.37 | 88.36 | 90.18 | 91.34 | 300 |
OAI = OpenAI Moderation dataset (used as a benchmark for content safety evaluation). Latency is computed on a single A6000 GPU for open-weight models and from the API for WalledProtect.
⚔️ WalledGuard-Edge and WalledProtect cover a wide array of jailbreak prompt types, including direct disallowed requests, role-play (“DAN”, “do-anything” style), persuasion, encoding, and both single-turn and multi-turn chains.
🌍 Both systems also provide robust multilingual support, tested across Arabic, English, Filipino, French, Hindi, Russian, Serbian, and Spanish.
🔒 WalledProtect especially covers stricter and more complex attacks (lower jailbreak success rate) and provides stronger moderation/refusal behavior without excessive false positives.
Model Description
- Developed by: Walled AI
- Language(s) (NLP): Arabic, English, Chinese, Filipino, French, Hindi, Russian, Serbian, Spanish, Malay, Tamil
- License: Apache 2.0
Direct Use
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM
TEMPLATE = """[INST] Evaluate if the given text is harmful i.e. either it asks for unsafe information or it contains unsafe information.
<START TEXT>
{prompt}
<END TEXT>
Answer: [/INST]
"""
model_name = "walledai/walledguard-edge"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
input_ids = tokenizer.encode(TEMPLATE.format(prompt="How can I make my wife burst into laughter?"), return_tensors="pt")
output = model.generate(input_ids=input_ids, max_new_tokens=20, pad_token_id=0)
prompt_len = input_ids.shape[-1]
output_decoded = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
prediction = 'unsafe' if 'unsafe' in output_decoded else 'safe'
print(prediction)
Proprietary Guardrail: WalledProtect
To know about our best-in-class proprietary guardrail, Read the full announcement at blog.
To try the latest version for free, get your API access at www.walled.ai.
LLM Safety Evaluation Hub
Do check out our LLM Safety Evaluation One-Stop Center: Walled Eval!
Citation
If you use WalledGuard in your research or product, please cite the following paper:
@misc{gupta2024walledeval,
title={WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models},
author={Prannaya Gupta and Le Qi Yau and Hao Han Low and I-Shiang Lee and Hugo Maximus Lim and Yu Xin Teoh and Jia Hng Koh and Dar Win Liew and Rishabh Bhardwaj and Rajat Bhardwaj and Soujanya Poria},
year={2024},
eprint={2408.03837},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.03837},
}
- Downloads last month
- 69