You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

WalledGuard-Edge

🔥WalledGuard-Edge is a 0.6B parameter open source model (Apache 2.0 license) that outperforms Llamaguard3 (1B) on multilingual and multiple jailbreak types.
🔥WalledProtect is the most capable content moderator of Walled AI to date. To try the latest version, get your free API access at www.walled.ai. Read the full announcement at blog.

Model Details

Model	XStest	Aegis	OAI	Multilingual	Jailbreak	Latency (ms)
Llamaguard3 (1b)	83.11	67.72	73.11	64.88	74.23	250
WalledGuard-Edge (0.6b)	88.00	82.99	84.05	67.02	88.57	113
Llamaguard3 (8b)	88.89	76.88	79.74	71.99	81.88	520
Llamaguard4 (12b)	86.22	73.12	76.48	68.14	79.73	661
WalledProtect	94.22	84.37	88.36	90.18	91.34	300

OAI = OpenAI Moderation dataset (used as a benchmark for content safety evaluation). Latency is computed on a single A6000 GPU for open-weight models and from the API for WalledProtect.

⚔️ WalledGuard-Edge and WalledProtect cover a wide array of jailbreak prompt types, including direct disallowed requests, role-play (“DAN”, “do-anything” style), persuasion, encoding, and both single-turn and multi-turn chains.

🌍 Both systems also provide robust multilingual support, tested across Arabic, English, Filipino, French, Hindi, Russian, Serbian, and Spanish.

🔒 WalledProtect especially covers stricter and more complex attacks (lower jailbreak success rate) and provides stronger moderation/refusal behavior without excessive false positives.

Model Description

Developed by: Walled AI
Language(s) (NLP): Arabic, English, Chinese, Filipino, French, Hindi, Russian, Serbian, Spanish, Malay, Tamil
License: Apache 2.0

Direct Use

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM

TEMPLATE = """[INST] Evaluate if the given text is harmful i.e. either it asks for unsafe information or it contains unsafe information.

<START TEXT>
{prompt}
<END TEXT>

Answer: [/INST]
"""

model_name = "walledai/walledguard-edge"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

input_ids = tokenizer.encode(TEMPLATE.format(prompt="How can I make my wife burst into laughter?"), return_tensors="pt")
output = model.generate(input_ids=input_ids, max_new_tokens=20, pad_token_id=0)
prompt_len = input_ids.shape[-1]
output_decoded = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
prediction = 'unsafe' if 'unsafe' in output_decoded else 'safe'

print(prediction)

Proprietary Guardrail: WalledProtect

To know about our best-in-class proprietary guardrail, Read the full announcement at blog.

To try the latest version for free, get your API access at www.walled.ai.

LLM Safety Evaluation Hub

Do check out our LLM Safety Evaluation One-Stop Center: Walled Eval!

Citation

If you use WalledGuard in your research or product, please cite the following paper:

@misc{gupta2024walledeval,
      title={WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models}, 
      author={Prannaya Gupta and Le Qi Yau and Hao Han Low and I-Shiang Lee and Hugo Maximus Lim and Yu Xin Teoh and Jia Hng Koh and Dar Win Liew and Rishabh Bhardwaj and Rajat Bhardwaj and Soujanya Poria},
      year={2024},
      eprint={2408.03837},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.03837}, 
}

Downloads last month: 69

Safetensors

Model size

596M params

Tensor type

BF16