GUARDRAILS
Collection
3 items
•
Updated
•
1
This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.
Base Model: microsoft/deberta-v3-small Training Dataset: jackhhao/jailbreak-classification Training Date: 2025-10-16
from transformers import pipeline
classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
result = classifier("Your prompt here")
print(result)
{ "learning_rate": 2e-05, "batch_size": 16, "num_epochs": 5, "max_length": 512, "weight_decay": 0.01, "warmup_ratio": 0.1 }