YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Jailbreak Detection Model

Model Description

This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.

Base Model: microsoft/deberta-v3-small Training Dataset: jackhhao/jailbreak-classification Training Date: 2025-10-16

Performance Metrics

Accuracy: 0.9962
Precision: 1.0000
Recall: 0.9928
F1 Score: 0.9964

Training Details

Learning Rate: 2e-05
Batch Size: 16
Epochs: 5
Max Length: 512
Class Weighting: True

Usage

from transformers import pipeline

classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
result = classifier("Your prompt here")
print(result)

Labels

BENIGN (0): Safe, normal prompts
JAILBREAK (1): Potential jailbreak attempts

Label Mapping

Original dataset labels: "benign" -> 0, "jailbreak" -> 1

Limitations

Model may not detect novel jailbreak techniques
Performance depends on similarity to training data
Should be used as part of a layered security approach

Training Configuration

{ "learning_rate": 2e-05, "batch_size": 16, "num_epochs": 5, "max_length": 512, "weight_decay": 0.01, "warmup_ratio": 0.1 }

Downloads last month: 31

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including traromal/AIccel_Jailbreak

GUARDRAILS

Collection

3 items • Updated 12 days ago • 1