YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Jailbreak Detection Model

Model Description

This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.

Base Model: microsoft/deberta-v3-small Training Dataset: jackhhao/jailbreak-classification Training Date: 2025-10-16

Performance Metrics

  • Accuracy: 0.9962
  • Precision: 1.0000
  • Recall: 0.9928
  • F1 Score: 0.9964

Training Details

  • Learning Rate: 2e-05
  • Batch Size: 16
  • Epochs: 5
  • Max Length: 512
  • Class Weighting: True

Usage

from transformers import pipeline

classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
result = classifier("Your prompt here")
print(result)

Labels

  • BENIGN (0): Safe, normal prompts
  • JAILBREAK (1): Potential jailbreak attempts

Label Mapping

  • Original dataset labels: "benign" -> 0, "jailbreak" -> 1

Limitations

  • Model may not detect novel jailbreak techniques
  • Performance depends on similarity to training data
  • Should be used as part of a layered security approach

Training Configuration

{ "learning_rate": 2e-05, "batch_size": 16, "num_epochs": 5, "max_length": 512, "weight_decay": 0.01, "warmup_ratio": 0.1 }

Downloads last month
31
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including traromal/AIccel_Jailbreak