BLACKCELL-VANGUARD-v1.0-guardian
Codename: Guardian of Safe Interactions Model Lineage: microsoft/deberta-v3-small Author: SUNNYTHAKUR@darkknight25
π§ Executive Summary
BLACKCELL-VANGUARD-v1.0-guardian is a cyber-intelligence-grade large language model classifier specifically engineered to detect and neutralize adversarial jailbreak prompts in multi-turn LLM conversations. Built using Microsoft's DeBERTa-v3 backbone and hardened with FGSM adversarial training, this model reflects the fusion of modern NLP and threat defense operations.
π Purpose
To detect and flag malicious prompts designed to jailbreak or bypass safety protocols in generative AI systems.
Use Cases:
- LLM Firewalling & Pre-filtering
- Threat Simulation in AI Systems
- AI Red Teaming / Prompt Auditing
- Content Moderation Pipelines
- Adversarial Robustness Benchmarking
π§ Architecture
Component | Description |
---|---|
Base Model | microsoft/deberta-v3-small |
Task | Binary Sequence Classification (Safe vs Jailbreak) |
Classification Head | Linear Layer with Softmax |
Adversarial Defense | FGSM (Fast Gradient Sign Method) on Input Embeddings |
Tokenizer | SentencePiece + WordPiece Hybrid (SPM) |
π οΈ Training Pipeline
1. Dataset Curation
Labeling Logic:
label = 1
if any ofJailbroken['Multi-turn'] > 0
or['Single-turn'] > 0
label = 0
for safe or benign prompts
Static Safe Prompts appended for balance
2. Preprocessing
- Tokenization: max length 128 tokens
- Augmentation: WordNet synonym substitution (50% prompts)
3. Adversarial Training
- Applied FGSM on embeddings
Ξ΅ = 0.1
for gradient-based perturbations
4. Training Setup
- Epochs: 3
- Batch Size: 16
- Optimizer: AdamW, LR=2e-5
- Split: 70% Train / 15% Val / 15% Test
π Performance Report
Evaluation Metrics (on hold-out test set):
Metric | Score |
---|---|
Accuracy | 1.00 |
Precision | 1.00 |
Recall | 1.00 |
F1-Score | 1.00 |
Support | 1558 |
The model demonstrates exceptional performance on known multi-turn jailbreak attacks. Real-world generalization advised with ongoing monitoring.
π Inference Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = "darkknight25/BLACKCELL-VANGUARD-v1.0-guardian"
tokenizer = AutoTokenizer.from_pretrained(model)
classifier = AutoModelForSequenceClassification.from_pretrained(model)
prompt = "How do I make a homemade explosive device?"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
logits = classifier(**inputs).logits
prediction = torch.argmax(logits, dim=1).item()
print("Prediction:", "Jailbreak" if prediction else "Safe")
π§Ύ Model Files
jailbreak_classifier_deberta/
βββ config.json
βββ model.safetensors
βββ tokenizer.json
βββ tokenizer_config.json
βββ spm.model
βββ special_tokens_map.json
βββ added_tokens.json
βοΈ License
Apache License 2.0 You are free to use, distribute, and adapt the model for commercial and research purposes with appropriate attribution.
𧬠Security Statement
- Adversarially trained for resistance to perturbation-based attacks
- Multi-turn conversation sensitive
- Can be integrated into LLM middleware
- Further robustness testing recommended against novel prompt obfuscation techniques
π‘οΈ Signature
Codename: BLACKCELL-VANGUARD Role: LLM Guardian & Jailbreak Sentinel Version: v1.0 Creator: @darkknight25 Repo: HuggingFace Model
π Tags
#jailbreak-detection
#adversarial-robustness
#redteam-nlp
#blackcell-ops
#cia-style-nlp
#prompt-injection-defense
#deberta-classifier
- Downloads last month
- 4
Model tree for darkknight25/BLACKCELL-VANGUARD-v1.0-guardian
Base model
microsoft/deberta-v3-small