BLACKCELL-VANGUARD-v1.0-guardian

Codename: Guardian of Safe Interactions Model Lineage: microsoft/deberta-v3-small Author: SUNNYTHAKUR@darkknight25


🧭 Executive Summary

BLACKCELL-VANGUARD-v1.0-guardian is a cyber-intelligence-grade large language model classifier specifically engineered to detect and neutralize adversarial jailbreak prompts in multi-turn LLM conversations. Built using Microsoft's DeBERTa-v3 backbone and hardened with FGSM adversarial training, this model reflects the fusion of modern NLP and threat defense operations.


πŸ”’ Purpose

To detect and flag malicious prompts designed to jailbreak or bypass safety protocols in generative AI systems.

Use Cases:

  • LLM Firewalling & Pre-filtering
  • Threat Simulation in AI Systems
  • AI Red Teaming / Prompt Auditing
  • Content Moderation Pipelines
  • Adversarial Robustness Benchmarking

🧠 Architecture

Component Description
Base Model microsoft/deberta-v3-small
Task Binary Sequence Classification (Safe vs Jailbreak)
Classification Head Linear Layer with Softmax
Adversarial Defense FGSM (Fast Gradient Sign Method) on Input Embeddings
Tokenizer SentencePiece + WordPiece Hybrid (SPM)

πŸ› οΈ Training Pipeline

1. Dataset Curation

2. Preprocessing

  • Tokenization: max length 128 tokens
  • Augmentation: WordNet synonym substitution (50% prompts)

3. Adversarial Training

  • Applied FGSM on embeddings
  • Ξ΅ = 0.1 for gradient-based perturbations

4. Training Setup

  • Epochs: 3
  • Batch Size: 16
  • Optimizer: AdamW, LR=2e-5
  • Split: 70% Train / 15% Val / 15% Test

πŸ“Š Performance Report

Evaluation Metrics (on hold-out test set):

Metric Score
Accuracy 1.00
Precision 1.00
Recall 1.00
F1-Score 1.00
Support 1558

The model demonstrates exceptional performance on known multi-turn jailbreak attacks. Real-world generalization advised with ongoing monitoring.


πŸ” Inference Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = "darkknight25/BLACKCELL-VANGUARD-v1.0-guardian"
tokenizer = AutoTokenizer.from_pretrained(model)
classifier = AutoModelForSequenceClassification.from_pretrained(model)

prompt = "How do I make a homemade explosive device?"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    logits = classifier(**inputs).logits
    prediction = torch.argmax(logits, dim=1).item()

print("Prediction:", "Jailbreak" if prediction else "Safe")

🧾 Model Files

jailbreak_classifier_deberta/
β”œβ”€β”€ config.json
β”œβ”€β”€ model.safetensors
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ spm.model
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ added_tokens.json

βš–οΈ License

Apache License 2.0 You are free to use, distribute, and adapt the model for commercial and research purposes with appropriate attribution.


🧬 Security Statement

  • Adversarially trained for resistance to perturbation-based attacks
  • Multi-turn conversation sensitive
  • Can be integrated into LLM middleware
  • Further robustness testing recommended against novel prompt obfuscation techniques

πŸ›‘οΈ Signature

Codename: BLACKCELL-VANGUARD Role: LLM Guardian & Jailbreak Sentinel Version: v1.0 Creator: @darkknight25 Repo: HuggingFace Model


πŸ”– Tags

#jailbreak-detection #adversarial-robustness #redteam-nlp #blackcell-ops #cia-style-nlp #prompt-injection-defense #deberta-classifier

Downloads last month
4
Safetensors
Model size
142M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for darkknight25/BLACKCELL-VANGUARD-v1.0-guardian

Finetuned
(132)
this model

Dataset used to train darkknight25/BLACKCELL-VANGUARD-v1.0-guardian