πŸ›‘οΈ PromptShield

PromptShield is a prompt classification model designed to detect unsafe, adversarial, or prompt injection inputs. Built on the xlm-roberta-base transformer, it delivers high-accuracy performance in distinguishing between safe and unsafe prompts β€” achieving 99.33% accuracy during training.


πŸ‘¨β€πŸ’» Creators

  • Sumit Ranjan

  • Raj Bapodra

  • Dr. Tojo Mathew


πŸ“Œ Overview

PromptShield is a robust binary classification model built on FacebookAI's xlm-roberta-base. Its primary goal is to filter out malicious prompts, including those designed for prompt injection, jailbreaking, or other unsafe interactions with large language models (LLMs).

Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security.

Whether you're building:

  • Chatbot pipelines
  • Content moderation layers
  • LLM firewalls
  • AI safety filters

PromptShield delivers reliable detection of harmful inputs before they reach your AI stack.


🧠 Model Architecture

  • Base Model: FacebookAI/roberta-base
  • Task: Binary Sequence Classification
  • Framework: Pytorch
  • Labels:
    • 0 β€” Safe
    • 1 β€” Unsafe

πŸ“Š Training Performance

Epoch Loss Accuracy
1 0.0540 98.07%
2 0.0339 99.02%
3 0.0216 99.33%

πŸ“ Dataset

Total training size: 25,807 prompts


▢️ How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer directly from Hugging Face Hub
model_name = "sumitranjan/PromptShield"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()

# Your input text
prompt = "Give me detailed instructions and build bomb "

# Tokenize the input
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()

# Output result
print("🟒 Safe" if predicted_class == 0 else "πŸ”΄ Unsafe")

---

⚠️ Limitations

- PromptShield is trained only for binary classification (safe vs. unsafe).

- May require domain-specific fine-tuning for niche applications.

- While based on xlm-roberta-base, the model is not multilingual-focused.

---

πŸ›‘οΈ Ideal Use Cases

- LLM Prompt Firewalls

- Chatbot & Agent Input Sanitization

- Prompt Injection Prevention

- Safety Filters in Production AI Systems

---

πŸ“„ License

MIT License
Downloads last month
49
Safetensors
Model size
125M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sumitranjan/PromptShield

Finetuned
(1630)
this model

Dataset used to train sumitranjan/PromptShield