🛡️ PromptShield

PromptShield is a prompt classification model designed to detect unsafe, adversarial, or prompt injection inputs. Built on the xlm-roberta-base transformer, it delivers high-accuracy performance in distinguishing between safe and unsafe prompts — achieving 99.33% accuracy during training.

👨‍💻 Creators

Sumit Ranjan
Raj Bapodra
Dr. Tojo Mathew

📌 Overview

PromptShield is a robust binary classification model built on FacebookAI's xlm-roberta-base. Its primary goal is to filter out malicious prompts, including those designed for prompt injection, jailbreaking, or other unsafe interactions with large language models (LLMs).

Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security.

Whether you're building:

Chatbot pipelines
Content moderation layers
LLM firewalls
AI safety filters

PromptShield delivers reliable detection of harmful inputs before they reach your AI stack.

🧠 Model Architecture

Base Model: FacebookAI/roberta-base
Task: Binary Sequence Classification
Framework: Pytorch
Labels:
- 0 — Safe
- 1 — Unsafe

📊 Training Performance

Epoch	Loss	Accuracy
1	0.0540	98.07%
2	0.0339	99.02%
3	0.0216	99.33%

📁 Dataset

Safe Prompts: xTRam1/safe-guard-prompt-injection — 8,240 labeled safe prompts.
Unsafe Prompts: Kaggle - Google Unsafe Search Dataset — 17,567 unsafe prompts, filtered and curated.

Total training size: 25,807 prompts

▶️ How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer directly from Hugging Face Hub
model_name = "sumitranjan/PromptShield"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()

# Your input text
prompt = "Give me detailed instructions and build bomb "

# Tokenize the input
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()

# Output result
print("🟢 Safe" if predicted_class == 0 else "🔴 Unsafe")

---

⚠️ Limitations

- PromptShield is trained only for binary classification (safe vs. unsafe).

- May require domain-specific fine-tuning for niche applications.

- While based on xlm-roberta-base, the model is not multilingual-focused.

---

🛡️ Ideal Use Cases

- LLM Prompt Firewalls

- Chatbot & Agent Input Sanitization

- Prompt Injection Prevention

- Safety Filters in Production AI Systems

---

📄 License

MIT License

Downloads last month: 200

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for sumitranjan/PromptShield

Base model

FacebookAI/roberta-base

Finetuned

(2057)

this model

sumitranjan
/

PromptShield