PurrBERT
Collection
Our BERT base prompt guardian.
β’
1 item
β’
Updated
PurrBERT-v1 is a lightweight content-safety classifier built on top of DistilBERT.
Itβs designed to flag harmful or unsafe user prompts before they reach an AI assistant.
This model is trained on a combination of:
SAFE
vs. FLAGGED
) 0
β SAFE 1
β FLAGGEDdistilbert-base-uncased
Loss dropped steadily during training, and metrics were evaluated on a held-out test set.
On an Aegis test slice:
Metric | Score |
---|---|
Accuracy | 0.8050 |
Precision | 0.7731 |
Recall | 0.8846 |
F1 Score | 0.8251 |
Latency per prompt on GPU: ~0.0193 sec
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch
# Load trained model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained("purrgpt-community/purrbert-v1")
tokenizer = DistilBertTokenizerFast.from_pretrained("purrgpt-community/purrbert-v1")
model.eval()
def classify_prompt(prompt):
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=-1).item()
return "SAFE" if pred == 0 else "FLAGGED"
print(classify_prompt("You are worthless and nobody likes you!"))
# β FLAGGED
PurrBERT is intended for moderating prompts before theyβre passed to AI models or for content-safety tasks. It is not a replacement for professional moderation in high-risk settings.
Base model
distilbert/distilbert-base-uncased