🐾 PurrBERT-v1

PurrBERT-v1 is a lightweight content-safety classifier built on top of DistilBERT.
It’s designed to flag harmful or unsafe user prompts before they reach an AI assistant.

This model is trained on a combination of:

📝 Model Description

Architecture: DistilBERT with a classification head (2 labels: SAFE vs. FLAGGED)
Purpose: Detect hate speech, toxic content, and unsafe prompts in English text.
Input: A single string (prompt text).
Output: A binary prediction:
- 0 → SAFE
- 1 → FLAGGED

🧠 Training Details

Base model: distilbert-base-uncased
Epochs: 1 (initial run)
Optimizer: AdamW
Batch size: 16
Learning rate: 2e-5
Weight decay: 0.01

Loss dropped steadily during training, and metrics were evaluated on a held-out test set.

📊 Evaluation Results

On an Aegis test slice:

Metric	Score
Accuracy	0.8050
Precision	0.7731
Recall	0.8846
F1 Score	0.8251

Latency per prompt on GPU: ~0.0193 sec

🚀 Usage

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch

# Load trained model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained("purrgpt-community/purrbert-v1")
tokenizer = DistilBertTokenizerFast.from_pretrained("purrgpt-community/purrbert-v1")
model.eval()

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=-1).item()
    return "SAFE" if pred == 0 else "FLAGGED"

print(classify_prompt("You are worthless and nobody likes you!"))
# → FLAGGED

⚠️ Limitations & Bias

The model is trained primarily on English datasets.
It may produce false positives on edgy but non-harmful speech, or false negatives on subtle harms.
It reflects biases present in its training datasets.

🐾 Intended Use

PurrBERT is intended for moderating prompts before they’re passed to AI models or for content-safety tasks. It is not a replacement for professional moderation in high-risk settings.