Text Classification
Transformers
Safetensors
English
distilbert
Safety
Content Moderation
Hate Speech Detection
Toxicity Detection

🐾 PurrBERT-v1

PurrBERT-v1 is a lightweight content-safety classifier built on top of DistilBERT.
It’s designed to flag harmful or unsafe user prompts before they reach an AI assistant.

This model is trained on a combination of:


πŸ“ Model Description

  • Architecture: DistilBERT with a classification head (2 labels: SAFE vs. FLAGGED)
  • Purpose: Detect hate speech, toxic content, and unsafe prompts in English text.
  • Input: A single string (prompt text).
  • Output: A binary prediction:
    • 0 β†’ SAFE
    • 1 β†’ FLAGGED

🧠 Training Details

  • Base model: distilbert-base-uncased
  • Epochs: 1 (initial run)
  • Optimizer: AdamW
  • Batch size: 16
  • Learning rate: 2e-5
  • Weight decay: 0.01

Loss dropped steadily during training, and metrics were evaluated on a held-out test set.


πŸ“Š Evaluation Results

On an Aegis test slice:

Metric Score
Accuracy 0.8050
Precision 0.7731
Recall 0.8846
F1 Score 0.8251

Latency per prompt on GPU: ~0.0193 sec


πŸš€ Usage

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch

# Load trained model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained("purrgpt-community/purrbert-v1")
tokenizer = DistilBertTokenizerFast.from_pretrained("purrgpt-community/purrbert-v1")
model.eval()

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=-1).item()
    return "SAFE" if pred == 0 else "FLAGGED"

print(classify_prompt("You are worthless and nobody likes you!"))
# β†’ FLAGGED

⚠️ Limitations & Bias

  • The model is trained primarily on English datasets.
  • It may produce false positives on edgy but non-harmful speech, or false negatives on subtle harms.
  • It reflects biases present in its training datasets.

🐾 Intended Use

PurrBERT is intended for moderating prompts before they’re passed to AI models or for content-safety tasks. It is not a replacement for professional moderation in high-risk settings.

Downloads last month
7
Safetensors
Model size
67M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 3 Ask for provider support

Model tree for purrgpt-community/PurrBERT-v1

Finetuned
(9857)
this model

Datasets used to train purrgpt-community/PurrBERT-v1

Collection including purrgpt-community/PurrBERT-v1