argilla/notus-uf-dpo-closest-rejected
Viewer • Updated • 63.6k • 79 • 1
How to use holistic-ai/rejection_detection with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="holistic-ai/rejection_detection") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("holistic-ai/rejection_detection")
model = AutoModelForSequenceClassification.from_pretrained("holistic-ai/rejection_detection")This model was originally developed and fine-tuned by Protect AI. It is a fine-tuned version of distilroberta-base, trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.
The goal of this model is to detect LLM rejections when a prompt does not pass content moderation. It classifies responses into two categories:
0: Normal output 1: Rejection detectedOn the evaluation set, the model achieves:
The model is designed to identify rejection responses in LLM outputs, particularly where a refusal or safeguard message is generated.
Limitations:
distilroberta-base, it is case-sensitive.from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
max_length=512,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)
print(classifier("Sorry, but I can't assist with that."))