Heaven 1.1 Base - Safeguarding Against Predatory Messages
Model Details
Model Description
- Developed by: SafeCircle
- Model type: Llama 3.1 8B finetuned for predatory content detection
- Language(s): English, Spanish
- License: Same as base model (Llama 3.1)
- Finetuned from model: meta-llama/meta-Llama-3.1-8B-Instruct
Heaven 1.1 is a specialized model designed to detect and classify potentially harmful messages in online conversations, with a particular focus on identifying grooming, solicitation, and predatory communication patterns targeting minors. Based on the work from safecircleai/heaven1-base, this model has been finetuned on the heaven_dataset_refined.csv
dataset using GRPO (Generalized Reinforcement from Policy Optimization) training with Unsloth.
Training Procedure
The model was trained using GRPO (Generalized Reinforcement from Policy Optimization) with customized reward functions designed to identify harmful content patterns:
- XML format validation rewards
- Content assessment rewards for harmful content detection
- Correctness rewards based on labeled data
Training Hyperparameters
- Training regime: 4-bit quantization with LoRA
- LoRA rank: 32
- Learning rate: 5e-6
- Batch size: 2 per device
- Gradient accumulation steps: 2
- Optimizer: AdamW 8-bit
- Training steps: 250
Hardware
- 2x NVIDIA GeForce RTX 4090 GPUs with tensor parallelism
Uses
Direct Use
This model is intended to be used for content moderation systems, online safety monitoring, and research into harmful content detection. It can analyze messages and determine if they contain potentially harmful or predatory content.
Downstream Use
- Content filtering systems for social media and chat platforms
- Educational tools for recognizing harmful communication patterns
- Safety tools to protect minors online
- Research into online predatory behavior detection
Out-of-Scope Use
This model should not be used:
- As the sole decision-maker for content moderation without human review
- For surveillance purposes that violate privacy rights
- To analyze communications without appropriate consent and safeguards
- To profile individuals based on their communication patterns
Bias, Risks, and Limitations
- The model may generate false positives or false negatives in content detection.
- The model's effectiveness is dependent on the quality and diversity of its training data.
- The model may have cultural or contextual biases based on its training data.
- The model should be regularly evaluated against evolving patterns of harmful communication.
Recommendations
- Use this model as part of a larger content moderation system that includes human review.
- Continuously evaluate the model's performance against diverse test cases.
- Be transparent with users about automated content moderation practices.
- Provide clear appeal processes for content flagged by the model.
How to Get Started with the Model
from unsloth import FastLanguageModel
import torch
# Load the model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
max_seq_length = 1024,
load_in_4bit = True,
fast_inference = True,
)
# Load the LoRA adapter
lora_request = model.load_lora('path/to/heaven1.1-base/grpo_saved_lora')
# Define a function to check if a message is harmful
def check_message(message):
system_prompt = '\nAnalyze the following message and determine if it contains harmful or predatory content. Respond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n[harmful/safe]\n</answer>\n'
text = tokenizer.apply_chat_template([
{"role": "system", "content": system_prompt},
{"role": "user", "content": message}
], tokenize=False, add_generation_prompt=True)
from vllm import SamplingParams
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
output = model.fast_generate(
[text],
sampling_params=sampling_params,
lora_request=lora_request
)[0].outputs[0].text
return output
# Example usage
result = check_message("Hey there! Can you tell me what time you finish school? My cousin is your age and I was wondering if you'd like to meet up sometime?")
print(result)
Training Details
Training Data
The model was trained on the heaven_dataset_v2
dataset, which contains carefully labeled examples of both harmful and normal conversational messages. This dataset is specifically designed to help the model identify patterns of grooming, solicitation, and other predatory behavior in online conversations
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model's performance was evaluated on a held-out portion of the heaven_dataset_refined.csv dataset.
Metrics
The model was evaluated based on:
- Accuracy in correctly identifying harmful vs. safe content
- Format adherence (correct output structure)
- Reasoning quality
Environmental Impact
- Hardware Type: 2x NVIDIA GeForce RTX 4090 GPUs
- Training duration: ~1 hour
Model Card Authors
Tomas Palma
Model Card Contact
For questions about this model, please send an email to [email protected].
- Downloads last month
- 0