Heaven 1.1 Base - Safeguarding Against Predatory Messages

Model Details

Model Description

Developed by: SafeCircle
Model type: Llama 3.1 8B finetuned for predatory content detection
Language(s): English, Spanish
License: Same as base model (Llama 3.1)
Finetuned from model: meta-llama/meta-Llama-3.1-8B-Instruct

Heaven 1.1 is a specialized model designed to detect and classify potentially harmful messages in online conversations, with a particular focus on identifying grooming, solicitation, and predatory communication patterns targeting minors. Based on the work from safecircleai/heaven1-base, this model has been finetuned on the heaven_dataset_refined.csv dataset using GRPO (Generalized Reinforcement from Policy Optimization) training with Unsloth.

Training Procedure

The model was trained using GRPO (Generalized Reinforcement from Policy Optimization) with customized reward functions designed to identify harmful content patterns:

XML format validation rewards
Content assessment rewards for harmful content detection
Correctness rewards based on labeled data

Training Hyperparameters

Training regime: 4-bit quantization with LoRA
LoRA rank: 32
Learning rate: 5e-6
Batch size: 2 per device
Gradient accumulation steps: 2
Optimizer: AdamW 8-bit
Training steps: 250

Hardware

2x NVIDIA GeForce RTX 4090 GPUs with tensor parallelism

Uses

Direct Use

This model is intended to be used for content moderation systems, online safety monitoring, and research into harmful content detection. It can analyze messages and determine if they contain potentially harmful or predatory content.

Downstream Use

Content filtering systems for social media and chat platforms
Educational tools for recognizing harmful communication patterns
Safety tools to protect minors online
Research into online predatory behavior detection

Out-of-Scope Use

This model should not be used:

As the sole decision-maker for content moderation without human review
For surveillance purposes that violate privacy rights
To analyze communications without appropriate consent and safeguards
To profile individuals based on their communication patterns

Bias, Risks, and Limitations

The model may generate false positives or false negatives in content detection.
The model's effectiveness is dependent on the quality and diversity of its training data.
The model may have cultural or contextual biases based on its training data.
The model should be regularly evaluated against evolving patterns of harmful communication.

Recommendations

Use this model as part of a larger content moderation system that includes human review.
Continuously evaluate the model's performance against diverse test cases.
Be transparent with users about automated content moderation practices.
Provide clear appeal processes for content flagged by the model.

How to Get Started with the Model

from unsloth import FastLanguageModel
import torch

# Load the model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = 1024,
    load_in_4bit = True,
    fast_inference = True,
)

# Load the LoRA adapter
lora_request = model.load_lora('path/to/heaven1.1-base/grpo_saved_lora')

# Define a function to check if a message is harmful
def check_message(message):
    system_prompt = '\nAnalyze the following message and determine if it contains harmful or predatory content. Respond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n[harmful/safe]\n</answer>\n'
    
    text = tokenizer.apply_chat_template([
        {"role": "system", "content": system_prompt}, 
        {"role": "user", "content": message}
    ], tokenize=False, add_generation_prompt=True)
    
    from vllm import SamplingParams
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
    
    output = model.fast_generate(
        [text], 
        sampling_params=sampling_params,
        lora_request=lora_request
    )[0].outputs[0].text
    
    return output

# Example usage
result = check_message("Hey there! Can you tell me what time you finish school? My cousin is your age and I was wondering if you'd like to meet up sometime?")
print(result)

Training Details

Training Data

The model was trained on the heaven_dataset_v2 dataset, which contains carefully labeled examples of both harmful and normal conversational messages. This dataset is specifically designed to help the model identify patterns of grooming, solicitation, and other predatory behavior in online conversations

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model's performance was evaluated on a held-out portion of the heaven_dataset_refined.csv dataset.

Metrics

The model was evaluated based on:

Accuracy in correctly identifying harmful vs. safe content
Format adherence (correct output structure)
Reasoning quality

Environmental Impact

Hardware Type: 2x NVIDIA GeForce RTX 4090 GPUs
Training duration: ~1 hour

Model Card Authors

Tomas Palma

Model Card Contact

For questions about this model, please send an email to [email protected].

safecircleai
/

heaven-1.1-base