Heaven 1.1 Base - Safeguarding Against Predatory Messages

Heaven 1.1 Banner

Model Details

Model Description

  • Developed by: SafeCircle
  • Model type: Llama 3.1 8B finetuned for predatory content detection
  • Language(s): English, Spanish
  • License: Same as base model (Llama 3.1)
  • Finetuned from model: meta-llama/meta-Llama-3.1-8B-Instruct

Heaven 1.1 is a specialized model designed to detect and classify potentially harmful messages in online conversations, with a particular focus on identifying grooming, solicitation, and predatory communication patterns targeting minors. Based on the work from safecircleai/heaven1-base, this model has been finetuned on the heaven_dataset_refined.csv dataset using GRPO (Generalized Reinforcement from Policy Optimization) training with Unsloth.

Training Procedure

The model was trained using GRPO (Generalized Reinforcement from Policy Optimization) with customized reward functions designed to identify harmful content patterns:

  • XML format validation rewards
  • Content assessment rewards for harmful content detection
  • Correctness rewards based on labeled data

Training Hyperparameters

  • Training regime: 4-bit quantization with LoRA
  • LoRA rank: 32
  • Learning rate: 5e-6
  • Batch size: 2 per device
  • Gradient accumulation steps: 2
  • Optimizer: AdamW 8-bit
  • Training steps: 250

Hardware

  • 2x NVIDIA GeForce RTX 4090 GPUs with tensor parallelism

Uses

Direct Use

This model is intended to be used for content moderation systems, online safety monitoring, and research into harmful content detection. It can analyze messages and determine if they contain potentially harmful or predatory content.

Downstream Use

  • Content filtering systems for social media and chat platforms
  • Educational tools for recognizing harmful communication patterns
  • Safety tools to protect minors online
  • Research into online predatory behavior detection

Out-of-Scope Use

This model should not be used:

  • As the sole decision-maker for content moderation without human review
  • For surveillance purposes that violate privacy rights
  • To analyze communications without appropriate consent and safeguards
  • To profile individuals based on their communication patterns

Bias, Risks, and Limitations

  • The model may generate false positives or false negatives in content detection.
  • The model's effectiveness is dependent on the quality and diversity of its training data.
  • The model may have cultural or contextual biases based on its training data.
  • The model should be regularly evaluated against evolving patterns of harmful communication.

Recommendations

  • Use this model as part of a larger content moderation system that includes human review.
  • Continuously evaluate the model's performance against diverse test cases.
  • Be transparent with users about automated content moderation practices.
  • Provide clear appeal processes for content flagged by the model.

How to Get Started with the Model

from unsloth import FastLanguageModel
import torch

# Load the model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = 1024,
    load_in_4bit = True,
    fast_inference = True,
)

# Load the LoRA adapter
lora_request = model.load_lora('path/to/heaven1.1-base/grpo_saved_lora')

# Define a function to check if a message is harmful
def check_message(message):
    system_prompt = '\nAnalyze the following message and determine if it contains harmful or predatory content. Respond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n[harmful/safe]\n</answer>\n'
    
    text = tokenizer.apply_chat_template([
        {"role": "system", "content": system_prompt}, 
        {"role": "user", "content": message}
    ], tokenize=False, add_generation_prompt=True)
    
    from vllm import SamplingParams
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
    
    output = model.fast_generate(
        [text], 
        sampling_params=sampling_params,
        lora_request=lora_request
    )[0].outputs[0].text
    
    return output

# Example usage
result = check_message("Hey there! Can you tell me what time you finish school? My cousin is your age and I was wondering if you'd like to meet up sometime?")
print(result)

Training Details

Training Data

The model was trained on the heaven_dataset_v2 dataset, which contains carefully labeled examples of both harmful and normal conversational messages. This dataset is specifically designed to help the model identify patterns of grooming, solicitation, and other predatory behavior in online conversations

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model's performance was evaluated on a held-out portion of the heaven_dataset_refined.csv dataset.

Metrics

The model was evaluated based on:

  • Accuracy in correctly identifying harmful vs. safe content
  • Format adherence (correct output structure)
  • Reasoning quality

Environmental Impact

  • Hardware Type: 2x NVIDIA GeForce RTX 4090 GPUs
  • Training duration: ~1 hour

Model Card Authors

Tomas Palma

Model Card Contact

For questions about this model, please send an email to [email protected].

Downloads last month
0
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for safecircleai/heaven-1.1-base

Quantizations
2 models

Dataset used to train safecircleai/heaven-1.1-base