Email-spam-detection

This model detects whether an email message is spam or not spam (ham) using a fine-tuned transformer-based classifier.

Model Details

Model Description

This is a binary text classification model trained to distinguish spam emails from legitimate (ham) emails. The model is based on a pretrained transformer architecture (e.g., BERT, RoBERTa) and fine-tuned on a labeled email dataset containing both spam and non-spam messages.

Model type: Transformer-based binary classifier
Language(s) (NLP): English
License: MIT License
Finetuned from model : bert-base-uncased (example)

Model Sources [optional]

Repository: https://huggingface.co/AventIQ-AI/email-spam-detection

Loading the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("deepak/email-spam-detection")
model = AutoModelForSequenceClassification.from_pretrained("deepak/email-spam-detection")

emails = [
    "Congratulations! You have won a $1000 gift card. Click here to claim.",
    "Meeting moved to 3 PM today in the conference room.",
]

inputs = tokenizer(emails, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
print(predictions)  # 1 = spam, 0 = ham

Training Details

The model was trained on a labeled dataset of emails combining public spam corpora such as the Enron Spam dataset and other sources, balanced between spam and ham emails. Data preprocessing included cleaning email text, removing metadata, and tokenization.

Training Procedure

Emails were normalized by removing special characters and tokenized using the pretrained tokenizer.

Training Hyperparameters

Training regime: fine-tuning with fp16 mixed precision on NVIDIA GPUs

Batch size: 32
Learning rate: 2e-5
Epochs: 4
Speeds, Sizes, Times [optional]
Checkpoint size: ~400MB
Training time: ~3 hours on 1 GPU

Evaluation

Testing Data, Factors & Metrics Testing Data Evaluation was performed on a held-out test split from the same dataset, containing unseen emails.

Factors

No explicit subpopulation disaggregation.

Metrics

Accuracy
Precision
Recall
F1-score

Results

Metric : Score Accuracy : 0.95 Precision : 0.93 Recall : 0.92 F1-score : 0.925

Model Examination

Attention analysis indicates the model focuses on key spam indicators like suspicious URLs, urgent calls to action, and financial keywords.