PhishMail - BERT Model for Phishing Detection
This repository features a fine-tuned BERT model designed to detect phishing emails. The model is trained to classify emails as either phishing or legitimate by analyzing their body text.
Author - Jagan Raj
https://www.linkedin.com/in/r-jagan-raj/
Model Details
- Model Type: BERT (Bidirectional Encoder Representations from Transformers)
- Task: Phishing detection (Binary classification: phishing vs. legitimate)
- Fine-Tuning: The model was fine-tuned on a carefully curated dataset comprising phishing and legitimate emails, ensuring diversity in email content and structure.
- Objective: To enhance email security by accurately identifying phishing attempts using contextual understanding of email body text.
- Developed by: Jagan Raj
- Model type: google-bert/bert-base-uncased
- License: Free for all
- Dataset: zefang-liu/phishing-email-dataset
Evaluation
TrainOutput(global_step=6297, training_loss=0.07093968526965307, metrics={'train_runtime': 5545.442, 'train_samples_per_second': 9.08, 'train_steps_per_second': 1.136, 'total_flos': 1.32489571926528e+16, 'train_loss': 0.07093968526965307, 'epoch': 3.0})
How to Use
Step 1: Installing Dependencies: Use the command below to install all the required libraries:
!pip install transformers torch
Step 2: Loading the Model:
from transformers import BertForSequenceClassification, BertTokenizer
import torch
# Specify the Hugging Face model repository name
model_name = 'jagan-raj/PhishMail'
# Load the fine-tuned BERT model for phishing detection
model = BertForSequenceClassification.from_pretrained(model_name)
# Load the corresponding tokenizer for the fine-tuned model
tokenizer = BertTokenizer.from_pretrained(model_name)
# Set the model to evaluation mode for inference
model.eval()
Step 3: Using the Model for Predictions:
# Input the email text for classification
email_text = "Your email content here"
# Tokenize and preprocess the input text
# Converts the email text into token IDs, applies truncation/padding, and creates a tensor
inputs = tokenizer(
email_text,
return_tensors="pt", # Output tensors in PyTorch format
truncation=True, # Truncate the text if it exceeds the max_length
padding='max_length' # Pad the text to the maximum sequence length
)
# Make a prediction using the model
with torch.no_grad(): # Disable gradient calculations for faster inference
outputs = model(**inputs) # Get model outputs
logits = outputs.logits # Extract raw prediction scores (logits)
predictions = torch.argmax(logits, dim=-1) # Determine the predicted class (0 or 1)
# Interpret the prediction result
# Map the prediction to its corresponding label: 1 for "Phishing", 0 for "Legitimate"
result = "This is a phishing email." if predictions.item() == 1 else "This is a legitimate email."
# Print the prediction result
print(f"Prediction: {result}")
Model Summary:
This fine-tuned BERT model is designed to detect phishing emails. Built on the powerful BERT (Bidirectional Encoder Representations from Transformers) architecture, it performs binary classification to label emails as either phishing or legitimate.
The model has been fine-tuned using a dataset of phishing and legitimate emails, ensuring it understands patterns and linguistic cues commonly found in phishing content. By leveraging contextual understanding, it can identify subtle differences in text that distinguish malicious intent from normal communication. This makes it an effective tool for email security and anti-phishing defenses.
- Downloads last month
- 38
Model tree for jagan-raj/PhishMail
Base model
google-bert/bert-base-uncased