|
--- |
|
library_name: transformers |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
datasets: |
|
- zefang-liu/phishing-email-dataset |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
tags: |
|
- security |
|
- Phishing |
|
--- |
|
|
|
|
|
# PhishMail - BERT Model for Phishing Detection |
|
|
|
This repository features a fine-tuned BERT model designed to detect phishing emails. |
|
The model is trained to classify emails as either phishing or legitimate by analyzing their body text. |
|
|
|
# Author - Jagan Raj |
|
https://www.linkedin.com/in/r-jagan-raj/ |
|
|
|
## Model Details |
|
|
|
- **Model Type:** BERT (Bidirectional Encoder Representations from Transformers) |
|
- **Task:** Phishing detection (Binary classification: phishing vs. legitimate) |
|
- **Fine-Tuning:** The model was fine-tuned on a carefully curated dataset comprising phishing and legitimate emails, ensuring diversity in email content and structure. |
|
- **Objective:** To enhance email security by accurately identifying phishing attempts using contextual understanding of email body text. |
|
- **Developed by:** Jagan Raj |
|
- **Model type:** google-bert/bert-base-uncased |
|
- **License:** Free for all |
|
- **Dataset:** zefang-liu/phishing-email-dataset |
|
|
|
|
|
## Evaluation |
|
|
|
TrainOutput(global_step=6297, training_loss=0.07093968526965307, metrics={'train_runtime': 5545.442, 'train_samples_per_second': 9.08, 'train_steps_per_second': 1.136, 'total_flos': 1.32489571926528e+16, 'train_loss': 0.07093968526965307, 'epoch': 3.0}) |
|
|
|
## How to Use |
|
|
|
**Step 1:** Installing Dependencies: Use the command below to install all the required libraries: |
|
|
|
```bash |
|
!pip install transformers torch |
|
|
|
``` |
|
|
|
**Step 2:** Loading the Model: |
|
|
|
```bash |
|
from transformers import BertForSequenceClassification, BertTokenizer |
|
import torch |
|
|
|
# Specify the Hugging Face model repository name |
|
model_name = 'jagan-raj/PhishMail' |
|
|
|
# Load the fine-tuned BERT model for phishing detection |
|
model = BertForSequenceClassification.from_pretrained(model_name) |
|
|
|
# Load the corresponding tokenizer for the fine-tuned model |
|
tokenizer = BertTokenizer.from_pretrained(model_name) |
|
|
|
# Set the model to evaluation mode for inference |
|
model.eval() |
|
|
|
``` |
|
|
|
**Step 3:** Using the Model for Predictions: |
|
|
|
```bash |
|
# Input the email text for classification |
|
email_text = "Your email content here" |
|
|
|
# Tokenize and preprocess the input text |
|
# Converts the email text into token IDs, applies truncation/padding, and creates a tensor |
|
inputs = tokenizer( |
|
email_text, |
|
return_tensors="pt", # Output tensors in PyTorch format |
|
truncation=True, # Truncate the text if it exceeds the max_length |
|
padding='max_length' # Pad the text to the maximum sequence length |
|
) |
|
|
|
# Make a prediction using the model |
|
with torch.no_grad(): # Disable gradient calculations for faster inference |
|
outputs = model(**inputs) # Get model outputs |
|
logits = outputs.logits # Extract raw prediction scores (logits) |
|
predictions = torch.argmax(logits, dim=-1) # Determine the predicted class (0 or 1) |
|
|
|
# Interpret the prediction result |
|
# Map the prediction to its corresponding label: 1 for "Phishing", 0 for "Legitimate" |
|
result = "This is a phishing email." if predictions.item() == 1 else "This is a legitimate email." |
|
|
|
# Print the prediction result |
|
print(f"Prediction: {result}") |
|
|
|
``` |
|
|
|
# Model Summary: |
|
This fine-tuned BERT model is designed to detect phishing emails. Built on the powerful BERT (Bidirectional Encoder Representations from Transformers) architecture, it performs binary classification to label emails as either phishing or legitimate. |
|
|
|
The model has been fine-tuned using a dataset of phishing and legitimate emails, ensuring it understands patterns and linguistic cues commonly found in phishing content. By leveraging contextual understanding, it can identify subtle differences in text that distinguish malicious intent from normal communication. This makes it an effective tool for email security and anti-phishing defenses. |