DistilBERT-Base-Uncased Quantized Model for Spam Detection

This repository hosts a quantized version of the DistilBERT model, fine-tuned for spam classification using a labeled SMS dataset. The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.

Model Details

Model Architecture: DistilBERT Base Uncased
Task: Binary Spam Classification (Spam/Ham)
Dataset: SMS Spam Collection
Quantization: Float16
Fine-tuning Framework: Hugging Face Transformers

Installation

pip install transformers datasets scikit-learn

Loading the Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load tokenizer and model
model_path = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Define test messages
texts = [
    "Congratulations! You have won a free iPhone. Click here to claim your prize.",
    "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
]

# Tokenize and predict
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.long() for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()
    label_map = {0: "Ham", 1: "Spam"}
    print(f"Text: {text}")
    print(f"Predicted Label: {label_map[predicted_class]}\n")

Performance Metrics

Accuracy: 0.9994
Precision: 1.0000
Recall: 0.9955
F1 Score: 0.9978

Fine-Tuning Details

Dataset

The dataset used is the SMS Spam Collection dataset containing labeled messages as either "spam" or "ham".
The dataset was cleaned using custom preprocessing, then split into 80% training and 20% validation sets with stratification.

Training

Epochs: 5
Batch size: 12 (train) / 16 (eval)
Learning rate: 3e-5
Evaluation strategy: epoch
FP16 Training: Enabled
Trainer: Hugging Face Trainer API

Quantization

Post-training quantization was applied using model.to(dtype=torch.float16) to reduce model size and speed up inference.

Repository Structure

.
├── quantized-model/               # Contains the quantized model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   ├── vocab.txt
│   └── special_tokens_map.json
├── README.md                      # Project documentation

Limitations

The model is trained specifically for binary spam classification on SMS data.
Performance might degrade when applied to emails or social media without domain adaptation.
FP16 inference might show slight instability on edge cases.

Contributing

Feel free to open issues or submit pull requests to improve the model, training process, or documentation.