YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DistilBERT-Base-Uncased Quantized Model for Spam Detection

This repository hosts a quantized version of the DistilBERT model, fine-tuned for spam classification using a labeled SMS dataset. The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.

Model Details

  • Model Architecture: DistilBERT Base Uncased
  • Task: Binary Spam Classification (Spam/Ham)
  • Dataset: SMS Spam Collection
  • Quantization: Float16
  • Fine-tuning Framework: Hugging Face Transformers

Installation

pip install transformers datasets scikit-learn 

Loading the Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load tokenizer and model
model_path = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Define test messages
texts = [
    "Congratulations! You have won a free iPhone. Click here to claim your prize.",
    "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
]

# Tokenize and predict
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.long() for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()
    label_map = {0: "Ham", 1: "Spam"}
    print(f"Text: {text}")
    print(f"Predicted Label: {label_map[predicted_class]}\n")

Performance Metrics

  • Accuracy: 0.9994
  • Precision: 1.0000
  • Recall: 0.9955
  • F1 Score: 0.9978

Fine-Tuning Details

Dataset

The dataset used is the SMS Spam Collection dataset containing labeled messages as either "spam" or "ham".
The dataset was cleaned using custom preprocessing, then split into 80% training and 20% validation sets with stratification.

Training

  • Epochs: 5
  • Batch size: 12 (train) / 16 (eval)
  • Learning rate: 3e-5
  • Evaluation strategy: epoch
  • FP16 Training: Enabled
  • Trainer: Hugging Face Trainer API

Quantization

Post-training quantization was applied using model.to(dtype=torch.float16) to reduce model size and speed up inference.


Repository Structure

.
β”œβ”€β”€ quantized-model/               # Contains the quantized model files
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.txt
β”‚   └── special_tokens_map.json
β”œβ”€β”€ README.md                      # Project documentation

Limitations

  • The model is trained specifically for binary spam classification on SMS data.
  • Performance might degrade when applied to emails or social media without domain adaptation.
  • FP16 inference might show slight instability on edge cases.

Contributing

Feel free to open issues or submit pull requests to improve the model, training process, or documentation.

Downloads last month
4
Safetensors
Model size
60.5M params
Tensor type
FP16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support