DistilBERT SMS Classifier

Includes TensorFlow + TFLite model and tokenizer.

Intended uses & limitations

Binary classification

Training and evaluation data

Dataset from Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning Models

Citation for the dataset authors

@article{salman2024investigating,
  title={Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning Models},
  author={Salman, Muhammad and Ikram, Muhammad and Kaafar, Mohamed Ali},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

Training Details

The model is based on distilbert-base-uncased and was fine-tuned for binary SMS classification (spam vs ham).

Training configuration:

Dataset: Custom SMS dataset
Input Format: SMS text, labels in (0, 1)
Max Sequence Length: 256
Tokenizer: DistilBERT tokenizer (AutoTokenizer.from_pretrained("distilbert-base-uncased"))
Batch Size: 32
Epochs: 2
Learning Rate: 2e-5
Weight Decay: 0.01
Warmup Steps: 200
Loss Function: SparseCategoricalCrossentropy (from logits)
Optimizer: AdamW with linear decay + warmup
Framework: TensorFlow (Keras API + Hugging Face Transformers)
Evaluation Split: 80% train / 20% test

Evaluation Results

Evaluated on the 20% held-out test set of the SMS spam classification dataset.

Class	Precision	Recall	F1-Score	Support
0 (Ham)	0.9945	0.9956	0.9951	8211
1 (Spam)	0.9931	0.9913	0.9922	5191

Overall Metrics:

Metric	Value
Accuracy	0.9940
Macro Avg F1	0.9936
Weighted Avg F1	0.9940