File size: 5,270 Bytes
b31c870 51c7137 b31c870 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
---
language:
- id
license: mit
tags:
- text-classification
- bert
- spam-detection
- indonesian
- twitter
- retrained
datasets:
- nahiar/mail_data
pipeline_tag: text-classification
inference: true
base_model: nahiar/spam-detection-bert-v2
model_type: bert
library_name: transformers
widget:
- text: "Senin, 21 Juli 2025, Samapta Polsek Ngaglik melaksanakan patroli stasioner balong jalan palagan donoharjo"
example_title: "Ham Example"
- text: "Mari berkontribusi terhadap gerakan rakyat dengan membeli baju ini seharga Rp 160.000. Hubungi kami melalui WA 08977472296"
example_title: "Spam Example"
model-index:
- name: spam-detection-bert-v3
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: Mail Data Indonesian Spam Detection
type: csv
metrics:
- name: Accuracy
type: accuracy
value: 0.95
- name: F1 Score (Weighted)
type: f1
value: 0.95
- name: Precision (HAM)
type: precision
value: 0.98
- name: Recall (HAM)
type: recall
value: 0.96
- name: Precision (SPAM)
type: precision
value: 0.77
- name: Recall (SPAM)
type: recall
value: 0.85
---
# Indonesian Spam Detection BERT
BERT model for spam detection in Indonesian with **95% accuracy**. This v3 model has been fine-tuned from v2 model with email dataset for optimal performance on Indonesian content.
## Quick Start
```python
from transformers import pipeline
# The easiest way to use the model
classifier = pipeline("text-classification",
model="nahiar/spam-detection-bert-v3",
tokenizer="nahiar/spam-detection-bert-v3")
# Test with text
texts = [
"lacak hp hilang by no hp / imei lacak penipu/scammer/tabrak lari/terror/revengeporn sadap / hack / pulihkan akun",
"Senin, 21 Juli 2025, Samapta Polsek Ngaglik melaksanakan patroli stasioner balong jalan palagan donoharjo",
"Mari berkontribusi terhadap gerakan rakyat dengan membeli baju ini seharga Rp 160.000. Hubungi kami melalui WA 08977472296"
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"Text: {text}")
print(f"Result: {result['label']} (confidence: {result['score']:.4f})")
print("---")
```
## Model Details
- **Base Model**: nahiar/spam-detection-bert-v2
- **Task**: Binary Text Classification (Spam vs Ham)
- **Language**: Indonesian (Bahasa Indonesia)
- **Model Size**: ~110M parameters
- **Max Sequence Length**: 512 tokens
- **Training Epochs**: 3
- **Batch Size**: 16
- **Learning Rate**: 2e-5
## Performance
| Metric | HAM | SPAM | Overall |
| -------------------- | --- | ---- | ------- |
| Precision | 98% | 77% | 95% |
| Recall | 96% | 85% | 95% |
| F1-Score | 97% | 81% | 95% |
| **Overall Accuracy** | - | - | **95%** |
### Confusion Matrix
- True HAM correctly predicted: 953/988 (96%)
- True SPAM correctly predicted: 115/135 (85%)
- False Positives (HAM predicted as SPAM): 35
- False Negatives (SPAM predicted as HAM): 20
## Key Features
- **Fine-tuned** from v2 model with email dataset
- **Good accuracy** (95%) on spam detection with Indonesian context
- **Better handling** for spam email content
- **Enhanced performance** on Indonesian email text
- **Optimized** for Indonesian email and social media spam detection
## Label Mapping
```
0: "HAM" (not spam)
1: "SPAM" (spam)
```
## Training Process
This model was retrained using:
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Epochs**: 3
- **Batch Size**: 16
- **Max Length**: 128 tokens
- **Train/Validation Split**: 80/20
## Usage Example
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("nahiar/spam-detection-bert-v3")
model = AutoModelForSequenceClassification.from_pretrained("nahiar/spam-detection-bert-v3")
def predict_spam(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
predicted_label = torch.argmax(probs, dim=1).item()
confidence = probs[0][predicted_label].item()
label_map = {0: "HAM", 1: "SPAM"}
return label_map[predicted_label], confidence
# Test
text = "Dapatkan uang dengan mudah! Klik link ini sekarang!"
result, confidence = predict_spam(text)
print(f"Prediksi: {result} (Confidence: {confidence:.4f})")
```
## Citation
```bibtex
@misc{nahiar_spam_detection_bert,
title={Indonesian Spam Detection BERT},
author={Raihan Hidayatullah Djunaedi},
year={2025},
url={https://huggingface.co/nahiar/spam-detection-bert-v3}
}
```
## Changelog
### Current Version v3 (August 2025)
- Fine-tuned from v2 model with email dataset (mail_data.csv)
- Enhanced handling for Indonesian spam email content
- Good performance (95% accuracy) on email spam detection
- Optimized for Indonesian email and social media content
- Improved with GPU-accelerated training using RTX 3080
|