will-rads's picture
Update model card with pipeline_tag, widget examples, and detailed info
5e10c5e verified
---
language: en
license: mit # Or choose another like 'apache-2.0', 'cc-by-sa-4.0', etc.
library_name: transformers
tags:
- text-classification
- hate-speech
- offensive-language
- distilbert
- tensorflow
pipeline_tag: text-classification
widget:
- text: "I love this beautiful day, it's fantastic!"
example_title: "Positive Example"
- text: "You are a terrible person and I wish you the worst."
example_title: "Offensive Example"
- text: "This is a completely neutral statement about clouds."
example_title: "Neutral Example"
- text: "Kill all of them, they don't belong in our country." # Potentially strong hate speech
example_title: "Hate Speech Example"
model-index:
- name: distilbert-hatespeech-classifier # Should match your model name
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: tdavidson/hate_speech_offensive # Or the specific name you used
type: hf # Indicates it's from Hugging Face datasets
metrics:
- name: Validation Accuracy
type: accuracy
value: 0.7137 # Your best validation accuracy (from Epoch 2)
- name: Validation Loss
type: loss
value: 0.7337 # Your best validation loss (from Epoch 2)
---
# Ethical-Content-Moderation
Fine-Tuning DistilBERT for Ethical Content Moderation
## Model description
This model fine-tunes distilbert-base-uncased on the Davidson et al. (2017) hate speech and offensive language dataset loaded from HuggingFace. The classifier predicts whether a tweet is:
- (a) hate speech
- (b) offensive but not hate
- (c) neither
Using a frozen DistilBERT base and a custom dense head.
The architecture consists of three dense layers (256 → 128 → 32, LeakyReLU and Swish activations), with dropout and batch normalization to improve generalization.
## Intended uses & limitations
Intended uses
- As a starting point for transfer learning in NLP and AI ethics projects
- Academic research on hate speech and offensive language detection
- As a fast, lightweight screening tool for moderating user-generated content (e.g., tweets, comments, reviews)
Limitations
Not suitable for real-time production use without further robustness testing
Trained on English Twitter data (2017) — performance on other domains or languages may be poor
Does not guarantee removal of all forms of bias or unfairness; see Fairness & Bias section
## Training and evaluation data
Dataset:
Davidson et al., 2017 (24K+ English tweets, labeled as hate, offensive, or neither)
Class distribution: Imbalanced (majority: “offensive”; minority: “hate”)
Split: 80% training, 20% validation (stratified)
## Training procedure
Frozen base: DistilBERT transformer weights frozen; only dense classifier head is trained.
Loss: Sparse categorical crossentropy
Optimizer: Adam (learning rate = 3e-5)
Batch size: 16
Class weighting: Used to compensate for class imbalance (higher weight for “hate”)
Early stopping: Custom callback at val_accuracy ≥ 0.92
Hardware: Google Colab (Tesla T4 GPU)
### Training hyperparameters
The following hyperparameters were used during training:
- optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': np.float32(3e-05), 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}
- training_precision: float32
### Training results
| Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | Epoch |
|:----------:|:--------------:|:---------------:|:-------------------:|:-----:|
| 1.4634 | 0.4236 | 0.9268 | 0.6454 | 1 |
| 1.1659 | 0.5067 | 0.9578 | 0.6480 | 2 |
| 1.0965 | 0.5388 | 0.8224 | 0.7043 | 3 |
| 1.0026 | 0.5667 | 0.8131 | 0.7051 | 4 |
| 0.9948 | 0.5817 | 0.8264 | 0.6940 | 5 |
| 0.9631 | 0.5921 | 0.7893 | 0.7111 | 6 |
| 0.9431 | 0.6009 | 0.7725 | 0.7252 | 7 |
| 0.9019 | 0.6197 | 0.8177 | 0.7049 | 8 |
| 0.8790 | 0.6247 | 0.7408 | 0.7351 | 9 |
| 0.8578 | 0.6309 | 0.7786 | 0.7176 | 10 |
| 0.8275 | 0.6455 | 0.7387 | 0.7331 | 11 |
| 0.8530 | 0.6411 | 0.7253 | 0.7273 | 12 |
| 0.8197 | 0.6506 | 0.7430 | 0.7293 | 13 |
| 0.8145 | 0.6549 | 0.7535 | 0.7162 | 14 |
| 0.8081 | 0.6631 | 0.7207 | 0.7402 | 15 |
### Best validation accuracy:
0.7402 at epoch 15
### Environmental Impact
Training emissions:
Estimated at 0.0273 kg CO₂ (CodeCarbon, Colab T4 GPU)
### Fairness & Bias
Bias/fairness audit:
The model was evaluated on synthetic gender pronoun tests and showed relatively balanced outputs, but biases may remain due to dataset limitations.
See Appendix B of the project report for details.
### If you use this model, please cite:
Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM 2017.
William Radiyeh. DistilBERT Hate Speech Classifier (2025). https://huggingface.co/will-rads/distilbert-hatespeech-classifier
### Framework versions
- Transformers 4.51.3
- TensorFlow 2.18.0
- Datasets 3.6.0
- Tokenizers 0.21.1