File size: 5,721 Bytes
5df2592 0e7f697 5df2592 0e7f697 5df2592 0e7f697 5e10c5e 0e7f697 5e10c5e 0e7f697 5e10c5e 0e7f697 5e10c5e 0e7f697 5df2592 0e7f697 5e10c5e 0e7f697 5df2592 9114a26 5df2592 946c094 0d7efb3 3e38753 0d7efb3 3e38753 0d7efb3 5df2592 0d7efb3 5df2592 3e38753 5df2592 3e38753 5df2592 3e38753 5df2592 3e38753 5df2592 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
pipeline_tag: text-classification
library_name: transformers
license: mit
language: en
tags:
- transformers
- tensorflow
- distilbert
- text-classification
# Widget examples shown on the model page:
widget:
- text: "I love this community."
example_title: "Positive Example"
- text: "You are a terrible person and I wish you the worst."
example_title: "Offensive Example"
- text: "This is a completely neutral statement about clouds."
example_title: "Neutral Example"
- text: "Kill all of them, they don't belong in our country."
example_title: "Hate Speech Example"
# Optional: results for the model card
model-index:
- name: distilbert-hatespeech-classifier
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: tdavidson/hate_speech_offensive
type: hf
metrics:
- name: Validation Accuracy
type: accuracy
value: 0.7137
- name: Validation Loss
type: loss
value: 0.7337
---
# Ethical-Content-Moderation
Fine-Tuning DistilBERT for Ethical Content Moderation
## Live Demo
Try the model directly in your browser here:
➡️ [Ethical Content Moderator Space](https://huggingface.co/spaces/will-rads/ethical-content-moderator)
## Model description
This model fine-tunes distilbert-base-uncased on the Davidson et al. (2017) hate speech and offensive language dataset loaded from HuggingFace. The classifier predicts whether a tweet is:
- (a) hate speech
- (b) offensive but not hate
- (c) neither
Using a frozen DistilBERT base and a custom dense head.
The architecture consists of three dense layers (256 → 128 → 32, LeakyReLU and Swish activations), with dropout and batch normalization to improve generalization.
## Intended uses & limitations
Intended uses
- As a starting point for transfer learning in NLP and AI ethics projects
- Academic research on hate speech and offensive language detection
- As a fast, lightweight screening tool for moderating user-generated content (e.g., tweets, comments, reviews)
Limitations
Not suitable for real-time production use without further robustness testing
Trained on English Twitter data (2017) — performance on other domains or languages may be poor
Does not guarantee removal of all forms of bias or unfairness; see Fairness & Bias section
## Training and evaluation data
Dataset:
Davidson et al., 2017 (24K+ English tweets, labeled as hate, offensive, or neither)
Class distribution: Imbalanced (majority: “offensive”; minority: “hate”)
Split: 80% training, 20% validation (stratified)
## Training procedure
Frozen base: DistilBERT transformer weights frozen; only dense classifier head is trained.
Loss: Sparse categorical crossentropy
Optimizer: Adam (learning rate = 3e-5)
Batch size: 16
Class weighting: Used to compensate for class imbalance (higher weight for “hate”)
Early stopping: Custom callback at val_accuracy ≥ 0.92
Hardware: Google Colab (Tesla T4 GPU)
### Training hyperparameters
The following hyperparameters were used during training:
- optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': np.float32(3e-05), 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}
- training_precision: float32
### Training results
| Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | Epoch |
|:----------:|:--------------:|:---------------:|:-------------------:|:-----:|
| 1.4634 | 0.4236 | 0.9268 | 0.6454 | 1 |
| 1.1659 | 0.5067 | 0.9578 | 0.6480 | 2 |
| 1.0965 | 0.5388 | 0.8224 | 0.7043 | 3 |
| 1.0026 | 0.5667 | 0.8131 | 0.7051 | 4 |
| 0.9948 | 0.5817 | 0.8264 | 0.6940 | 5 |
| 0.9631 | 0.5921 | 0.7893 | 0.7111 | 6 |
| 0.9431 | 0.6009 | 0.7725 | 0.7252 | 7 |
| 0.9019 | 0.6197 | 0.8177 | 0.7049 | 8 |
| 0.8790 | 0.6247 | 0.7408 | 0.7351 | 9 |
| 0.8578 | 0.6309 | 0.7786 | 0.7176 | 10 |
| 0.8275 | 0.6455 | 0.7387 | 0.7331 | 11 |
| 0.8530 | 0.6411 | 0.7253 | 0.7273 | 12 |
| 0.8197 | 0.6506 | 0.7430 | 0.7293 | 13 |
| 0.8145 | 0.6549 | 0.7535 | 0.7162 | 14 |
| 0.8081 | 0.6631 | 0.7207 | 0.7402 | 15 |
### Best validation accuracy:
0.7402 at epoch 15
### Environmental Impact
Training emissions:
Estimated at 0.0273 kg CO₂ (CodeCarbon, Colab T4 GPU)
### Fairness & Bias
Bias/fairness audit:
The model was evaluated on synthetic gender pronoun tests and showed relatively balanced outputs, but biases may remain due to dataset limitations.
See Appendix B of the project report for details.
### If you use this model, please cite:
Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM 2017.
William Radiyeh. DistilBERT Hate Speech Classifier (2025). https://huggingface.co/will-rads/distilbert-hatespeech-classifier
### Framework versions
- Transformers 4.51.3
- TensorFlow 2.18.0
- Datasets 3.6.0
- Tokenizers 0.21.1
|