File size: 5,721 Bytes
5df2592
0e7f697
5df2592
0e7f697
 
5df2592
0e7f697
5e10c5e
0e7f697
 
 
 
5e10c5e
0e7f697
5e10c5e
 
 
 
 
0e7f697
5e10c5e
0e7f697
 
5df2592
0e7f697
5e10c5e
0e7f697
 
 
 
 
 
 
 
 
 
 
 
 
5df2592
9114a26
 
5df2592
946c094
 
 
 
 
0d7efb3
 
3e38753
 
0d7efb3
 
 
3e38753
0d7efb3
5df2592
0d7efb3
5df2592
 
 
 
3e38753
 
 
 
 
 
 
 
 
 
 
 
 
 
5df2592
 
 
3e38753
 
 
 
 
 
 
5df2592
 
 
3e38753
 
 
 
 
 
 
 
 
 
 
 
 
 
5df2592
 
 
 
 
 
 
 
 
 
3e38753
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5df2592
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
pipeline_tag: text-classification
library_name: transformers
license: mit
language: en
tags:
  - transformers
  - tensorflow
  - distilbert
  - text-classification

# Widget examples shown on the model page:
widget:
  - text: "I love this community."
    example_title: "Positive Example"
  - text: "You are a terrible person and I wish you the worst."
    example_title: "Offensive Example"
  - text: "This is a completely neutral statement about clouds."
    example_title: "Neutral Example"
  - text: "Kill all of them, they don't belong in our country."
    example_title: "Hate Speech Example"

# Optional: results for the model card
model-index:
  - name: distilbert-hatespeech-classifier
    results:
      - task:
          type: text-classification
          name: Text Classification
        dataset:
          name: tdavidson/hate_speech_offensive
          type: hf
        metrics:
          - name: Validation Accuracy
            type: accuracy
            value: 0.7137
          - name: Validation Loss
            type: loss
            value: 0.7337
---
# Ethical-Content-Moderation
Fine-Tuning DistilBERT for Ethical Content Moderation

## Live Demo
Try the model directly in your browser here:  
➡️ [Ethical Content Moderator Space](https://huggingface.co/spaces/will-rads/ethical-content-moderator)


## Model description

This model fine-tunes distilbert-base-uncased on the Davidson et al. (2017) hate speech and offensive language dataset loaded from HuggingFace. The classifier predicts whether a tweet is:

- (a) hate speech
- (b) offensive but not hate
- (c) neither

Using a frozen DistilBERT base and a custom dense head.

The architecture consists of three dense layers (256 → 128 → 32, LeakyReLU and Swish activations), with dropout and batch normalization to improve generalization.


## Intended uses & limitations

Intended uses

- As a starting point for transfer learning in NLP and AI ethics projects

- Academic research on hate speech and offensive language detection

- As a fast, lightweight screening tool for moderating user-generated content (e.g., tweets, comments, reviews)

Limitations
Not suitable for real-time production use without further robustness testing

Trained on English Twitter data (2017) — performance on other domains or languages may be poor

Does not guarantee removal of all forms of bias or unfairness; see Fairness & Bias section

## Training and evaluation data

Dataset:
Davidson et al., 2017 (24K+ English tweets, labeled as hate, offensive, or neither)

Class distribution: Imbalanced (majority: “offensive”; minority: “hate”)

Split: 80% training, 20% validation (stratified)


## Training procedure

Frozen base: DistilBERT transformer weights frozen; only dense classifier head is trained.

Loss: Sparse categorical crossentropy

Optimizer: Adam (learning rate = 3e-5)

Batch size: 16

Class weighting: Used to compensate for class imbalance (higher weight for “hate”)

Early stopping: Custom callback at val_accuracy ≥ 0.92 

Hardware: Google Colab (Tesla T4 GPU)

### Training hyperparameters

The following hyperparameters were used during training:
- optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': np.float32(3e-05), 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}
- training_precision: float32

### Training results

| Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | Epoch |
|:----------:|:--------------:|:---------------:|:-------------------:|:-----:|
| 1.4634     | 0.4236         | 0.9268          | 0.6454              | 1     |
| 1.1659     | 0.5067         | 0.9578          | 0.6480              | 2     |
| 1.0965     | 0.5388         | 0.8224          | 0.7043              | 3     |
| 1.0026     | 0.5667         | 0.8131          | 0.7051              | 4     |
| 0.9948     | 0.5817         | 0.8264          | 0.6940              | 5     |
| 0.9631     | 0.5921         | 0.7893          | 0.7111              | 6     |
| 0.9431     | 0.6009         | 0.7725          | 0.7252              | 7     |
| 0.9019     | 0.6197         | 0.8177          | 0.7049              | 8     |
| 0.8790     | 0.6247         | 0.7408          | 0.7351              | 9     |
| 0.8578     | 0.6309         | 0.7786          | 0.7176              | 10    |
| 0.8275     | 0.6455         | 0.7387          | 0.7331              | 11    |
| 0.8530     | 0.6411         | 0.7253          | 0.7273              | 12    |
| 0.8197     | 0.6506         | 0.7430          | 0.7293              | 13    |
| 0.8145     | 0.6549         | 0.7535          | 0.7162              | 14    |
| 0.8081     | 0.6631         | 0.7207          | 0.7402              | 15    |

### Best validation accuracy:
0.7402 at epoch 15

### Environmental Impact
Training emissions:
Estimated at 0.0273 kg CO₂ (CodeCarbon, Colab T4 GPU)

### Fairness & Bias

Bias/fairness audit:
The model was evaluated on synthetic gender pronoun tests and showed relatively balanced outputs, but biases may remain due to dataset limitations. 
See Appendix B of the project report for details.

### If you use this model, please cite:

Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM 2017.

 William Radiyeh. DistilBERT Hate Speech Classifier (2025). https://huggingface.co/will-rads/distilbert-hatespeech-classifier


### Framework versions

- Transformers 4.51.3
- TensorFlow 2.18.0
- Datasets 3.6.0
- Tokenizers 0.21.1