distilbert-hatespeech-classifier / README.md

Update model card with pipeline_tag, widget examples, and detailed info

5e10c5e verified 4 months ago

5.78 kB

	---
	language: en
	license: mit # Or choose another like 'apache-2.0', 'cc-by-sa-4.0', etc.
	library_name: transformers
	tags:
	- text-classification
	- hate-speech
	- offensive-language
	- distilbert
	- tensorflow
	pipeline_tag: text-classification
	widget:
	- text: "I love this beautiful day, it's fantastic!"
	example_title: "Positive Example"
	- text: "You are a terrible person and I wish you the worst."
	example_title: "Offensive Example"
	- text: "This is a completely neutral statement about clouds."
	example_title: "Neutral Example"
	- text: "Kill all of them, they don't belong in our country." # Potentially strong hate speech
	example_title: "Hate Speech Example"
	model-index:
	- name: distilbert-hatespeech-classifier # Should match your model name
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: tdavidson/hate_speech_offensive # Or the specific name you used
	type: hf # Indicates it's from Hugging Face datasets
	metrics:
	- name: Validation Accuracy
	type: accuracy
	value: 0.7137 # Your best validation accuracy (from Epoch 2)
	- name: Validation Loss
	type: loss
	value: 0.7337 # Your best validation loss (from Epoch 2)
	---


	# Ethical-Content-Moderation
	Fine-Tuning DistilBERT for Ethical Content Moderation

	## Model description

	This model fine-tunes distilbert-base-uncased on the Davidson et al. (2017) hate speech and offensive language dataset loaded from HuggingFace. The classifier predicts whether a tweet is:

	- (a) hate speech
	- (b) offensive but not hate
	- (c) neither

	Using a frozen DistilBERT base and a custom dense head.

	The architecture consists of three dense layers (256 → 128 → 32, LeakyReLU and Swish activations), with dropout and batch normalization to improve generalization.


	## Intended uses & limitations

	Intended uses

	- As a starting point for transfer learning in NLP and AI ethics projects

	- Academic research on hate speech and offensive language detection

	- As a fast, lightweight screening tool for moderating user-generated content (e.g., tweets, comments, reviews)

	Limitations
	Not suitable for real-time production use without further robustness testing

	Trained on English Twitter data (2017) — performance on other domains or languages may be poor

	Does not guarantee removal of all forms of bias or unfairness; see Fairness & Bias section

	## Training and evaluation data

	Dataset:
	Davidson et al., 2017 (24K+ English tweets, labeled as hate, offensive, or neither)

	Class distribution: Imbalanced (majority: “offensive”; minority: “hate”)

	Split: 80% training, 20% validation (stratified)


	## Training procedure

	Frozen base: DistilBERT transformer weights frozen; only dense classifier head is trained.

	Loss: Sparse categorical crossentropy

	Optimizer: Adam (learning rate = 3e-5)

	Batch size: 16

	Class weighting: Used to compensate for class imbalance (higher weight for “hate”)

	Early stopping: Custom callback at val_accuracy ≥ 0.92

	Hardware: Google Colab (Tesla T4 GPU)

	### Training hyperparameters

	The following hyperparameters were used during training:
	- optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': np.float32(3e-05), 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}
	- training_precision: float32

	### Training results

	\| Train Loss \| Train Accuracy \| Validation Loss \| Validation Accuracy \| Epoch \|
	\|:----------:\|:--------------:\|:---------------:\|:-------------------:\|:-----:\|
	\| 1.4634 \| 0.4236 \| 0.9268 \| 0.6454 \| 1 \|
	\| 1.1659 \| 0.5067 \| 0.9578 \| 0.6480 \| 2 \|
	\| 1.0965 \| 0.5388 \| 0.8224 \| 0.7043 \| 3 \|
	\| 1.0026 \| 0.5667 \| 0.8131 \| 0.7051 \| 4 \|
	\| 0.9948 \| 0.5817 \| 0.8264 \| 0.6940 \| 5 \|
	\| 0.9631 \| 0.5921 \| 0.7893 \| 0.7111 \| 6 \|
	\| 0.9431 \| 0.6009 \| 0.7725 \| 0.7252 \| 7 \|
	\| 0.9019 \| 0.6197 \| 0.8177 \| 0.7049 \| 8 \|
	\| 0.8790 \| 0.6247 \| 0.7408 \| 0.7351 \| 9 \|
	\| 0.8578 \| 0.6309 \| 0.7786 \| 0.7176 \| 10 \|
	\| 0.8275 \| 0.6455 \| 0.7387 \| 0.7331 \| 11 \|
	\| 0.8530 \| 0.6411 \| 0.7253 \| 0.7273 \| 12 \|
	\| 0.8197 \| 0.6506 \| 0.7430 \| 0.7293 \| 13 \|
	\| 0.8145 \| 0.6549 \| 0.7535 \| 0.7162 \| 14 \|
	\| 0.8081 \| 0.6631 \| 0.7207 \| 0.7402 \| 15 \|

	### Best validation accuracy:
	0.7402 at epoch 15

	### Environmental Impact
	Training emissions:
	Estimated at 0.0273 kg CO₂ (CodeCarbon, Colab T4 GPU)

	### Fairness & Bias

	Bias/fairness audit:
	The model was evaluated on synthetic gender pronoun tests and showed relatively balanced outputs, but biases may remain due to dataset limitations.
	See Appendix B of the project report for details.

	### If you use this model, please cite:

	Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM 2017.

	William Radiyeh. DistilBERT Hate Speech Classifier (2025). https://huggingface.co/will-rads/distilbert-hatespeech-classifier


	### Framework versions

	- Transformers 4.51.3
	- TensorFlow 2.18.0
	- Datasets 3.6.0
	- Tokenizers 0.21.1