πΉπ· Turkish Toxic Language Detection Model π§ π₯
This model is a fine-tuned version of dbmdz/bert-base-turkish-cased
for binary toxicity classification in Turkish text. It was trained using a cleaned and preprocessed version of the Overfit-GM/turkish-toxic-language
dataset.
π Performance
Metric | Non-Toxic | Toxic | Macro Avg |
---|---|---|---|
Precision | 0.96 | 0.95 | 0.96 |
Recall | 0.95 | 0.96 | 0.96 |
F1-score | 0.96 | 0.96 | 0.96 |
Accuracy | 0.96 | ||
Test Samples | 5400 | 5414 | 10814 |
Confusion Matrix
Pred: Non-Toxic | Pred: Toxic | |
---|---|---|
True: Non-Toxic | 5154 | 246 |
True: Toxic | 200 | 5214 |
π§ͺ Preprocessing Details (cleaned_corrected_text)
The model is trained on the cleaned_corrected_text
column, which is derived from corrected_text
using basic regex-based cleaning steps and manual slang filtering. Here's how:
π§ Cleaning Function
def clean_corrected_text(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE) # URL removal
text = re.sub(r"@\w+", '', text) # remove @mentions
text = re.sub(r"[^\w\s.,!?-]", '', text) # remove special characters (e.g., emojis)
text = re.sub(r"\s+", ' ', text).strip() # normalize whitespaces
return text
π§Ή Manual Slang Filtering
slang_words = ["kanka", "lan", "knk", "bro", "la", "birader", "kanki"]
def remove_slang(text):
for word in slang_words:
text = text.replace(word, "")
return text.strip()
β Applied Steps Summary
Step | Description |
---|---|
Lowercasing | All text is converted to lowercase |
URL removal | Removes links containing http, www, https |
Mention removal | Removes @username style mentions |
Special character removal | Removes emojis and symbols (π, *, %, $, ^, etc.) |
Whitespace normalization | Collapses multiple spaces into one |
Slang word removal | Removes common informal words like "kanka", "lan", etc. |
π Conclusion: cleaned_corrected_text
is a lightly cleaned, non-linguistically processed text column. The model is trained directly on this.
π‘ Example Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("fc63/turkish_toxic_language_detection_model")
model = AutoModelForSequenceClassification.from_pretrained("fc63/turkish_toxic_language_detection_model")
def predict_toxicity(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
outputs = model(**inputs)
predicted = torch.argmax(outputs.logits, dim=1).item()
return "Toxic" if predicted == 1 else "Non-Toxic"
π Training Details
- Trainer: Hugging Face
Trainer
API - Epochs: 3
- Batch size: 16
- Learning Rate: 2e-5
- Eval Strategy: Epoch-based
- Undersampling: Applied to balance class distribution
π Dataset
Dataset used: Overfit-GM/turkish-toxic-language
Final dataset size after preprocessing and balancing: 54068 samples
- Downloads last month
- 26
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Dataset used to train fc63/turkish-toxic-language-detection
Evaluation results
- Accuracy on Turkish Toxic Language Datasetself-reported0.960
- F1 on Turkish Toxic Language Datasetself-reported0.960
- Precision on Turkish Toxic Language Datasetself-reported0.960
- Recall on Turkish Toxic Language Datasetself-reported0.960