Upload folder using huggingface_hub
Browse files- .gitattributes +1 -0
- README.md +111 -3
- config.json +48 -0
- model.safetensors +3 -0
- special_tokens_map.json +51 -0
- tokenizer.json +3 -0
- tokenizer_config.json +62 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,111 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
- sr
|
5 |
+
license: mit
|
6 |
+
library_name: transformers
|
7 |
+
tags:
|
8 |
+
- hate-speech-detection
|
9 |
+
- text-classification
|
10 |
+
- multilingual
|
11 |
+
- xlm-roberta
|
12 |
+
- serbian
|
13 |
+
- english
|
14 |
+
- pytorch
|
15 |
+
datasets:
|
16 |
+
- hate-speech
|
17 |
+
pipeline_tag: text-classification
|
18 |
+
widget:
|
19 |
+
- text: "I really enjoyed that movie last night!"
|
20 |
+
example_title: "Appropriate Content"
|
21 |
+
- text: "You people are all the same, causing problems everywhere."
|
22 |
+
example_title: "Hate Speech Example"
|
23 |
+
- text: "Ovaj film je bio odličan!"
|
24 |
+
example_title: "Serbian Appropriate"
|
25 |
+
---
|
26 |
+
|
27 |
+
# Multilingual Hate Speech Detector (XLM-RoBERTa)
|
28 |
+
|
29 |
+
## Model Description
|
30 |
+
|
31 |
+
This is a fine-tuned XLM-RoBERTa model for multilingual hate speech detection, specifically trained on English and Serbian text. The model classifies text into 8 categories:
|
32 |
+
|
33 |
+
- **Race**: Racial discrimination and slurs
|
34 |
+
- **Sexual Orientation**: Homophobic content, LGBTQ+ discrimination
|
35 |
+
- **Gender**: Sexist content, misogyny, gender-based harassment
|
36 |
+
- **Physical Appearance**: Body shaming, lookism, appearance-based harassment
|
37 |
+
- **Religion**: Religious discrimination, islamophobia, antisemitism
|
38 |
+
- **Class**: Classist content, economic discrimination
|
39 |
+
- **Disability**: Ableist content, discrimination against disabled people
|
40 |
+
- **Appropriate**: Non-hateful, normal conversation
|
41 |
+
|
42 |
+
## Languages Supported
|
43 |
+
|
44 |
+
- **English**: Comprehensive hate speech detection
|
45 |
+
- **Serbian**: Native Serbian language support (Cyrillic and Latin scripts)
|
46 |
+
|
47 |
+
## Usage
|
48 |
+
|
49 |
+
```python
|
50 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
51 |
+
import torch
|
52 |
+
|
53 |
+
# Load model and tokenizer
|
54 |
+
tokenizer = AutoTokenizer.from_pretrained("sadjava/multilingual-hate-speech-xlm-roberta")
|
55 |
+
model = AutoModelForSequenceClassification.from_pretrained("sadjava/multilingual-hate-speech-xlm-roberta")
|
56 |
+
|
57 |
+
# Example prediction
|
58 |
+
text = "Your text here"
|
59 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
60 |
+
|
61 |
+
with torch.no_grad():
|
62 |
+
outputs = model(**inputs)
|
63 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
64 |
+
|
65 |
+
# Categories
|
66 |
+
categories = ["Race", "Sexual Orientation", "Gender", "Physical Appearance",
|
67 |
+
"Religion", "Class", "Disability", "Appropriate"]
|
68 |
+
|
69 |
+
# Get predicted category
|
70 |
+
predicted_class = torch.argmax(predictions, dim=-1).item()
|
71 |
+
predicted_category = categories[predicted_class]
|
72 |
+
confidence = float(predictions[0][predicted_class])
|
73 |
+
|
74 |
+
print(f"Category: {predicted_category}")
|
75 |
+
print(f"Confidence: {confidence:.2%}")
|
76 |
+
```
|
77 |
+
|
78 |
+
## Training Data
|
79 |
+
|
80 |
+
The model was fine-tuned on multilingual hate speech datasets including:
|
81 |
+
- English hate speech datasets
|
82 |
+
- Serbian hate speech datasets
|
83 |
+
- Augmented examples for better multilingual performance
|
84 |
+
|
85 |
+
## Performance
|
86 |
+
|
87 |
+
- **Accuracy**: High-confidence predictions with detailed explanations
|
88 |
+
- **Languages**: English and Serbian with cross-lingual capabilities
|
89 |
+
- **Categories**: 8-class classification including appropriate content
|
90 |
+
|
91 |
+
## Ethical Considerations
|
92 |
+
|
93 |
+
This model is designed for research and educational purposes. Results should be interpreted carefully and human judgment should always be applied for critical decisions. The system is designed to assist, not replace, human moderation.
|
94 |
+
|
95 |
+
## Citation
|
96 |
+
|
97 |
+
If you use this model, please cite:
|
98 |
+
|
99 |
+
```bibtex
|
100 |
+
@misc{multilingual-hate-speech-xlm-roberta,
|
101 |
+
author = {sadjava},
|
102 |
+
title = {Multilingual Hate Speech Detector},
|
103 |
+
year = {2024},
|
104 |
+
publisher = {Hugging Face},
|
105 |
+
url = {https://huggingface.co/sadjava/multilingual-hate-speech-xlm-roberta}
|
106 |
+
}
|
107 |
+
```
|
108 |
+
|
109 |
+
## Demo
|
110 |
+
|
111 |
+
Try the interactive demo: [Multilingual Hate Speech Detector Space](https://huggingface.co/spaces/sadjava/multilingual-hate-speech-detector)
|
config.json
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"XLMRobertaForSequenceClassification"
|
4 |
+
],
|
5 |
+
"attention_probs_dropout_prob": 0.1,
|
6 |
+
"bos_token_id": 0,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"eos_token_id": 2,
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 768,
|
12 |
+
"id2label": {
|
13 |
+
"0": "Race",
|
14 |
+
"1": "Sexual Orientation",
|
15 |
+
"2": "Gender",
|
16 |
+
"3": "Physical Appearance",
|
17 |
+
"4": "Religion",
|
18 |
+
"5": "Class",
|
19 |
+
"6": "Disability",
|
20 |
+
"7": "Appropriate"
|
21 |
+
},
|
22 |
+
"initializer_range": 0.02,
|
23 |
+
"intermediate_size": 3072,
|
24 |
+
"label2id": {
|
25 |
+
"Appropriate": 7,
|
26 |
+
"Class": 5,
|
27 |
+
"Disability": 6,
|
28 |
+
"Gender": 2,
|
29 |
+
"Physical Appearance": 3,
|
30 |
+
"Race": 0,
|
31 |
+
"Religion": 4,
|
32 |
+
"Sexual Orientation": 1
|
33 |
+
},
|
34 |
+
"layer_norm_eps": 1e-05,
|
35 |
+
"max_position_embeddings": 514,
|
36 |
+
"model_type": "xlm-roberta",
|
37 |
+
"num_attention_heads": 12,
|
38 |
+
"num_hidden_layers": 12,
|
39 |
+
"output_past": true,
|
40 |
+
"pad_token_id": 1,
|
41 |
+
"position_embedding_type": "absolute",
|
42 |
+
"problem_type": "single_label_classification",
|
43 |
+
"torch_dtype": "float32",
|
44 |
+
"transformers_version": "4.52.4",
|
45 |
+
"type_vocab_size": 1,
|
46 |
+
"use_cache": true,
|
47 |
+
"vocab_size": 250002
|
48 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a110841fd526a364b21e8ffa8965531a43018b2ebcf7c3fca6510898ab90179e
|
3 |
+
size 1112223464
|
special_tokens_map.json
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": {
|
3 |
+
"content": "<s>",
|
4 |
+
"lstrip": false,
|
5 |
+
"normalized": false,
|
6 |
+
"rstrip": false,
|
7 |
+
"single_word": false
|
8 |
+
},
|
9 |
+
"cls_token": {
|
10 |
+
"content": "<s>",
|
11 |
+
"lstrip": false,
|
12 |
+
"normalized": false,
|
13 |
+
"rstrip": false,
|
14 |
+
"single_word": false
|
15 |
+
},
|
16 |
+
"eos_token": {
|
17 |
+
"content": "</s>",
|
18 |
+
"lstrip": false,
|
19 |
+
"normalized": false,
|
20 |
+
"rstrip": false,
|
21 |
+
"single_word": false
|
22 |
+
},
|
23 |
+
"mask_token": {
|
24 |
+
"content": "<mask>",
|
25 |
+
"lstrip": true,
|
26 |
+
"normalized": false,
|
27 |
+
"rstrip": false,
|
28 |
+
"single_word": false
|
29 |
+
},
|
30 |
+
"pad_token": {
|
31 |
+
"content": "<pad>",
|
32 |
+
"lstrip": false,
|
33 |
+
"normalized": false,
|
34 |
+
"rstrip": false,
|
35 |
+
"single_word": false
|
36 |
+
},
|
37 |
+
"sep_token": {
|
38 |
+
"content": "</s>",
|
39 |
+
"lstrip": false,
|
40 |
+
"normalized": false,
|
41 |
+
"rstrip": false,
|
42 |
+
"single_word": false
|
43 |
+
},
|
44 |
+
"unk_token": {
|
45 |
+
"content": "<unk>",
|
46 |
+
"lstrip": false,
|
47 |
+
"normalized": false,
|
48 |
+
"rstrip": false,
|
49 |
+
"single_word": false
|
50 |
+
}
|
51 |
+
}
|
tokenizer.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3c088c06cf975b7097e469bd69630cdb0d675c6db1ce3af1042b6e19c6d01f22
|
3 |
+
size 17082999
|
tokenizer_config.json
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"added_tokens_decoder": {
|
3 |
+
"0": {
|
4 |
+
"content": "<s>",
|
5 |
+
"lstrip": false,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"single_word": false,
|
9 |
+
"special": true
|
10 |
+
},
|
11 |
+
"1": {
|
12 |
+
"content": "<pad>",
|
13 |
+
"lstrip": false,
|
14 |
+
"normalized": false,
|
15 |
+
"rstrip": false,
|
16 |
+
"single_word": false,
|
17 |
+
"special": true
|
18 |
+
},
|
19 |
+
"2": {
|
20 |
+
"content": "</s>",
|
21 |
+
"lstrip": false,
|
22 |
+
"normalized": false,
|
23 |
+
"rstrip": false,
|
24 |
+
"single_word": false,
|
25 |
+
"special": true
|
26 |
+
},
|
27 |
+
"3": {
|
28 |
+
"content": "<unk>",
|
29 |
+
"lstrip": false,
|
30 |
+
"normalized": false,
|
31 |
+
"rstrip": false,
|
32 |
+
"single_word": false,
|
33 |
+
"special": true
|
34 |
+
},
|
35 |
+
"250001": {
|
36 |
+
"content": "<mask>",
|
37 |
+
"lstrip": true,
|
38 |
+
"normalized": false,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false,
|
41 |
+
"special": true
|
42 |
+
}
|
43 |
+
},
|
44 |
+
"bos_token": "<s>",
|
45 |
+
"clean_up_tokenization_spaces": false,
|
46 |
+
"cls_token": "<s>",
|
47 |
+
"eos_token": "</s>",
|
48 |
+
"extra_special_tokens": {},
|
49 |
+
"mask_token": "<mask>",
|
50 |
+
"max_length": 128,
|
51 |
+
"model_max_length": 512,
|
52 |
+
"pad_to_multiple_of": null,
|
53 |
+
"pad_token": "<pad>",
|
54 |
+
"pad_token_type_id": 0,
|
55 |
+
"padding_side": "right",
|
56 |
+
"sep_token": "</s>",
|
57 |
+
"stride": 0,
|
58 |
+
"tokenizer_class": "XLMRobertaTokenizerFast",
|
59 |
+
"truncation_side": "right",
|
60 |
+
"truncation_strategy": "longest_first",
|
61 |
+
"unk_token": "<unk>"
|
62 |
+
}
|