nahiar commited on
Commit
35c7334
·
verified ·
1 Parent(s): cb03110

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - sr
5
+ license: mit
6
+ library_name: transformers
7
+ tags:
8
+ - hate-speech-detection
9
+ - text-classification
10
+ - multilingual
11
+ - xlm-roberta
12
+ - serbian
13
+ - english
14
+ - pytorch
15
+ datasets:
16
+ - hate-speech
17
+ pipeline_tag: text-classification
18
+ widget:
19
+ - text: "I really enjoyed that movie last night!"
20
+ example_title: "Appropriate Content"
21
+ - text: "You people are all the same, causing problems everywhere."
22
+ example_title: "Hate Speech Example"
23
+ - text: "Ovaj film je bio odličan!"
24
+ example_title: "Serbian Appropriate"
25
+ ---
26
+
27
+ # Multilingual Hate Speech Detector (XLM-RoBERTa)
28
+
29
+ ## Model Description
30
+
31
+ This is a fine-tuned XLM-RoBERTa model for multilingual hate speech detection, specifically trained on English and Serbian text. The model classifies text into 8 categories:
32
+
33
+ - **Race**: Racial discrimination and slurs
34
+ - **Sexual Orientation**: Homophobic content, LGBTQ+ discrimination
35
+ - **Gender**: Sexist content, misogyny, gender-based harassment
36
+ - **Physical Appearance**: Body shaming, lookism, appearance-based harassment
37
+ - **Religion**: Religious discrimination, islamophobia, antisemitism
38
+ - **Class**: Classist content, economic discrimination
39
+ - **Disability**: Ableist content, discrimination against disabled people
40
+ - **Appropriate**: Non-hateful, normal conversation
41
+
42
+ ## Languages Supported
43
+
44
+ - **English**: Comprehensive hate speech detection
45
+ - **Serbian**: Native Serbian language support (Cyrillic and Latin scripts)
46
+
47
+ ## Usage
48
+
49
+ ```python
50
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
51
+ import torch
52
+
53
+ # Load model and tokenizer
54
+ tokenizer = AutoTokenizer.from_pretrained("sadjava/multilingual-hate-speech-xlm-roberta")
55
+ model = AutoModelForSequenceClassification.from_pretrained("sadjava/multilingual-hate-speech-xlm-roberta")
56
+
57
+ # Example prediction
58
+ text = "Your text here"
59
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
60
+
61
+ with torch.no_grad():
62
+ outputs = model(**inputs)
63
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
64
+
65
+ # Categories
66
+ categories = ["Race", "Sexual Orientation", "Gender", "Physical Appearance",
67
+ "Religion", "Class", "Disability", "Appropriate"]
68
+
69
+ # Get predicted category
70
+ predicted_class = torch.argmax(predictions, dim=-1).item()
71
+ predicted_category = categories[predicted_class]
72
+ confidence = float(predictions[0][predicted_class])
73
+
74
+ print(f"Category: {predicted_category}")
75
+ print(f"Confidence: {confidence:.2%}")
76
+ ```
77
+
78
+ ## Training Data
79
+
80
+ The model was fine-tuned on multilingual hate speech datasets including:
81
+ - English hate speech datasets
82
+ - Serbian hate speech datasets
83
+ - Augmented examples for better multilingual performance
84
+
85
+ ## Performance
86
+
87
+ - **Accuracy**: High-confidence predictions with detailed explanations
88
+ - **Languages**: English and Serbian with cross-lingual capabilities
89
+ - **Categories**: 8-class classification including appropriate content
90
+
91
+ ## Ethical Considerations
92
+
93
+ This model is designed for research and educational purposes. Results should be interpreted carefully and human judgment should always be applied for critical decisions. The system is designed to assist, not replace, human moderation.
94
+
95
+ ## Citation
96
+
97
+ If you use this model, please cite:
98
+
99
+ ```bibtex
100
+ @misc{multilingual-hate-speech-xlm-roberta,
101
+ author = {sadjava},
102
+ title = {Multilingual Hate Speech Detector},
103
+ year = {2024},
104
+ publisher = {Hugging Face},
105
+ url = {https://huggingface.co/sadjava/multilingual-hate-speech-xlm-roberta}
106
+ }
107
+ ```
108
+
109
+ ## Demo
110
+
111
+ Try the interactive demo: [Multilingual Hate Speech Detector Space](https://huggingface.co/spaces/sadjava/multilingual-hate-speech-detector)
config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XLMRobertaForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "Race",
14
+ "1": "Sexual Orientation",
15
+ "2": "Gender",
16
+ "3": "Physical Appearance",
17
+ "4": "Religion",
18
+ "5": "Class",
19
+ "6": "Disability",
20
+ "7": "Appropriate"
21
+ },
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 3072,
24
+ "label2id": {
25
+ "Appropriate": 7,
26
+ "Class": 5,
27
+ "Disability": 6,
28
+ "Gender": 2,
29
+ "Physical Appearance": 3,
30
+ "Race": 0,
31
+ "Religion": 4,
32
+ "Sexual Orientation": 1
33
+ },
34
+ "layer_norm_eps": 1e-05,
35
+ "max_position_embeddings": 514,
36
+ "model_type": "xlm-roberta",
37
+ "num_attention_heads": 12,
38
+ "num_hidden_layers": 12,
39
+ "output_past": true,
40
+ "pad_token_id": 1,
41
+ "position_embedding_type": "absolute",
42
+ "problem_type": "single_label_classification",
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.52.4",
45
+ "type_vocab_size": 1,
46
+ "use_cache": true,
47
+ "vocab_size": 250002
48
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a110841fd526a364b21e8ffa8965531a43018b2ebcf7c3fca6510898ab90179e
3
+ size 1112223464
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c088c06cf975b7097e469bd69630cdb0d675c6db1ce3af1042b6e19c6d01f22
3
+ size 17082999
tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "max_length": 128,
51
+ "model_max_length": 512,
52
+ "pad_to_multiple_of": null,
53
+ "pad_token": "<pad>",
54
+ "pad_token_type_id": 0,
55
+ "padding_side": "right",
56
+ "sep_token": "</s>",
57
+ "stride": 0,
58
+ "tokenizer_class": "XLMRobertaTokenizerFast",
59
+ "truncation_side": "right",
60
+ "truncation_strategy": "longest_first",
61
+ "unk_token": "<unk>"
62
+ }