AnnyNguyen commited on
Commit
2609a1b
·
verified ·
1 Parent(s): 1de244e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ```yaml
2
+ language: vi
3
+ tags:
4
+ - hate-speech-detection
5
+ - vietnamese
6
+ - transformer
7
+ license: apache-2.0
8
+ datasets:
9
+ - visolex/ViHOS
10
+ metrics:
11
+ - precision
12
+ - recall
13
+ - f1
14
+ model-index:
15
+ - name: visobert-hsd-span
16
+ results:
17
+ - task:
18
+ type: token-classification
19
+ name: Hate Speech Span Detection
20
+ dataset:
21
+ name: ViHOS
22
+ type: custom
23
+ metrics:
24
+ - name: Precision
25
+ type: precision
26
+ value: <INSERT_PRECISION>
27
+ - name: Recall
28
+ type: recall
29
+ value: <INSERT_RECALL>
30
+ - name: F1 Score
31
+ type: f1
32
+ value: <INSERT_F1>
33
+ base_model:
34
+ - uitnlp/visobert
35
+ pipeline_tag: token-classification
36
+ ```
37
+
38
+ # ViSoBERT-HSD-Span
39
+
40
+ This model is fine-tuned from [`uitnlp/visobert`](https://huggingface.co/uitnlp/visobert) on the **visolex/ViHOS** dataset for span-level hate/offensive detection in Vietnamese comments.
41
+
42
+ ## Model Details
43
+
44
+ * **Base Model**: [`uitnlp/visobert`](https://huggingface.co/uitnlp/visobert)
45
+ * **Dataset**: [visolex/ViHOS](https://huggingface.co/datasets/visolex/ViHOS) citeturn1view0
46
+ * **Fine-tuning**: HuggingFace Transformers
47
+
48
+ ### Hyperparameters
49
+
50
+ * Batch size: `16`
51
+ * Learning rate: `5e-5`
52
+ * Epochs: `100`
53
+ * Max sequence length: `128`
54
+ * Early stopping: `5`
55
+
56
+ ## Usage
57
+
58
+ ```python
59
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
60
+
61
+ tokenizer = AutoTokenizer.from_pretrained("visolex/visobert-hsd-span")
62
+ model = AutoModelForTokenClassification.from_pretrained("visolex/visobert-hsd-span")
63
+
64
+ text = "Nói cái lol . t thấy thô tục vl"
65
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
66
+ with torch.no_grad():
67
+ outputs = model(**inputs)
68
+ logits = outputs.logits # [batch, seq_len, num_labels]
69
+ # For binary: use sigmoid, for multi-class: use softmax+argmax
70
+ probs = torch.sigmoid(logits)
71
+ preds = (probs > 0.5).long().squeeze().tolist() # [seq_len]
72
+ tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
73
+
74
+ span_labels = [p[0] for p in preds]
75
+
76
+ # Lấy token có nhãn span = 1, loại bỏ <s> và </s> nếu muốn
77
+ span_tokens = [token for token, label in zip(tokens, span_labels) if label == 1 and token not in ['<s>', '</s>']]
78
+
79
+ print("Span tokens:", span_tokens)
80
+ print("Span text:", tokenizer.convert_tokens_to_string(span_tokens))
81
+ ```