visolex
/

visobert-hsd-span

Token Classification

hate-speech-detection

Model card Files Files and versions

AnnyNguyen commited on Jun 27

Commit

2609a1b

·

verified ·

1 Parent(s): 1de244e

Create README.md

Files changed (1) hide show

README.md +81 -0

README.md ADDED Viewed

	@@ -0,0 +1,81 @@

+```yaml
+language: vi
+tags:
+- hate-speech-detection
+- vietnamese
+- transformer
+license: apache-2.0
+datasets:
+- visolex/ViHOS
+metrics:
+- precision
+- recall
+- f1
+model-index:
+- name: visobert-hsd-span
+  results:
+  - task:
+      type: token-classification
+      name: Hate Speech Span Detection
+    dataset:
+      name: ViHOS
+      type: custom
+    metrics:
+    - name: Precision
+      type: precision
+      value: <INSERT_PRECISION>
+    - name: Recall
+      type: recall
+      value: <INSERT_RECALL>
+    - name: F1 Score
+      type: f1
+      value: <INSERT_F1>
+base_model:
+- uitnlp/visobert
+pipeline_tag: token-classification
+```
+# ViSoBERT-HSD-Span
+This model is fine-tuned from [`uitnlp/visobert`](https://huggingface.co/uitnlp/visobert) on the **visolex/ViHOS** dataset for span-level hate/offensive detection in Vietnamese comments.
+## Model Details
+* **Base Model**: [`uitnlp/visobert`](https://huggingface.co/uitnlp/visobert)
+* **Dataset**: [visolex/ViHOS](https://huggingface.co/datasets/visolex/ViHOS) citeturn1view0
+* **Fine-tuning**: HuggingFace Transformers
+### Hyperparameters
+* Batch size: `16`
+* Learning rate: `5e-5`
+* Epochs: `100`
+* Max sequence length: `128`
+* Early stopping: `5`
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+tokenizer = AutoTokenizer.from_pretrained("visolex/visobert-hsd-span")
+model = AutoModelForTokenClassification.from_pretrained("visolex/visobert-hsd-span")
+text = "Nói cái lol . t thấy thô tục vl"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+logits = outputs.logits  # [batch, seq_len, num_labels]
+# For binary: use sigmoid, for multi-class: use softmax+argmax
+probs = torch.sigmoid(logits)
+preds = (probs > 0.5).long().squeeze().tolist()  # [seq_len]
+tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
+span_labels = [p[0] for p in preds]
+# Lấy token có nhãn span = 1, loại bỏ <s> và </s> nếu muốn
+span_tokens = [token for token, label in zip(tokens, span_labels) if label == 1 and token not in ['<s>', '</s>']]
+print("Span tokens:", span_tokens)
+print("Span text:", tokenizer.convert_tokens_to_string(span_tokens))
+```