|
--- |
|
language: vi |
|
tags: |
|
- hate-speech-detection |
|
- vietnamese |
|
- transformer |
|
license: apache-2.0 |
|
datasets: |
|
- visolex/ViHOS |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
model-index: |
|
- name: visobert-hsd-span |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Hate Speech Span Detection |
|
dataset: |
|
name: ViHOS |
|
type: custom |
|
metrics: |
|
- name: Precision |
|
type: precision |
|
value: <INSERT_PRECISION> |
|
- name: Recall |
|
type: recall |
|
value: <INSERT_RECALL> |
|
- name: F1 Score |
|
type: f1 |
|
value: <INSERT_F1> |
|
base_model: |
|
- uitnlp/visobert |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
# ViSoBERT-HSD-Span |
|
|
|
This model is fine-tuned from [`uitnlp/visobert`](https://huggingface.co/uitnlp/visobert) on the **visolex/ViHOS** dataset for span-level hate/offensive detection in Vietnamese comments. |
|
|
|
## Model Details |
|
|
|
* **Base Model**: [`uitnlp/visobert`](https://huggingface.co/uitnlp/visobert) |
|
* **Dataset**: [visolex/ViHOS](https://huggingface.co/datasets/visolex/ViHOS) |
|
* **Fine-tuning**: HuggingFace Transformers |
|
|
|
### Hyperparameters |
|
|
|
* Batch size: `16` |
|
* Learning rate: `5e-5` |
|
* Epochs: `100` |
|
* Max sequence length: `128` |
|
* Early stopping: `5` |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("visolex/visobert-hsd-span") |
|
model = AutoModelForTokenClassification.from_pretrained("visolex/visobert-hsd-span") |
|
|
|
text = "Nói cái lol . t thấy thô tục vl" |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
logits = outputs.logits # [batch, seq_len, num_labels] |
|
# For binary: use sigmoid, for multi-class: use softmax+argmax |
|
probs = torch.sigmoid(logits) |
|
preds = (probs > 0.5).long().squeeze().tolist() # [seq_len] |
|
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) |
|
|
|
span_labels = [p[0] for p in preds] |
|
|
|
# Lấy token có nhãn span = 1, loại bỏ <s> và </s> nếu muốn |
|
span_tokens = [token for token, label in zip(tokens, span_labels) if label == 1 and token not in ['<s>', '</s>']] |
|
|
|
print("Span tokens:", span_tokens) |
|
print("Span text:", tokenizer.convert_tokens_to_string(span_tokens)) |
|
``` |
|
|