Eraly-ml/KazBERT-Duplicates-BETA_TEST
KazBERT-Duplicates is a Kazakh language model fine-tuned to classify types of textual duplication between sentence pairs. It predicts whether two sentences are exact, partial, paraphrase, or contextual duplicates.
Model Description
- Base Model: KazBERT (BERT-based)
- Language: Kazakh 🇰🇿
- Task: Sentence Pair Classification (Duplicate Detection)
- Labels:
exact
: Sentences are exactly the samepartial
: One sentence partially overlaps with the otherparaphrase
: Sentences convey the same meaning in different wordingcontextual
: Sentences are topically similar but semantically different
How to Use
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("Eraly-ml/KazBERT-Duplicates")
model = AutoModelForSequenceClassification.from_pretrained("Eraly-ml/KazBERT-Duplicates")
model.config.label2id = {v: k for k, v in model.config.id2label.items()}
nlp = pipeline(
task="text-classification",
model=model,
tokenizer=tokenizer,
return_all_scores=True,
device=0, # remove if not using GPU
batch_size=2,
)
examples = [
{"text": "Менің атым Ералы", "text_pair": "Менің есімім — Ералы"},
{"text": "Бүгін ауа‑райы жақсы", "text_pair": "Кеше жаңбыр жауды"}
]
results = nlp(examples, truncation=True, padding=True, max_length=512)
for i, (ex, res) in enumerate(zip(examples, results), 1):
print(f"\n[{i}] \"{ex['text']}\" ↔ \"{ex['text_pair']}\"")
top = max(res, key=lambda x: x['score'])
print(f" → Top prediction: **{top['label']}** ({top['score']:.2%})")
print(" All scores:")
for r in res:
print(f" - {r['label']}: {r['score']:.2%}")
Output:
[1] "Менің атым Ералы" ↔ "Менің есімім — Ералы"
→ Top prediction: **partial** (55.15%)
All scores:
- contextual: 2.04%
- exact: 6.70%
- paraphrase: 36.11%
- partial: 55.15%
[2] "Бүгін ауа‑райы жақсы" ↔ "Кеше жаңбыр жауды"
→ Top prediction: **contextual** (99.59%)
All scores:
- contextual: 99.59%
- exact: 0.01%
- paraphrase: 0.04%
- partial: 0.36%
Evaluation Metrics
The model was evaluated on a held-out test set using macro-averaged metrics:
Metric | Value | Description |
---|---|---|
eval_loss |
0.21 | Low loss — good, the model is confident in its predictions. |
eval_accuracy |
91.05% | Very high accuracy for the classification task. |
eval_f1 |
91.05% | Excellent balance between precision and recall. |
eval_precision |
92.36% | Almost no false positives. |
eval_recall |
91.21% | High coverage of true positives. |
Training Details
- Framework: Hugging Face Transformers + Accelerate
- Dataset: KazakhTextDuplicates
- Batch Size: 16
- Epochs: 6
- Learning Rate: 2e-5
- Optimizer: AdamW
- Max Seq Length: 512
- Loss Function: CrossEntropyLoss
Training was launched with multi-GPU support:
notebook_launcher(training_function, num_processes=2)
Intended Uses
- Kazakh duplicate sentence classification
- Plagiarism detection in Kazakh
- NLP pre-processing for deduplication tasks
Limitations
- Limited to the Kazakh language
- May not generalize well to domain-specific text (e.g. legal or medical)
- Sensitive to long or noisy inputs
Contact
For questions or collaborations, contact via Home Page.
- Downloads last month
- 17
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support