Eraly-ml/KazBERT-Duplicates-BETA_TEST

KazBERT-Duplicates is a Kazakh language model fine-tuned to classify types of textual duplication between sentence pairs. It predicts whether two sentences are exact, partial, paraphrase, or contextual duplicates.

Model Description

Base Model: KazBERT (BERT-based)
Language: Kazakh 🇰🇿
Task: Sentence Pair Classification (Duplicate Detection)
Labels:
- exact: Sentences are exactly the same
- partial: One sentence partially overlaps with the other
- paraphrase: Sentences convey the same meaning in different wording
- contextual: Sentences are topically similar but semantically different

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("Eraly-ml/KazBERT-Duplicates")
model = AutoModelForSequenceClassification.from_pretrained("Eraly-ml/KazBERT-Duplicates")

model.config.label2id = {v: k for k, v in model.config.id2label.items()}

nlp = pipeline(
    task="text-classification",
    model=model,
    tokenizer=tokenizer,
    return_all_scores=True,
    device=0,  # remove if not using GPU
    batch_size=2,
)

examples = [
    {"text": "Менің атым Ералы", "text_pair": "Менің есімім — Ералы"},
    {"text": "Бүгін ауа‑райы жақсы", "text_pair": "Кеше жаңбыр жауды"}
]

results = nlp(examples, truncation=True, padding=True, max_length=512)

for i, (ex, res) in enumerate(zip(examples, results), 1):
    print(f"\n[{i}] \"{ex['text']}\" ↔ \"{ex['text_pair']}\"")
    top = max(res, key=lambda x: x['score'])
    print(f"   → Top prediction: **{top['label']}** ({top['score']:.2%})")
    print("   All scores:")
    for r in res:
        print(f"      - {r['label']}: {r['score']:.2%}")

Output:

[1] "Менің атым Ералы" ↔ "Менің есімім — Ералы"
   → Top prediction: **partial** (55.15%)
   All scores:
      - contextual: 2.04%
      - exact: 6.70%
      - paraphrase: 36.11%
      - partial: 55.15%

[2] "Бүгін ауа‑райы жақсы" ↔ "Кеше жаңбыр жауды"
   → Top prediction: **contextual** (99.59%)
   All scores:
      - contextual: 99.59%
      - exact: 0.01%
      - paraphrase: 0.04%
      - partial: 0.36%

Evaluation Metrics

The model was evaluated on a held-out test set using macro-averaged metrics:

Metric	Value	Description
`eval_loss`	0.21	Low loss — good, the model is confident in its predictions.
`eval_accuracy`	91.05%	Very high accuracy for the classification task.
`eval_f1`	91.05%	Excellent balance between precision and recall.
`eval_precision`	92.36%	Almost no false positives.
`eval_recall`	91.21%	High coverage of true positives.

Training Details

Framework: Hugging Face Transformers + Accelerate
Dataset: KazakhTextDuplicates
Batch Size: 16
Epochs: 6
Learning Rate: 2e-5
Optimizer: AdamW
Max Seq Length: 512
Loss Function: CrossEntropyLoss

Training was launched with multi-GPU support:

notebook_launcher(training_function, num_processes=2)

Intended Uses

Kazakh duplicate sentence classification
Plagiarism detection in Kazakh
NLP pre-processing for deduplication tasks

Limitations

Limited to the Kazakh language
May not generalize well to domain-specific text (e.g. legal or medical)
Sensitive to long or noisy inputs

Contact

For questions or collaborations, contact via Home Page.

Eraly-ml
/

KazBERT-Duplicates