Eraly-ml/KazBERT-Duplicates-BETA_TEST

KazBERT-Duplicates is a Kazakh language model fine-tuned to classify types of textual duplication between sentence pairs. It predicts whether two sentences are exact, partial, paraphrase, or contextual duplicates.

Model Description

  • Base Model: KazBERT (BERT-based)
  • Language: Kazakh 🇰🇿
  • Task: Sentence Pair Classification (Duplicate Detection)
  • Labels:
    • exact: Sentences are exactly the same
    • partial: One sentence partially overlaps with the other
    • paraphrase: Sentences convey the same meaning in different wording
    • contextual: Sentences are topically similar but semantically different

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("Eraly-ml/KazBERT-Duplicates")
model = AutoModelForSequenceClassification.from_pretrained("Eraly-ml/KazBERT-Duplicates")

model.config.label2id = {v: k for k, v in model.config.id2label.items()}

nlp = pipeline(
    task="text-classification",
    model=model,
    tokenizer=tokenizer,
    return_all_scores=True,
    device=0,  # remove if not using GPU
    batch_size=2,
)

examples = [
    {"text": "Менің атым Ералы", "text_pair": "Менің есімім — Ералы"},
    {"text": "Бүгін ауа‑райы жақсы", "text_pair": "Кеше жаңбыр жауды"}
]

results = nlp(examples, truncation=True, padding=True, max_length=512)

for i, (ex, res) in enumerate(zip(examples, results), 1):
    print(f"\n[{i}] \"{ex['text']}\" ↔ \"{ex['text_pair']}\"")
    top = max(res, key=lambda x: x['score'])
    print(f"   → Top prediction: **{top['label']}** ({top['score']:.2%})")
    print("   All scores:")
    for r in res:
        print(f"      - {r['label']}: {r['score']:.2%}")

Output:

[1] "Менің атым Ералы""Менің есімім — Ералы"
   → Top prediction: **partial** (55.15%)
   All scores:
      - contextual: 2.04%
      - exact: 6.70%
      - paraphrase: 36.11%
      - partial: 55.15%

[2] "Бүгін ауа‑райы жақсы""Кеше жаңбыр жауды"
   → Top prediction: **contextual** (99.59%)
   All scores:
      - contextual: 99.59%
      - exact: 0.01%
      - paraphrase: 0.04%
      - partial: 0.36%

Evaluation Metrics

The model was evaluated on a held-out test set using macro-averaged metrics:

Metric Value Description
eval_loss 0.21 Low loss — good, the model is confident in its predictions.
eval_accuracy 91.05% Very high accuracy for the classification task.
eval_f1 91.05% Excellent balance between precision and recall.
eval_precision 92.36% Almost no false positives.
eval_recall 91.21% High coverage of true positives.

Training Details

  • Framework: Hugging Face Transformers + Accelerate
  • Dataset: KazakhTextDuplicates
  • Batch Size: 16
  • Epochs: 6
  • Learning Rate: 2e-5
  • Optimizer: AdamW
  • Max Seq Length: 512
  • Loss Function: CrossEntropyLoss

Training was launched with multi-GPU support:

notebook_launcher(training_function, num_processes=2)

Intended Uses

  • Kazakh duplicate sentence classification
  • Plagiarism detection in Kazakh
  • NLP pre-processing for deduplication tasks

Limitations

  • Limited to the Kazakh language
  • May not generalize well to domain-specific text (e.g. legal or medical)
  • Sensitive to long or noisy inputs

Contact

For questions or collaborations, contact via Home Page.

Downloads last month
17
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Eraly-ml/KazBERT-Duplicates

Finetuned
Eraly-ml/KazBERT
Finetuned
(1)
this model

Dataset used to train Eraly-ml/KazBERT-Duplicates