Model Details

  • Developed by: Nikolas Ziartis
  • Institute: University of Cyprus
  • Model type: MarianMT (Transformer-based Seq2Seq)
  • Source language: Cypriot Greek (ISO 639-1: cy)
  • Target language: Modern Standard Greek (ISO 639-1: el)
  • Fine-tuned from: Helsinki-NLP/opus-mt-en-grk
  • License: CC BY 4.0

Model Description

This model is a MarianMT transformer, fine-tuned via active learning to translate from the low-resource Cypriot Greek dialect into Modern Standard Greek. In nine iterative batches, we:

  1. Extracted high-dimensional embeddings for every unlabeled Cypriot sentence using the Greek LLM ilsp/Meltemi-7B-Instruct-v1.5 :contentReference[oaicite:0]{index=0}.
  2. Applied k-means clustering to select the 50 “most informative” sentence pairs per batch.
  3. Had human annotators translate those 50 sentences into Standard Greek.
  4. Fine-tuned the MarianMT model on the accumulating parallel corpus, freezing and unfreezing layers to preserve learned representations.

The result is a system that accurately captures colloquial Cypriot expressions while producing fluent Modern Greek.

Usage

from transformers import MarianMTModel, MarianTokenizer

model_name = "ZiartisNikolas/NMT-cypriot-dialect-to-greek"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model     = MarianMTModel.from_pretrained(model_name)

src = ["Τζ̆αι φυσικά ήξερα ίνταμπου εγινίσκετουν."]  # Cypriot Greek sentence
batch = tokenizer(src, return_tensors="pt", padding=True)
gen   = model.generate(**batch)
print(tokenizer.batch_decode(gen, skip_special_tokens=True))
Downloads last month
30
Safetensors
Model size
55.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support