ZiartisNikolas/NMT-cypriot-dialect-to-greek

Model Details

Developed by: Nikolas Ziartis
Institute: University of Cyprus
Model type: MarianMT (Transformer-based Seq2Seq)
Source language: Cypriot Greek (ISO 639-1: cy)
Target language: Modern Standard Greek (ISO 639-1: el)
Fine-tuned from: Helsinki-NLP/opus-mt-en-grk
License: CC BY 4.0

Model Description

This model is a MarianMT transformer, fine-tuned via active learning to translate from the low-resource Cypriot Greek dialect into Modern Standard Greek. In nine iterative batches, we:

Extracted high-dimensional embeddings for every unlabeled Cypriot sentence using the Greek LLM ilsp/Meltemi-7B-Instruct-v1.5 :contentReference[oaicite:0]{index=0}.
Applied k-means clustering to select the 50 “most informative” sentence pairs per batch.
Had human annotators translate those 50 sentences into Standard Greek.
Fine-tuned the MarianMT model on the accumulating parallel corpus, freezing and unfreezing layers to preserve learned representations.

The result is a system that accurately captures colloquial Cypriot expressions while producing fluent Modern Greek.

Usage

from transformers import MarianMTModel, MarianTokenizer

model_name = "ZiartisNikolas/NMT-cypriot-dialect-to-greek"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model     = MarianMTModel.from_pretrained(model_name)

src = ["Τζ̆αι φυσικά ήξερα ίνταμπου εγινίσκετουν."]  # Cypriot Greek sentence
batch = tokenizer(src, return_tensors="pt", padding=True)
gen   = model.generate(**batch)
print(tokenizer.batch_decode(gen, skip_special_tokens=True))