metadata

language:
  - ka
  - en
license: apache-2.0
tags:
  - translation
  - evaluation
  - comet
  - mt-evaluation
  - georgian
metrics:
  - kendall_tau
  - spearman_correlation
  - pearson_correlation
model-index:
  - name: Georgian-COMET
    results:
      - task:
          type: translation-evaluation
          name: Machine Translation Evaluation
        dataset:
          name: Georgian MT Evaluation Dataset
          type: Darsala/georgian_metric_evaluation
        metrics:
          - type: pearson_correlation
            value: 0.878
            name: Pearson Correlation
          - type: spearman_correlation
            value: 0.796
            name: Spearman Correlation
          - type: kendall_tau
            value: 0.603
            name: Kendall's Tau
base_model: Unbabel/wmt22-comet-da
datasets:
  - Darsala/georgian_metric_evaluation

Georgian-COMET: Fine-tuned COMET for English-Georgian MT Evaluation

This is a COMET evaluation model fine-tuned specifically for English-Georgian machine translation evaluation. It receives a triplet with (source sentence, translation, reference translation) and returns a score that reflects the quality of the translation compared to both source and reference.

Model Description

Georgian-COMET is a fine-tuned version of Unbabel/wmt22-comet-da that has been optimized for evaluating English-to-Georgian translations through knowledge distillation from Claude Sonnet 4. The model shows significant improvements over the base model when evaluating Georgian translations.

Key Improvements over Base Model

Metric	Base COMET	Georgian-COMET	Improvement
Pearson	0.867	0.878	+1.1%
Spearman	0.759	0.796	+3.7%
Kendall	0.564	0.603	+3.9%

Paper

Base Model Paper: COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task (Rei et al., WMT 2022)
This Model: Paper coming soon

Repository

https://github.com/LukaDarsalia/nmt_metrics_research

License

Apache-2.0

Usage (unbabel-comet)

Using this model requires unbabel-comet to be installed:

pip install --upgrade pip  # ensures that pip is current 
pip install unbabel-comet

Option 1: Direct Download from HuggingFace

from comet import load_from_checkpoint
import requests
import os

# Download the model checkpoint
model_url = "https://huggingface.co/Darsala/georgian_comet/resolve/main/model.ckpt"
model_path = "georgian_comet.ckpt"

# Download if not already present
if not os.path.exists(model_path):
    response = requests.get(model_url)
    with open(model_path, 'wb') as f:
        f.write(response.content)

# Load the model
model = load_from_checkpoint(model_path)

# Prepare your data
data = [
    {
        "src": "The cat sat on the mat.",
        "mt": "კატა ზის ხალიჩაზე.",
        "ref": "კატა იჯდა ხალიჩაზე."
    },
    {
        "src": "Schools and kindergartens were opened.",
        "mt": "სკოლები და საბავშვო ბაღები გაიხსნა.",
        "ref": "გაიხსნა სკოლები და საბავშვო ბაღები."
    }
]

# Get predictions
model_output = model.predict(data, batch_size=8, gpus=1)
print(model_output)

Option 2: Using comet CLI

First download the model checkpoint:

wget https://huggingface.co/Darsala/georgian_comet/resolve/main/model.ckpt -O georgian_comet.ckpt

Then use it with comet CLI:

comet-score -s {source-inputs}.txt -t {translation-outputs}.txt -r {references}.txt --model georgian_comet.ckpt

Option 3: Integration with Evaluation Pipeline

from comet import load_from_checkpoint
import pandas as pd

# Load model
model = load_from_checkpoint("georgian_comet.ckpt")

# Load your evaluation data
df = pd.read_csv("your_evaluation_data.csv")

# Prepare data in COMET format
data = [
    {
        "src": row["sourceText"],
        "mt": row["targetText"],
        "ref": row["referenceText"]
    }
    for _, row in df.iterrows()
]

# Get scores
scores = model.predict(data, batch_size=16)
print(f"Average score: {sum(scores['scores']) / len(scores['scores']):.3f}")

Intended Uses

This model is intended to be used for English-Georgian MT evaluation.

Given a triplet with (source sentence in English, translation in Georgian, reference translation in Georgian), it outputs a single score between 0 and 1 where 1 represents a perfect translation.

Primary Use Cases

MT System Development: Evaluate and compare different English-Georgian MT systems
Quality Assurance: Automated quality checks for Georgian translations
Research: Study MT evaluation for morphologically rich languages like Georgian
Production Monitoring: Track translation quality in production environments

Out-of-Scope Use

Other Language Pairs: This model is specifically fine-tuned for English-Georgian and may not perform well on other language pairs
Reference-Free Evaluation: The model requires reference translations
Document-Level: Optimized for sentence-level evaluation

Training Details

Training Data

Dataset: 5,000 English-Georgian pairs from corp.dict.ge
MT Systems: Translations from SMaLL-100, Google Translate, and Ucraft Translate
Scoring Method: Knowledge distillation from Claude Sonnet 4 with added Gaussian noise (σ=3)
Details: See Darsala/georgian_metric_evaluation

Training Configuration

regression_metric:
  init_args:
    nr_frozen_epochs: 0.3
    keep_embeddings_frozen: True
    optimizer: AdamW
    encoder_learning_rate: 1.5e-05
    learning_rate: 1.5e-05
    loss: mse
    dropout: 0.1
    batch_size: 8

Training Procedure

Base Model: Started from Unbabel/wmt22-comet-da checkpoint
Knowledge Distillation: Used Claude Sonnet 4 scores as training targets
Robustness: Added Gaussian noise to training scores to prevent overfitting
Optimization: 8 epochs with early stopping (patience=4) on validation Kendall's tau

Evaluation Results

Test Set Performance

Evaluated on 400 human-annotated English-Georgian translation pairs:

Metric	Score	p-value
Pearson	0.878	< 0.001
Spearman	0.796	< 0.001
Kendall	0.603	< 0.001

Comparison with Other Metrics

Metric	Pearson	Spearman	Kendall
Georgian-COMET	0.878	0.796	0.603
Base COMET	0.867	0.759	0.564
LLM-Reference-Based	0.852	0.798	0.660
CHRF++	0.739	0.690	0.498
TER	0.466	0.443	0.311
BLEU	0.413	0.497	0.344

Languages Covered

While the base model (XLM-R) covers 100+ languages, this fine-tuned version is specifically optimized for:

Source Language: English (en)
Target Language: Georgian (ka)

For other language pairs, we recommend using the base Unbabel/wmt22-comet-da model.

Limitations

Language Specific: Optimized only for English→Georgian evaluation
Domain: Training data primarily from corp.dict.ge (general/literary domain)
Reference Required: Cannot perform reference-free evaluation
Sentence Level: Not optimized for document-level evaluation

Citation

If you use this model, please cite:

@misc{georgian-comet-2025,
  title={Georgian-COMET: Fine-tuned COMET for English-Georgian MT Evaluation},
  author={Luka Darsalia, Ketevan Bakhturidze, Saba Sturua},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/Darsala/georgian_comet}
}

@inproceedings{rei-etal-2022-comet,
  title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
  author = "Rei, Ricardo  and
    C. de Souza, Jos{\'e} G.  and
    Alves, Duarte  and
    Zerva, Chrysoula  and
    Farinha, Ana C  and
    Glushkova, Taisiya  and
    Lavie, Alon  and
    Coheur, Luisa  and
    Martins, Andr{\'e} F. T.",
  booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
  year = "2022",
  address = "Abu Dhabi, United Arab Emirates",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2022.wmt-1.52",
  pages = "578--585",
}

Acknowledgments

Unbabel team for the base COMET model
Anthropic for Claude Sonnet 4 used in knowledge distillation
corp.dict.ge for the Georgian-English corpus
All contributors to the nmt_metrics_research project