Jeju Satoru

Project Overview

'Jeju Satoru' is a bidirectional Jeju-Standard Korean translation model developed to preserve the Jeju language, which is designated as an 'endangered language' by UNESCO. The model aims to bridge the digital divide for elderly Jeju dialect speakers by improving their digital accessibility.

Model Information

  • Base Model: KoBART (gogamza/kobart-base-v2)
  • Model Architecture: Seq2Seq (Encoder-Decoder structure)
  • Training Data: The model was trained using a large-scale dataset of approximately 930,000 sentence pairs. The dataset was built by leveraging the publicly available Junhoee/Jeju-Standard-Translation dataset, which is primarily based on text from the KakaoBrain JIT (Jeju-Island-Translation) corpus and transcribed data from the AI Hub Jeju dialect speech dataset.

Training Strategy and Parameters

Our model was trained using a two-stage domain adaptation method to handle the complexities of the Jeju dialect.

  1. Domain Adaptation: The model was separately trained on Standard Korean and Jeju dialect sentences to help it deeply understand the grammar and style of each language.
  2. Translation Fine-Tuning: The final stage involved training the model on the bidirectional dataset, with [์ œ์ฃผ] (Jeju) and [ํ‘œ์ค€] (Standard) tags added to each sentence to explicitly guide the translation direction.

The following key hyperparameters and techniques were applied for performance optimization:

  • Learning Rate: 2e-5
  • Epochs: 3
  • Batch Size: 128
  • Weight Decay: 0.01
  • Generation Beams: 5
  • GPU Memory Efficiency: Mixed-precision training (FP16) was used to reduce training time, along with Gradient Accumulation (Steps: 16).

Performance Evaluation

The model's performance was comprehensively evaluated using both quantitative and qualitative metrics.

Quantitative Evaluation

Direction SacreBLEU CHRF BERTScore
Jeju Dialect โ†’ Standard 77.19 83.02 0.97
Standard โ†’ Jeju Dialect 64.86 72.68 0.94

Qualitative Evaluation (Summary)

  • Adequacy: The model accurately captures the meaning of most source sentences.
  • Fluency: The translated sentences are grammatically correct and natural-sounding.
  • Tone: While generally good at maintaining the tone, the model has some limitations in perfectly reflecting the nuances and specific colloquial endings of the Jeju dialect.

How to Use

You can easily load and infer with the model using the transformers library's pipeline function.

1. Installation

pip install transformers torch

from transformers import pipeline

# Load the model pipeline
translator = pipeline(
    "translation",
    model="sbaru/jeju-satoru"
)

# Example: Jeju Dialect -> Standard
jeju_sentence = '[์ œ์ฃผ] ์šฐ๋ฆฌ ์ง‘์ด ํŽœ์•ˆํ—ˆ๋‹ค.'
result = translator(jeju_sentence, max_length=128)
print(f"Input: {jeju_sentence}")
print(f"Output: {result[0]['translation_text']}")

# Example: Standard -> Jeju Dialect
standard_sentence = '[ํ‘œ์ค€] ์šฐ๋ฆฌ ์ง‘์€ ํŽธ์•ˆํ•˜๋‹ค.'
result = translator(standard_sentence, max_length=128)
print(f"Input: {standard_sentence}")
print(f"Output: {result[0]['translation_text']}")
Downloads last month
20
Safetensors
Model size
124M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sbaru/jeju-satoru

Finetuned
(22)
this model

Dataset used to train sbaru/jeju-satoru