Jeju Satoru
Project Overview
'Jeju Satoru' is a bidirectional Jeju-Standard Korean translation model developed to preserve the Jeju language, which is designated as an 'endangered language' by UNESCO. The model aims to bridge the digital divide for elderly Jeju dialect speakers by improving their digital accessibility.
Model Information
- Base Model: KoBART (
gogamza/kobart-base-v2
) - Model Architecture: Seq2Seq (Encoder-Decoder structure)
- Training Data: The model was trained using a large-scale dataset of approximately 930,000 sentence pairs. The dataset was built by leveraging the publicly available Junhoee/Jeju-Standard-Translation dataset, which is primarily based on text from the KakaoBrain JIT (Jeju-Island-Translation) corpus and transcribed data from the AI Hub Jeju dialect speech dataset.
Training Strategy and Parameters
Our model was trained using a two-stage domain adaptation method to handle the complexities of the Jeju dialect.
- Domain Adaptation: The model was separately trained on Standard Korean and Jeju dialect sentences to help it deeply understand the grammar and style of each language.
- Translation Fine-Tuning: The final stage involved training the model on the bidirectional dataset, with
[์ ์ฃผ]
(Jeju) and[ํ์ค]
(Standard) tags added to each sentence to explicitly guide the translation direction.
The following key hyperparameters and techniques were applied for performance optimization:
- Learning Rate: 2e-5
- Epochs: 3
- Batch Size: 128
- Weight Decay: 0.01
- Generation Beams: 5
- GPU Memory Efficiency: Mixed-precision training (FP16) was used to reduce training time, along with Gradient Accumulation (Steps: 16).
Performance Evaluation
The model's performance was comprehensively evaluated using both quantitative and qualitative metrics.
Quantitative Evaluation
Direction | SacreBLEU | CHRF | BERTScore |
---|---|---|---|
Jeju Dialect โ Standard | 77.19 | 83.02 | 0.97 |
Standard โ Jeju Dialect | 64.86 | 72.68 | 0.94 |
Qualitative Evaluation (Summary)
- Adequacy: The model accurately captures the meaning of most source sentences.
- Fluency: The translated sentences are grammatically correct and natural-sounding.
- Tone: While generally good at maintaining the tone, the model has some limitations in perfectly reflecting the nuances and specific colloquial endings of the Jeju dialect.
How to Use
You can easily load and infer with the model using the transformers
library's pipeline
function.
1. Installation
pip install transformers torch
from transformers import pipeline
# Load the model pipeline
translator = pipeline(
"translation",
model="sbaru/jeju-satoru"
)
# Example: Jeju Dialect -> Standard
jeju_sentence = '[์ ์ฃผ] ์ฐ๋ฆฌ ์ง์ด ํ์ํ๋ค.'
result = translator(jeju_sentence, max_length=128)
print(f"Input: {jeju_sentence}")
print(f"Output: {result[0]['translation_text']}")
# Example: Standard -> Jeju Dialect
standard_sentence = '[ํ์ค] ์ฐ๋ฆฌ ์ง์ ํธ์ํ๋ค.'
result = translator(standard_sentence, max_length=128)
print(f"Input: {standard_sentence}")
print(f"Output: {result[0]['translation_text']}")
- Downloads last month
- 20
Model tree for sbaru/jeju-satoru
Base model
gogamza/kobart-base-v2