jeju-satoru / README.md
sbaru's picture
Update README.md
82a6f03 verified
---
license: mit
datasets:
- Junhoee/Jeju-Standard-Translation
language:
- ko
metrics:
- sacrebleu
- chrf
- bertscore
base_model:
- gogamza/kobart-base-v2
tags:
- nlp
- translation
- seq2seq
- low-resource-language
- korean-dialect
- jeju-dialect
- kobart
---
# Jeju Satoru
## Project Overview
'Jeju Satoru' is a **bidirectional Jeju-Standard Korean translation model** developed to preserve the Jeju language, which is designated as an **'endangered language'** by UNESCO. The model aims to bridge the digital divide for elderly Jeju dialect speakers by improving their digital accessibility.
## Model Information
* **Base Model**: KoBART (`gogamza/kobart-base-v2`)
* **Model Architecture**: Seq2Seq (Encoder-Decoder structure)
* **Training Data**: The model was trained using a large-scale dataset of approximately 930,000 sentence pairs. The dataset was built by leveraging the publicly available [Junhoee/Jeju-Standard-Translation](https://huggingface.co/datasets/Junhoee/Jeju-Standard-Translation) dataset, which is primarily based on text from the KakaoBrain JIT (Jeju-Island-Translation) corpus and transcribed data from the AI Hub Jeju dialect speech dataset.
## Training Strategy and Parameters
Our model was trained using a **two-stage domain adaptation method** to handle the complexities of the Jeju dialect.
1. **Domain Adaptation**: The model was separately trained on Standard Korean and Jeju dialect sentences to help it deeply understand the grammar and style of each language.
2. **Translation Fine-Tuning**: The final stage involved training the model on the bidirectional dataset, with `[제주]` (Jeju) and `[ν‘œμ€€]` (Standard) tags added to each sentence to explicitly guide the translation direction.
The following key hyperparameters and techniques were applied for performance optimization:
* **Learning Rate**: 2e-5
* **Epochs**: 3
* **Batch Size**: 128
* **Weight Decay**: 0.01
* **Generation Beams**: 5
* **GPU Memory Efficiency**: Mixed-precision training (FP16) was used to reduce training time, along with Gradient Accumulation (Steps: 16).
## Performance Evaluation
The model's performance was comprehensively evaluated using both quantitative and qualitative metrics.
### Quantitative Evaluation
| Direction | SacreBLEU | CHRF | BERTScore |
|--------------------------|-----------|--------|-----------|
| Jeju Dialect β†’ Standard | 77.19 | 83.02 | 0.97 |
| Standard β†’ Jeju Dialect | 64.86 | 72.68 | 0.94 |
### Qualitative Evaluation (Summary)
* **Adequacy**: The model accurately captures the meaning of most source sentences.
* **Fluency**: The translated sentences are grammatically correct and natural-sounding.
* **Tone**: While generally good at maintaining the tone, the model has some limitations in perfectly reflecting the nuances and specific colloquial endings of the Jeju dialect.
## How to Use
You can easily load and infer with the model using the `transformers` library's `pipeline` function.
**1. Installation**
```bash
pip install transformers torch
from transformers import pipeline
# Load the model pipeline
translator = pipeline(
"translation",
model="sbaru/jeju-satoru"
)
# Example: Jeju Dialect -> Standard
jeju_sentence = '[제주] 우리 집이 νŽœμ•ˆν—ˆλ‹€.'
result = translator(jeju_sentence, max_length=128)
print(f"Input: {jeju_sentence}")
print(f"Output: {result[0]['translation_text']}")
# Example: Standard -> Jeju Dialect
standard_sentence = '[ν‘œμ€€] 우리 집은 νŽΈμ•ˆν•˜λ‹€.'
result = translator(standard_sentence, max_length=128)
print(f"Input: {standard_sentence}")
print(f"Output: {result[0]['translation_text']}")