|
--- |
|
license: mit |
|
datasets: |
|
- Junhoee/Jeju-Standard-Translation |
|
language: |
|
- ko |
|
metrics: |
|
- sacrebleu |
|
- chrf |
|
- bertscore |
|
base_model: |
|
- gogamza/kobart-base-v2 |
|
tags: |
|
- nlp |
|
- translation |
|
- seq2seq |
|
- low-resource-language |
|
- korean-dialect |
|
- jeju-dialect |
|
- kobart |
|
--- |
|
# Jeju Satoru |
|
|
|
## Project Overview |
|
'Jeju Satoru' is a **bidirectional Jeju-Standard Korean translation model** developed to preserve the Jeju language, which is designated as an **'endangered language'** by UNESCO. The model aims to bridge the digital divide for elderly Jeju dialect speakers by improving their digital accessibility. |
|
|
|
## Model Information |
|
* **Base Model**: KoBART (`gogamza/kobart-base-v2`) |
|
* **Model Architecture**: Seq2Seq (Encoder-Decoder structure) |
|
* **Training Data**: The model was trained using a large-scale dataset of approximately 930,000 sentence pairs. The dataset was built by leveraging the publicly available [Junhoee/Jeju-Standard-Translation](https://huggingface.co/datasets/Junhoee/Jeju-Standard-Translation) dataset, which is primarily based on text from the KakaoBrain JIT (Jeju-Island-Translation) corpus and transcribed data from the AI Hub Jeju dialect speech dataset. |
|
|
|
## Training Strategy and Parameters |
|
Our model was trained using a **two-stage domain adaptation method** to handle the complexities of the Jeju dialect. |
|
|
|
1. **Domain Adaptation**: The model was separately trained on Standard Korean and Jeju dialect sentences to help it deeply understand the grammar and style of each language. |
|
2. **Translation Fine-Tuning**: The final stage involved training the model on the bidirectional dataset, with `[μ μ£Ό]` (Jeju) and `[νμ€]` (Standard) tags added to each sentence to explicitly guide the translation direction. |
|
|
|
The following key hyperparameters and techniques were applied for performance optimization: |
|
* **Learning Rate**: 2e-5 |
|
* **Epochs**: 3 |
|
* **Batch Size**: 128 |
|
* **Weight Decay**: 0.01 |
|
* **Generation Beams**: 5 |
|
* **GPU Memory Efficiency**: Mixed-precision training (FP16) was used to reduce training time, along with Gradient Accumulation (Steps: 16). |
|
|
|
## Performance Evaluation |
|
The model's performance was comprehensively evaluated using both quantitative and qualitative metrics. |
|
|
|
### Quantitative Evaluation |
|
| Direction | SacreBLEU | CHRF | BERTScore | |
|
|--------------------------|-----------|--------|-----------| |
|
| Jeju Dialect β Standard | 77.19 | 83.02 | 0.97 | |
|
| Standard β Jeju Dialect | 64.86 | 72.68 | 0.94 | |
|
|
|
### Qualitative Evaluation (Summary) |
|
* **Adequacy**: The model accurately captures the meaning of most source sentences. |
|
* **Fluency**: The translated sentences are grammatically correct and natural-sounding. |
|
* **Tone**: While generally good at maintaining the tone, the model has some limitations in perfectly reflecting the nuances and specific colloquial endings of the Jeju dialect. |
|
|
|
## How to Use |
|
You can easily load and infer with the model using the `transformers` library's `pipeline` function. |
|
|
|
**1. Installation** |
|
```bash |
|
pip install transformers torch |
|
|
|
from transformers import pipeline |
|
|
|
# Load the model pipeline |
|
translator = pipeline( |
|
"translation", |
|
model="sbaru/jeju-satoru" |
|
) |
|
|
|
# Example: Jeju Dialect -> Standard |
|
jeju_sentence = '[μ μ£Ό] μ°λ¦¬ μ§μ΄ νμνλ€.' |
|
result = translator(jeju_sentence, max_length=128) |
|
print(f"Input: {jeju_sentence}") |
|
print(f"Output: {result[0]['translation_text']}") |
|
|
|
# Example: Standard -> Jeju Dialect |
|
standard_sentence = '[νμ€] μ°λ¦¬ μ§μ νΈμνλ€.' |
|
result = translator(standard_sentence, max_length=128) |
|
print(f"Input: {standard_sentence}") |
|
print(f"Output: {result[0]['translation_text']}") |