jeju-satoru / README.md

Update README.md

82a6f03 verified 14 days ago

3.65 kB

	---
	license: mit
	datasets:
	- Junhoee/Jeju-Standard-Translation
	language:
	- ko
	metrics:
	- sacrebleu
	- chrf
	- bertscore
	base_model:
	- gogamza/kobart-base-v2
	tags:
	- nlp
	- translation
	- seq2seq
	- low-resource-language
	- korean-dialect
	- jeju-dialect
	- kobart
	---
	# Jeju Satoru

	## Project Overview
	'Jeju Satoru' is a bidirectional Jeju-Standard Korean translation model developed to preserve the Jeju language, which is designated as an 'endangered language' by UNESCO. The model aims to bridge the digital divide for elderly Jeju dialect speakers by improving their digital accessibility.

	## Model Information
	* Base Model: KoBART (`gogamza/kobart-base-v2`)
	* Model Architecture: Seq2Seq (Encoder-Decoder structure)
	* Training Data: The model was trained using a large-scale dataset of approximately 930,000 sentence pairs. The dataset was built by leveraging the publicly available [Junhoee/Jeju-Standard-Translation](https://huggingface.co/datasets/Junhoee/Jeju-Standard-Translation) dataset, which is primarily based on text from the KakaoBrain JIT (Jeju-Island-Translation) corpus and transcribed data from the AI Hub Jeju dialect speech dataset.

	## Training Strategy and Parameters
	Our model was trained using a two-stage domain adaptation method to handle the complexities of the Jeju dialect.

	1. Domain Adaptation: The model was separately trained on Standard Korean and Jeju dialect sentences to help it deeply understand the grammar and style of each language.
	2. Translation Fine-Tuning: The final stage involved training the model on the bidirectional dataset, with `[제주]` (Jeju) and `[표준]` (Standard) tags added to each sentence to explicitly guide the translation direction.

	The following key hyperparameters and techniques were applied for performance optimization:
	* Learning Rate: 2e-5
	* Epochs: 3
	* Batch Size: 128
	* Weight Decay: 0.01
	* Generation Beams: 5
	* GPU Memory Efficiency: Mixed-precision training (FP16) was used to reduce training time, along with Gradient Accumulation (Steps: 16).

	## Performance Evaluation
	The model's performance was comprehensively evaluated using both quantitative and qualitative metrics.

	### Quantitative Evaluation
	\| Direction \| SacreBLEU \| CHRF \| BERTScore \|
	\|--------------------------\|-----------\|--------\|-----------\|
	\| Jeju Dialect → Standard \| 77.19 \| 83.02 \| 0.97 \|
	\| Standard → Jeju Dialect \| 64.86 \| 72.68 \| 0.94 \|

	### Qualitative Evaluation (Summary)
	* Adequacy: The model accurately captures the meaning of most source sentences.
	* Fluency: The translated sentences are grammatically correct and natural-sounding.
	* Tone: While generally good at maintaining the tone, the model has some limitations in perfectly reflecting the nuances and specific colloquial endings of the Jeju dialect.

	## How to Use
	You can easily load and infer with the model using the `transformers` library's `pipeline` function.

	1. Installation
	```bash
	pip install transformers torch

	from transformers import pipeline

	# Load the model pipeline
	translator = pipeline(
	"translation",
	model="sbaru/jeju-satoru"
	)

	# Example: Jeju Dialect -> Standard
	jeju_sentence = '[제주] 우리 집이 펜안허다.'
	result = translator(jeju_sentence, max_length=128)
	print(f"Input: {jeju_sentence}")
	print(f"Output: {result[0]['translation_text']}")

	# Example: Standard -> Jeju Dialect
	standard_sentence = '[표준] 우리 집은 편안하다.'
	result = translator(standard_sentence, max_length=128)
	print(f"Input: {standard_sentence}")
	print(f"Output: {result[0]['translation_text']}")