๐ง Obfuscated Korean Text Restoration
This repository is designed for restoring obfuscated Korean text.
It was developed and validated using the dataset from the 2024 Dacon Obfuscated Korean Review Restoration AI Competition
For more details on the dataset and modeling approach, please refer to the 2024 Dacon Obfuscated Korean Review Restoration AI Competition.
๐ง Features
This repository includes the following components:
Pretrained Korean Text Restoration Model
GemmaModel trained to restore obfuscated Korean text to its original, human-readable form.Syllable-level Korean Tokenizer
A tokenizer tailored to process Korean at the syllable level for improved granularity and performance.Flexible Korean Sentence Splitter
A sentence segmentation tool that handles the complexities of Korean syntax effectively.Korean Text Obfuscator
A module for simulating text obfuscation, useful for training and evaluation.
1. Pretrained Korean Text Restoration Model
This pretrained model restores obfuscated Korean text by converting broken or scrambled Hangul into fluent, natural Korean.
Finetuned for Korean tour review restoration
โ Example Usage
For Short Text
from transformers import AutoModel
# Load the tokenizer and model
hangul_tokenizer = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', subfolder='hangul_tokenizer', trust_remote_code=True)
hangul_deobfuscator = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', trust_remote_code=True)
hangul_deobfuscator.load_hangul_tokenizer(hangul_tokenizer)
# Example
text = '์๋
ํฅ์
์ง.'
restored = hangul_deobfuscator.deobfuscate(text)
print(restored) # '์๋
ํ์ธ์.'
For Long Sentences
from transformers import AutoModel
# Load models
hangul_tokenizer = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', subfolder='hangul_tokenizer', trust_remote_code=True)
sentence_tokenizer = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', subfolder='sentence_tokenizer', trust_remote_code=True)
hangul_deobfuscator = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', trust_remote_code=True)
hangul_deobfuscator.load_hangul_tokenizer(hangul_tokenizer)
# Example
sentence = '''๋ณ ํ ๊ฒํ ์๊น๋. ์ ์ธ๋๋ฏ๋ฆญ ํผ 1์บ๋ฅผ ์ฅฐ๋์ง ์ปฅ๊บพํฐ ์ธ๋๋ฏ๋กฏ์ ๋ง๋ก ์ง๋ฉํฅ์๋ ํฏ๋๋ฃ๋ ๋
๋ฎค ํผ๊ต... ์ผ๋ญํฐ ๋ ๋ณ ๋บ์จ ๊น๋ ์ซํ ๊ตฃ. ๊นธ์ฅ์ 20์ฌ ๋
๋๋
๋ตจ ๊ณง ์ค ์ฉจ์ ๊ทํผ ๋ํ ๋ค ๊ณถ.'''
restored = hangul_deobfuscator.deobfuscate(sentence, sentence_tokenizer)
print(restored)
# '๋ณ ํ ๊ฐ๋ ์๊น๋ค. ์ ์ฌ๋๋ค์ด ๋ณ 1๊ฐ๋ฅผ ์ฃผ๋์ง ๊ฒช์ด๋ณธ ์ฌ๋์ผ๋ก์ ๋ง๋ก ์ค๋ช
ํ์๋ ๋๊ธ๋ก๋ ๋๋ฌด ๊ธธ๊ณ ... ์๋ฌดํผ ๋ ๋ฒ ๋ค์ ๊ฐ๊ธธ ์ซ์ ๊ณณ. ์บ ํ์ 20์ฌ ๋
๋ค๋
๋ณธ ๊ณณ ์ค ์ ์ผ ๊ธฐ๋ถ ๋๋นด๋ ๊ณณ.'
2. Syllable-level Korean Tokenizer
A tokenizer tailored to process Korean at the syllable level for improved granularity and performance.
โ Example Usage
from transformers import AutoModel
hangul_tokenizer = AutoModel.from_pretrained(
'jwengr/gemma2-2b-kor-deobfuscation',
subfolder='hangul_tokenizer',
trust_remote_code=True
)
encoded_ids, token_type_ids = hangul_tokenizer.encode_char('a์b๋
cํd์ธe์!')
decoded_text = hangul_tokenizer.decode_char(encoded_ids, token_type_ids)
encoded_ids, token_type_ids = hangul_tokenizer.encode_jamo('a์b๋
cํd์ธe์!')
decoded_text = hangul_tokenizer.decode_jamo(encoded_ids, token_type_ids)
print(decoded_text)
# Output: 'a์b๋
cํd์ธe์!'
3. Flexible Korean Sentence Splitter
A sentence segmentation tool that handles the complexities of Korean syntax effectively.
โ Example Usage
from transformers import AutoModel
sentence_tokenizer = AutoModel.from_pretrained(
'jwengr/gemma2-2b-kor-deobfuscation',
subfolder='sentence_tokenizer',
trust_remote_code=True
)
text = '''์... ๊ฐ๊ฒฉ ์ข๊ณ ๋ทฐ๋ ๋ปฅ ๋ซ๋ ค์ ์์ํ์ง๋ง ๋ด๋ฐฐ ๋์ ๋ฏธ์ณ๋ฒ๋ฆผ. ์ธ๊ฒ ํ๋ฃจ๋ง ๋ฌต๊ฒ ๋ค! ํ๋ ์ฌ๋ํํ
๋ง ์ถ์ฒ. ๋ด๋ฐฐ ๋์๊ฐ ๋ชจ๋ ์ฅ์ ์ ๊ฐ์ ธ๊ฐ๋ ๊ณณ. ๋
ธ๋๋ฐฉ์์ ๊ฐ์ข
๋ด๋ฐฐ์ ์ ํฅ์ ์ฉ์์ ๋ ๋๋ ๋์๊ฐ ๊ณ์ ๋ฐฉ์ ์์ ใ
... ์ธ๋๊น ํ ๋ง ์์.'''
# ๋ฌธ์ฅ ๋ถ๋ฆฌ
chunks = sentence_tokenizer.split_text(text)
print(chunks)
# Output: [
# '์... ๊ฐ๊ฒฉ ์ข๊ณ ๋ทฐ๋ ๋ปฅ ๋ซ๋ ค์ ์์ํ์ง๋ง ๋ด๋ฐฐ ๋์ ๋ฏธ์ณ๋ฒ๋ฆผ. ์ธ๊ฒ ํ๋ฃจ๋ง ๋ฌต๊ฒ ๋ค! ํ๋ ์ฌ๋ํํ
๋ง ์ถ์ฒ. ',
# '๋ด๋ฐฐ ๋์๊ฐ ๋ชจ๋ ์ฅ์ ์ ๊ฐ์ ธ๊ฐ๋ ๊ณณ. ๋
ธ๋๋ฐฉ์์ ๊ฐ์ข
๋ด๋ฐฐ์ ์ ํฅ์ ์ฉ์์ ๋ ๋๋ ๋์๊ฐ ๊ณ์ ๋ฐฉ์ ์์ ',
# 'ใ
... ์ธ๋๊น ํ ๋ง ์์.'
# ]
# ์ค๋ฒ๋ฉ ์ ์ฉ
chunks_overlapped = sentence_tokenizer.overlap(chunks)
print(chunks_overlapped)
# Output:
# [
# (0, 64, '์... ๊ฐ๊ฒฉ ์ข๊ณ ๋ทฐ๋ ๋ปฅ ๋ซ๋ ค์ ์์ํ์ง๋ง ๋ด๋ฐฐ ๋์ ๋ฏธ์ณ๋ฒ๋ฆผ. ์ธ๊ฒ ํ๋ฃจ๋ง ๋ฌต๊ฒ ๋ค! ํ๋ ์ฌ๋ํํ
๋ง ์ถ์ฒ.'),
# (17, 86, '๋ซ๋ ค์ ์์ํ์ง๋ง ๋ด๋ฐฐ ๋์ ๋ฏธ์ณ๋ฒ๋ฆผ. ์ธ๊ฒ ํ๋ฃจ๋ง ๋ฌต๊ฒ ๋ค! ํ๋ ์ฌ๋ํํ
๋ง ์ถ์ฒ. ๋ด๋ฐฐ ๋์๊ฐ ๋ชจ๋ ์ฅ์ ์ ๊ฐ์ ธ๊ฐ๋ ๊ณณ.'),
# (42, 109, 'ํ๋ฃจ๋ง ๋ฌต๊ฒ ๋ค! ํ๋ ์ฌ๋ํํ
๋ง ์ถ์ฒ. ๋ด๋ฐฐ ๋์๊ฐ ๋ชจ๋ ์ฅ์ ์ ๊ฐ์ ธ๊ฐ๋ ๊ณณ. ๋
ธ๋๋ฐฉ์์ ๊ฐ์ข
๋ด๋ฐฐ์ ์ ํฅ์ ์ฉ์์ ๋'),
# (64, 125, '๋ด๋ฐฐ ๋์๊ฐ ๋ชจ๋ ์ฅ์ ์ ๊ฐ์ ธ๊ฐ๋ ๊ณณ. ๋
ธ๋๋ฐฉ์์ ๊ฐ์ข
๋ด๋ฐฐ์ ์ ํฅ์ ์ฉ์์ ๋ ๋๋ ๋์๊ฐ ๊ณ์ ๋ฐฉ์ ์์'),
# (86, 130, '๋
ธ๋๋ฐฉ์์ ๊ฐ์ข
๋ด๋ฐฐ์ ์ ํฅ์ ์ฉ์์ ๋ ๋๋ ๋์๊ฐ ๊ณ์ ๋ฐฉ์ ์์ ใ
...'),
# (109, 134, '๋๋ ๋์๊ฐ ๊ณ์ ๋ฐฉ์ ์์ ใ
... ์ธ๋๊น'),
# (125, 141, 'ใ
... ์ธ๋๊น ํ ๋ง ์์.')
# ]
# ๋ณต์๋ ํ
์คํธ ์ถ๋ ฅ
decoded = sentence_tokenizer.decode_overlap(chunks_overlapped)
print(decoded)
# Output:
# '์... ๊ฐ๊ฒฉ ์ข๊ณ ๋ทฐ๋ ๋ปฅ ๋ซ๋ ค์ ์์ํ์ง๋ง ๋ด๋ฐฐ ๋์ ๋ฏธ์ณ๋ฒ๋ฆผ. ์ธ๊ฒ ํ๋ฃจ๋ง ๋ฌต๊ฒ ๋ค! ํ๋ ์ฌ๋ํํ
๋ง ์ถ์ฒ. ๋ด๋ฐฐ ๋์๊ฐ ๋ชจ๋ ์ฅ์ ์ ๊ฐ์ ธ๊ฐ๋ ๊ณณ. ๋
ธ๋๋ฐฉ์์ ๊ฐ์ข
๋ด๋ฐฐ์ ์ ํฅ์ ์ฉ์์ ๋ ๋๋ ๋์๊ฐ ๊ณ์ ๋ฐฉ์ ์์ ใ
... ์ธ๋๊น ํ ๋ง ์์.'
4. Korean Text Obfuscator
A module for simulating Korean text obfuscation, useful for training, data augmentation, and evaluation.
It generates noisy or obfuscated versions of input text to mimic real-world corrupted or user-modified input.
โ Example Usage
from transformers import AutoModel
hangul_augmentator = AutoModel.from_pretrained(
'jwengr/gemma2-2b-kor-deobfuscation',
subfolder='hangul_augmentator',
trust_remote_code=True
)
# ์
๋ ฅ ๋ฌธ์ฅ
text = '์๋
ํ์ธ์'
# ๋๋
ํ๋ ์ถ๋ ฅ
obfuscated = hangul_augmentator(text)
print(obfuscated)
# Output: '์๋
ํจ์ท์ค'
- Downloads last month
- 8
Model tree for jwengr/gemma2-2b-kor-deobfuscation
Base model
google/gemma-2-2b