๐Ÿง  Obfuscated Korean Text Restoration

This repository is designed for restoring obfuscated Korean text.

It was developed and validated using the dataset from the 2024 Dacon Obfuscated Korean Review Restoration AI Competition

For more details on the dataset and modeling approach, please refer to the 2024 Dacon Obfuscated Korean Review Restoration AI Competition.

๐Ÿ”ง Features

This repository includes the following components:

  1. Pretrained Korean Text Restoration Model
    GemmaModel trained to restore obfuscated Korean text to its original, human-readable form.

  2. Syllable-level Korean Tokenizer
    A tokenizer tailored to process Korean at the syllable level for improved granularity and performance.

  3. Flexible Korean Sentence Splitter
    A sentence segmentation tool that handles the complexities of Korean syntax effectively.

  4. Korean Text Obfuscator
    A module for simulating text obfuscation, useful for training and evaluation.

1. Pretrained Korean Text Restoration Model

This pretrained model restores obfuscated Korean text by converting broken or scrambled Hangul into fluent, natural Korean.
Finetuned for Korean tour review restoration

โœ… Example Usage

For Short Text
from transformers import AutoModel

# Load the tokenizer and model
hangul_tokenizer = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', subfolder='hangul_tokenizer', trust_remote_code=True)
hangul_deobfuscator = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', trust_remote_code=True)
hangul_deobfuscator.load_hangul_tokenizer(hangul_tokenizer)

# Example
text = '์–€๋…•ํ•ฅ์…ˆ์šง.'
restored = hangul_deobfuscator.deobfuscate(text)
print(restored)  # '์•ˆ๋…•ํ•˜์„ธ์š”.'
For Long Sentences
from transformers import AutoModel

# Load models
hangul_tokenizer = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', subfolder='hangul_tokenizer', trust_remote_code=True)
sentence_tokenizer = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', subfolder='sentence_tokenizer', trust_remote_code=True)
hangul_deobfuscator = AutoModel.from_pretrained('jwengr/gemma2-2b-kor-deobfuscation', trust_remote_code=True)
hangul_deobfuscator.load_hangul_tokenizer(hangul_tokenizer)

# Example
sentence = '''๋ณ„ ํ•œ ๊ฒŒํ†  ์•˜๊น๋•€. ์™œ ์‹ธ๋žŒ๋“ฏ๋ฆญ ํŽผ 1์บ๋ฅผ ์ฅฐ๋ˆˆ์ง• ์ปฅ๊บพํฐ ์‹ธ๋žŒ๋ฏ๋กฏ์„ž ๋ง’๋ก ์„ง๋ฉํ•ฅ์Ÿˆ๋‹ ํƒฏ๋Ž๋ฃ๋ˆˆ ๋…€๋ฎค ํ€ผ๊ต... ์•ผ๋ญํˆฐ ๋‘  ๋ณ€ ๋‹บ์”จ ๊น๋‚„ ์‹ซํ›ˆ ๊ตฃ. ๊นธ์‚ฅ์Š 20์—ฌ ๋…„ ๋Œœ๋…๋ตจ ๊ณง ์ค‘ ์ฉจ์œŒ ๊ท‘ํ‘ผ ๋‚™ํŒ ๋–ค ๊ณถ.'''
restored = hangul_deobfuscator.deobfuscate(sentence, sentence_tokenizer)
print(restored)
# '๋ณ„ ํ•œ ๊ฐœ๋„ ์•„๊น๋‹ค. ์™œ ์‚ฌ๋žŒ๋“ค์ด ๋ณ„ 1๊ฐœ๋ฅผ ์ฃผ๋Š”์ง€ ๊ฒช์–ด๋ณธ ์‚ฌ๋žŒ์œผ๋กœ์„œ ๋ง๋กœ ์„ค๋ช…ํ•˜์ž๋‹ˆ ๋Œ“๊ธ€๋กœ๋Š” ๋„ˆ๋ฌด ๊ธธ๊ณ ... ์•„๋ฌดํŠผ ๋‘ ๋ฒˆ ๋‹ค์‹œ ๊ฐ€๊ธธ ์‹ซ์€ ๊ณณ. ์บ ํ•‘์„ 20์—ฌ ๋…„ ๋‹ค๋…€๋ณธ ๊ณณ ์ค‘ ์ œ์ผ ๊ธฐ๋ถ„ ๋‚˜๋นด๋˜ ๊ณณ.'

2. Syllable-level Korean Tokenizer

A tokenizer tailored to process Korean at the syllable level for improved granularity and performance.

โœ… Example Usage

from transformers import AutoModel

hangul_tokenizer = AutoModel.from_pretrained(
    'jwengr/gemma2-2b-kor-deobfuscation',
    subfolder='hangul_tokenizer',
    trust_remote_code=True
)

encoded_ids, token_type_ids = hangul_tokenizer.encode_char('a์•ˆb๋…•cํ•˜d์„ธe์š”!')
decoded_text = hangul_tokenizer.decode_char(encoded_ids, token_type_ids)
encoded_ids, token_type_ids = hangul_tokenizer.encode_jamo('a์•ˆb๋…•cํ•˜d์„ธe์š”!')
decoded_text = hangul_tokenizer.decode_jamo(encoded_ids, token_type_ids)
print(decoded_text)
# Output: 'a์•ˆb๋…•cํ•˜d์„ธe์š”!'

3. Flexible Korean Sentence Splitter

A sentence segmentation tool that handles the complexities of Korean syntax effectively.

โœ… Example Usage

from transformers import AutoModel

sentence_tokenizer = AutoModel.from_pretrained(
    'jwengr/gemma2-2b-kor-deobfuscation',
    subfolder='sentence_tokenizer',
    trust_remote_code=True
)

text = '''์•„... ๊ฐ€๊ฒฉ ์ข‹๊ณ  ๋ทฐ๋„ ๋ปฅ ๋šซ๋ ค์„œ ์‹œ์›ํ•˜์ง€๋งŒ ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ ๋ฏธ์ณ๋ฒ„๋ฆผ. ์‹ธ๊ฒŒ ํ•˜๋ฃจ๋งŒ ๋ฌต๊ฒ ๋‹ค! ํ•˜๋Š” ์‚ฌ๋žŒํ•œํ…Œ๋งŒ ์ถ”์ฒœ. ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ๊ฐ€ ๋ชจ๋“  ์žฅ์ ์„ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ณณ. ๋…ธ๋ž˜๋ฐฉ์—์„œ ๊ฐ์ข… ๋‹ด๋ฐฐ์™€ ์œ ํฅ์— ์ฉ”์—ˆ์„ ๋•Œ ๋‚˜๋Š” ๋ƒ„์ƒˆ๊ฐ€ ๊ณ„์† ๋ฐฉ์— ์žˆ์Œ ใ…†... ์‹ธ๋‹ˆ๊นŒ ํ•  ๋ง ์—†์Œ.'''

# ๋ฌธ์žฅ ๋ถ„๋ฆฌ
chunks = sentence_tokenizer.split_text(text)
print(chunks)
# Output: [
#   '์•„... ๊ฐ€๊ฒฉ ์ข‹๊ณ  ๋ทฐ๋„ ๋ปฅ ๋šซ๋ ค์„œ ์‹œ์›ํ•˜์ง€๋งŒ ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ ๋ฏธ์ณ๋ฒ„๋ฆผ. ์‹ธ๊ฒŒ ํ•˜๋ฃจ๋งŒ ๋ฌต๊ฒ ๋‹ค! ํ•˜๋Š” ์‚ฌ๋žŒํ•œํ…Œ๋งŒ ์ถ”์ฒœ. ',
#   '๋‹ด๋ฐฐ ๋ƒ„์ƒˆ๊ฐ€ ๋ชจ๋“  ์žฅ์ ์„ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ณณ. ๋…ธ๋ž˜๋ฐฉ์—์„œ ๊ฐ์ข… ๋‹ด๋ฐฐ์™€ ์œ ํฅ์— ์ฉ”์—ˆ์„ ๋•Œ ๋‚˜๋Š” ๋ƒ„์ƒˆ๊ฐ€ ๊ณ„์† ๋ฐฉ์— ์žˆ์Œ ',
#   'ใ…†... ์‹ธ๋‹ˆ๊นŒ ํ•  ๋ง ์—†์Œ.'
# ]

# ์˜ค๋ฒ„๋žฉ ์ ์šฉ
chunks_overlapped = sentence_tokenizer.overlap(chunks)
print(chunks_overlapped)
# Output:
# [
#   (0, 64, '์•„... ๊ฐ€๊ฒฉ ์ข‹๊ณ  ๋ทฐ๋„ ๋ปฅ ๋šซ๋ ค์„œ ์‹œ์›ํ•˜์ง€๋งŒ ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ ๋ฏธ์ณ๋ฒ„๋ฆผ. ์‹ธ๊ฒŒ ํ•˜๋ฃจ๋งŒ ๋ฌต๊ฒ ๋‹ค! ํ•˜๋Š” ์‚ฌ๋žŒํ•œํ…Œ๋งŒ ์ถ”์ฒœ.'),
#   (17, 86, '๋šซ๋ ค์„œ ์‹œ์›ํ•˜์ง€๋งŒ ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ ๋ฏธ์ณ๋ฒ„๋ฆผ. ์‹ธ๊ฒŒ ํ•˜๋ฃจ๋งŒ ๋ฌต๊ฒ ๋‹ค! ํ•˜๋Š” ์‚ฌ๋žŒํ•œํ…Œ๋งŒ ์ถ”์ฒœ. ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ๊ฐ€ ๋ชจ๋“  ์žฅ์ ์„ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ณณ.'),
#   (42, 109, 'ํ•˜๋ฃจ๋งŒ ๋ฌต๊ฒ ๋‹ค! ํ•˜๋Š” ์‚ฌ๋žŒํ•œํ…Œ๋งŒ ์ถ”์ฒœ. ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ๊ฐ€ ๋ชจ๋“  ์žฅ์ ์„ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ณณ. ๋…ธ๋ž˜๋ฐฉ์—์„œ ๊ฐ์ข… ๋‹ด๋ฐฐ์™€ ์œ ํฅ์— ์ฉ”์—ˆ์„ ๋•Œ'),
#   (64, 125, '๋‹ด๋ฐฐ ๋ƒ„์ƒˆ๊ฐ€ ๋ชจ๋“  ์žฅ์ ์„ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ณณ. ๋…ธ๋ž˜๋ฐฉ์—์„œ ๊ฐ์ข… ๋‹ด๋ฐฐ์™€ ์œ ํฅ์— ์ฉ”์—ˆ์„ ๋•Œ ๋‚˜๋Š” ๋ƒ„์ƒˆ๊ฐ€ ๊ณ„์† ๋ฐฉ์— ์žˆ์Œ'),
#   (86, 130, '๋…ธ๋ž˜๋ฐฉ์—์„œ ๊ฐ์ข… ๋‹ด๋ฐฐ์™€ ์œ ํฅ์— ์ฉ”์—ˆ์„ ๋•Œ ๋‚˜๋Š” ๋ƒ„์ƒˆ๊ฐ€ ๊ณ„์† ๋ฐฉ์— ์žˆ์Œ ใ…†...'),
#   (109, 134, '๋‚˜๋Š” ๋ƒ„์ƒˆ๊ฐ€ ๊ณ„์† ๋ฐฉ์— ์žˆ์Œ ใ…†... ์‹ธ๋‹ˆ๊นŒ'),
#   (125, 141, 'ใ…†... ์‹ธ๋‹ˆ๊นŒ ํ•  ๋ง ์—†์Œ.')
# ]

# ๋ณต์›๋œ ํ…์ŠคํŠธ ์ถœ๋ ฅ
decoded = sentence_tokenizer.decode_overlap(chunks_overlapped)
print(decoded)
# Output:
# '์•„... ๊ฐ€๊ฒฉ ์ข‹๊ณ  ๋ทฐ๋„ ๋ปฅ ๋šซ๋ ค์„œ ์‹œ์›ํ•˜์ง€๋งŒ ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ ๋ฏธ์ณ๋ฒ„๋ฆผ. ์‹ธ๊ฒŒ ํ•˜๋ฃจ๋งŒ ๋ฌต๊ฒ ๋‹ค! ํ•˜๋Š” ์‚ฌ๋žŒํ•œํ…Œ๋งŒ ์ถ”์ฒœ. ๋‹ด๋ฐฐ ๋ƒ„์ƒˆ๊ฐ€ ๋ชจ๋“  ์žฅ์ ์„ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ณณ. ๋…ธ๋ž˜๋ฐฉ์—์„œ ๊ฐ์ข… ๋‹ด๋ฐฐ์™€ ์œ ํฅ์— ์ฉ”์—ˆ์„ ๋•Œ ๋‚˜๋Š” ๋ƒ„์ƒˆ๊ฐ€ ๊ณ„์† ๋ฐฉ์— ์žˆ์Œ ใ…†... ์‹ธ๋‹ˆ๊นŒ ํ•  ๋ง ์—†์Œ.'

4. Korean Text Obfuscator

A module for simulating Korean text obfuscation, useful for training, data augmentation, and evaluation.
It generates noisy or obfuscated versions of input text to mimic real-world corrupted or user-modified input.

โœ… Example Usage

from transformers import AutoModel

hangul_augmentator = AutoModel.from_pretrained(
    'jwengr/gemma2-2b-kor-deobfuscation',
    subfolder='hangul_augmentator',
    trust_remote_code=True
)

# ์ž…๋ ฅ ๋ฌธ์žฅ
text = '์•ˆ๋…•ํ•˜์„ธ์š”'

# ๋‚œ๋…ํ™”๋œ ์ถœ๋ ฅ
obfuscated = hangul_augmentator(text)
print(obfuscated)
# Output: '์•ˆ๋…•ํ•จ์’ท์˜ค'
Downloads last month
8
Safetensors
Model size
2.77B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jwengr/gemma2-2b-kor-deobfuscation

Base model

google/gemma-2-2b
Finetuned
(524)
this model