YunMin Korean Tokenizer (96k vocab)

A Korean language tokenizer with 96,000 vocabulary size, optimized for Korean text processing.

Files Description

YunMin-tokenizer-96k.model - SentencePiece model file (2.0MB)
YunMin-tokenizer-96k.vocab - Vocabulary file (2.0MB)
tokenizer.json - Hugging Face tokenizer configuration
tokenizer_config.json - Tokenizer configuration parameters
special_tokens_map.json - Special tokens mapping
config.json - Model configuration

Usage

From Hugging Face Hub

from transformers import PreTrainedTokenizerFast

# Load the tokenizer from Hugging Face Hub
tokenizer = PreTrainedTokenizerFast.from_pretrained("mrcha033/YunMin-tokenizer-96k")

# Tokenize Korean text
text = "안녕하세요, 한국어 토크나이저입니다."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print(f"Decoded: {decoded_text}")

Special Tokens

<unk> - Unknown token
<s> - Beginning of sequence
</s> - End of sequence
<pad> - Padding token

Vocabulary Size

96,000 tokens optimized for Korean language processing.

Model Type

Unigram language model with whitespace pre-tokenization.