YunMin Korean Tokenizer (96k vocab)

A Korean language tokenizer with 96,000 vocabulary size, optimized for Korean text processing.

Files Description

  • YunMin-tokenizer-96k.model - SentencePiece model file (2.0MB)
  • YunMin-tokenizer-96k.vocab - Vocabulary file (2.0MB)
  • tokenizer.json - Hugging Face tokenizer configuration
  • tokenizer_config.json - Tokenizer configuration parameters
  • special_tokens_map.json - Special tokens mapping
  • config.json - Model configuration

Usage

From Hugging Face Hub

from transformers import PreTrainedTokenizerFast

# Load the tokenizer from Hugging Face Hub
tokenizer = PreTrainedTokenizerFast.from_pretrained("mrcha033/YunMin-tokenizer-96k")

# Tokenize Korean text
text = "μ•ˆλ…•ν•˜μ„Έμš”, ν•œκ΅­μ–΄ ν† ν¬λ‚˜μ΄μ €μž…λ‹ˆλ‹€."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print(f"Decoded: {decoded_text}")

Special Tokens

  • <unk> - Unknown token
  • <s> - Beginning of sequence
  • </s> - End of sequence
  • <pad> - Padding token

Vocabulary Size

96,000 tokens optimized for Korean language processing.

Model Type

Unigram language model with whitespace pre-tokenization.

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support