YunMin Korean Tokenizer (96k vocab)
A Korean language tokenizer with 96,000 vocabulary size, optimized for Korean text processing.
Files Description
YunMin-tokenizer-96k.model
- SentencePiece model file (2.0MB)YunMin-tokenizer-96k.vocab
- Vocabulary file (2.0MB)tokenizer.json
- Hugging Face tokenizer configurationtokenizer_config.json
- Tokenizer configuration parametersspecial_tokens_map.json
- Special tokens mappingconfig.json
- Model configuration
Usage
From Hugging Face Hub
from transformers import PreTrainedTokenizerFast
# Load the tokenizer from Hugging Face Hub
tokenizer = PreTrainedTokenizerFast.from_pretrained("mrcha033/YunMin-tokenizer-96k")
# Tokenize Korean text
text = "μλ
νμΈμ, νκ΅μ΄ ν ν¬λμ΄μ μ
λλ€."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")
# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print(f"Decoded: {decoded_text}")
Special Tokens
<unk>
- Unknown token<s>
- Beginning of sequence</s>
- End of sequence<pad>
- Padding token
Vocabulary Size
96,000 tokens optimized for Korean language processing.
Model Type
Unigram language model with whitespace pre-tokenization.
- Downloads last month
- 25
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support