sentencepiece unigramを日本語で学習
https://github.com/huggingface/tokenizers

sample

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("if001/sentencepiece_ja", trust_remote_code=True)
print(tokenizer("hello world"))  

>> {'input_ids': [158, 8418, 1427, 15930, 866, 13782, 44, 15034, 1719, 16655, 8, 115, 5, 280, 17635, 94, 818, 2748, 1168, 1114], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

print(tokenizer.tokenize('それは九月初旬のある蒸し暑い晩のことであった。私は、D坂の大通りの中程にある'))
>> ['それは', '九月', '初', '旬', 'のある', '蒸', 'し', '暑い', '晩', 'のことであった', '。', '私は', '、', 'D', '坂の', '大', '通り', 'の中', '程', 'にある']

データセット

https://huggingface.co/datasets/izumi-lab/wikinews-ja-20230728
https://huggingface.co/datasets/izumi-lab/wikinews-en-20230728
https://huggingface.co/datasets/if001/aozorabunko-clean-sin

settings

all_special_ids:  [1, 2, 3, 0, 4]
all_special_tokens:  ['<BOS>', '<EOS>', '<UNK>', '<PAD>', '<MASK>']
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.