How to add multilinguality to llama 3 tokenizer?

#4
by richard-park - opened

I am a newbie in the LLM field.
I'm trying to add Korean language to llama 3 tokenizer,
I have no idea where the problem is in the code.
I would really appreciate it if you could tell me what's wrong with the code.

I added the token as below, but tokenize gives me the same result.
I'm currently using a dataset of about 10G.

I used this article by huggingface as a reference.
https://huggingface.co/learn/nlp-course/chapter6/2


from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
base_model = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model)

ds = load_dataset("richard-park/sapie-dataset-for-tokenizer", cache_dir="./cache")

def get_training_corpus():
return (
ds["train"][i : i + 1000]["text"]
for i in range(0, len(ds["train"]), 1000)
)

training_corpus = get_training_corpus()
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, 12_000)
new_tokens = set(new_tokenizer.vocab) - set(tokenizer.vocab)
print("new vocab size: ", len(new_tokens))

sample_text = ["์š”์ฆ˜ ๋‚ ์”จ๊ฐ€ ๋„ˆ๋ฌด ์˜ค๋ฝ๊ฐ€๋ฝํ•ด์„œ ์•„์ง๋„ ๊ฒจ์šธ์˜ท์„ ๋ชป์น˜์› ์–ด์š”.." ,
"๋ง›์žˆ๋Š” ๋ฐฅ์„ ๋“œ์…จ์Šต๋‹ˆ๊นŒ? ๋ง›์ด ๊ถ๊ธˆํ•˜๋„ค์š”",
"๋Œ€๋ฒ•์›๋ถ€ํ„ฐ ํ•˜๊ธ‰์‹ฌ ํŒ๋ก€๊นŒ์ง€ ์›ํ•˜๋Š” ํŒ๋ก€๋ฅผ ์ฐพ๋Š” ๊ฐ€์žฅ ๋น ๋ฅธ ๋ฐฉ๋ฒ• - ์„œ๋ฉด ๊ฒ€์ƒ‰, ์š”์ฒญ ํŒ๋ก€, ์œ ์‚ฌ ํŒ๋ก€, AI ์ถ”์ฒœ, ํŒ๋ก€ ๋ฐ ๋ฒ•๋ น ๊ฒ€์ƒ‰." ,
"๋ณธ ๋ฐœ๋ช…์€ ๊ธˆ์†ํŒ์˜ ๋‹ค์ˆ˜ ๋ถ€๋ถ„์„ ์—์นญ์‹œ์ผœ ํŠน์ • ๋ฌด๋Šฌ๋ชจ์–‘์„ ํ˜•์„ฑํ•˜๋Š” ๊ฑด์ถ•์šฉ ๊ธˆ์†์žฌ ์žฅ์‹ํŒ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ฒƒ์— ํŠน์ง•์ด ์žˆ๋‹ค.",
"๊ณจ๋‹ค๊ณต์ฆ์€ ์™œ ์ƒ๊ธฐ๋Š”๊ฑฐ์—์š”? ๊ทธ๋ฆฌ๊ณ  ์น˜๋ฃŒํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒํ•ด์•ผํ•˜์ฃ ?",
]

tokenizer.add_tokens(list(new_tokens))
base_tokenizer = AutoTokenizer.from_pretrained(base_model)
for text in sample_text:
print("old: ", [base_tokenizer.decode([id]) for id in base_tokenizer.encode(text)])
print("new: ", [tokenizer.decode([id]) for id in tokenizer.encode(text)])
print("-"*100)

Sign up or log in to comment