How to add multilinguality to llama 3 tokenizer?
I am a newbie in the LLM field.
I'm trying to add Korean language to llama 3 tokenizer,
I have no idea where the problem is in the code.
I would really appreciate it if you could tell me what's wrong with the code.
I added the token as below, but tokenize gives me the same result.
I'm currently using a dataset of about 10G.
I used this article by huggingface as a reference.
https://huggingface.co/learn/nlp-course/chapter6/2
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
base_model = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model)
ds = load_dataset("richard-park/sapie-dataset-for-tokenizer", cache_dir="./cache")
def get_training_corpus():
return (
ds["train"][i : i + 1000]["text"]
for i in range(0, len(ds["train"]), 1000)
)
training_corpus = get_training_corpus()
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, 12_000)
new_tokens = set(new_tokenizer.vocab) - set(tokenizer.vocab)
print("new vocab size: ", len(new_tokens))
sample_text = ["์์ฆ ๋ ์จ๊ฐ ๋๋ฌด ์ค๋ฝ๊ฐ๋ฝํด์ ์์ง๋ ๊ฒจ์ธ์ท์ ๋ชป์น์ ์ด์.." ,
"๋ง์๋ ๋ฐฅ์ ๋์
จ์ต๋๊น? ๋ง์ด ๊ถ๊ธํ๋ค์",
"๋๋ฒ์๋ถํฐ ํ๊ธ์ฌ ํ๋ก๊น์ง ์ํ๋ ํ๋ก๋ฅผ ์ฐพ๋ ๊ฐ์ฅ ๋น ๋ฅธ ๋ฐฉ๋ฒ - ์๋ฉด ๊ฒ์, ์์ฒญ ํ๋ก, ์ ์ฌ ํ๋ก, AI ์ถ์ฒ, ํ๋ก ๋ฐ ๋ฒ๋ น ๊ฒ์." ,
"๋ณธ ๋ฐ๋ช
์ ๊ธ์ํ์ ๋ค์ ๋ถ๋ถ์ ์์นญ์์ผ ํน์ ๋ฌด๋ฌ๋ชจ์์ ํ์ฑํ๋ ๊ฑด์ถ์ฉ ๊ธ์์ฌ ์ฅ์ํ์ผ๋ก ์ด๋ฃจ์ด์ง ๊ฒ์ ํน์ง์ด ์๋ค.",
"๊ณจ๋ค๊ณต์ฆ์ ์ ์๊ธฐ๋๊ฑฐ์์? ๊ทธ๋ฆฌ๊ณ ์น๋ฃํ๋ ค๋ฉด ์ด๋ป๊ฒํด์ผํ์ฃ ?",
]
tokenizer.add_tokens(list(new_tokens))
base_tokenizer = AutoTokenizer.from_pretrained(base_model)
for text in sample_text:
print("old: ", [base_tokenizer.decode([id]) for id in base_tokenizer.encode(text)])
print("new: ", [tokenizer.decode([id]) for id in tokenizer.encode(text)])
print("-"*100)