lightblue/suzume-llama-3-8B-multilingual · How to add multilinguality to llama 3 tokenizer?

I am a newbie in the LLM field.
I'm trying to add Korean language to llama 3 tokenizer,
I have no idea where the problem is in the code.
I would really appreciate it if you could tell me what's wrong with the code.

I added the token as below, but tokenize gives me the same result.
I'm currently using a dataset of about 10G.

I used this article by huggingface as a reference.
https://huggingface.co/learn/nlp-course/chapter6/2

from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
base_model = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model)

ds = load_dataset("richard-park/sapie-dataset-for-tokenizer", cache_dir="./cache")

def get_training_corpus():
return (
ds["train"][i : i + 1000]["text"]
for i in range(0, len(ds["train"]), 1000)
)

training_corpus = get_training_corpus()
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, 12_000)
new_tokens = set(new_tokenizer.vocab) - set(tokenizer.vocab)
print("new vocab size: ", len(new_tokens))

sample_text = ["요즘 날씨가 너무 오락가락해서 아직도 겨울옷을 못치웠어요.." ,
"맛있는 밥을 드셨습니까? 맛이 궁금하네요",
"대법원부터 하급심 판례까지 원하는 판례를 찾는 가장 빠른 방법 - 서면 검색, 요청 판례, 유사 판례, AI 추천, 판례 및 법령 검색." ,
"본 발명은 금속판의 다수 부분을 에칭시켜 특정 무늬모양을 형성하는 건축용 금속재 장식판으로 이루어진 것에 특징이 있다.",
"골다공증은 왜 생기는거에요? 그리고 치료하려면 어떻게해야하죠?",
]

tokenizer.add_tokens(list(new_tokens))
base_tokenizer = AutoTokenizer.from_pretrained(base_model)
for text in sample_text:
print("old: ", [base_tokenizer.decode([id]) for id in base_tokenizer.encode(text)])
print("new: ", [tokenizer.decode([id]) for id in tokenizer.encode(text)])
print("-"*100)