Yokhal (욕쟁이 할머니)

Korean Chatbot based on Google Gemma

Model Details

Model Description

Fine-tuned by: Seonglae Cho
Model type: Gemma
Language(s) (NLP): Korean, English
Finetuned from model: Gemma-2b-it

Model Sources

Repository: https://github.com/seonglae/yokhal
Demo: https://huggingface.co/spaces/seonglae/yokhal

Uses

Direct Use

Korean Chatbot with Internet culture

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.


tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16,
                                             device_map="auto" if device is None else device, 
                                             attn_implementation="flash_attention_2") # if flash enabled
sys_prompt = '한국어로 대답해'
texts = ['안녕', '서울은 오늘 어때']
chats = list(map(lambda t: [{'role': 'user', 'content': f'{sys_prompt}\n{t}'}], texts)) # ChatML format
prompts = list(map(lambda p: tokenizer.apply_chat_template(p, tokenize=False, add_generation_prompt=True), chats))
input_ids = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda" if device is None else device)
outputs = model.generate(**input_ids, max_new_tokens=100, repetition_penalty=1.05)
for output in outputs:
  print(tokenizer.decode(output, skip_special_tokens=True), end='\n\n')

Training Details

Trained on 2 x RTX3090

More Information on Github source code

Training Data

[More Information Needed]

Training Procedure

Weight Initialized from Internet comments dataset
Trained on Korean Namuwiki dataset until step 80000 (30000 step is on main branch because of repetition issue above there)

seq_length 1024 with dataset packing
batch 3 per device
lr 1e-5
optim adafactor

Instruction tuning on Korean Instruction Dataset using QLoRa (not on main)

seq_length 2048
lr 2e-4

Preprocessing [optional]

Gemma do not support explicit system prompt in ChatML, so I trained putting system prompt before user message like below

if (chat[0]['role'] == 'system'):
  chat[1]['content'] = f"{chat[0]['content']}\n{chat[1]['content']}"
  chat = chat[1:]
try:
  prompt = tokenizer.apply_chat_template(chat, tokenize=False)

Source Code

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

seonglae
/

yokhal-md

Yokhal (욕쟁이 할머니)

Model Details

Model Description

Model Sources

Uses

Direct Use

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Preprocessing [optional]

Training Hyperparameters

Speeds, Sizes, Times [optional]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Factors

Metrics

Results

Summary

Spaces using seonglae/yokhal-md 2