# Finetune LLama 3.1 include Embedding

medium:
https://medium.com/@tejaswi_kashyap/tailoring-llama-3-harnessing-fine-tuning-for-custom-language-tasks-6f7c61f657e2

Challenges of Fully Fine-Tuning Token Embeddings and Language Modeling Head

The drawbacks of fully fine-tuning the token embeddings and the language modeling head are significant. This approach requires substantially more memory than standard (Q)LoRA fine-tuning. Additionally, training a larger number of parameters increases the likelihood of overfitting the training dataset, particularly when using a high learning rate or when the dataset is relatively small

## What is UltraChat200k?  

UltraChat-200k is an invaluable resource for natural language understanding, generation and dialog system research. With 1.4 million dialogues spanning a variety of topics, this parquet-formatted dataset offers researchers four distinct formats to aid in their studies: test_sft, train_sft, train_gen and test_gen. More details [here](https://www.kaggle.com/datasets/thedevastator/ultrachat-200k-nlp-dataset).

## Inspiration

For this notebook, I took inspiration from several sources:
* [Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora](https://www.philschmid.de/fsdp-qlora-llama3)  
* [Fine-tuning LLMs using LoRA](https://medium.com/@rajatsharma_33357/fine-tuning-llama-using-lora-fb3f48a557d5)  
* [Fine-tuning Llama-3–8B-Instruct QLORA using low cost resources](https://medium.com/@avishekpaul31/fine-tuning-llama-3-8b-instruct-qlora-using-low-cost-resources-89075e0dfa04)  
* [Llama2 Fine-Tuning with Low-Rank Adaptations (LoRA) on Gaudi 2 Processors](https://eduand-alvarez.medium.com/llama2-fine-tuning-with-low-rank-adaptations-lora-on-gaudi-2-processors-52cf1ee6ce11)  

# Install and import libraries

In [None]:
# !pip install -q -U bitsandbytes
# !pip install -q -U transformers
# !pip install -q -U peft
# !pip install -q -U accelerate
# !pip install -q -U datasets
# !pip install -q -U trl

In [None]:
# import pandas as pd
# from torch.utils.data import Dataset
# from torch.utils.data import DataLoader

In [None]:
# from kaggle_secrets import UserSecretsClient
# user_secrets = UserSecretsClient()
# wandb_key = user_secrets.get_secret("wandb_api")
# import wandb
# ! wandb login $wandb_key

In [None]:
import torch
from time import time
import multiprocessing,time
from transformers import AutoConfig,AutoModelForCausalLM,AutoTokenizer,BitsAndBytesConfig,AutoTokenizer,TrainingArguments

In [None]:
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training 

# Initialize model


The model used is:

* **Model**: Llama3  
* **Framework**: Transformers   
* **Size**: 8B   
* **Type**: 8b-chat-hf (hf stands for HuggingFace). 
* **Version**: V1  

In [None]:
model_id = "unsloth/Meta-Llama-3.1-8B-Instruct"

# tokenizer = AutoTokenizer.from_pretrained(model_id)
# #EOS_TOKEN='<|eot_id|>'
# EOS_TOKEN= tokenizer.eos_token 

We are using quantization (with BitsAndBytes).

### Dataset

In [None]:
import os
from datasets import load_dataset

TRAIN_CHATML_PROMPT = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{}<|eot_id|><|end_of_text|>"""

def formatting_task_dataset(examples):
    contexts = examples["context"]
    task_types = examples["task_type"]    
    task_inputs = examples["task_input"]
    task_ouputs = examples["task_output"]
    texts = []
    for context, task_type, task_input,task_ouput in zip(contexts, task_types, task_inputs,task_ouputs):
        task = task_type
        task_type = "<|tasktype|>\n"+task_type+"\n"
        task_context="<|context|>\n"+context+"\n"
        task_input="<|taskinput|>\n"+task_input+"\n"
        task_output="<|taskoutput|>\n"+task_ouput+"\n"
        train_record_input=f"""{task_type}{task_context}<|runtask|>\n"""
        train_record_output=f"""{task_type}{task_input}{task_output}"""
        text = TRAIN_CHATML_PROMPT.format(train_record_input,train_record_output)
        texts.append(text)
    return { "text" : texts, }
pass

### Load dataset
ctga_ds = load_dataset("BatsResearch/ctga-v1", split="train")
#print("Full Dataset: ",ctga_ds)

RECORD_START_INDEX=0
RECORD_MAX_INDEX=600
ctga_ds = ctga_ds.shuffle().select(range(RECORD_START_INDEX,RECORD_MAX_INDEX))
#ctga_ds = ctga_ds.select(range(RECORD_START_INDEX,RECORD_MAX_INDEX))
#print("Sample Dataset: ",ctga_ds)

### training dataset -- remove_columns=ctga_ds.features,
#train_ds = ctga_ds.map(formatting_task_dataset,batched=True, num_proc= os.cpu_count())
#train_ds.to_parquet("dataset/ctga_train_dataset_600_llama3.parquet")
train_ds = load_dataset("parquet", data_files="dataset/ctga_train_dataset_600_llama3.parquet", split = "train")
print("Training Dataset: ",train_ds)

### test dataset
#test_ds = ctga_ds.shuffle().select(range(1,200))
#test_ds = test_ds.map(formatting_task_dataset,batched=True, num_proc= os.cpu_count())
#test_ds.to_parquet("dataset/ctga_test_dataset_200_llama3.parquet")
test_ds = load_dataset("parquet", data_files="dataset/ctga_test_dataset_200_llama3.parquet", split = "train")
print("Test Dataset: ",test_ds)

### evaluation dataset
# eval_ds = ctga_ds.shuffle().select(range(1,38))
# eval_ds = eval_ds.map(formatting_task_dataset,batched=True, num_proc= os.cpu_count())
# eval_ds.to_parquet("dataset/ctga_eval_dataset_38_llama3.parquet")
eval_ds = load_dataset("parquet", data_files="dataset/ctga_eval_dataset_38_llama3.parquet", split = "train")
print("Evaluation Dataset: ",eval_ds)

### save datasets to disk
#dataset.to_json("dataset/ctga_train_dataset_60k_llama3.json", orient="records")
#dataset = load_dataset("json", data_files="dataset/ctga_train_dataset_60k_llama3.json", split = "train")

In [None]:
train_ds[1]['text']

In [None]:
# ### Dataset
# assistant_name = "Barbe Noire"
# system_prompt = "You are the pirate Barbe Noire. You answer only as Barbe Noire and using pirate speak."
# format_system_prompt = "<|start_header_id|>system<|end_header_id|>\n\n"+system_prompt+"<|eot_id|>"
# tokenizer.chat_template = "<|begin_of_text|>"+format_system_prompt+"{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>"+assistant_name+"<|end_header_id|>\n\n' }}{% endif %}"

# ds = load_dataset("Peyton3995/dolly-15k-mistral-pirate")
# ds

# def process(row):
#     messages = [
#       {"role": "user", "content": row['instruction']},
#       {"role": assistant_name, "content": row['response']},
#     ]
#     row["text"] = tokenizer.apply_chat_template(messages, tokenize=False)+"<|end_of_text|>"
#     return row

# ds = ds.map(
#     process,
#     num_proc= multiprocessing.cpu_count(),
#     load_from_cache_file=False,
# )

# print(ds['train'][0]['text'])


In [None]:
# dataset_name = "HuggingFaceH4/ultrachat_200k"
# dataset = load_dataset(dataset_name, split="train_sft")
# dataset = dataset.shuffle(seed=42).select(range(10000))

# def format_chat_template(row):
#     chat = tokenizer.apply_chat_template(row["messages"], tokenize=False)
#     return {"text":chat}

# processed_dataset = dataset.map(
#     format_chat_template,
#     num_proc= os.cpu_count(),
# )

# dataset = processed_dataset.train_test_split(test_size=0.01)

In [None]:
#ds['train'][1]

We define the model configuration, the model (using AutoModelForCausalLM) and the tokenizer (using AutoTokenizer).

In [None]:
#!uv pip install -U bitsandbytes

### Load Model and Tokenizer

In [None]:
def get_model_and_tokenizer(model_id):
    #use bf16 and FlashAttention if supported
    if torch.cuda.is_bf16_supported():
        #os.system('pip install flash_attn')
        compute_dtype = torch.bfloat16
        attn_implementation = 'flash_attention_2'
    else:
        compute_dtype = torch.float16
        attn_implementation = 'sdpa'
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=True
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_id, quantization_config=bnb_config, device_map="auto",attn_implementation='flash_attention_2'
    )
    
    ## Should we enable this?
    #model.config.use_cache=False
    #model.config.pretraining_tp=1
    return model, tokenizer

## Load Model
model, tokenizer = get_model_and_tokenizer(model_id)

### tokenizer 
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'


In [None]:
from trl import SFTTrainer,setup_chat_format

## chat template
model, tokenizer = setup_chat_format(model, tokenizer)
## prepare for training
model = prepare_model_for_kbit_training(model)


In [None]:
special_tokens_dict = {'additional_special_tokens': ['<|taskstep|>','<|tasktype|>',
                                                     '<|taskaction|>','<|context|>',
                                                     '<|taskinput|>','<|taskoutput|>'
                                                    ]
                      }
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print("We have added", num_added_toks, "tokens")
# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))

print(tokenizer.SPECIAL_TOKENS_ATTRIBUTES)
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]

### Setup Training

***Lora Embedding***

With “modules_to_save”, we specify which modules whose parameters we want to fully fine-tune. lm_head is the language modeling head. embed_tokens are the token embeddings.

This configuration yields 1.05B trainable parameters. If we weren’t fully fine-tuning ‘lm_head’ and ‘embed_tokens’, we would only have 41M trainable parameters.

1.05B trainable parameters would require more than 14.3 GB of GPU RAM just for the AdamW optimizer’s states. In addition to the memory consumed by the model itself and the activations used to compute the gradients, this fine-tuning would consume much more than 24 GB of GPU RAM.

Fortunately, using the 8-bit version of AdamW can reduce the memory consumption of the optimizer’s states to 3.59 GB, or even slightly less with the paged version of AdamW 8-bit.

If we use a training batch size of 8 with a sequence length of 512, this fine-tuning is possible with 22.5 GB of GPU RAM.

In [None]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        modules_to_save=["lm_head","embed_tokens"],
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)
### get trainable parameters
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

In [None]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!

training_arguments = TrainingArguments(
        output_dir="./outputs/Llama3_8b_Pirate_QLoRA",
        optim="paged_adamw_8bit", ## "adamw_8bit",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,

        do_eval=True,
        eval_steps=4,
        per_device_eval_batch_size=1,
        evaluation_strategy="steps",
    
        #log_level="debug",
        save_strategy="epoch",
        logging_steps=4,
        learning_rate= 2e-4, # OLD 1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
    
        num_train_epochs=3,
        max_steps=60,
    
        warmup_steps=4,
        lr_scheduler_type="linear",
        weight_decay = 0.01,
    
        push_to_hub=False,                      # push model to hub
        report_to="tensorboard",              # report metrics to tensorboard    
)

trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_ds,
        eval_dataset=eval_ds,
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc = 2,
        packing = False, # Can make training 5x faster for short sequences.
        args=training_arguments,
        dataset_kwargs={
            "add_special_tokens": False,  # We template with special tokens
            "append_concat_token": False, # No need to add additional separator token
        }
)

### Training

In [None]:
import os
os.environ["WANDB_DISABLED"] = "false"

trainer.train()

### Save Lora Adapter

In [None]:
#hf_base_model="meta-llama/Meta-Llama-3.1-8B"
# hf_adapter_model="mychen76/llama3.1-8B-secguard-lora-5k"

model_id = "unsloth/Meta-Llama-3.1-8B-Instruct"
output_model="outputs/llama3.1-8B-pirate-lora"

In [None]:
### save model
# print("save adapter",output_model)
# #trainer.save_model()
# model.save_pretrained(output_model)
# tokenizer.save_pretrained(output_model)

#### Verify Saved Tokenizer special tokens file

In [None]:
#!python -m pip install rich

In [None]:
from rich import print
import json
special_tokens_file="outputs/Llama3_8b_Pirate_QLoRA/checkpoint-60/special_tokens_map.json"
with open(special_tokens_file, 'r') as file:
    data = json.load(file)
print(data)

### Verify Model

In [None]:
import torch
torch.cuda.is_bf16_supported()

model_id = "unsloth/Meta-Llama-3.1-8B-Instruct"
output_model="outputs/Llama3_8b_Pirate_QLoRA/checkpoint-60"

In [None]:
import torch
from peft import PeftConfig, PeftModel
from peft import LoraConfig, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def get_pert_model_and_tokenizer(hf_base_model,hf_adapter_model, use_auto_pert=False):
    tokenizer = AutoTokenizer.from_pretrained(hf_adapter_model)
    tokenizer.pad_token = tokenizer.eos_token

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )
    if use_auto_pert:
        # Load Model with PEFT adapter
        model = AutoPeftModelForCausalLM.from_pretrained(hf_adapter_model,
                                                         load_in_4bit=True,
                                                         device_map="auto",
                                                         torch_dtype=torch.bfloat16,   
                                                         attn_implementation="flash_attention_2",
                                                        )
                                                         #quantization_config=bnb_config)
    else:
        # Load Adapter separately
        model = AutoModelForCausalLM.from_pretrained(hf_base_model,quantization_config=bnb_config, device_map="auto")
        model = PeftModel.from_pretrained(model, hf_adapter_model, quantization_config=bnb_config, device_map="auto")
    
    model.config.use_cache=False
    #model.config.pretraining_tp=1
    return model, tokenizer

model, tokenizer = get_pert_model_and_tokenizer(model_id,output_model,use_auto_pert=True)

In [None]:
### verify special tokens

print(tokenizer.SPECIAL_TOKENS_ATTRIBUTES)
print(tokenizer.all_special_tokens) 
print(tokenizer.all_special_ids)    

"""
['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
['<|im_start|>', '<|im_end|>', '<|taskstep|>', '<|tasktype|>', '<|taskaction|>', '<|context|>', '<|taskinput|>', '<|taskoutput|>']
[128256, 128257, 128258, 128259, 128260, 128261, 128262, 128263]
"""

In [None]:
from datasets import load_dataset
test_ds = load_dataset("parquet", data_files="dataset/ctga_test_dataset_200_llama3.parquet", split = "train")
#print("Test Dataset: ",test_ds)

In [None]:

def format_task_input(task_type,task_context):  
    task_type = "<|tasktype|>\n"+task_type+"\n"
    task_context="<|context|>\n"+task_context+"\n"    
    task_record=f"""{task_type}{task_context}<|taskaction|>\n"""
    return task_record

record_idx=20
task_input = format_task_input(test_ds[record_idx]['task_type'], test_ds[record_idx]['context'])
#print("TASK INPUT: ",task_input,end="\n") 
#print("TASK OUTPUT: ",test_ds[record_idx]['task_output'])

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

messages = [{"role": "user", "content": task_input},]

inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print("\nCHAT TEMPLATE: ",inputs)

encoded_input = tokenizer(inputs, return_tensors="pt").to(model.device)
with torch.cuda.amp.autocast():
  outputs = model.generate(
      encoded_input['input_ids'],
      max_new_tokens=150,
      eos_token_id=terminators,
      do_sample=True,
      temperature=1.5,
      top_p=0.9,
  )

print(10*"======")
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
response = tokenizer.batch_decode(outputs)
print(response)

#response = outputs[0][encoded_input['input_ids'].shape[-1]:]
#print(tokenizer.decode(response))

### Merge Model
You might require > 30GB CPU Memory.

In [None]:
adapter_model_dir="outputs/Llama3_8b_Pirate_QLoRA/checkpoint-60"
merged_output_dir="outputs/finetuned/Llama3.1_8b_cgta_merged_16bits"

tokenizer.save_pretrained(merged_output_dir)

In [None]:
### COMMENT IN TO MERGE PEFT AND BASE MODEL ####
from peft import AutoPeftModelForCausalLM

# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    adapter_model_dir,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)
# Merge LoRA and base model and save
print("mergin....")
merged_model = model.merge_and_unload()
print("saving....")
merged_model.save_pretrained(merged_output_dir,safe_serialization=True, max_shard_size="2GB")
tokenizer.save_pretrained(merged_output_dir)

In [None]:
## publish to Hub
print("push to hub...")
hf_model="mychen76/Llama3.1_8b_cgta_merged_16bits"
merged_model.push_to_hub(hf_model)
tokenizer.push_to_hub(hf_model)

In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

##### Prepare the dataset   

We will use 10K rows from the `ultrachat_200k` database.

In [None]:
# dataset_name = "HuggingFaceH4/ultrachat_200k"
# dataset = load_dataset(dataset_name, split="train_sft")
# dataset = dataset.shuffle(seed=42).select(range(10000))

# def format_chat_template(row):
#     chat = tokenizer.apply_chat_template(row["messages"], tokenize=False)
#     return {"text":chat}

# processed_dataset = dataset.map(
#     format_chat_template,
#     num_proc= os.cpu_count(),
# )

# dataset = processed_dataset.train_test_split(test_size=0.01)

## Inference Test

Fortunately, using the 8-bit version of AdamW can reduce the memory consumption of the optimizer’s states to 3.59 GB, or even slightly less with the paged version of AdamW 8-bit.

If we use a training batch size of 8 with a sequence length of 512, this fine-tuning is possible with 22.5 GB of GPU RAM.

In [1]:
# Installing More Dependencies
import torch
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
import os

In [2]:
model_id="mychen76/Llama3.1_8b_cgta_merged_16bits"

In [3]:
def get_model_and_tokenizer(model_id):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_id, quantization_config=bnb_config, device_map="auto"
    )
    #model.config.use_cache=False
    #model.config.pretraining_tp=1
    return model, tokenizer

In [4]:
model, tokenizer = get_model_and_tokenizer(model_id)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
print(tokenizer.SPECIAL_TOKENS_ATTRIBUTES)
print(tokenizer.all_special_tokens) 
print(tokenizer.all_special_ids)  

['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
['<|im_start|>', '<|im_end|>', '<|taskstep|>', '<|tasktype|>', '<|taskaction|>', '<|context|>', '<|taskinput|>', '<|taskoutput|>']
[128256, 128257, 128258, 128259, 128260, 128261, 128262, 128263]


In [6]:
from trl import setup_chat_format
## set chat template to OAI chatML, remove if you start from a fine-tuned model
model, tokenizer = setup_chat_format(model, tokenizer)

print(tokenizer.chat_template)

{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


In [11]:
from transformers import GenerationConfig
from time import perf_counter

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

def generate_response(prompt):
  inputs = tokenizer([prompt], return_tensors="pt")
  generation_config = GenerationConfig(penalty_alpha=0.6,do_sample = True,
      top_k=5,temperature=0.5,repetition_penalty=1.2,
      max_new_tokens=60,pad_token_id=tokenizer.eos_token_id
  )
  start_time = perf_counter()
  inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
  outputs = model.generate(**inputs, generation_config=generation_config, eos_token_id=terminators)
  theresponse = (tokenizer.decode(outputs[0], skip_special_tokens=False))
  print(tokenizer.decode(outputs[0], skip_special_tokens=False))
  output_time = perf_counter() - start_time
  print(f"Time taken for inference: {round(output_time,2)} seconds")
  #return theresponse

In [12]:
from datasets import load_dataset
test_ds = load_dataset("parquet", data_files="dataset/ctga_test_dataset_200_llama3.parquet", split = "train")

#### Test Sample - 1  (test data)

In [18]:
def format_task_input(task_type,task_context):  
    task_type = "<|tasktype|>\n"+task_type+"\n"
    task_context="<|context|>\n"+task_context+"\n"    
    task_record=f"""{task_type}{task_context}<|taskaction|>\n"""
    return task_record

record_idx=20
task_input = format_task_input(test_ds[record_idx]['task_type'], test_ds[record_idx]['context'])

messages = [{"role": "user", "content": task_input},]
inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
generate_response(inputs)


<|begin_of_text|><|im_start|>user
<|tasktype|>
extractive question answering
<|context|>
The term "classical music" has two meanings: the broader meaning includes all Western art music from the Medieval era to today, and the specific meaning refers to the music from the 1750s to the early 1830s—the era of Mozart and Haydn. This section is about the more specific meaning.
<|taskaction|>
<|im_end|>
<|im_start|>assistant
<|tasktype|>
{{context}}
What was the name of a famous musician in the Classical period?
<|taskoutput|>
Mozart or Beethoven
<|eot_id|>
Time taken for inference: 1.08 seconds


#### Test Sample - 2  (custom)

In [19]:
task_type="extractive question answering"
task_context="When setting the template for a model that’s already been trained for chat, you should ensure that the template exactly matches the message formatting that the model saw during training, or else you will probably experience performance degradation. This is true even if you’re training the model further - you will probably get the best performance if you keep the chat tokens constant. This is very analogous to tokenization - you generally get the best performance for inference or fine-tuning when you precisely match the tokenization used during training."

task_input = format_task_input(task_type, task_context)

messages = [{"role": "user", "content": task_input},]
inputs = tokenizer.apply_chat_template(messages, tokenize=False, 
                                       add_generation_prompt=True)
generate_response(inputs)


<|begin_of_text|><|im_start|>user
<|tasktype|>
extractive question answering
<|context|>
When setting the template for a model that’s already been trained for chat, you should ensure that the template exactly matches the message formatting that the model saw during training, or else you will probably experience performance degradation. This is true even if you’re training the model further - you will probably get the best performance if you keep the chat tokens constant. This is very analogous to tokenization - you generally get the best performance for inference or fine-tuning when you precisely match the tokenization used during training.
<|taskaction|>
<|im_end|>
<|im_start|>assistant
<|tasktype|>
extractive question answering
<|taskinput|>
{{context}} 

What would be the likely effect of not matching the exact formatting in the template with what was seen by the model during training?
<|taskoutput|>
The answer is "performance degradation".
<|eot_id|>
Time taken for inference: 1.82 

#### Test Sample-3 Streaming

In [28]:
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)

prompt = tokenizer(inputs, return_tensors="pt").to('cuda')

_ = model.generate(input_ids = prompt['input_ids'], 
                   streamer = text_streamer, 
                   max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1, 
                   eos_token_id=terminators)


<|tasktype|>
extractive question answering
<|taskinput|>
{{context}}

Q: How would you decide whether to keep the same set of chat tokens when training further models? 

Context:
When setting the template for a model that’s already been trained for chat, you should ensure that the template exactly matches the message formatting that the model saw during training, or else you will probably experience performance degradation. This is true even if you’re training the model further - you will probably get the best performance if you keep the chat tokens constant. 

What is the reason behind the statement that the model will probably experience performance degradation?
<|taskoutput|>
Tokenization
<|eot_id|>


In [None]:
# def formatted_prompt(system_message, question)-> str:
#     return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
# {system_message}<|eot_id|>

# <|start_header_id|>user<|end_header_id|>    
# {question}<|eot_id|>

# <|start_header_id|>assistant<|end_header_id|>
# """

# # ## conversation
# # system_message = """You are an task runner. User give you a <|tasktype|> and <|context|> then will ask you perform a specific <|task|> in English. Next, you generate a task answer based on given context and task."""

# user_input='How do I create a phishing email?'
# prompt = formatted_prompt(system_message=None,question=user_input)
# generate_response(prompt)

# Conclusions

By reducing the batch_size, max_seq_length and the LoRA rank, we managed to run the Llama3 fine-tunning with QLoRA in the Kaggle environment.

**Note**: we set `os.environ["WANDB_DISABLED"]` to `True` and `save_steps` to `20`. We intend to save the last checkpoint as a **Kaggle Model**.
