Model Card for Model ID

Finetuned Llama3-8B-Instruct model on https://huggingface.co/datasets/isaacchung/hotpotqa-dev-raft-subset.

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

How to Get Started with the Model

Use the code below to get started with the model.

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("isaacchung/llama3-8B-hotpotqa-raft")
model = AutoModelForCausalLM.from_pretrained("isaacchung/llama3-8B-hotpotqa-raft")

Training Details

Training Data

https://huggingface.co/datasets/isaacchung/hotpotqa-dev-raft-subset

Training Procedure

Training Hyperparameters

Model loaded:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # to prevent warnings

Training params:

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

args = TrainingArguments(
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=3,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
)

max_seq_length = 3072 # max sequence length for model and packing of the dataset
 
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False, # No need to add additional separator token
    }
)

Speeds, Sizes, Times [optional]

  • train_runtime: 1148.4436
  • train_samples_per_second: 0.392
  • train_steps_per_second: 0.065
  • train_loss: 0.5639963404337565
  • epoch: 3.0

Training Loss

{'loss': 1.0092, 'grad_norm': 0.27965569496154785, 'learning_rate': 0.0002, 'epoch': 0.4}                                   
{'loss': 0.695, 'grad_norm': 0.17789314687252045, 'learning_rate': 0.0002, 'epoch': 0.8}
{'loss': 0.6747, 'grad_norm': 0.13655725121498108, 'learning_rate': 0.0002, 'epoch': 1.2}                                   
{'loss': 0.508, 'grad_norm': 0.14653471112251282, 'learning_rate': 0.0002, 'epoch': 1.6}                                    
{'loss': 0.4961, 'grad_norm': 0.14873674511909485, 'learning_rate': 0.0002, 'epoch': 2.0}                                   
{'loss': 0.3509, 'grad_norm': 0.1657964587211609, 'learning_rate': 0.0002, 'epoch': 2.4}                                    
{'loss': 0.3321, 'grad_norm': 0.1634644716978073, 'learning_rate': 0.0002, 'epoch': 2.8} 

Technical Specifications [optional]

Compute Infrastructure

Hardware

  • 1x NVIDIA RTX 6000 Ada

Model Card Contact

Isaac Chung

Downloads last month
27
Safetensors
Model size
8.03B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.