# Training Large Language Models in 2bit with `aqlm`, `transformers` and `PEFT`

<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_2bit_training.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Welcome to this notebook that goes through the recent `aqlm` integration that introduces minimal performance degradation 2bit quantization techniques.

In this notebook, we will learn how to load a large model in 2bit (`Mixtral-8x7b`) and train it using Google Colab and PEFT library from Hugging Face ðŸ¤—.


**Install the `aqlm` library**
- It's the only extra dependency to run AQLM models.
- Add `[gpu]` to install the required CUDA specific dependencies.
- Install the latest `accelerate` and `transformers` releases to properly support it.

In [None]:
%%capture
!pip install aqlm[gpu]>=1.1.0
!pip install git+https://github.com/huggingface/peft.git@main
!pip install accelerate>=0.27.0
!pip install git+https://github.com/huggingface/transformers.git@main
!pip install datasets
!pip install bitsandbytes
# for 8-bit optimizer only

First let's load the model we are going to use - `Mixtral-8x7b`! Note that the model itself is around 50GB in half precision

In [23]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="bfloat16", low_cpu_mem_usage=True)

**Add LoRA**

To alter model's behavior, we have to make it trainable. We can do that by addind a small set of trainable parameters on top of the untrainable quantized ones.

In [24]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj', ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
model.enable_input_require_grads() # it's needed for gradient checkpointing

trainable params: 41,943,040 || all params: 2,084,114,432 || trainable%: 2.0125


Here we add a trainable adapter ontop of every `q_prok`, `k_proj` and `o_proj` linear layer.

**Loading a dataset**

Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [13]:
from datasets import load_dataset, Dataset
import itertools
from transformers import AutoTokenizer

# Load the dataset in streaming mode
ds = load_dataset("open-web-math/open-web-math", split="train", streaming=True)

# Define the number of examples you want to load
num_examples = 100000  # Adjust this number as needed

# Create a subset by taking the first num_examples
subset = list(itertools.islice(ds, num_examples))

# Convert the subset to a Dataset object
data = Dataset.from_list(subset)
print(f"Loaded dataset with {len(data)} examples")

# Initialize tokenizer (replace 'gpt2' with your specific model if different)
tokenizer = AutoTokenizer.from_pretrained('gpt2')

max_seq_length = 2048
tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = max_seq_length

def preprocess_function(examples):
    # Join the list of strings into a single string
    texts = [" ".join(text) for text in examples["text"]]
    return tokenizer(texts, truncation=True, max_length=max_seq_length, padding="max_length")

# Process the dataset
processed_dataset = data.map(preprocess_function, batched=True, remove_columns=data.column_names)

print(f"Processed dataset has {len(processed_dataset)} examples")
print(f"Features: {processed_dataset.features}")


Resolving data files:   0%|          | 0/114 [00:00<?, ?it/s]

Loaded dataset with 100000 examples


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Processed dataset has 100000 examples
Features: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}


In [15]:
import argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
import transformers
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from transformers.trainer_callback import TrainerCallback
import os
import random
import subprocess
from huggingface_hub import HfApi, hf_hub_download


# Custom callback to push to Hub
class PushToHubCallback(TrainerCallback):
    def __init__(self, trainer, push_frequency):
        self.trainer = trainer
        self.push_frequency = push_frequency

    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step % self.push_frequency == 0:
            self.trainer.save_model()
            self.trainer.push_to_hub(
                commit_message=f"Training in progress - Step {state.global_step}"
            )


In [21]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

In [None]:
hub_model_id = "davisrbr/math-lora"
tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()
trainer = transformers.Trainer(
    model=model,
    train_dataset=processed_dataset,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        gradient_checkpointing=True,
        warmup_steps=200,
        max_steps=10000,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=25,
        output_dir=".",
        optim="adamw_bnb_8bit",
        logging_first_step=True,
        push_to_hub=True,
        hub_model_id=hub_model_id,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False

push_frequency = 100
trainer.add_callback(PushToHubCallback(trainer, push_frequency,))

trainer.train()

final_commit_hash = trainer.push_to_hub("Training complete")
print(f"Training complete. Final commit hash: {final_commit_hash}")

max_steps is given, it will override any value given in num_train_epochs
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
1,5.5585


Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        gradient_checkpointing=True,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="adamw_bnb_8bit",
        logging_first_step=True,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()



Step,Training Loss
1,2.0422
2,1.2934
3,1.4475
4,1.4336
5,1.7259
6,1.5064
7,1.5496
8,1.0383
9,1.6033
10,1.6764


TrainOutput(global_step=10, training_loss=1.531658697128296, metrics={'train_runtime': 861.2678, 'train_samples_per_second': 0.046, 'train_steps_per_second': 0.012, 'total_flos': 56809829376000.0, 'train_loss': 1.531658697128296, 'epoch': 0.02})