PEFT

PEFT, a library of parameter-efficient fine-tuning methods, enables training and storing large models on consumer GPUs. These methods only fine-tune a small number of extra model parameters, also known as adapters, on top of the pretrained model. A significant amount of memory is saved because the GPU doesn’t need to store the optimizer states and gradients for the pretrained base model. Adapters are very lightweight, making it convenient to share, store, and load them.

This guide provides a short introduction to the PEFT library and how to use it for training with Transformers. For more details, refer to the PEFT documentation.

Install PEFT with the command below.

pip

source

PEFT currently supports the LoRA, IA3, and AdaLoRA methods for Transformers. To use another PEFT method, such as prompt learning or prompt tuning, use the PEFT library directly.

Low-Rank Adaptation (LoRA) is a very common PEFT method that decomposes the weight matrix into two smaller trainable matrices. Start by defining a LoraConfig object with the parameters shown below.

from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM

# create LoRA configuration object
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, # type of task to train on
    inference_mode=False, # set to False for training
    r=8, # dimension of the smaller matrices
    lora_alpha=32, # scaling factor
    lora_dropout=0.1 # dropout of LoRA layers
)

Add LoraConfig to the model with add_adapter(). The model is now ready to be passed to Trainer for training.

model.add_adapter(lora_config, adapter_name="lora_1")
trainer = Trainer(model=model, ...)
trainer.train()

To add an additional trainable adapter on top of a model with an existing adapter attached, specify the modules you want to train in modules_to_save().

For example, to train the lm_head module on top of a causal language model with a LoRA adapter attached, set modules_to_save=["lm_head"]. Add the adapter to the model as shown below, and then pass it to Trainer.

from transformers import AutoModelForCausalLM
from peft import LoraConfig

model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b")

lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj"],
    modules_to_save=["lm_head"],
)

model.add_adapter(lora_config)
trainer = Trainer(model=model, ...)
trainer.train()

Save your adapter with save_pretrained() to reuse it.

Load adapter

To load an adapter with Transformers, the Hub repository or local directory must contain an adapter_config.json file and the adapter weights. Load the adapter with from_pretrained() or with load_adapter().

from_pretrained

load_adapter

For very large models, it is helpful to load a quantized version of the model in 8 or 4-bit precision to save memory. Transformers supports quantization with its bitsandbytes integration. Specify in BitsAndBytesConfig whether you want to load a model in 8 or 4-bit precision.

For multiple devices, add device_map="auto" to automatically distribute the model across your hardware.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "klcsp/gemma7b-lora-alpaca-11-v1",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    device_map="auto",
)

Set adapter

add_adapter() adds a new adapter to a model. To add a second adapter, the new adapter must be the same type as the first adapter. Use the adapter_name parameter to assign a name to the adapter.

model.add_adapter(lora_config, adapter_name="lora_2")

Once added, use set_adapter() to force a model to use the specified adapter and disable the other adapters.

model.set_adapter("lora_2")

Enable and disable adapter

enable_adapters() is a broader function that enables all adapters attached to a model, and disable_adapters() disables all attached adapters.

model.add_adapter(lora_1)
model.add_adapter(lora_2)
model.enable_adapters()

# disable all adapters
model.disable_adapters()

< > Update on GitHub