Transformers documentation
Accelerate
Accelerate
Accelerate is a library designed to simplify distributed training on any type of setup with PyTorch by uniting the most common frameworks (Fully Sharded Data Parallel (FSDP) and DeepSpeed) for it into a single interface. Trainer is powered by Accelerate under the hood, enabling loading big models and distributed training.
This guide will show you two ways to use Accelerate with Transformers, using FSDP as the backend. The first method demonstrates distributed training with Trainer, and the second method demonstrates adapting a PyTorch training loop. For more detailed information about Accelerate, please refer to the documentation.
pip install accelerate
Start by running accelerate config in the command line to answer a series of prompts about your training system. This creates and saves a configuration file to help Accelerate correctly set up training based on your setup.
accelerate config
Depending on your setup and the answers you provide, an example configuration file for distributing training with FSDP on one machine with two GPUs may look like the following.
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: BertLayer
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Trainer
Pass the path to the saved configuration file to TrainingArguments, and from there, pass your TrainingArguments to Trainer.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="your-model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
fsdp_config="path/to/fsdp_config",
fsdp_strategy="full_shard",
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
Native PyTorch
Accelerate can also be added to any PyTorch training loop to enable distributed training. The Accelerator is the main entry point for adapting your PyTorch code to work with Accelerate. It automatically detects your distributed training setup and initializes all the necessary components for training. You don’t need to explicitly place your model on a device because Accelerator knows which device to move your model to.
from accelerate import Accelerator
accelerator = Accelerator()
device = accelerator.device
All PyTorch objects (model, optimizer, scheduler, dataloaders) should be passed to the prepare method now. This method moves your model to the appropriate device or devices, adapts the optimizer and scheduler to use AcceleratedOptimizer and AcceleratedScheduler, and creates a new shardable dataloader.
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( train_dataloader, eval_dataloader, model, optimizer )
Replace loss.backward
in your training loop with Accelerates backward method to scale the gradients and determine the appropriate backward
method to use depending on your framework (for example, DeepSpeed or Megatron).
for epoch in range(num_epochs):
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
Combine everything into a function and make it callable as a script.
from accelerate import Accelerator
def main():
accelerator = Accelerator()
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
model, optimizer, training_dataloader, scheduler
)
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
if __name__ == "__main__":
main()
From the command line, call accelerate launch to run your training script. Any additional arguments or parameters can be passed here as well.
To launch your training script on two GPUs, add the --num_processes
argument.
accelerate launch --num_processes=2 your_script.py
Refer to the Launching Accelerate scripts for more details.
< > Update on GitHub