Accelerate

Accelerate is a library designed to simplify distributed training on any type of setup with PyTorch by uniting the most common frameworks (Fully Sharded Data Parallel (FSDP) and DeepSpeed) for it into a single interface. Trainer is powered by Accelerate under the hood, enabling loading big models and distributed training.

This guide will show you two ways to use Accelerate with Transformers, using FSDP as the backend. The first method demonstrates distributed training with Trainer, and the second method demonstrates adapting a PyTorch training loop. For more detailed information about Accelerate, please refer to the documentation.

pip install accelerate

Start by running accelerate config in the command line to answer a series of prompts about your training system. This creates and saves a configuration file to help Accelerate correctly set up training based on your setup.

accelerate config

Depending on your setup and the answers you provide, an example configuration file for distributing training with FSDP on one machine with two GPUs may look like the following.

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: BertLayer
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Trainer

Pass the path to the saved configuration file to TrainingArguments, and from there, pass your TrainingArguments to Trainer.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="your-model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    fsdp_config="path/to/fsdp_config",
    fsdp_strategy="full_shard",
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Native PyTorch

Accelerate can also be added to any PyTorch training loop to enable distributed training. The Accelerator is the main entry point for adapting your PyTorch code to work with Accelerate. It automatically detects your distributed training setup and initializes all the necessary components for training. You don’t need to explicitly place your model on a device because Accelerator knows which device to move your model to.

from accelerate import Accelerator

accelerator = Accelerator()
device = accelerator.device

All PyTorch objects (model, optimizer, scheduler, dataloaders) should be passed to the prepare method now. This method moves your model to the appropriate device or devices, adapts the optimizer and scheduler to use AcceleratedOptimizer and AcceleratedScheduler, and creates a new shardable dataloader.

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

Replace loss.backward in your training loop with Accelerates backward method to scale the gradients and determine the appropriate backward method to use depending on your framework (for example, DeepSpeed or Megatron).

for epoch in range(num_epochs):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Combine everything into a function and make it callable as a script.

from accelerate import Accelerator
  
def main():
  accelerator = Accelerator()

  model, optimizer, training_dataloader, scheduler = accelerator.prepare(
      model, optimizer, training_dataloader, scheduler
  )

  for batch in training_dataloader:
      optimizer.zero_grad()
      inputs, targets = batch
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
      accelerator.backward(loss)
      optimizer.step()
      scheduler.step()

if __name__ == "__main__":
    main()

From the command line, call accelerate launch to run your training script. Any additional arguments or parameters can be passed here as well.

To launch your training script on two GPUs, add the --num_processes argument.

accelerate launch --num_processes=2 your_script.py

Refer to the Launching Accelerate scripts for more details.

< > Update on GitHub