Transformers documentation

FSDP2

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.12.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

FSDP2

Fully Sharded Data Parallel (FSDP2) shards the model, gradients, and optimizer states across GPUs. Before computation, each GPU gathers a complete set of parameters from all shards, then frees them afterward. Sharding lets you train models larger than a single GPU’s memory, at the cost of more communication than DDP. Use FSDP when your model or optimizer states don’t fit on a single GPU.

                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚  training data  β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚ shard 0          β”‚ shard 1          β”‚ shard 2
            β–Ό                  β–Ό                  β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  param      β”‚    β”‚  param      β”‚    β”‚  param      β”‚
     β”‚  shard 0    β”‚    β”‚  shard 1    β”‚    β”‚  shard 2    β”‚
     β”‚  GPU 0      β”‚    β”‚  GPU 1      β”‚    β”‚  GPU 2      β”‚
     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
            β”‚                  β”‚                  β”‚
            └──────── all-gather (params) β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                    full params on each GPU
                               β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β–Ό                  β–Ό                  β–Ό
         forward             forward             forward
            β”‚                  β”‚                  β”‚
            └───── reduce-scatter (grads) β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β–Ό                  β–Ό                  β–Ό
     grad shard 0       grad shard 1       grad shard 2
     optim shard 0      optim shard 1      optim shard 2
        step               step               step

Sharding strategies

FSDP2 controls sharding with ~TrainingArguments.fsdp_config. Set fsdp=True to enable FSDP, and set reshard_after_forward in the FSDP config to choose the memory and throughput tradeoff.

reshard_after_forward behavior
true reshard parameters after the forward pass to save more memory
false keep parameters gathered between forward and backward to avoid the re-all-gather, at the cost of higher peak memory

auto_wrap_policy controls how modules are wrapped into FSDP units. It defaults to "TRANSFORMER_BASED_WRAP", which wraps the model’s transformer layers. Without wrapping ("NO_WRAP"), the entire model is one FSDP unit and you lose the memory benefit of sharding.

Configure FSDP

These fields control how FSDP2 wraps, shards, and loads the model. reshard_after_forward and auto_wrap_policy are covered in Sharding strategies.

  • cpu_offload offloads parameters and gradients to CPU when they aren’t in use to save GPU memory.

  • transformer_layer_cls_to_wrap defines the transformer layer to wrap into an FSDP unit when auto_wrap_policy is "TRANSFORMER_BASED_WRAP". Each unit manages its own gather and scatter ops. Only the current unit’s parameters are gathered during the forward pass. The previous units’ parameters are released to save memory.

    Wrapping only the top-level model yields no GPU memory savings. Wrapping every individual Linear layer makes inter-unit communication very expensive. Leave this field empty and FSDP reads the value from the model definition.

  • min_num_params sets the minimum number of parameters per module for size-based wrapping. It is only used when auto_wrap_policy is "SIZE_BASED_WRAP".

  • state_dict_type controls the checkpoint format. Defaults to "FULL_STATE_DICT" for a single Transformers-compatible checkpoint. Use "SHARDED_STATE_DICT" for one checkpoint file per rank, which is faster for large models. Sharded checkpoints only load back into FSDP, so save a "FULL_STATE_DICT" for the final checkpoint you want to share or load outside FSDP.

  • cpu_ram_efficient_loading loads the checkpoint from disk on rank 0 only. Other GPUs initialize an empty model and receive the weights by broadcast, avoiding multiple processes loading a large model into CPU RAM.

  • activation_checkpointing recomputes activations during the backward pass instead of storing them. Use this instead of gradient checkpointing in TrainingArguments. Setting both raises an error.

Configure FSDP training with either an Accelerate config file or an FSDP config file passed to fsdp_config.

Accelerate config file
FSDP config file

Run the accelerate config command and answer questions about your hardware and training setup. This creates a default_config.yaml file in your cache.

Run accelerate launch with a Trainer-based script. The fsdp_config is unnecessary because the Accelerate config file covers the same settings.

accelerate launch train.py

Next steps

  • See DDP for data-parallel training when your model fits on one GPU.
  • See DeepSpeed for ZeRO optimization and NVMe offloading.
  • For FSDP on TPUs with PyTorch/XLA, set xla, xla_fsdp_settings, and xla_fsdp_grad_ckpt in ~TrainingArguments.fsdp_config.
  • Read the FSDP chapter from The Ultra-Scale Playbook for more information about how FSDP works.
Update on GitHub