FSDP2

Fully Sharded Data Parallel (FSDP2) shards the model, gradients, and optimizer states across GPUs. Before computation, each GPU gathers a complete set of parameters from all shards, then frees them afterward. Sharding lets you train models larger than a single GPU’s memory, at the cost of more communication than DDP. Use FSDP when your model or optimizer states don’t fit on a single GPU.

                      ┌─────────────────┐
                      │  training data  │
                      └────────┬────────┘
            ┌──────────────────┼──────────────────┐
            │ shard 0          │ shard 1          │ shard 2
            ▼                  ▼                  ▼
     ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
     │  param      │    │  param      │    │  param      │
     │  shard 0    │    │  shard 1    │    │  shard 2    │
     │  GPU 0      │    │  GPU 1      │    │  GPU 2      │
     └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
            │                  │                  │
            └──────── all-gather (params) ────────┘
                               │
                    full params on each GPU
                               │
            ┌──────────────────┼──────────────────┐
            ▼                  ▼                  ▼
         forward             forward             forward
            │                  │                  │
            └───── reduce-scatter (grads) ────────┘
                               │
            ┌──────────────────┼──────────────────┐
            ▼                  ▼                  ▼
     grad shard 0       grad shard 1       grad shard 2
     optim shard 0      optim shard 1      optim shard 2
        step               step               step

Sharding strategies

FSDP2 controls sharding with ~TrainingArguments.fsdp_config. Set fsdp=True to enable FSDP, and set reshard_after_forward in the FSDP config to choose the memory and throughput tradeoff.

`reshard_after_forward`	behavior
`true`	reshard parameters after the forward pass to save more memory
`false`	keep parameters gathered between forward and backward to avoid the re-all-gather, at the cost of higher peak memory

auto_wrap_policy controls how modules are wrapped into FSDP units. It defaults to "TRANSFORMER_BASED_WRAP", which wraps the model’s transformer layers. Without wrapping ("NO_WRAP"), the entire model is one FSDP unit and you lose the memory benefit of sharding.

Configure FSDP

These fields control how FSDP2 wraps, shards, and loads the model. reshard_after_forward and auto_wrap_policy are covered in Sharding strategies.

cpu_offload offloads parameters and gradients to CPU when they aren’t in use to save GPU memory.
transformer_layer_cls_to_wrap defines the transformer layer to wrap into an FSDP unit when auto_wrap_policy is "TRANSFORMER_BASED_WRAP". Each unit manages its own gather and scatter ops. Only the current unit’s parameters are gathered during the forward pass. The previous units’ parameters are released to save memory.

Wrapping only the top-level model yields no GPU memory savings. Wrapping every individual Linear layer makes inter-unit communication very expensive. Leave this field empty and FSDP reads the value from the model definition.
min_num_params sets the minimum number of parameters per module for size-based wrapping. It is only used when auto_wrap_policy is "SIZE_BASED_WRAP".
state_dict_type controls the checkpoint format. Defaults to "FULL_STATE_DICT" for a single Transformers-compatible checkpoint. Use "SHARDED_STATE_DICT" for one checkpoint file per rank, which is faster for large models. Sharded checkpoints only load back into FSDP, so save a "FULL_STATE_DICT" for the final checkpoint you want to share or load outside FSDP.
cpu_ram_efficient_loading loads the checkpoint from disk on rank 0 only. Other GPUs initialize an empty model and receive the weights by broadcast, avoiding multiple processes loading a large model into CPU RAM.
activation_checkpointing recomputes activations during the backward pass instead of storing them. Use this instead of gradient checkpointing in TrainingArguments. Setting both raises an error.

Configure FSDP training with either an Accelerate config file or an FSDP config file passed to fsdp_config.

Accelerate config file

FSDP config file

Next steps

See DDP for data-parallel training when your model fits on one GPU.
See DeepSpeed for ZeRO optimization and NVMe offloading.
For FSDP on TPUs with PyTorch/XLA, set xla, xla_fsdp_settings, and xla_fsdp_grad_ckpt in ~TrainingArguments.fsdp_config.
Read the FSDP chapter from The Ultra-Scale Playbook for more information about how FSDP works.

Update on GitHub

Transformers

FSDP2

Sharding strategies

Configure FSDP

Next steps