Ahmadzei's picture
update 1
57bdca5
raw
history blame
1.01 kB
To learn more about the other available FSDP options, take a look at the fsdp_config parameters.
Sharding strategy
FSDP offers a number of sharding strategies to select from:
FULL_SHARD - shards model parameters, gradients and optimizer states across workers; select 1 for this option
SHARD_GRAD_OP- shard gradients and optimizer states across workers; select 2 for this option
NO_SHARD - don't shard anything (this is equivalent to DDP); select 3 for this option
HYBRID_SHARD - shard model parameters, gradients and optimizer states within each worker where each worker also has a full copy; select 4 for this option
HYBRID_SHARD_ZERO2 - shard gradients and optimizer states within each worker where each worker also has a full copy; select 5 for this option
This is enabled by the fsdp_sharding_strategy flag.
CPU offload
You could also offload parameters and gradients when they are not in use to the CPU to save even more GPU memory and help you fit large models where even FSDP may not be sufficient.