|
If you don't configure the optimizer in the config, the [Trainer] automatically selects AdamW and either uses the supplied values or the default values for the following parameters from the command line: lr, adam_beta1, adam_beta2, adam_epsilon, weight_decay. |
|
You can set the parameters to "auto" or manually input your own desired values. |
|
yaml |
|
{ |
|
"optimizer": { |
|
"type": "AdamW", |
|
"params": { |
|
"lr": "auto", |
|
"betas": "auto", |
|
"eps": "auto", |
|
"weight_decay": "auto" |
|
} |
|
} |
|
} |
|
You can also use an unsupported optimizer by adding the following to the top level configuration. |
|
yaml |
|
{ |
|
"zero_allow_untested_optimizer": true |
|
} |
|
From DeepSpeed==0.8.3 on, if you want to use offload, you'll also need to the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer. |
|
yaml |
|
{ |
|
"zero_force_ds_cpu_optimizer": false |
|
} |
|
|
|
DeepSpeed supports the LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR learning rate schedulers. |
|
Transformers and DeepSpeed provide two of the same schedulers: |
|
|
|
WarmupLR is the same as --lr_scheduler_type constant_with_warmup in Transformers |
|
WarmupDecayLR is the same as --lr_scheduler_type linear in Transformers (this is the default scheduler used in Transformers) |
|
|
|
If you don't configure the scheduler in the config, the [Trainer] automatically selects WarmupDecayLR and either uses the supplied values or the default values for the following parameters from the command line: warmup_min_lr, warmup_max_lr, warmup_num_steps, total_num_steps (automatically calculated during run time if max_steps is not provided). |
|
You can set the parameters to "auto" or manually input your own desired values. |
|
yaml |
|
{ |
|
"scheduler": { |
|
"type": "WarmupDecayLR", |
|
"params": { |
|
"total_num_steps": "auto", |
|
"warmup_min_lr": "auto", |
|
"warmup_max_lr": "auto", |
|
"warmup_num_steps": "auto" |
|
} |
|
} |
|
} |
|
|
|
Precision |
|
Deepspeed supports fp32, fp16, and bf16 mixed precision. |
|
|
|
If your model doesn't work well with mixed precision, for example if it wasn't pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. |