Ahmadzei's picture
update 1
57bdca5
raw
history blame
2.22 kB
If you don't configure the optimizer in the config, the [Trainer] automatically selects AdamW and either uses the supplied values or the default values for the following parameters from the command line: lr, adam_beta1, adam_beta2, adam_epsilon, weight_decay.
You can set the parameters to "auto" or manually input your own desired values.
yaml
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
}
}
You can also use an unsupported optimizer by adding the following to the top level configuration.
yaml
{
"zero_allow_untested_optimizer": true
}
From DeepSpeed==0.8.3 on, if you want to use offload, you'll also need to the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer.
yaml
{
"zero_force_ds_cpu_optimizer": false
}
DeepSpeed supports the LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR learning rate schedulers.
Transformers and DeepSpeed provide two of the same schedulers:
WarmupLR is the same as --lr_scheduler_type constant_with_warmup in Transformers
WarmupDecayLR is the same as --lr_scheduler_type linear in Transformers (this is the default scheduler used in Transformers)
If you don't configure the scheduler in the config, the [Trainer] automatically selects WarmupDecayLR and either uses the supplied values or the default values for the following parameters from the command line: warmup_min_lr, warmup_max_lr, warmup_num_steps, total_num_steps (automatically calculated during run time if max_steps is not provided).
You can set the parameters to "auto" or manually input your own desired values.
yaml
{
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
}
}
Precision
Deepspeed supports fp32, fp16, and bf16 mixed precision.
If your model doesn't work well with mixed precision, for example if it wasn't pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss.