Learning rate

5.3. Learning rate#

5.3.1. Theory#

The learning rate schedule consists of two phases: an optional warmup phase followed by a decay phase.

5.3.1.1. Warmup phase (optional)#

During the warmup phase (steps \(0 \leq \tau < \tau^{\text{warmup}}\)), the learning rate increases linearly from an initial warmup learning rate to the target starting learning rate:

\[ \gamma(\tau) = \gamma^{\text{warmup}} + \frac{\gamma^0 - \gamma^{\text{warmup}}}{\tau^{\text{warmup}}} \tau,\]

where \(\gamma^{\text{warmup}} = f^{\text{warmup}} \cdot \gamma^0\) is the initial warmup learning rate, \(f^{\text{warmup}} \in [0, 1]\) is the warmup start factor (default 0.0), and \(\tau^{\text{warmup}} \in \mathbb{N}\) is the number of warmup steps.

5.3.1.2. Decay phase#

After the warmup phase (steps \(\tau \geq \tau^{\text{warmup}}\)), the learning rate decays according to the selected schedule type.

Exponential decay (type: "exp"):

The learning rate decays exponentially:

\[ \gamma(\tau) = \gamma^0 r ^ {\lfloor (\tau - \tau^{\text{warmup}})/s \rfloor},\]

where \(\tau \in \mathbb{N}\) is the index of the training step, \(\gamma^0 \in \mathbb{R}\) is the learning rate at the start of the decay phase (i.e., after warmup), and the decay rate \(r\) is given by

\[ r = {\left(\frac{\gamma^{\text{stop}}}{\gamma^0}\right )} ^{\frac{s}{\tau^{\text{decay}}}},\]

where \(\tau^{\text{decay}} = \tau^{\text{stop}} - \tau^{\text{warmup}}\) is the number of decay steps, \(\tau^{\text{stop}} \in \mathbb{N}\) is the total training steps, \(\gamma^{\text{stop}} \in \mathbb{R}\) is the stopping learning rate, and \(s \in \mathbb{N}\) is the decay steps.

Cosine annealing (type: "cosine"):

The learning rate follows a cosine annealing schedule:

\[ \gamma(\tau) = \gamma^{\text{stop}} + \frac{\gamma^0 - \gamma^{\text{stop}}}{2} \left(1 + \cos\left(\frac{\pi (\tau - \tau^{\text{warmup}})}{\tau^{\text{decay}}}\right)\right),\]

where the learning rate smoothly decreases from \(\gamma^0\) to \(\gamma^{\text{stop}}\) following a cosine curve over the decay phase.

For both schedule types, the stopping learning rate can be specified directly as \(\gamma^{\text{stop}}\) or as a ratio: \(\gamma^{\text{stop}} = \rho^{\text{stop}} \cdot \gamma^0\), where \(\rho^{\text{stop}} \in (0, 1]\) is the stopping learning rate ratio. [1]

5.3.2. Migration Guide#

5.3.2.1. Required parameters for learning rate configuration#

Starting from this version (3.1.3), the learning rate configuration has the following required parameters:

start_lr (required): The learning rate at the start of the decay phase (after warmup). This parameter no longer has a default value and must be explicitly specified in your configuration.
Either stop_lr or stop_lr_ratio (required): You must provide one of these two parameters:
- stop_lr: The target learning rate at the end of training
- stop_lr_ratio: The stopping learning rate as a ratio of start_lr

These parameters are mutually exclusive - you cannot specify both stop_lr and stop_lr_ratio at the same time.

5.3.2.1.1. Migration examples#

Before (legacy configuration):

"learning_rate": {
    "type": "exp",
    "decay_steps": 5000
}

After (updated configuration):

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr": 1e-6,
    "decay_steps": 5000
}

Or using stop_lr_ratio:

"learning_rate": {
    "type": "exp",
    "start_lr": 0.001,
    "stop_lr_ratio": 1e-3,
    "decay_steps": 5000
}

Note: If you are upgrading from a previous version, please update your configuration files to include explicit values for start_lr and one of stop_lr or stop_lr_ratio. Failure to do so will result in a validation error.

5.3.3. Instructions#

DeePMD-kit supports two types of learning rate schedules: exponential decay (type: "exp") and cosine annealing (type: "cosine"). Both types support optional warmup and can use either absolute stopping learning rate or a ratio-based specification.

5.3.3.1. Exponential decay schedule#

The learning_rate section for exponential decay in input.json is given as follows

    "learning_rate" :{
        "type":        "exp",
        "start_lr":    0.001,
        "stop_lr":     1e-6,
        "decay_steps": 5000,
        "_comment":    "that's all"
    }

5.3.3.1.1. Basic parameters#

The following parameters are available for learning rate configuration.

Common parameters for both exp and cosine types:

start_lr gives the learning rate at the start of the decay phase (i.e., after warmup if enabled). It should be set appropriately based on the model architecture and dataset.
stop_lr gives the target learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge. This parameter is mutually exclusive with stop_lr_ratio.
stop_lr_ratio (optional) specifies the stopping learning rate as a ratio of start_lr. For example, stop_lr_ratio: 1e-3 means stop_lr = start_lr * 1e-3. This parameter is mutually exclusive with stop_lr. Either stop_lr or stop_lr_ratio must be provided.

Additional parameters for exp type only:

decay_steps specifies the interval (in training steps) at which the learning rate is decayed. The learning rate is updated every decay_steps steps during the decay phase. If decay_steps exceeds the decay phase steps (num_steps - warmup_steps) and decay_rate is not explicitly provided, it will be automatically adjusted to a sensible default value.
smooth (optional, default: false) controls the decay behavior. When set to false, the learning rate decays in a stepped manner (updated every decay_steps steps). When set to true, the learning rate decays smoothly at every step.

Learning rate formula for exp type:

During the decay phase, the learning rate decays exponentially from start_lr to stop_lr.

Stepped mode (smooth: false, default):

lr(t) = start_lr * decay_rate ^ floor((t - warmup_steps) / decay_steps)

Smooth mode (smooth: true):

lr(t) = start_lr * decay_rate ^ ((t - warmup_steps) / decay_steps)

where t is the current training step and warmup_steps is the number of warmup steps (0 if warmup is not enabled).

The formula for cosine annealing is as follows.

Learning rate formula for cosine type:

For cosine annealing, the learning rate smoothly decreases following a cosine curve:

lr(t) = stop_lr + (start_lr - stop_lr) / 2 * (1 + cos(pi * (t - warmup_steps) / decay_phase_steps))

where decay_phase_steps = numb_steps - warmup_steps is the number of steps in the decay phase.

5.3.3.1.2. Warmup parameters (optional)#

Warmup is a technique to stabilize training in the early stages by gradually increasing the learning rate from a small initial value to the target start_lr. The warmup parameters are optional and can be configured as follows:

warmup_steps (optional, default: 0) specifies the number of steps for learning rate warmup. During warmup, the learning rate increases linearly from warmup_start_factor * start_lr to start_lr. This parameter is mutually exclusive with warmup_ratio.
warmup_ratio (optional) specifies the warmup duration as a ratio of the total training steps. For example, warmup_ratio: 0.1 means the warmup phase will last for 10% of the total training steps. The actual number of warmup steps is computed as int(warmup_ratio * numb_steps). This parameter is mutually exclusive with warmup_steps.
warmup_start_factor (optional, default: 0.0) specifies the factor for the initial warmup learning rate. The warmup learning rate starts from warmup_start_factor * start_lr and increases linearly to start_lr. A value of 0.0 means the learning rate starts from zero.

5.3.3.1.3. Configuration examples#

The following examples demonstrate various learning rate configurations.

Example 1: Basic exponential decay without warmup

    "learning_rate": {
        "type":        "exp",
        "start_lr":    0.001,
        "stop_lr":     1e-6,
        "decay_steps": 5000
    }

Example 2: Using stop_lr_ratio instead of stop_lr

    "learning_rate": {
        "type":          "exp",
        "start_lr":      0.001,
        "stop_lr_ratio": 1e-3,
        "decay_steps":   5000
    }

This is equivalent to setting stop_lr: 1e-6 (i.e., 0.001 * 1e-3).

The following example shows exponential decay with warmup using a specific number of warmup steps.

Example 3: Exponential decay with warmup (using warmup_steps)

    "learning_rate": {
        "type":               "exp",
        "start_lr":           0.001,
        "stop_lr":            1e-6,
        "decay_steps":        5000,
        "warmup_steps":       10000,
        "warmup_start_factor": 0.1
    }

In this example, the learning rate starts from 0.0001 (i.e., 0.1 * 0.001) and increases linearly to 0.001 over the first 10,000 steps. After that, it decays exponentially to 1e-6.

The following example shows exponential decay with warmup using a ratio-based warmup duration.

Example 4: Exponential decay with warmup (using warmup_ratio)

    "learning_rate": {
        "type":          "exp",
        "start_lr":      0.001,
        "stop_lr_ratio": 1e-3,
        "decay_steps":   5000,
        "warmup_ratio":  0.05
    }

In this example, if the total training steps (numb_steps) is 1,000,000, the warmup phase will last for 50,000 steps (i.e., 0.05 * 1,000,000). The learning rate starts from 0.0 (default warmup_start_factor: 0.0) and increases linearly to 0.001 over the first 50,000 steps, then decays exponentially.

The following examples demonstrate cosine annealing configurations.

5.3.3.2. Cosine annealing schedule#

The learning_rate section for cosine annealing in input.json is given as follows

    "learning_rate": {
        "type":     "cosine",
        "start_lr": 0.001,
        "stop_lr":  1e-6
    }

Cosine annealing provides a smooth decay curve that often works well for training neural networks. Unlike exponential decay, it does not require the decay_steps parameter.

The following example shows basic cosine annealing without warmup.

Example 5: Basic cosine annealing without warmup

    "learning_rate": {
        "type":     "cosine",
        "start_lr": 0.001,
        "stop_lr":  1e-6
    }

The following example shows cosine annealing with stop_lr_ratio.

Example 6: Cosine annealing with stop_lr_ratio

    "learning_rate": {
        "type":          "cosine",
        "start_lr":      0.001,
        "stop_lr_ratio": 1e-3
    }

This is equivalent to setting stop_lr: 1e-6 (i.e., 0.001 * 1e-3).

The following example shows cosine annealing with warmup.

Example 7: Cosine annealing with warmup

    "learning_rate": {
        "type":               "cosine",
        "start_lr":           0.001,
        "stop_lr":            1e-6,
        "warmup_steps":       5000,
        "warmup_start_factor": 0.0
    }

In this example, the learning rate starts from 0.0 and increases linearly to 0.001 over the first 5,000 steps, then follows a cosine annealing curve down to 1e-6.

The following example shows exponential decay with smooth mode enabled.

Example 8: Exponential decay with smooth mode

    "learning_rate": {
        "type":        "exp",
        "start_lr":    0.001,
        "stop_lr":     1e-6,
        "decay_steps": 5000,
        "smooth":      true
    }

By setting smooth: true, the learning rate decays smoothly at every step instead of in a stepped manner. This provides a more gradual decay curve similar to PyTorch’s ExponentialLR, whereas the default stepped mode (smooth: false) is similar to PyTorch’s StepLR.