4.2. Advanced options
In this section, we will take $deepmd_source_dir/examples/water/se_e2_a/input.json
as an example of the input file.
4.2.1. Learning rate
The learning_rate section in input.json
is given as follows
"learning_rate" :{
"type": "exp",
"start_lr": 0.001,
"stop_lr": 3.51e8,
"decay_steps": 5000,
"_comment": "that's all"
}
start_lr gives the learning rate at the beginning of the training.
stop_lr gives the learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge.
During the training, the learning rate decays exponentially from start_lr to stop_lr following the formula:
where \(t\) is the training step, \(\alpha\) is the learning rate, \(\alpha_0\) is the starting learning rate (set by start_lr), \(\lambda\) is the decay rate, and \(\tau\) is the decay steps, i.e.
```
lr(t) = start_lr * decay_rate ^ ( t / decay_steps )
```
4.2.2. Training parameters
Other training parameters are given in the training section.
"training": {
"training_data": {
"systems": ["../data_water/data_0/", "../data_water/data_1/", "../data_water/data_2/"],
"batch_size": "auto"
},
"validation_data":{
"systems": ["../data_water/data_3"],
"batch_size": 1,
"numb_btch": 3
},
"mixed_precision": {
"output_prec": "float32",
"compute_prec": "float16"
},
"numb_steps": 1000000,
"seed": 1,
"disp_file": "lcurve.out",
"disp_freq": 100,
"save_freq": 1000
}
The sections training_data and validation_data give the training dataset and validation dataset, respectively. Taking the training dataset for example, the keys are explained below:
systems provide paths of the training data systems. DeePMDkit allows you to provide multiple systems with different numbers of atoms. This key can be a
list
or astr
.At each training step, DeePMDkit randomly picks batch_size frame(s) from one of the systems. The probability of using a system is by default in proportion to the number of batches in the system. More options are available for automatically determining the probability of using systems. One can set the key auto_prob to
"prob_uniform"
all systems are used with the same probability."prob_sys_size"
the probability of using a system is proportional to its size (number of frames)."prob_sys_size; sidx_0:eidx_0:w_0; sidx_1:eidx_1:w_1;..."
thelist
of systems is divided into blocks. Blocki
has systems ranging fromsidx_i
toeidx_i
. The probability of using a system from blocki
is proportional tow_i
. Within one block, the probability of using a system is proportional to its size.
An example of using
"auto_prob"
is given below. The probability of usingsystems[2]
is 0.4, and the sum of the probabilities of usingsystems[0]
andsystems[1]
is 0.6. If the number of frames insystems[1]
is twice ofsystem[0]
, then the probability of usingsystem[1]
is 0.4 and that ofsystem[0]
is 0.2.
"training_data": {
"systems": ["../data_water/data_0/", "../data_water/data_1/", "../data_water/data_2/"],
"auto_prob": "prob_sys_size; 0:2:0.6; 2:3:0.4",
"batch_size": "auto"
}
The probability of using systems can also be specified explicitly with key sys_probs which is a list having the length of the number of systems. For example
"training_data": {
"systems": ["../data_water/data_0/", "../data_water/data_1/", "../data_water/data_2/"],
"sys_probs": [0.5, 0.3, 0.2],
"batch_size": "auto:32"
}
The key batch_size specifies the number of frames used to train or validate the model in a training step. It can be set to
list
: the length of which is the same as the systems. The batch size of each system is given by the elements of the list.int
: all systems use the same batch size."auto"
: the same as"auto:32"
, see"auto:N"
"auto:N"
: automatically determines the batch size so that the batch_size times the number of atoms in the system is no less thanN
.
The key numb_batch in validate_data gives the number of batches of model validation. Note that the batches may not be from the same system
The section mixed_precision specifies the mixed precision settings, which will enable the mixed precision training workflow for DeePMDkit. The keys are explained below:
output_prec precision used in the output tensors, only
float32
is supported currently.compute_prec precision used in the computing tensors, only
float16
is supported currently. Note there are several limitations about mixed precision training:Only se_e2_a type descriptor is supported by the mixed precision training workflow.
The precision of the embedding net and the fitting net are forced to be set to
float32
.
Other keys in the training section are explained below:
numb_steps The number of training steps.
seed The random seed for getting frames from the training data set.
disp_file The file for printing learning curve.
disp_freq The frequency of printing learning curve. Set in the unit of training steps
save_freq The frequency of saving checkpoint.
4.2.3. Options and environment variables
Several command line options can be passed to dp train
, which can be checked with
$ dp train help
An explanation will be provided
positional arguments:
INPUT the input json database
optional arguments:
h, help show this help message and exit
initmodel INIT_MODEL
Initialize a model by the provided checkpoint
restart RESTART Restart the training from the provided checkpoint
initfrzmodel INIT_FRZ_MODEL
Initialize the training from the frozen model.
skipneighborstat Skip calculating neighbor statistics. Sel checking, automatic sel, and model compression will be disabled. (default: False)
initmodel model.ckpt
, initializes the model training with an existing model that is stored in the checkpoint model.ckpt
, the network architectures should match.
restart model.ckpt
, continues the training from the checkpoint model.ckpt
.
initfrzmodel frozen_model.pb
, initializes the training with an existing model that is stored in frozen_model.pb
.
skipneighborstat
will skip calculating neighbor statistics if one is concerned about performance. Some features will be disabled.
To maximize the performance, one should follow FAQ: How to control the parallelism of a job to control the number of threads.
One can set other environmental variables:
Environment variables  Allowed value  Default value  Usage 

DP_INTERFACE_PREC 

 Control high (double) or low (float) precision of training. 
DP_AUTO_PARALLELIZATION  0, 1  0  Enable auto parallelization for CPU operators. 
DP_JIT  0, 1  0  Enable JIT. Note that this option may either improve or decrease the performance. Requires TensorFlow supports JIT. 
4.2.4. Adjust sel
of a frozen model
One can use initfrzmodel
features to adjust (increase or decrease) sel
of a existing model. Firstly, one needs to adjust sel
in input.json
. For example, adjust from [46, 92]
to [23, 46]
.
"model": {
"descriptor": {
"sel": [23, 46]
}
}
To obtain the new model at once, numb_steps
should be set to zero:
"training": {
"numb_steps": 0
}
Then, one can initialize the training from the frozen model and freeze the new model at once:
dp train input.json initfrzmodel frozen_model.pb
dp freeze o frozen_model_adjusted_sel.pb
Two models should give the same result when the input satisfies both constraints.
Note: At this time, this feature is only supported by se_e2_a
descriptor with set_davg_true
enabled, or hybrid
composed of the above descriptors.