11.2. How to control the parallelism of a job?
DeePMD-kit has three levels of parallelism. To get the best performance, one should control the number of threads used by DeePMD-kit. One should make sure the product of the parallel numbers is less than or equal to the number of cores available.
11.2.1. MPI (optional)
Parallelism for MPI is optional and used for multiple nodes, multiple GPU cards, or sometimes multiple CPU cores.
To enable MPI support for training, one should install horovod in advance. Note that the parallelism mode is data parallelism, so it is not expected to see the training time per batch decreases.
MPI support for inference is not directly supported by DeePMD-kit, but indirectly supported by the third-party software. For example, LAMMPS enables running simulations in parallel using the MPI parallel communication standard with distributed data. That software has to build against MPI.
Set the number of processes with:
mpirun -np $num_nodes dp
mpirun here should be the same as the MPI used to build software. For example, one can use
mpirun --version and
lmp -h to see if
mpirun and LAMMPS has the same MPI version.
$num_nodes and the nodes information can be directly given by the HPC scheduler system, if the MPI used here is the same as the MPI used to build the scheduler system. Otherwise, one have to manually assign these information.
11.2.2. Parallelism between independent operators
For CPU devices, TensorFlow use multiple streams to run independent operators (OP).
However, for GPU devices, TensorFlow uses only one compute stream and multiple copy streams. Note that some of DeePMD-kit OPs do not have GPU support, so it is still encouraged to set environmental variables even if one has a GPU.
11.2.3. Parallelism within an individual operators
For CPU devices,
TF_INTRA_OP_PARALLELISM_THREADS controls parallelism within TensorFlow native OPs when TensorFlow is built against Eigen.
OMP_NUM_THREADS is threads for OpenMP parallelism. It controls parallelism within TensorFlow native OPs when TensorFlow is built by Intel OneDNN and DeePMD-kit custom CPU OPs. It may also control parallelsim for NumPy when NumPy is built against OpenMP, so one who uses GPUs for training should also care this environmental variable.
There are several other environmental variables for OpenMP, such as
KMP_BLOCKTIME. See Intel documentation for detailed information.
11.2.4. Tune the performance
There is no one general parallel configuration that works for all situations, so you are encouraged to tune parallel configurations yourself after empirical testing.
Here are some empirical examples. If you wish to use 3 cores of 2 CPUs on one node, you may set the environmental variables and run DeePMD-kit as follows:
dp train input.json
For a node with 128 cores, it is recommended to start with the following variables:
Again, in general, one should make sure the product of the parallel numbers is less than or equal to the number of cores available. In the above case, \(16 \times 8 = 128\), so threads will not compete with each other.