Running the DeePMD-kit on the Expanse cluster

Expanse is a cluster operated by the San Diego Supercomputer Center. Here we provide an example to run jobs on the expanse.

The machine parameters are provided below. Expanse uses the SLURM workload manager for job scheduling. remote_root has been created in advance. It’s worth metioned that we do not recommend to use the password, so SSH keys are used instead to improve security.

 1{
 2  "batch_type": "Slurm",
 3  "local_root": "./",
 4  "remote_root": "/expanse/lustre/scratch/njzjz/temp_project/dpgen_workdir",
 5  "clean_asynchronously": true,
 6  "context_type": "SSHContext",
 7  "remote_profile": {
 8    "hostname": "login.expanse.sdsc.edu",
 9    "username": "njzjz",
10    "port": 22
11  }
12}

Expanse’s standard compute nodes are each powered by two 64-core AMD EPYC 7742 processors and contain 256 GB of DDR4 memory. Here, we request one node with 32 cores and 16 GB memory from the shared partition. Expanse does not support --gres=gpu:0 command, so we use custom_gpu_line to customize the statement.

 1{
 2  "number_node": 1,
 3  "cpu_per_node": 1,
 4  "gpu_per_node": 0,
 5  "queue_name": "shared",
 6  "group_size": 1,
 7  "custom_flags": [
 8    "#SBATCH -c 32",
 9    "#SBATCH --mem=16G",
10    "#SBATCH --time=48:00:00",
11    "#SBATCH --account=rut149",
12    "#SBATCH --requeue"
13  ],
14  "source_list": [
15    "activate /home/njzjz/deepmd-kit"
16  ],
17  "envs": {
18    "OMP_NUM_THREADS": 4,
19    "TF_INTRA_OP_PARALLELISM_THREADS": 4,
20    "TF_INTER_OP_PARALLELISM_THREADS": 8,
21    "DP_AUTO_PARALLELIZATION": 1
22  },
23  "batch_type": "Slurm",
24  "kwargs": {
25    "custom_gpu_line": "#SBATCH --gpus=0"
26  }
27}

The following task parameter runs a DeePMD-kit task, forwarding an input file and backwarding graph files. Here, the data set will be used among all the tasks, so it is not included in the forward_files. Instead, it should be included in the submission’s forward_common_files.

 1{
 2    "command": "dp train input.json && dp freeze && dp compress",
 3    "task_work_path": "model1/",
 4    "forward_files": [
 5      "input.json"
 6    ],
 7    "backward_files": [
 8      "frozen_model.pb",
 9      "frozen_model_compressed.pb"
10    ],
11    "outlog": "log",
12    "errlog": "err"
13}