deepmd.tf.train.run_options

Module taking care of important package constants.

Module Contents

Classes

RunOptions

Class with info on how to run training (cluster, MPI and GPU config).

class deepmd.tf.train.run_options.RunOptions(init_model: str | None = None, init_frz_model: str | None = None, finetune: str | None = None, restart: str | None = None, log_path: str | None = None, log_level: int = 0, mpi_log: str = 'master')[source]

Class with info on how to run training (cluster, MPI and GPU config).

Attributes:
gpus: Optional[List[int]]

list of GPUs if any are present else None

is_chief: bool

in distribured training it is true for tha main MPI process in serail it is always true

world_size: int

total worker count

my_rank: int

index of the MPI task

nodename: str

name of the node

node_list_List[str]

the list of nodes of the current mpirun

my_device: str

deviice type - gpu or cpu

property is_chief[source]

Whether my rank is 0.

gpus: List[int] | None[source]
world_size: int[source]
my_rank: int[source]
nodename: str[source]
nodelist: List[int][source]
my_device: str[source]
_HVD: horovod.tensorflow | None[source]
_log_handles_already_set: bool = False[source]
print_resource_summary()[source]

Print build and current running cluster configuration summary.

_setup_logger(log_path: pathlib.Path | None, log_level: int, mpi_log: str | None)[source]

Set up package loggers.

Parameters:
log_levelint

logging level

log_pathOptional[str]

path to log file, if None logs will be send only to console. If the parent directory does not exist it will be automatically created, by default None

mpi_logOptional[str], optional

mpi log type. Has three options. master will output logs to file and console only from rank==0. collect will write messages from all ranks to one file opened under rank==0 and to console. workers will open one log file for each worker designated by its rank, console behaviour is the same as for collect.

_try_init_distrib()[source]
_init_distributed(HVD: RunOptions._init_distributed.HVD)[source]

Initialize settings for distributed training.

Parameters:
HVDHVD

horovod object

_init_serial()[source]

Initialize setting for serial training.