deepmd.tf.train.run_options

deepmd.tf.train.run_options#

Module taking care of important package constants.

Classes#

RunOptions

Class with info on how to run training (cluster, MPI and GPU config).

Module Contents#

class deepmd.tf.train.run_options.RunOptions(init_model: str | None = None, init_frz_model: str | None = None, finetune: str | None = None, restart: str | None = None, log_path: str | None = None, log_level: int = 0, mpi_log: str = 'master')[source]#

Class with info on how to run training (cluster, MPI and GPU config).

Attributes:

gpus: Optional[list[int]]: list of GPUs if any are present else None
is_chief: bool: in distribured training it is true for the main MPI process in serail it is always true
world_size: int: total worker count
my_rank: int: index of the MPI task
nodename: str: name of the node
node_list_list[str]: the list of nodes of the current mpirun
my_device: str: device type - gpu or cpu

gpus: list[int] | None[source]#

world_size: int[source]#

my_rank: int[source]#

nodename: str[source]#

nodelist: list[int][source]#

my_device: str[source]#

_HVD: horovod.tensorflow | None[source]#

_log_handles_already_set: bool = False[source]#

restart = None[source]#

init_model = None[source]#

init_frz_model = None[source]#

finetune = None[source]#

init_mode = 'init_from_scratch'[source]#

property is_chief[source]#: Whether my rank is 0.

print_resource_summary() → None[source]#: Print build and current running cluster configuration summary.

_setup_logger(log_path: pathlib.Path | None, log_level: int, mpi_log: str | None) → None[source]#

Set up package loggers.

Parameters:

log_levelint: logging level
log_pathOptional[str]: path to log file, if None logs will be send only to console. If the parent directory does not exist it will be automatically created, by default None
mpi_logOptional[str], optional: mpi log type. Has three options. master will output logs to file and console only from rank==0. collect will write messages from all ranks to one file opened under rank==0 and to console. workers will open one log file for each worker designated by its rank, console behaviour is the same as for collect.

_try_init_distrib() → None[source]#

_init_distributed(HVD: RunOptions._init_distributed.HVD) → None[source]#

Initialize settings for distributed training.

Parameters:

HVDHVD: horovod object

_init_serial() → None[source]#: Initialize setting for serial training.