deepmd.tf.utils.data

deepmd.tf.utils.data#

Alias for backward compatibility.

Classes#

DeepmdData

Class for a data system.

Module Contents#

class deepmd.tf.utils.data.DeepmdData(sys_path: str, set_prefix: str = 'set', shuffle_test: bool = True, type_map: list[str] | None = None, optional_type_map: bool = True, modifier: Any | None = None, trn_all_set: bool = False, sort_atoms: bool = True)[source]#

Class for a data system.

It loads data from hard disk, and maintains the data as a data_dict

Parameters:

sys_path: Path to the data system
set_prefix: Prefix for the directories of different sets
shuffle_test: If the test data are shuffled
type_map: Gives the name of different atom types
optional_type_map: If the type_map.raw in each system is optional
modifier: Data modifier that has the method modify_data
trn_all_set: [DEPRECATED] Deprecated. Now all sets are trained and tested.
sort_atomsbool: Sort atoms by atom types. Required to enable when the data is directly fed to descriptors except mixed types.

dirs#

mixed_type#

atom_type#

natoms#

type_map#

pbc = True#

enforce_type_map = False#

sort_atoms = True#

idx_map#

data_dict#

set_count = 0#

iterator = 0#

shuffle_test = True#

modifier = None#

nframes#

prefix_sum#

use_modifier_cache = True#

add(key: str, ndof: int, atomic: bool = False, must: bool = False, high_prec: bool = False, type_sel: list[int] | None = None, repeat: int = 1, default: float = 0.0, dtype: numpy.dtype | None = None, output_natoms_for_type_sel: bool = False) → DeepmdData[source]#

Add a data item that to be loaded.

Parameters:

key: The key of the item. The corresponding data is stored in sys_path/set.*/key.npy
ndof: The number of dof
atomic: The item is an atomic property. If False, the size of the data should be nframes x ndof If True, the size of data should be nframes x natoms x ndof
must: The data file sys_path/set.*/key.npy must exist. If must is False and the data file does not exist, the data_dict[find_key] is set to 0.0
high_prec: Load the data and store in float64, otherwise in float32
type_sel: Select certain type of atoms
repeat: The data will be repeated repeat times.
defaultfloat, default=0.: default value of data
dtypenp.dtype, optional: the dtype of data, overwrites high_prec if provided
output_natoms_for_type_selbool, optional: if True and type_sel is True, the atomic dimension will be natoms instead of nsel

reduce(key_out: str, key_in: str) → DeepmdData[source]#

Generate a new item from the reduction of another atom.

Parameters:

key_out: The name of the reduced item
key_in: The name of the data item to be reduced

get_data_dict() → dict[source]#: Get the data_dict.

check_batch_size(batch_size: int) → bool[source]#: Check if the system can get a batch of data with batch_size frames.

check_test_size(test_size: int) → bool[source]#: Check if the system can get a test dataset with test_size frames.

get_item_torch(index: int, num_worker: int = 1) → dict[source]#

Get a single frame data . The frame is picked from the data system by index. The index is coded across all the sets.

Parameters:

index: index of the frame
num_worker: number of workers for parallel data modification

get_item_paddle(index: int, num_worker: int = 1) → dict[source]#

Get a single frame data . The frame is picked from the data system by index. The index is coded across all the sets. Same with PyTorch backend.

Parameters:

index: index of the frame
num_worker: number of workers for parallel data modification

get_batch(batch_size: int) → dict[source]#

Get a batch of data with batch_size frames. The frames are randomly picked from the data system.

Parameters:

batch_size: size of the batch

get_test(ntests: int = -1) → dict[source]#

Get the test data with ntests frames.

Parameters:

ntests: Size of the test data set. If ntests is -1, all test data will be get.

get_ntypes() → int[source]#: Number of atom types in the system.

get_type_map() → list[str][source]#: Get the type map.

get_atom_type() → list[int][source]#: Get atom types.

get_numb_set() → int[source]#: Get number of training sets.

get_numb_batch(batch_size: int, set_idx: int) → int[source]#: Get the number of batches in a set.

get_sys_numb_batch(batch_size: int) → int[source]#: Get the number of batches in the data system.

get_natoms() → int[source]#: Get number of atoms.

get_natoms_vec(ntypes: int) → numpy.ndarray[source]#

Get number of atoms and number of atoms in different types.

Parameters:

ntypes: Number of types (may be larger than the actual number of types in the system).

Returns:

natoms: natoms[0]: number of local atoms natoms[1]: total number of atoms held by this processor natoms[i]: 2 <= i < Ntypes+2, number of type i atoms

get_single_frame(index: int, num_worker: int) → dict[source]#: Orchestrates loading a single frame efficiently using memmap.

preload_and_modify_all_data_torch(num_worker: int) → None[source]#

Preload all frames and apply modifier to cache them.

This method is useful when use_modifier_cache is True and you want to avoid applying the modifier repeatedly during training.

avg(key: str) → float[source]#: Return the average value of an item.

_idx_map_sel(atom_type: numpy.ndarray, type_sel: list[int]) → numpy.ndarray[source]#

_get_natoms_2(ntypes: int) → tuple[int, numpy.ndarray][source]#

_get_memmap(path: deepmd.utils.path.DPPath) → numpy.memmap[source]#: Get or create a memory-mapped object for a given npy file. Uses file path and modification time as cache keys to detect file changes and invalidate cache when files are modified.

_get_subdata(data: dict[str, Any], idx: numpy.ndarray | None = None) → dict[str, Any][source]#

_load_batch_set(set_name: deepmd.utils.path.DPPath) → None[source]#

reset_get_batch() → None[source]#

_load_test_set(shuffle_test: bool) → None[source]#

_shuffle_data(data: dict[str, Any]) → dict[str, Any][source]#

_get_nframes(set_name: deepmd.utils.path.DPPath | str) → int[source]#

reformat_data_torch(data: dict[str, Any]) → dict[str, Any][source]#

Modify the data format for the requirements of Torch backend.

Parameters:

data: original data

_load_set(set_name: deepmd.utils.path.DPPath) → dict[str, Any][source]#

_load_data(set_name: str, key: str, nframes: int, ndof_: int, atomic: bool = False, must: bool = True, repeat: int = 1, high_prec: bool = False, type_sel: list[int] | None = None, default: float = 0.0, dtype: numpy.dtype | None = None, output_natoms_for_type_sel: bool = False) → numpy.ndarray[source]#

_load_single_data(set_dir: deepmd.utils.path.DPPath, key: str, frame_idx: int, set_nframes: int) → tuple[numpy.float32, numpy.ndarray][source]#

Loads and processes data for a SINGLE frame from a SINGLE key, fully replicating the logic from the original _load_data method.

Parameters:

set_dirDPPath: The directory path of the set
keystr: The key name of the data to load
frame_idxint: The local frame index within the set
set_nframesint: The total number of frames in this set (to avoid redundant _get_nframes calls)

_load_type(sys_path: deepmd.utils.path.DPPath) → numpy.ndarray[source]#

_load_type_mix(set_name: deepmd.utils.path.DPPath) → numpy.ndarray[source]#

_make_idx_map(atom_type: numpy.ndarray) → numpy.ndarray[source]#

_load_type_map(sys_path: deepmd.utils.path.DPPath) → list[str] | None[source]#

_check_pbc(sys_path: deepmd.utils.path.DPPath) → bool[source]#

_check_mode(set_path: deepmd.utils.path.DPPath) → bool[source]#

static _create_memmap(path_str: str, mtime_str: str) → numpy.memmap[source]#

A cached helper function to create memmap objects. Using lru_cache to limit the number of open file handles.

Parameters:

path_str: The file path as a string.
mtime_str: The modification time as a string, used for cache invalidation.