This module detects and returns system resources
- class resources.resources.Resources(top_level_dir=None, central_mode=False, zero_resource_workers=, allow_oversubscribe=False, launcher=None, cores_on_node=None, node_file=None, nodelist_env_slurm=None, nodelist_env_cobalt=None, nodelist_env_lsf=None, nodelist_env_lsf_shortform=None)¶
Provides system resources to libEnsemble and executor.
This is intialized when the executor is created with auto_resources set to true.
These are set on initialization.
top_level_dir (string) – Directory where searches for node_list file
central_mode (boolean) – If true, then running in central mode; otherwise distributed
env_resources (EnvResources) – An object storing environment variables used by resources
global_nodelist (list) – A list of all nodes available for running user applications
logical_cores_avail_per_node (int) – Logical cores (including SMT threads) available on a node
physical_cores_avail_per_node (int) – Physical cores available on a node
worker_resources (WorkerResources) – An object that can contain worker specific resources
- __init__(top_level_dir=None, central_mode=False, zero_resource_workers=, allow_oversubscribe=False, launcher=None, cores_on_node=None, node_file=None, nodelist_env_slurm=None, nodelist_env_cobalt=None, nodelist_env_lsf=None, nodelist_env_lsf_shortform=None)¶
Initializes a new Resources instance
Determines the compute resources available for current allocation, including node list and cores/hardware threads available within nodes.
top_level_dir (string, optional) – Directory libEnsemble runs in (default is current working directory)
central_mode (boolean, optional) – If true, then running in central mode, otherwise distributed. Central mode means libE processes (manager and workers) are grouped together and do not share nodes with applications. Distributed mode means Workers share nodes with applications.
zero_resource_workers (list of ints, optional) – List of workers that require no resources.
allow_oversubscribe (boolean, optional) – If false, then resources will raise an error if task process counts exceed the CPUs available to the worker, as detected by auto_resources. Larger node counts will always raise an error. When auto_resources is off, this argument is ignored.
launcher (String, optional) – The name of the job launcher, such as mpirun or aprun. This may be used to obtain intranode information by launching a probing job onto the compute nodes. If not present, the local node will be used to obtain this information.
cores_on_node (tuple (int,int), optional) – If supplied gives (physical cores, logical cores) for the nodes. If not supplied, this will be auto-detected.
node_file (String, optional) – If supplied, give the name of a file in the run directory to use as a node-list for use by libEnsemble. Defaults to a file named ‘node_list’. If the file does not exist, then the node-list will be auto-detected.
nodelist_env_slurm (String, optional) – The environment variable giving a node list in Slurm format (Default: uses SLURM_NODELIST). Note: This is queried only if a node_list file is not provided and auto_resources=True.
nodelist_env_cobalt (String, optional) – The environment variable giving a node list in Cobalt format (Default: uses COBALT_PARTNAME). Note: This is queried only if a node_list file is not provided and auto_resources=True.
nodelist_env_lsf (String, optional) – The environment variable giving a node list in LSF format (Default: uses LSB_HOSTS). Note: This is queried only if a node_list file is not provided and auto_resources=True.
nodelist_env_lsf_shortform (String, optional) – The environment variable giving a node list in LSF short-form format (Default: uses LSB_MCPU_HOSTS) Note: This is only queried if a node_list file is not provided and auto_resources=True.
Adds comms-specific information to resources
Removes libEnsemble nodes from nodelist if in central_mode.
- static get_MPI_variant()¶
Returns MPI base implementation
mpi_variant – MPI variant ‘aprun’ or ‘jsrun’ or ‘mpich’ or ‘openmpi’
- Return type
- static is_nodelist_shortnames(nodelist)¶
Returns True if any entry contains a ‘.’, else False
- static remove_nodes(global_nodelist_in, remove_list)¶
Removes any nodes in remove_list from the global nodelist
- static best_split(a, n)¶
Creates the most even split of list a into n parts and return list of lists
- static get_global_nodelist(node_file='node_list', rundir=None, env_resources=None)¶
Returns the list of nodes available to all libEnsemble workers.
If a node_file exists this is used, otherwise the environment is interrogated for a node list. If a dedicated manager node is used, then a node_file is recommended.
In central mode, any node with a libE worker is removed from the list.
- class resources.resources.WorkerResources(workerID, comm, resources)¶
Provide system resources per worker to libEnsemble and executor.
These are set on initialisation.
num_workers (int) – Total number of workers
workerID (int) – workerID
local_nodelist (list) – A list of all nodes assigned to this worker
local_node_count (int) – The number of nodes available to this worker (rounded up to whole number)
workers_per_node (int) – The number of workers per node (if using subnode workers)
- __init__(workerID, comm, resources)¶
Initializes a new WorkerResources instance
Determines the compute resources available for current worker, including node list and cores/hardware threads available within nodes.
workerID (int) – workerID of current process
comm (Comm) – The Comm object for manager/worker communications
resources (Resources) – A Resources object containing global nodelist and intranode information
- static map_workerid_to_index(num_workers, workerID, zero_resource_list)¶
Map WorkerID to index into a nodelist
- static get_workers2assign2(num_workers, resources)¶
Returns workers to assign resources to
- static even_assignment(nnodes, nworkers)¶
Returns True if workers are evenly distributied to nodes, else False
- static expand_list(nnodes, nworkers, nodelist)¶
Duplicates each element of
nodelistto best map workers to nodes.
Returns node list with duplicates, and a list of local (on-node) worker counts, both indexed by worker.
- static get_local_nodelist(num_workers, workerID, resources)¶
Returns the list of nodes available to the current worker
Assumes that self.global_nodelist has been calculated (in __init__). Also self.global_nodelist will have already removed non-application nodes