MPI Executor - MPI apps

This module launches and controls the running of MPI applications.

In order to create an MPI executor, the calling script should contain:

exctr = MPIExecutor()

The MPIExecutor will use system resource information supplied by the libEnsemble resource manager when submitting tasks.

See this example for usage.

class libensemble.executors.mpi_executor.MPIExecutor(custom_info={})

Bases: Executor

The MPI executor can create, poll and kill runnable MPI tasks

Parameters:: custom_info (dict, Optional) – Provide custom overrides to selected variables that are usually auto-detected. See below.

submit(calc_type=None, app_name=None, num_procs=None, num_nodes=None, procs_per_node=None, num_gpus=None, machinefile=None, app_args=None, stdout=None, stderr=None, stage_inout=None, hyperthreads=False, dry_run=False, wait_on_start=False, extra_args=None, auto_assign_gpus=False, match_procs_to_gpus=False, env_script=None, mpi_runner_type=None)

Creates a new task, and either executes or schedules execution.

The created task object is returned.

The user must supply either the app_name or calc_type arguments (app_name is recommended). All other arguments are optional.

Parameters:

calc_type (str, Optional) – The calculation type: ‘sim’ or ‘gen’ Only used if app_name is not supplied. Uses default sim or gen application.
app_name (str, Optional) – The application name. Must be supplied if calc_type is not.
num_procs (int, Optional) – The total number of processes (MPI ranks)
num_nodes (int, Optional) – The number of nodes
procs_per_node (int, Optional) – The processes per node
num_gpus (int, Optional) – The total number of GPUs
machinefile (str, Optional) – Name of a machinefile
app_args (str, Optional) – A string of the application arguments to be added to task submit command line
stdout (str, Optional) – A standard output filename
stderr (str, Optional) – A standard error filename
stage_inout (str, Optional) – A directory to copy files from; default will take from current directory
hyperthreads (bool, Optional) – Whether to submit MPI tasks to hyperthreads
dry_run (bool, Optional) – Whether this is a dry_run - no task will be launched; instead runline is printed to logger (at INFO level)
wait_on_start (bool or int, Optional) – Whether to wait for task to be polled as RUNNING (or other active/end state) before continuing. If an integer N is supplied, wait at most N seconds.
extra_args (str, Optional) – Additional command line arguments to supply to MPI runner. If arguments are recognized as MPI resource configuration (num_procs, num_nodes, procs_per_node) they will be used in resources determination unless also supplied in the direct options.
auto_assign_gpus (bool, Optional) – Auto-assign GPUs available to this worker using either the method supplied in configuration or determined by detected environment. Default: False
match_procs_to_gpus (bool, Optional) – For use with auto_assign_gpus. Auto-assigns MPI processors to match the assigned GPUs. Default: False unless auto_assign_gpus is True and no other CPU configuration is supplied.
env_script (str, Optional) – The full path of a shell script to set up the environment for the launched task. This will be run in the subprocess, and not affect the worker environment. The script should start with a shebang.
mpi_runner_type ((str|dict), Optional) – An MPI runner to be used for this submit only. Supply either a string for the MPI runner type or a dictionary for detailed configuration (see custom_info on MPIExecutor constructor). This will not change the default MPI runner for the executor. Example string inputs are “mpich”, “openmpi”, “srun”, “jsrun”, “aprun”.

Returns:

task – The launched task object

Return type:

Task

Note that if some combination of num_procs, num_nodes, and procs_per_node is provided, these will be honored if possible. If resource detection is on and these are omitted, then the available resources will be divided among workers.

manager_kill_received()

Return True if received kill signal from the manager

Return type:: bool

manager_poll()

Polls for a manager signal

The executor manager_signal attribute will be updated.

Return type:: int

polling_loop(task, timeout=None, delay=0.1, poll_manager=False)

Optional, blocking, generic task status polling loop. Operates until the task finishes, times out, or is optionally killed via a manager signal. On completion, returns a presumptive calc_status integer. Useful for running an application via the Executor until it stops without monitoring its intermediate output.

Parameters:

task (object) – a Task object returned by the executor on submission
timeout (int, Optional) – Maximum number of seconds for the polling loop to run. Tasks that run longer than this limit are killed. Default: No timeout
delay (int, Optional) – Sleep duration between polling loop iterations. Default: 0.1 seconds
poll_manager (bool, Optional) – Whether to also poll the manager for ‘finish’ or ‘kill’ signals. If detected, the task is killed. Default: False.

Returns:

calc_status – presumptive integer attribute describing the final status of a launched task

Return type:

int

register_app(full_path, app_name=None, calc_type=None, desc=None, precedent='')

Registers a user application to libEnsemble.

The full_path of the application must be supplied. Either app_name or calc_type can be used to identify the application in user scripts (in the submit function). app_name is recommended.

Parameters:

full_path (str) – The full path of the user application to be registered
app_name (str, Optional) – Name to identify this application.
calc_type (str, Optional) – Calculation type: Set this application as the default ‘sim’ or ‘gen’ function.
desc (str, Optional) – Description of this application
precedent (str, Optional) – Any str that should directly precede the application full path.

Return type:

None

Class-specific Attributes

Class-specific attributes can be set directly to alter the behavior of the MPI Executor. However, they should be used with caution, because they may not be implemented in other executors.

max_submit_attempts:: (int) Maximum number of launch attempts for a given task. Default: 5.
fail_time:: (int or float) Only if wait_on_start is set. Maximum run time to failure in seconds that results in relaunch. Default: 2.
retry_delay_incr:: (int or float) Delay increment between launch attempts in seconds. Default: 5. (i.e., First retry after 5 seconds, then 10 seconds, then 15, etc…)

Example. To increase resilience against submission failures:

taskctrl = MPIExecutor()
taskctrl.max_launch_attempts = 8
taskctrl.fail_time = 5
taskctrl.retry_delay_incr = 10