libEnsemble with SLURM¶

SLURM is a popular open-source workload manager.

libEnsemble can read SLURM node lists and partition these to workers. By default this is done by reading an environment variable.

Example SLURM submission scripts for various systems are given in the examples. Further examples are given in some of the specific platform guides (e.g., Perlmutter guide)

By default, the MPIExecutor uses mpirun as a priority over srun as it works better in some cases. If mpirun does not work well, then try telling the MPIExecutor to use srun when it is initiated in the calling script:

from libensemble.executors.mpi_executor import MPIExecutor
exctr = MPIExecutor(custom_info={"mpi_runner":"srun"})

Common Errors¶

SLURM systems can have various configurations which may affect what is required when assigning more than one worker to any given node.

Note on Resource Binding¶

Note

Update: From version 0.10.0, it is recommended that GPUs are assigned automatically by libEnsemble. See the forces_gpu tutorial as an example.

Note that the use of CUDA_VISIBLE_DEVICES and other environment variables is often a highly portable way of assigning specific GPUs to workers, and has been known to work on some systems when other methods do not. See the libEnsemble regression test test_persistent_sampling_CUDA_variable_resources.py for an example of setting CUDA_VISIBLE_DEVICES in the imported simulator function (CUDA_variable_resources).

On other systems, like Perlmutter, using an option such as --gpus-per-task=1 or -gres=gpu:1 in extra_args is sufficient to allow SLURM to find the free GPUs.

Note that the srun options such as:

--gpu-bind=map_gpu:2,3

do not necessarily provide absolute GPU slots when there are more than one concurrent job steps (sruns) running on a node. If desired, such options could be set using the worker resources module in a similar manner to how CUDA_VISIBLE_DEVICES is set in the example.

Some useful commands¶

Find SLURM version:

scontrol --version

Find SLURM system configuration:

scontrol show config

Find SLURM partition configuration for a partition called “gpu”:

scontrol show partition gpu