libEnsemble with SLURM
SLURM is a popular open-source workload manager.
libEnsemble can read SLURM node lists and partition these to workers. By default this is done by reading an environment variable.
Example SLURM submission scripts for various systems are given in the examples. Further examples are given in some of the specific platform guides (e.g., Perlmutter guide)
By default, the MPIExecutor uses mpirun
as a priority over srun
as it works better in some cases. If mpirun
does
not work well, then try telling the MPIExecutor to use srun
when it is initiated
in the calling script:
from libensemble.executors.mpi_executor import MPIExecutor
exctr = MPIExecutor(custom_info={"mpi_runner":"srun"})
Common Errors
SLURM systems can have various configurations which may affect what is required when assigning more than one worker to any given node.
srun: Job ****** step creation temporarily disabled, retrying (Requested nodes are busy)
You may also see: srun: Job ****** step creation still disabled, retrying (Requested nodes are busy)
It is recommended to add these to submission scripts to prevent resource conflicts:
export SLURM_EXACT=1
export SLURM_MEM_PER_NODE=0
Alternatively, the --exact
option to srun, along with other relevant options
can be given on any srun
lines, including the MPIExecutor
submission lines
via the extra_args
option (from version 0.10.0, these are added automatically).
Secondly, while many configurations are possible, it is recommended to avoid using
#SBATCH
commands that may limit resources to srun job steps such as:
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
Instead provide these to sub-tasks via the extra_args
option to the
MPIExecutor submit
function.
GTL_DEBUG: [0] cudaHostRegister: no CUDA-capable device is detected
If using the environment variable MPICH_GPU_SUPPORT_ENABLED
, then srun
commands may
expect an option for allocating GPUs (e.g., --gpus-per-task=1
would
allocate one GPU to each MPI task of the MPI run). It is recommended that tasks submitted
via the MPIExecutor specify this in the extra_args
option to the submit
function (rather than using an #SBATCH
command).
If running the libEnsemble calling script with srun
, then it is recommended that
MPICH_GPU_SUPPORT_ENABLED
is set in the user sim_f
or gen_f
function where
GPU runs will be submitted, instead of in the batch script. For example:
os.environ["MPICH_GPU_SUPPORT_ENABLED"] = "1"
Note on Resource Binding
Note
Update: From version 0.10.0, it is recommended that GPUs are assigned automatically by libEnsemble. See the forces_gpu tutorial as an example.
Note that the use of CUDA_VISIBLE_DEVICES
and other environment variables is often
a highly portable way of assigning specific GPUs to workers, and has been known to work
on some systems when other methods do not. See the libEnsemble regression test test_persistent_sampling_CUDA_variable_resources.py for an example of setting
CUDA_VISIBLE_DEVICES in the imported simulator function (CUDA_variable_resources
).
On other systems, like Perlmutter, using an option such as --gpus-per-task=1
or
-gres=gpu:1
in extra_args
is sufficient to allow SLURM to find the free GPUs.
Note that the srun
options such as:
--gpu-bind=map_gpu:2,3
do not necessarily provide absolute GPU slots when there are more than one concurrent
job steps (sruns
) running on a node. If desired, such options could be set using the
worker resources module in a similar manner
to how CUDA_VISIBLE_DEVICES
is set in the example.
Some useful commands
Find SLURM version:
scontrol --version
Find SLURM system configuration:
scontrol show config
Find SLURM partition configuration for a partition called “gpu”:
scontrol show partition gpu