Summit
Summit is an IBM AC922 system located at the Oak Ridge Leadership Computing Facility (OLCF). Each of the approximately 4,600 compute nodes on Summit contains two IBM POWER9 processors and six NVIDIA Volta V100 accelerators.
Summit features three tiers of nodes: login, launch, and compute nodes.
Users on login nodes submit batch runs to the launch nodes.
Batch scripts and interactive sessions run on the launch nodes. Only the launch
nodes can submit MPI runs to the compute nodes via jsrun
.
Configuring Python
Begin by loading the Python 3 Anaconda module:
$ module load python
You can now create and activate your own custom conda environment:
conda create --name myenv python=3.9
export PYTHONNOUSERSITE=1 # Make sure get python from conda env
. activate myenv
If you are installing any packages with extensions, ensure that the correct compiler module is loaded. If using mpi4py, this must be installed from source, referencing the compiler. Currently, mpi4py must be built with gcc:
module load gcc
With your environment activated, run
CC=mpicc MPICC=mpicc pip install mpi4py --no-binary mpi4py
Installing libEnsemble
Obtaining libEnsemble is now as simple as pip install libensemble
.
Your prompt should be similar to the following line:
(my_env) user@login5:~$ pip install libensemble
Note
If you encounter pip errors, run python -m pip install --upgrade pip
first
Or, you can install via conda
:
(my_env) user@login5:~$ conda config --add channels conda-forge
(my_env) user@login5:~$ conda install -c conda-forge libensemble
See here for more information on advanced options for installing libEnsemble.
Special note on resource sets and Executor submit options
When using the portable MPI run configuration options (e.g., num_nodes) to the
MPIExecutor submit
function, it is important
to note that, due to the resource sets used on Summit, the options refer to
resource sets as follows:
num_procs (int, optional) – The total number resource sets for this run.
num_nodes (int, optional) – The number of nodes on which to submit the run.
procs_per_node (int, optional) – The number of resource sets per node.
It is recommended that the user defines a resource set as the minimal configuration
of CPU cores/processes and GPUs. These can be added to the extra_args
option
of the submit function. Alternatively, the portable options can be ignored and
everything expressed in extra_args
.
For example, the following jsrun line would run three resource sets, each having one core (with one process), and one GPU, along with some extra options:
jsrun -n 3 -a 1 -g 1 -c 1 --bind=packed:1 --smpiargs="-gpu"
To express this line in the submit
function may look
something like the following:
exctr = Executor.executor
task = exctr.submit(app_name="mycode",
num_procs=3,
extra_args="-a 1 -g 1 -c 1 --bind=packed:1 --smpiargs="-gpu""
app_args="-i input")
This would be equivalent to:
exctr = Executor.executor
task = exctr.submit(app_name="mycode",
extra_args="-n 3 -a 1 -g 1 -c 1 --bind=packed:1 --smpiargs="-gpu""
app_args="-i input")
The libEnsemble resource manager works out the resources available to each worker,
but unlike some other systems, jsrun
on Summit dynamically schedules runs to
available slots across and within nodes. It can also queue tasks. This allows variable
size runs to easily be handled on Summit. If oversubscription to the jsrun system
is desired, then libEnsemble’s resource manager can be disabled in the
calling script via:
libE_specs["disable_resource_manager"] = True
In the above example, the task being submitted used three GPUs, which is half those available on a Summit node, and thus two such tasks may be allocated to each node (from different workers), if they were running at the same time.
Job Submission
Summit uses LSF for job management and submission. For libEnsemble, the most
important command is bsub
for submitting batch scripts from the login nodes
to execute on the launch nodes.
It is recommended to run libEnsemble on the launch nodes (assuming workers are
submitting MPI applications) using the local
communications mode (multiprocessing).
Interactive Runs
You can run interactively with bsub
by specifying the -Is
flag,
similarly to the following:
$ bsub -W 30 -P [project] -nnodes 8 -Is
This will place you on a launch node.
Note
You will need to reactivate your conda virtual environment.
Batch Runs
Batch scripts specify run settings using #BSUB
statements. The following
simple example depicts configuring and launching libEnsemble to a launch node with
multiprocessing. This script also assumes the user is using the parse_args()
convenience function from libEnsemble’s tools module.
#!/bin/bash -x
#BSUB -P <project code>
#BSUB -J libe_mproc
#BSUB -W 60
#BSUB -nnodes 128
#BSUB -alloc_flags "smt1"
# --- Prepare Python ---
# Load conda module and gcc.
module load python
module load gcc
# Name of conda environment
export CONDA_ENV_NAME=my_env
# Activate conda environment
export PYTHONNOUSERSITE=1
source activate $CONDA_ENV_NAME
# --- Prepare libEnsemble ---
# Name of calling script
export EXE=calling_script.py
# Communication Method
export COMMS="--comms local"
# Number of workers.
export NWORKERS="--nworkers 128"
hash -r # Check no commands hashed (pip/python...)
# Launch libE
python $EXE $COMMS $NWORKERS > out.txt 2>&1
With this saved as myscript.sh
, allocating, configuring, and queueing
libEnsemble on Summit is achieved by running
$ bsub myscript.sh
Example submission scripts are also given in the examples.
Launching User Applications from libEnsemble Workers
Only the launch nodes can submit MPI runs to the compute nodes via jsrun
.
This can be accomplished in user sim_f
functions directly. However, it is highly
recommended that the Executor interface
be used inside the sim_f
or gen_f
, because this provides a portable interface
with many advantages including automatic resource detection, portability,
launch failure resilience, and ease of use.
Additional Information
See the OLCF guides for more information about Summit.