Executor - Assign GPUs
This tutorial shows the most portable way to assign tasks (user applications) to the GPU. The libEnsemble scripts in this example are available under forces_gpu in the libEnsemble repository.
This example is based on the simple forces tutorial with a slightly modified simulation function (to assign GPUs) and a greatly increased number of particles (allows live GPU usage to be viewed).
In the first example, each worker will be using one GPU. The code will assign the GPUs available to each worker, using the appropriate method. This works on systems using nVidia, AMD and intel GPUs.
Videos demonstrate running this example on Perlmutter, Spock, and Polaris. The first two videos are from an earlier release - you no longer need to change particle count or modify the `forces.c` file).
forces_simf.py) is as follows. The lines that are different
to the forces simple example are highlighted:
1import numpy as np 2 3# To retrieve our MPI Executor 4from libensemble.executors.executor import Executor 5 6# Optional - to print GPU settings 7from libensemble.tools.test_support import check_gpu_setting 8 9 10def run_forces(H, persis_info, sim_specs, libE_info): 11 """Launches the forces MPI app and auto-assigns ranks and GPU resources. 12 13 Assigns one MPI rank to each GPU assigned to the worker. 14 """ 15 16 # Parse out num particles, from generator function 17 particles = str(int(H["x"])) 18 19 # app arguments: num particles, timesteps, also using num particles as seed 20 args = particles + " " + str(10) + " " + particles 21 22 # Retrieve our MPI Executor 23 exctr = Executor.executor 24 25 # Submit our forces app for execution. 26 task = exctr.submit( 27 app_name="forces", 28 app_args=args, 29 auto_assign_gpus=True, 30 match_procs_to_gpus=True, 31 ) 32 33 # Block until the task finishes 34 task.wait() 35 36 # Optional - prints GPU assignment (method and numbers) 37 check_gpu_setting(task, assert_setting=False, print_setting=True) 38 39 # Stat file to check for bad runs 40 statfile = "forces.stat" 41 42 # Read final energy 43 data = np.loadtxt(statfile) 44 final_energy = data[-1] 45 46 # Define our output array, populate with energy reading 47 output = np.zeros(1, dtype=sim_specs["out"]) 48 output["energy"] = final_energy 49 50 51return output
Line 37 simply prints out how the GPUs were assigned. If this is not as desired,
option can be provided in the calling script. Alternatively, for known systems,
the LIBE_PLATFORM environment variable can be set.
The user can also set
num_gpus in the generator as in
the test_GPU_variable_resources.py example.
While this is sufficient for many users, note that it is possible to query the resources assigned to this worker (nodes and partitions of nodes), and use this information however you want.
How to query this worker’s resources
The example shown below implements a similar, but less portable, version of the above (excluding output lines).
1import numpy as np 2 3# To retrieve our MPI Executor and resources instances 4from libensemble.executors.executor import Executor 5from libensemble.resources.resources import Resources 6 7# Optional status codes to display in libE_stats.txt for each gen or sim 8from libensemble.message_numbers import WORKER_DONE, TASK_FAILED 9 10 11def run_forces(H, _, sim_specs): 12 calc_status = 0 13 14 # Parse out num particles, from generator function 15 particles = str(int(H["x"])) 16 17 # app arguments: num particles, timesteps, also using num particles as seed 18 args = particles + " " + str(10) + " " + particles 19 20 # Retrieve our MPI Executor instance and resources 21 exctr = Executor.executor 22 resources = Resources.resources.worker_resources 23 24 resources.set_env_to_slots("CUDA_VISIBLE_DEVICES") 25 26 # Submit our forces app for execution. Block until the task starts. 27 task = exctr.submit( 28 app_name="forces", 29 app_args=args, 30 num_nodes=resources.local_node_count, 31 procs_per_node=resources.slot_count, 32 wait_on_start=True, 33 ) 34 35 # Block until the task finishes 36 task.wait() 37 38 # Stat file to check for bad runs 39 statfile = "forces.stat" 40 41 # Read final energy 42 data = np.loadtxt(statfile) 43 final_energy = data[-1] 44 45 # Define our output array, populate with energy reading 46 output = np.zeros(1, dtype=sim_specs["out"]) 47 output["energy"] = final_energy 48 49 50return output
The above code will assign a GPU to each worker on CUDA capable systems, so long as the number of workers is chosen to fit the resources.
If you want to have one rank with multiple GPUs, then change source lines 30/31 accordingly.
The resource attributes used are:
local_node_count: The number of nodes available to this worker
slot_count: The number of slots per node for this worker
and the line:
will set the environment variable
CUDA_VISIBLE_DEVICES to match the assigned
slots (partitions on the node).
slots refers to the
resource sets enumerated on a node (starting with
zero). If a resource set has more than one node, then each node is considered to
have slot zero. [diagram]
Note that if you are on a system that automatically assigns free GPUs on the node,
CUDA_VISIBLE_DEVICES is not necessary unless you want to ensure
workers are strictly bound to GPUs. For example, on many SLURM systems, you
--gpus-per-task=1 (e.g., Perlmutter).
Such options can be added to the exctr.submit call as
task = exctr.submit( ... extra_args="--gpus-per-task=1" )
Alternative environment variables can be simply substituted in
On some systems
CUDA_VISIBLE_DEVICES may be overridden by other assignments
Compiling the Forces application
First, compile the forces application under the
Compile forces.x using one of the GPU build lines in build_forces.sh or similar for your platform.
Running the example
As an example, if you have been allocated two nodes, each with four GPUs, then assign eight workers. For example:
python run_libe_forces.py --comms local --nworkers 8
Note that if you are running one persistent generator that does not require resources, then assign nine workers and fix the number of resource_sets in your calling script:
libE_specs["num_resource_sets"] = 8
See zero resource workers for more ways to express this.
Changing the number of GPUs per worker
If you want to have two GPUs per worker on the same system (four GPUs per node), you could assign only four workers. You will see that two GPUs are used for each forces run.
The same code can be used when varying worker resources. In this case, you may
add an integer field called
resource_sets as a
gen_specs["out"] in your
In the generator function, assign the
resource_sets field of
H for each point generated. For example
if a larger simulation requires two MPI tasks (and two GPUs), set the
field to 2 for that sim_id in the generator function.
The calling script run_libe_forces.py contains alternative commented-out lines for a variable resource example. Search for “Uncomment for var resources”
In this case, the simulator function will work unmodified, assigning one CPU processor and one GPU to each MPI rank.
Further guidance on varying the resources assigned to workers can be found under the resource manager section.
Checking GPU usage
The output of forces.x will say if it has run on the host or device. When running
libEnsemble, this can be found under the
You can check you are running forces on the GPUs as expected by using profiling tools and/or by using a monitoring utility. For NVIDIA GPUs, for example, the Nsight profiler is generally available and can be run from the command line. To simply run forces.x stand-alone you could run:
nsys profile --stats=true mpirun -n 2 ./forces.x
To use the nvidia-smi monitoring tool while running, open another shell where your code is running (this may entail using ssh to get on to the node), and run:
watch -n 0.1 nvidia-smi
This will update GPU usage information every 0.1 seconds. You would need to ensure the code runs for long enough to register on the monitor, so let’s try 100,000 particles:
mpirun -n 2 ./forces.x 100000
It is also recommended that you run without the profiler when using the nvidia-smi utility.
This can also be used when running via libEnsemble, so long as you are on the node where the forces applications are being run.
Alternative monitoring devices include
rocm-smi (AMD) and
The latter does not need the watch command.
Example submission script
A simple example batch script for Perlmutter that runs 8 workers on 2 nodes:
1#!/bin/bash 2#SBATCH -J libE_small_test 3#SBATCH -A <myproject> 4#SBATCH -C gpu 5#SBATCH --time 10 6#SBATCH --nodes 2 7 8export MPICH_GPU_SUPPORT_ENABLED=1 9export SLURM_EXACT=1 10export SLURM_MEM_PER_NODE=0 11 12python run_libe_forces.py --comms local --nworkers 8
SLURM_MEM_PER_NODE are set to prevent
resource conflicts on each node.