Executor - Assign GPUs
This tutorial shows the most portable way to assign tasks (user applications) to the GPU.
In the first example, each worker will be using one GPU. We assume the workers are on a
cluster with CUDA-capable GPUs. We will assign GPUs by setting the environment
variable CUDA_VISIBLE_DEVICES
. An equivalent approach can be used with other
devices.
This example is based on the simple forces tutorial with a slightly modified simulation function.
To compile the forces application to use the GPU, ensure forces.c has the
#pragma omp target
line uncommented and comment out the equivalent
#pragma omp parallel
line. Then compile forces.x using one of the
GPU build lines in build_forces.sh or similar for your platform.
The libEnsemble scripts in this example are available under forces_gpu in the libEnsemble repository.
Note that at time of writing the calling script run_libe_forces.py
is identical
to that in forces_simple
. The forces_simf
file has slight modifications to
assign GPUs.
Simulation function
The sim_f
(forces_simf.py
) becomes as follows. The new lines are highlighted:
1import numpy as np
2
3# To retrieve our MPI Executor and resources instances
4from libensemble.executors.executor import Executor
5from libensemble.resources.resources import Resources
6
7# Optional status codes to display in libE_stats.txt for each gen or sim
8from libensemble.message_numbers import WORKER_DONE, TASK_FAILED
9
10def run_forces(H, persis_info, sim_specs, libE_info):
11 calc_status = 0
12
13 # Parse out num particles, from generator function
14 particles = str(int(H["x"][0][0]))
15
16 # app arguments: num particles, timesteps, also using num particles as seed
17 args = particles + " " + str(10) + " " + particles
18
19 # Retrieve our MPI Executor instance and resources
20 exctr = Executor.executor
21 resources = Resources.resources.worker_resources
22
23 resources.set_env_to_slots("CUDA_VISIBLE_DEVICES")
24
25 # Submit our forces app for execution. Block until the task starts.
26 task = exctr.submit(
27 app_name="forces",
28 app_args=args,
29 num_nodes=resources.local_node_count,
30 procs_per_node=resources.slot_count,
31 wait_on_start=True,
32 )
33
34 # Block until the task finishes
35 task.wait(timeout=60)
36
37 # Stat file to check for bad runs
38 statfile = "forces.stat"
39
40 # Try loading final energy reading, set the sim's status
41 try:
42 data = np.loadtxt(statfile)
43 final_energy = data[-1]
44 calc_status = WORKER_DONE
45 except Exception:
46 final_energy = np.nan
47 calc_status = TASK_FAILED
48
49 # Define our output array, populate with energy reading
50 outspecs = sim_specs["out"]
51 output = np.zeros(1, dtype=outspecs)
52 output["energy"][0] = final_energy
53
54 # Return final information to worker, for reporting to manager
55return output, persis_info, calc_status
The above code can be run on most systems, and will assign a GPU to each worker, so long as the number of workers is chosen to fit the resources.
The resource attributes used are:
local_node_count: The number of nodes available to this worker
slot_count: The number of slots per node for this worker
and the line:
resources.set_env_to_slots("CUDA_VISIBLE_DEVICES")
will set the environment variable CUDA_VISIBLE_DEVICES
to match the assigned
slots (partitions on the node).
Note that if you are on a system that automatically assigns free GPUs on the node,
then setting CUDA_VISIBLE_DEVICES
is not necessary unless you want to ensure
workers are strictly bound to GPUs. For example, on many SLURM systems, you
can use --gpus-per-task=1
(e.g., Perlmutter).
Such options can be added to the exctr.submit call as extra_args
:
task = exctr.submit(
...
extra_args='--gpus-per-task=1'
)
Alternative environment variables can be simply substituted in set_env_to_slots
.
(e.g., HIP_VISIBLE_DEVICES
, ROCR_VISIBLE_DEVICES
).
Note
On some systems CUDA_VISIBLE_DEVICES
may be overridden by other assignments
such as --gpus-per-task=1
Running the example
As an example, if you have been allocated two nodes, each with four GPUs, then assign eight workers. For example:
python run_libe_forces.py --comms local --nworkers 8
If you are running one persistent generator which does not require resources, then assign nine workers, and set the following in your calling script:
libE_specs['zero_resource_workers'] = [1]
Or - if you do not care which worker runs the generator, you could fix the resource_sets:
libE_specs['num_resource_sets'] = 8
Changing number of GPUs per worker
If you want to have two GPUs per worker on the same system (four GPUs per node), you could assign only four workers, and change line 24 to:
resources.set_env_to_slots("CUDA_VISIBLE_DEVICES", multiplier=2)
In this case there are two GPUs per worker (and per slot).
Varying resources
The same code can be used when varying worker resources. In this case, you may
choose to set one worker per GPU (as we did originally). Then add resource_sets
as a gen_specs['out']
in your calling script. Simply assign the
resource_sets
field of H for each point
generated.
In this case the above code would still work, assigning one CPU processor and one GPU to each rank. If you want to have one rank with multiple GPUs, then change source lines 29/30 accordingly.
Further guidance on varying resource to workers can be found under the resource manager.
Checking GPU usage
You can check you are running forces on the GPUs as expected by using profiling tools and/or by using a monitoring utility. For NVIDIA GPUs, for example, the Nsight profiler is generally available and can be run from the command line. To simply run forces.x stand-alone you could run:
nsys profile --stats=true mpirun -n 2 ./forces.x
To use the nvidia-smi monitoring tool while running, open another shell where your code is running (this may entail using ssh to get on to the node), and run:
watch -n 0.1 nvidia-smi
This will update GPU usage information every 0.1 seconds. You would need to ensure the code runs for long enough to register on the monitor, so lets try 100,000 particles:
mpirun -n 2 ./forces.x 100000
It is also recommended that you run without the profiler when using the nvidia-smi utility.
This can also be used when running via libEnsemble, so long as you are on the node where the forces applications are being run. As the default particles in the forces example is 1000, you will need to to increase particles to see clear GPU usage in the live monitor. E.g.,~ in line 14 to multiply the particles by 10:
# Parse out num particles, from generator function
particles = str(int(H["x"][0][0]) * 10)
Alternative monitoring devices include rocm-smi
(AMD) and intel_gpu_top
(Intel). The latter
does not need the watch command.
Example submission script
A simple example batch script for Perlmutter that runs 8 workers on 2 nodes:
1#!/bin/bash
2#SBATCH -J libE_small_test
3#SBATCH -A <myproject_g>
4#SBATCH -C gpu
5#SBATCH --time 10
6#SBATCH --nodes 2
7
8export MPICH_GPU_SUPPORT_ENABLED=1
9export SLURM_EXACT=1
10export SLURM_MEM_PER_NODE=0
11
12python run_libe_forces.py --comms local --nworkers 8
where SLURM_EXACT
and SLURM_MEM_PER_NODE
are set to prevent
resource conflicts on each node.