Executor - Assign GPUs

This tutorial shows the most portable way to assign tasks (user applications) to the GPU. The libEnsemble scripts in this example are available under forces_gpu in the libEnsemble repository.

This example is based on the simple forces tutorial with a slightly modified simulation function (to assign GPUs) and a greatly increased number of particles (allows live GPU usage to be viewed).

In the first example, each worker will be using one GPU. The code will assign the GPUs available to each worker, using the appropriate method. This works on systems using Nvidia, AMD, and Intel GPUs without modifying the scripts.

A video demonstrates running this example on Frontier.

Simulation function

The sim_f (forces_simf.py) is as follows. The lines that are different from the simple forces example are highlighted:

 1import numpy as np
 2
 3# Optional status codes to display in libE_stats.txt for each gen or sim
 4from libensemble.message_numbers import TASK_FAILED, WORKER_DONE
 5
 6# Optional - to print GPU settings
 7from libensemble.tools.test_support import check_gpu_setting
 8
 9
10def run_forces(H, persis_info, sim_specs, libE_info):
11    """Launches the forces MPI app and auto-assigns ranks and GPU resources.
12
13    Assigns one MPI rank to each GPU assigned to the worker.
14    """
15
16    calc_status = 0
17
18    # Parse out num particles, from generator function
19    particles = str(int(H["x"][0][0]))
20
21    # app arguments: num particles, timesteps, also using num particles as seed
22    args = particles + " " + str(10) + " " + particles
23
24    # Retrieve our MPI Executor
25    exctr = libE_info["executor"]
26
27    # Submit our forces app for execution.
28    task = exctr.submit(
29        app_name="forces",
30        app_args=args,
31        auto_assign_gpus=True,
32        match_procs_to_gpus=True,
33    )
34
35    # Block until the task finishes
36    task.wait()
37
38    # Optional - prints GPU assignment (method and numbers)
39    check_gpu_setting(task, assert_setting=False, print_setting=True)
40
41    # Try loading final energy reading, set the sim's status
42    statfile = "forces.stat"
43    try:
44        data = np.loadtxt(statfile)
45        final_energy = data[-1]
46        calc_status = WORKER_DONE
47    except Exception:
48        final_energy = np.nan
49        calc_status = TASK_FAILED
50
51    # Define our output array, populate with energy reading
52    output = np.zeros(1, dtype=sim_specs["out"])
53    output["energy"] = final_energy
54
55    # Return final information to worker, for reporting to manager
56    return output, persis_info, calc_status

Lines 31-32 tell the executor to use the GPUs assigned to this worker, and to match processors (MPI ranks) to GPUs.

The user can also set num_procs and num_gpus in the generator as in the forces_gpu_var_resources example, and skip lines 31-32.

Line 37 simply prints out how the GPUs were assigned. If this is not as expected, platform configuration can be provided.

While this is sufficient for most users, note that it is possible to query the resources assigned to this worker (nodes and partitions of nodes), and use this information however you want.

How to query this worker’s resources

The example shown below implements a similar, but less portable, version of the above (excluding output lines).

 1import numpy as np
 2
 3# To retrieve our MPI Executor and resources instances
 4from libensemble.executors.executor import Executor
 5from libensemble.resources.resources import Resources
 6
 7# Optional status codes to display in libE_stats.txt for each gen or sim
 8from libensemble.message_numbers import WORKER_DONE, TASK_FAILED
 9
10
11def run_forces(H, _, sim_specs):
12    calc_status = 0
13
14    # Parse out num particles, from generator function
15    particles = str(int(H["x"][0][0]))
16
17    # app arguments: num particles, timesteps, also using num particles as seed
18    args = particles + " " + str(10) + " " + particles
19
20    # Retrieve our MPI Executor instance and resources
21    exctr = Executor.executor
22    resources = Resources.resources.worker_resources
23
24    resources.set_env_to_slots("CUDA_VISIBLE_DEVICES")
25
26    # Submit our forces app for execution. Block until the task starts.
27    task = exctr.submit(
28        app_name="forces",
29        app_args=args,
30        num_nodes=resources.local_node_count,
31        procs_per_node=resources.slot_count,
32        wait_on_start=True,
33    )
34
35    # Block until the task finishes
36    task.wait()
37
38    # Stat file to check for bad runs
39    statfile = "forces.stat"
40
41    # Read final energy
42    data = np.loadtxt(statfile)
43    final_energy = data[-1]
44
45    # Define our output array,  populate with energy reading
46    output = np.zeros(1, dtype=sim_specs["out"])
47    output["energy"][0] = final_energy
48
49
50return output

The above code will assign a GPU to each worker on CUDA-capable systems, so long as the number of workers is chosen to fit the resources.

If you want to have one rank with multiple GPUs, then change source lines 30/31 accordingly.

The resource attributes used are:

  • local_node_count: The number of nodes available to this worker

  • slot_count: The number of slots per node for this worker

and the line:

resources.set_env_to_slots("CUDA_VISIBLE_DEVICES")

will set the environment variable CUDA_VISIBLE_DEVICES to match the assigned slots (partitions on the node).

Note

slots refers to the resource sets enumerated on a node (starting with zero). If a resource set has more than one node, then each node is considered to have slot zero. [diagram]

Note that if you are on a system that automatically assigns free GPUs on the node, then setting CUDA_VISIBLE_DEVICES is not necessary unless you want to ensure workers are strictly bound to GPUs. For example, on many SLURM systems, you can use --gpus-per-task=1 (e.g., Perlmutter). Such options can be added to the exctr.submit call as extra_args:

task = exctr.submit(
...
    extra_args="--gpus-per-task=1"
)

Alternative environment variables can be simply substituted in set_env_to_slots. (e.g., HIP_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES).

Note

On some systems CUDA_VISIBLE_DEVICES may be overridden by other assignments such as --gpus-per-task=1

Compiling the Forces application

First, compile the forces application under the forces_app directory.

Compile forces.x using one of the GPU build lines in build_forces.sh or similar for your platform.

Running the example

As an example, if you have been allocated two nodes, each with four GPUs, then assign nine workers (the extra worker runs the persistent generator).

For example:

python run_libe_forces.py --comms local --nworkers 9

See zero-resource workers for more ways to express this.

Changing the number of GPUs per worker

If you want to have two GPUs per worker on the same system (with four GPUs per node), you could assign only four workers. You will see that two GPUs are used for each forces run.

Varying resources

A variant of this example where you may specify any number of processors and GPUs for each simulation is given in the forces_gpu_var_resources example.

In this example, when simulations are parameterized in the generator function, the gen_specs["out"] field num_gpus is set for each simulation (based on the number of particles). These values will automatically be used for each simulation (they do not need to be passed as a sim_specs["in"]).

Further guidance on varying the resources assigned to workers can be found under the resource manager section.

Multiple applications

Another variant of this example, forces_multi_app, has two applications, one that uses GPUs, and another that only uses CPUs. Dynamic resource management can manage both types of resources and assign these to the same nodes concurrently, for maximum efficiency.

Checking GPU usage

The output of forces.x will say if it has run on the host or device. When running libEnsemble, this can be found in the simulation directories (under the ensemble directory).

You can check you are running forces on the GPUs as expected by using profiling tools and/or by using a monitoring utility. For NVIDIA GPUs, for example, the Nsight profiler is generally available and can be run from the command line. To simply run forces.x stand-alone you could run:

nsys profile --stats=true mpirun -n 2 ./forces.x

To use the nvidia-smi monitoring tool while running, open another shell where your code is running (this may entail using ssh to get on to the node), and run:

watch -n 0.1 nvidia-smi

This will update GPU usage information every 0.1 seconds. You would need to ensure the code runs for long enough to register on the monitor, so let’s try 100,000 particles:

mpirun -n 2 ./forces.x 100000

It is also recommended that you run without the profiler when using the nvidia-smi utility.

This can also be used when running via libEnsemble, so long as you are on the node where the forces applications are being run.

Alternative monitoring devices include rocm-smi (AMD) and intel_gpu_top (Intel). The latter does not need the watch command.

Example submission script

A simple example batch script for Perlmutter that runs 8 workers on 2 nodes:

 1#!/bin/bash
 2#SBATCH -J libE_small_test
 3#SBATCH -A <myproject>
 4#SBATCH -C gpu
 5#SBATCH --time 10
 6#SBATCH --nodes 2
 7
 8export MPICH_GPU_SUPPORT_ENABLED=1
 9export SLURM_EXACT=1
10
11python run_libe_forces.py --comms local --nworkers 9

where SLURM_EXACT is set to help prevent resource conflicts on each node.