Spock/Crusher

Spock and Crusher are early-access testbed systems located at Oak Ridge Leadership Computing Facility (OLCF).

Each Spock compute node consists of one 64-core AMD EPYC “Rome” CPU and four AMD MI100 GPUs.

Each Crusher compute node contains a 64-core AMD EPYC and 4 AMD MI250X GPUs (8 Graphics Compute Dies).

These systems use the SLURM scheduler to submit jobs from login nodes to run on the compute nodes.

Configuring Python and Installation

Begin by loading the python module:

module load cray-python

Job Submission

Slurm is used for job submission and management. libEnsemble runs on the compute nodes using either multi-processing or mpi4py.

If running more than one worker per node, the following is recommended to prevent resource conflicts:

export SLURM_EXACT=1
export SLURM_MEM_PER_NODE=0

Installing libEnsemble and dependencies

libEnsemble can be installed via pip:

pip install libensemble

Example

To run the forces_gpu tutorial on Spock or Crusher.

To obtain the example you can git clone libEnsemble - although only the forces sub-directory is needed:

git clone https://github.com/Libensemble/libensemble
cd libensemble/libensemble/tests/scaling_tests/forces/forces_app

To compile forces (in addition to cray-python module):

module load rocm
module load craype-accel-amd-gfx90a # (craype-accel-amd-gfx908 on Spock)
cc -DGPU -I${ROCM_PATH}/include -L${ROCM_PATH}/lib -lamdhip64 -fopenmp -O3 -o forces.x forces.c

Now go to forces_gpu directory:

cd ../forces_gpu

Now grab an interactive session on one node:

salloc --nodes=1 -A <project_id> --time=00:10:00

Then in the session run:

python run_libe_forces.py --comms local --nworkers 4

To see GPU usage, ssh into the node you are on in another window and run:

module load rocm
watch -n 0.1 rocm-smi