Spock/Crusher
Spock and Crusher are early-access testbed systems located at Oak Ridge Leadership Computing Facility (OLCF).
Each Spock compute node consists of one 64-core AMD EPYC “Rome” CPU and four AMD MI100 GPUs.
Each Crusher compute node contains a 64-core AMD EPYC and 4 AMD MI250X GPUs (8 Graphics Compute Dies).
These systems use the SLURM scheduler to submit jobs from login nodes to run on the compute nodes.
Configuring Python and Installation
Begin by loading the python
module:
module load cray-python
Job Submission
Slurm is used for job submission and management. libEnsemble runs on the
compute nodes using either multi-processing
or mpi4py
.
If running more than one worker per node, the following is recommended to prevent resource conflicts:
export SLURM_EXACT=1
export SLURM_MEM_PER_NODE=0
Installing libEnsemble and dependencies
libEnsemble can be installed via pip:
pip install libensemble
Example
To run the forces_gpu tutorial on Spock or Crusher.
To obtain the example you can git clone libEnsemble - although only the forces sub-directory is needed:
git clone https://github.com/Libensemble/libensemble
cd libensemble/libensemble/tests/scaling_tests/forces/forces_app
To compile forces (in addition to cray-python module):
module load rocm
module load craype-accel-amd-gfx90a # (craype-accel-amd-gfx908 on Spock)
cc -DGPU -I${ROCM_PATH}/include -L${ROCM_PATH}/lib -lamdhip64 -fopenmp -O3 -o forces.x forces.c
Now go to forces_gpu directory:
cd ../forces_gpu
Now grab an interactive session on one node:
salloc --nodes=1 -A <project_id> --time=00:10:00
Then in the session run:
python run_libe_forces.py --comms local --nworkers 4
To see GPU usage, ssh into the node you are on in another window and run:
module load rocm
watch -n 0.1 rocm-smi