ECP ANL _images/white.png

libEnsemble: A Python Library for Dynamic Ensemble-Based Computations

David Bindel, Stephen Hudson, Jeffrey Larson, John-Luke Navarro and Stefan Wild

A PDF poster version of this content is available on figshare.

Overview

libEnsemble is a Python library for coordinating the concurrent evaluation of dynamic ensembles of calculations. The library is developed to use massively parallel resources to accelerate the solution of design, decision, and inference problems and to expand the class of problems that can benefit from increased concurrency levels.

libEnsemble aims for the following:

  • Extreme scaling

  • Resilience/fault tolerance

  • Monitoring/killing of tasks (and recovering resources)

  • Portability and flexibility

  • Exploitation of persistent data/control flow

libEnsemble is most commonly used to coordinate large numbers of parallel instances (ensembles) of simulations at huge scales.

Using libEnsemble

The user selects or supplies a gen_f function that generates simulation input and a sim_f function that performs and monitors simulations. The user parameterizes these functions and initiates libEnsemble in a calling script. Examples and templates of such scripts and functions are included in the library.

Using libEnsemble

For example, the gen_f may contain an optimization routine to generate new simulation parameters on-the-fly based on results from previous sim_f simulations.

Other potential use-cases include:

Generator Functions:

Simulation Functions:

Parameter estimation

Particle-in-cell

Surrogate models

Subsurface flow

Sensitivity analysis

PETSc simulations

Design optimization

DFT simulations

Supervised learning

Quantum chemistry

Manager and Workers

libEnsemble employs a manager/worker scheme that can communicate through MPI, Python’s multiprocessing, or TCP. Each worker can control and monitor any level of work, from small sub-node tasks to huge many-node simulations. The manager allocates workers to asynchronously execute gen_f generation functions and sim_f simulation functions based on produced output, directed by a provided alloc_f allocation function.

Managers and Workers

Flexible Run Mechanisms

libEnsemble has been developed, supported, and tested on systems of highly varying scales, from laptops to machines with thousands of compute nodes. On multi-node systems, there are two basic modes of configuring libEnsemble to run and launch tasks (user applications) on available nodes.

  • Distributed: Workers are distributed across allocated nodes and launch tasks in-place. Workers share nodes with their applications.

Distributed
  • Centralized: Workers run on one or more dedicated nodes and launch tasks to the remaining allocated nodes.

Centralized

Note

Dividing up workers and tasks to allocated nodes is highly configurable. Multiple workers (and thus multiple tasks or user function instances) can be assigned to a single node. Alternatively, multiple nodes may be assigned to a single worker and each routine it performs.

Executor Module

An Executor interface is provided to ensure libEnsemble routines that coordinate user applications are portable, resilient, and flexible. The Executor automatically detects allocated nodes and available cores and can split up tasks if resource data isn’t supplied.

The Executor is agnostic of both the job launch/management system and selected manager/worker communication method on each machine. The main functions are submit(), poll(), and kill().

On machines that do not support launches from compute nodes, libEnsemble’s Executor can interface with the Balsam library, which functions as a proxy job launcher that maintains and submits jobs from a database on front end launch nodes.

Central Balsam

Supported Research Machines

libEnsemble is tested and supported on the following high-performance research machines:

Machine

Location

Facility

Info

Summit

Oak Ridge National Laboratory

OLCF

IBM AC922, IBM POWER9 nodes w/ NVIDIA Volta GPUs

Theta

Argonne National Laboratory

ALCF

Cray XC40, Intel KNL nodes

Cori

National Energy Research Scientific Computing Center

Cray XC40, Intel Haswell & KNL nodes

Bridges

Pittsburgh Supercomputing Center

HPE, Intel Haswell nodes, NVIDIA GPU nodes

Running at Scale

OPAL Simulations

  • ALCF/Theta (Cray XC40) with Balsam, at Argonne National Laboratory

  • 1030 node allocation, 511 workers, MPI communications.

  • 2044 2-node simulations

  • Object Oriented Parallel Accelerator Library (OPAL) simulation functions.

_images/libe_opal_complete_v_killed_511w_2044sims_1030nodes.png

Histogram of completed and killed simulations, binned by run time.

_images/libe_opal_util_v_time_511w_2044sims_1030nodes.png

Total number of Balsam-launched applications running over time.

Try libEnsemble Online

Try libEnsemble online with two Jupyter notebook examples.

The first notebook demonstrates the basics of parallel ensemble calculations with libEnsemble through a Simple Functions Tutorial. The second notebook, an Executor Tutorial, contains an example similar to most use-cases: simulation functions that launch and coordinate user applications.

Note

The Executor Tutorial notebook may take a couple minutes to initiate.

https://img.shields.io/badge/libEnsemble-Simple%20Functions%20Tutorial-579ACA.svg?logo= https://img.shields.io/badge/libEnsemble-Executor%20Tutorial-E66581.svg?logo=