


libEnsemble: A Python Library for Dynamic Ensemble-Based Computations¶
David Bindel, Stephen Hudson, Jeffrey Larson, John-Luke Navarro and Stefan Wild
A PDF poster version of this content is available on figshare.
Overview¶
libEnsemble is a Python library for coordinating the concurrent evaluation of dynamic ensembles of calculations. The library is developed to use massively parallel resources to accelerate the solution of design, decision, and inference problems and to expand the class of problems that can benefit from increased concurrency levels.
libEnsemble aims for the following:
Extreme scaling
Resilience/fault tolerance
Monitoring/killing of tasks (and recovering resources)
Portability and flexibility
Exploitation of persistent data/control flow
libEnsemble is most commonly used to coordinate large numbers of parallel instances (ensembles) of simulations at huge scales.
Using libEnsemble¶
The user selects or supplies a gen_f
function that generates simulation
input and a sim_f
function that performs and monitors simulations. The user
parameterizes these functions and initiates libEnsemble in a calling script.
Examples and templates of such scripts and functions are included in the library.

For example, the gen_f
may contain an optimization routine to generate new
simulation parameters on-the-fly based on results from previous sim_f
simulations.
Other potential use-cases include:
Generator Functions: |
Simulation Functions: |
---|---|
Parameter estimation |
Particle-in-cell |
Surrogate models |
Subsurface flow |
Sensitivity analysis |
PETSc simulations |
Design optimization |
DFT simulations |
Supervised learning |
Quantum chemistry |
Manager and Workers¶
libEnsemble employs a manager/worker scheme that can communicate through MPI,
Python’s multiprocessing, or TCP. Each worker
can control and monitor any level of work, from small sub-node tasks to huge
many-node simulations. The manager allocates workers to asynchronously execute
gen_f
generation functions and sim_f
simulation functions based on
produced output, directed by a provided alloc_f
allocation function.

Flexible Run Mechanisms¶
libEnsemble has been developed, supported, and tested on systems of highly varying scales, from laptops to machines with thousands of compute nodes. On multi-node systems, there are two basic modes of configuring libEnsemble to run and launch tasks (user applications) on available nodes.
Distributed: Workers are distributed across allocated nodes and launch tasks in-place. Workers share nodes with their applications.

Centralized: Workers run on one or more dedicated nodes and launch tasks to the remaining allocated nodes.

Note
Dividing up workers and tasks to allocated nodes is highly configurable. Multiple workers (and thus multiple tasks or user function instances) can be assigned to a single node. Alternatively, multiple nodes may be assigned to a single worker and each routine it performs.
Executor Module¶
An Executor interface is provided to ensure libEnsemble routines that coordinate user applications are portable, resilient, and flexible. The Executor automatically detects allocated nodes and available cores and can split up tasks if resource data isn’t supplied.
The Executor is agnostic of both the job launch/management system and selected
manager/worker communication method on each machine. The main functions are
submit()
, poll()
, and kill()
.
On machines that do not support launches from compute nodes, libEnsemble’s Executor can interface with the Balsam library, which functions as a proxy job launcher that maintains and submits jobs from a database on front end launch nodes.

Supported Research Machines¶
libEnsemble is tested and supported on the following high-performance research machines:
Machine |
Location |
Facility |
Info |
---|---|---|---|
IBM AC922, IBM POWER9 nodes w/ NVIDIA Volta GPUs |
|||
Cray XC40, Intel KNL nodes |
|||
Cray XC40, Intel Haswell & KNL nodes |
|||
HPE, Intel Haswell nodes, NVIDIA GPU nodes |
Running at Scale¶
OPAL Simulations
ALCF/Theta (Cray XC40) with Balsam, at Argonne National Laboratory
1030 node allocation, 511 workers, MPI communications.
2044 2-node simulations
Object Oriented Parallel Accelerator Library (OPAL) simulation functions.
![]() Histogram of completed and killed simulations, binned by run time.¶ |
![]() Total number of Balsam-launched applications running over time.¶ |
Try libEnsemble Online¶
Try libEnsemble online with two Jupyter notebook examples.
The first notebook demonstrates the basics of parallel ensemble calculations with libEnsemble through a Simple Functions Tutorial. The second notebook, an Executor Tutorial, contains an example similar to most use-cases: simulation functions that launch and coordinate user applications.
Note
The Executor Tutorial notebook may take a couple minutes to initiate.