Balsam Executor - Remote apps

This module launches and controls the running of tasks with Balsam, and most notably can submit tasks from any machine, to any machine running a Balsam site.

central_balsam

At this time, access to Balsam is limited to those with valid organizational logins authenticated through Globus.

In order to initiate a Balsam executor, the calling script should contain

from libensemble.executors import BalsamExecutor
exctr = BalsamExecutor()

Key differences to consider between this executor and libEnsemble’s others is Balsam ApplicationDefinition instances are registered instead of paths and task submissions will not run until Balsam reserves compute resources at a site.

This process may resemble:

from libensemble.executors import BalsamExecutor
from balsam.api import ApplicationDefinition

class HelloApp(ApplicationDefinition):
    site = "my-balsam-site"
    command_template = "/path/to/hello.app {{ my_name }}"

exctr = BalsamExecutor()
exctr.register_app(HelloApp, app_name="hello")

exctr.submit_allocation(
    site_id=999,  # corresponds to "my-balsam-site", found via ``balsam site ls``
    num_nodes=4,  # Total number of nodes requested for *all jobs*
    wall_time_min=30,
    queue="debug-queue",
    project="my-project",
)

Task submissions of registered apps aren’t too different from the other executors, except Balsam expects application arguments in dictionary form. Note that these fields must match the templating syntax in each ApplicationDefinition’s command_template field:

args = {"my_name": "World"}

task = exctr.submit(
    app_name="hello",
    app_args=args,
    num_procs=4,
    num_nodes=1,
    procs_per_node=4,
)

Application instances submitted by the executor to the Balsam service will get scheduled within the reserved resource allocation. Each Balsam app can only be submitted to the site specified in its class definition. Output files will appear in the Balsam site’s data directory, but can be automatically transferred back via Globus.

Reading Balsam’s documentation is highly recommended.

class balsam_executor.BalsamExecutor

Bases: Executor

Inherits from Executor and wraps the Balsam service. Via this Executor, Balsam Jobs can be submitted to Balsam sites, either local or on remote machines.

Note

Task kills are not configurable in the Balsam executor.

__init__()

Instantiate a new BalsamExecutor instance.

register_app(BalsamApp, app_name, calc_type=None, desc=None)

Registers a Balsam ApplicationDefinition to libEnsemble. This class instance must have a site and command_template specified. See the Balsam docs for information on other optional fields.

Parameters
  • BalsamApp (ApplicationDefinition object) – A Balsam ApplicationDefinition instance.

  • app_name (String, optional) – Name to identify this application.

  • calc_type (String, optional) – Calculation type: Set this application as the default 'sim' or 'gen' function.

  • desc (String, optional) – Description of this application

revoke_allocation(allocation)

Terminates a Balsam BatchJob machine allocation remotely. Balsam apps should no longer be submitted to this allocation. Best to run after libEnsemble completes, or after this BatchJob is no longer needed. Helps save machine time.

Parameters

allocation (BatchJob object) – a BatchJob with a corresponding machine allocation that should be cancelled.

submit(calc_type=None, app_name=None, app_args=None, num_procs=None, num_nodes=None, procs_per_node=None, max_tasks_per_node=None, machinefile=None, gpus_per_rank=0, transfers={}, workdir='', dry_run=False, wait_on_start=False, extra_args={}, tags={})

Initializes and submits a Balsam Job based on a registered ApplicationDefinition and requested resources. A corresponding libEnsemble Task object is returned.

calc_type: String, optional

The calculation type: 'sim' or 'gen' Only used if app_name is not supplied. Uses default sim or gen application.

app_name: String, optional

The application name. Must be supplied if calc_type is not.

app_args: dict

A dictionary of options that correspond to fields to template in the ApplicationDefinition’s command_template field.

num_procs: int, optional

The total number of MPI ranks on which to submit the task

num_nodes: int, optional

The number of nodes on which to submit the task

procs_per_node: int, optional

The processes per node for this task

max_tasks_per_node: int

Instructs Balsam to schedule at most this many Jobs per node.

machinefile: string, optional

Name of a machinefile for this task to use. Unused by Balsam

gpus_per_rank: int, optional

Number of GPUs to reserve for each MPI rank

transfers: dict, optional

A Job-specific Balsam transfers dictionary that corresponds with an ApplicationDefinition transfers field. See the Balsam docs for more information.

workdir: String

Specifies as name for the Job’s output directory within the Balsam site’s data directory. Default: libe_workflow

dry_run: boolean, optional

Whether this is a dry run - no task will be launched; instead runline is printed to logger (at INFO level)

wait_on_start: boolean, optional

Whether to block, and wait for task to be polled as RUNNING (or other active/end state) before continuing

extra_args: dict, optional

Additional arguments to supply to MPI runner.

tags: dict, optional

Additional tags to organize the Job or restrict which BatchJobs run it.

Returns

  • task (obj: Task) – The launched task object

  • Note that since Balsam Jobs are often sent to entirely different machines

  • than where libEnsemble is running, how libEnsemble’s resource manager

  • has divided local resources among workers doesn’t impact what resources

  • can be requested for a Balsam Job running on an entirely different machine.

submit_allocation(site_id, num_nodes, wall_time_min, job_mode='mpi', queue='local', project='local', optional_params={}, filter_tags={}, partitions=[])

Submits a Balsam BatchJob machine allocation request to Balsam. Corresponding Balsam applications with a matching site can be submitted to this allocation. Effectively a wrapper for BatchJob.objects.create().

Parameters
  • site_id (int) – The corresponding site_id for a Balsam site. Retrieve via balsam site ls

  • num_nodes (int) – The number of nodes to request from a machine with a running Balsam site

  • wall_time_min (int) – The number of walltime minutes to request for the BatchJob allocation

  • job_mode (String, optional) – Either "serial" or "mpi". Default: "mpi"

  • queue (String, optional) – Specifies the queue from which the BatchJob should request nodes. Default: "local"

  • project (String, optional) – Specifies the project that should be charged for the requested machine time. Default: "local"

  • optional_params (dict, optional) – Additional system-specific parameters to set, based on fields in Balsam’s job-template.sh

  • filter_tags (dict, optional) – Directs the resultant BatchJob to only run Jobs with matching tags.

  • partitions (list of dicts, optional) – Divides the allocation into multiple launcher partitions, with differing job_mode, num_nodes. filter_tags, etc. See the Balsam docs.

Return type

The corresponding BatchJob object.

class balsam_executor.BalsamTask(app=None, app_args=None, workdir=None, stdout=None, stderr=None, workerid=None)

Bases: Task

Wraps a Balsam Job from the Balsam service.

The same attributes and query routines are implemented. Use task.process to refer to the matching Balsam Job initialized by the BalsamExecutor, with every Balsam Job method invocable on it. Otherwise, libEnsemble task methods like poll() can be used directly.

poll()

Polls and updates the status attributes of the supplied task. Requests Job information from Balsam service.

wait(timeout=None)

Waits on completion of the task or raises TimeoutExpired.

Status attributes of task are updated on completion.

Parameters

timeout (float) – Time in seconds after which a TimeoutExpired exception is raised

kill()

Cancels the supplied task. Killing is unsupported at this time.