Balsam Executor - Remote apps

This module launches and controls tasks via Balsam, and can submit tasks from any machine, to any machine running a Balsam site.

central_balsam

At this time, access to Balsam is limited to those with valid organizational logins authenticated through Globus.

To initialize a Balsam executor:

from libensemble.executors.balsam_executors import BalsamExecutor
exctr = BalsamExecutor()

Note that Balsam ApplicationDefinition instances are registered instead of paths and task submissions will not run until Balsam reserves compute resources at a site:

from libensemble.executors.balsam_executors import BalsamExecutor
from balsam.api import ApplicationDefinition

class HelloApp(ApplicationDefinition):
    site = "my-balsam-site"
    command_template = "/path/to/hello.app {{ my_name }}"

exctr = BalsamExecutor()
exctr.register_app(HelloApp, app_name="hello")

exctr.submit_allocation(
    site_id=999,  # corresponds to "my-balsam-site", found via ``balsam site ls``
    num_nodes=4,  # Total number of nodes requested for *all jobs*
    wall_time_min=30,
    queue="debug-queue",
    project="my-project",
)

Task submissions of registered apps aren’t too different from the other executors, except Balsam expects application arguments in dictionary form. Note that these fields must match the templating syntax in each ApplicationDefinition’s command_template field:

args = {"my_name": "World"}

task = exctr.submit(
    app_name="hello",
    app_args=args,
    num_procs=4,
    num_nodes=1,
    procs_per_node=4,
)

Application instances submitted by the executor to the Balsam service will get scheduled within the reserved resource allocation. Each Balsam app can only be submitted to the site specified in its class definition. Output files will appear in the Balsam site’s data directory, but can be automatically transferred back via Globus.

Reading Balsam’s documentation is highly recommended.

class libensemble.executors.balsam_executor.BalsamExecutor

Bases: Executor

Wraps the Balsam service. Via this Executor, Balsam Jobs can be submitted to Balsam sites, either local or on remote machines.

Note

Task kills are not configurable in the Balsam executor.

__init__()

Instantiate a new BalsamExecutor instance.

Return type:

None

register_app(BalsamApp, app_name=None, calc_type=None, desc=None, precedent=None)

Registers a Balsam ApplicationDefinition to libEnsemble. This class instance must have a site and command_template specified. See the Balsam docs for information on other optional fields.

Parameters:
  • BalsamApp (ApplicationDefinition object) – A Balsam ApplicationDefinition instance.

  • app_name (str, Optional) – Name to identify this application.

  • calc_type (str, Optional) – Calculation type: Set this application as the default 'sim' or 'gen' function.

  • desc (str, Optional) – Description of this application

  • precedent (str | None) –

Return type:

None

submit_allocation(site_id, num_nodes, wall_time_min, job_mode='mpi', queue='local', project='local', optional_params={}, filter_tags={}, partitions=[])

Submits a Balsam BatchJob machine allocation request to Balsam. Corresponding Balsam applications with a matching site can be submitted to this allocation. Effectively a wrapper for BatchJob.objects.create().

Parameters:
  • site_id (int) – The corresponding site_id for a Balsam site. Retrieve via balsam site ls

  • num_nodes (int) – The number of nodes to request from a machine with a running Balsam site

  • wall_time_min (int) – The number of walltime minutes to request for the BatchJob allocation

  • job_mode (str, Optional) – Either "serial" or "mpi". Default: "mpi"

  • queue (str, Optional) – Specifies the queue from which the BatchJob should request nodes. Default: "local"

  • project (str, Optional) – Specifies the project that should be charged for the requested machine time. Default: "local"

  • optional_params (dict, Optional) – Additional system-specific parameters to set, based on fields in Balsam’s job-template.sh

  • filter_tags (dict, Optional) – Directs the resultant BatchJob to only run Jobs with matching tags.

  • partitions (List[dict], Optional) – Divides the allocation into multiple launcher partitions, with differing job_mode, num_nodes. filter_tags, etc. See the Balsam docs.

Return type:

The corresponding BatchJob object.

revoke_allocation(allocation, timeout=60)

Terminates a Balsam BatchJob machine allocation remotely. Balsam apps should no longer be submitted to this allocation. Best to run after libEnsemble completes, or after this BatchJob is no longer needed. Helps save machine time.

Parameters:
  • allocation (BatchJob object) – a BatchJob with a corresponding machine allocation that should be cancelled.

  • timeout (int, Optional) – Timeout and warn user after this many seconds of attempting to revoke an allocation.

Return type:

bool

submit(calc_type=None, app_name=None, app_args=None, num_procs=None, num_nodes=None, procs_per_node=None, max_tasks_per_node=None, machinefile=None, gpus_per_rank=0, transfers={}, workdir='', dry_run=False, wait_on_start=False, extra_args={}, tags={})

Initializes and submits a Balsam Job based on a registered ApplicationDefinition and requested resources. A corresponding libEnsemble Task object is returned.

Parameters:
  • calc_type (str, Optional) – The calculation type: 'sim' or 'gen' Only used if app_name is not supplied. Uses default sim or gen application.

  • app_name (str, Optional) – The application name. Must be supplied if calc_type is not.

  • app_args (dict) – A dictionary of options that correspond to fields to template in the ApplicationDefinition’s command_template field.

  • num_procs (int, Optional) – The total number of MPI ranks on which to submit the task

  • num_nodes (int, Optional) – The number of nodes on which to submit the task

  • procs_per_node (int, Optional) – The processes per node for this task

  • max_tasks_per_node (int) – Instructs Balsam to schedule at most this many Jobs per node.

  • machinefile (str, Optional) – Name of a machinefile for this task to use. Unused by Balsam

  • gpus_per_rank (int, Optional) – Number of GPUs to reserve for each MPI rank

  • transfers (dict, Optional) – A Job-specific Balsam transfers dictionary that corresponds with an ApplicationDefinition transfers field. See the Balsam docs for more information.

  • workdir (str) – Specifies as name for the Job’s output directory within the Balsam site’s data directory. Default: libe_workflow

  • dry_run (bool, Optional) – Whether this is a dry run - no task will be launched; instead runline is printed to logger (at INFO level)

  • wait_on_start (bool, Optional) – Whether to block, and wait for task to be polled as RUNNING (or other active/end state) before continuing

  • extra_args (dict, Optional) – Additional arguments to supply to MPI runner.

  • tags (dict, Optional) – Additional tags to organize the Job or restrict which BatchJobs run it.

Returns:

task – The launched task object

Return type:

BalsamTask

Note that since Balsam Jobs are often sent to entirely different machines than where libEnsemble is running, how libEnsemble’s resource manager has divided local resources among workers doesn’t impact what resources can be requested for a Balsam Job running on an entirely different machine.

class libensemble.executors.balsam_executor.BalsamTask(app=None, app_args=None, workdir=None, stdout=None, stderr=None, workerid=None)

Bases: Task

Wraps a Balsam Job from the Balsam service.

The same attributes and query routines are implemented. Use task.process to refer to the matching Balsam Job initialized by the BalsamExecutor, with every Balsam Job method invocable on it. Otherwise, libEnsemble task methods like poll() can be used directly.

Parameters:
  • app (Application | None) –

  • app_args (dict) –

  • workdir (str | None) –

  • stdout (str) –

  • stderr (str) –

  • workerid (int) –

poll()

Polls and updates the status attributes of the supplied task. Requests Job information from Balsam service.

Return type:

None

wait(timeout=None)

Waits on completion of the task or raises TimeoutExpired.

Status attributes of task are updated on completion.

Parameters:

timeout (int or float, Optional) – Time in seconds after which a TimeoutExpired exception is raised. If not set, then simply waits until completion. Note that the task is not automatically killed on timeout.

Return type:

None

kill()

Cancels the task. Killing a running task is unsupported by Balsam at this time.

Return type:

None