Balsam Executor - Remote apps
This module launches and controls tasks via Balsam, and can submit tasks from any machine, to any machine running a Balsam site.
![central_balsam](../_images/balsam2.png)
At this time, access to Balsam is limited to those with valid organizational logins authenticated through Globus.
To initialize a Balsam executor:
from libensemble.executors.balsam_executors import BalsamExecutor
exctr = BalsamExecutor()
Note that
Balsam ApplicationDefinition
instances are registered instead of paths and task
submissions will not run until Balsam reserves compute resources at a site:
from libensemble.executors.balsam_executors import BalsamExecutor
from balsam.api import ApplicationDefinition
class HelloApp(ApplicationDefinition):
site = "my-balsam-site"
command_template = "/path/to/hello.app {{ my_name }}"
exctr = BalsamExecutor()
exctr.register_app(HelloApp, app_name="hello")
exctr.submit_allocation(
site_id=999, # corresponds to "my-balsam-site", found via ``balsam site ls``
num_nodes=4, # Total number of nodes requested for *all jobs*
wall_time_min=30,
queue="debug-queue",
project="my-project",
)
Task submissions of registered apps aren’t too different from the other executors,
except Balsam expects application arguments in dictionary form. Note that these fields
must match the templating syntax in each ApplicationDefinition
’s command_template
field:
args = {"my_name": "World"}
task = exctr.submit(
app_name="hello",
app_args=args,
num_procs=4,
num_nodes=1,
procs_per_node=4,
)
Application instances submitted by the executor to the Balsam service will get
scheduled within the reserved resource allocation. Each Balsam app can only be
submitted to the site specified in its class definition. Output files will appear
in the Balsam site’s data
directory, but can be automatically transferred back
via Globus.
Reading Balsam’s documentation is highly recommended.
- class libensemble.executors.balsam_executor.BalsamExecutor
Bases:
Executor
Wraps the Balsam service. Via this Executor, Balsam
Jobs
can be submitted to Balsam sites, either local or on remote machines.Note
Task kills are not configurable in the Balsam executor.
- __init__()
Instantiate a new
BalsamExecutor
instance.- Return type:
None
- register_app(BalsamApp, app_name=None, calc_type=None, desc=None, precedent=None)
Registers a Balsam
ApplicationDefinition
to libEnsemble. This class instance must have asite
andcommand_template
specified. See the Balsam docs for information on other optional fields.- Parameters:
BalsamApp (
ApplicationDefinition
object) – A BalsamApplicationDefinition
instance.app_name (str, Optional) – Name to identify this application.
calc_type (str, Optional) – Calculation type: Set this application as the default
'sim'
or'gen'
function.desc (str, Optional) – Description of this application
precedent (str | None)
- Return type:
None
- submit_allocation(site_id, num_nodes, wall_time_min, job_mode='mpi', queue='local', project='local', optional_params={}, filter_tags={}, partitions=[])
Submits a Balsam
BatchJob
machine allocation request to Balsam. Corresponding Balsam applications with a matching site can be submitted to this allocation. Effectively a wrapper forBatchJob.objects.create()
.- Parameters:
site_id (int) – The corresponding
site_id
for a Balsam site. Retrieve viabalsam site ls
num_nodes (int) – The number of nodes to request from a machine with a running Balsam site
wall_time_min (int) – The number of walltime minutes to request for the
BatchJob
allocationjob_mode (str, Optional) – Either
"serial"
or"mpi"
. Default:"mpi"
queue (str, Optional) – Specifies the queue from which the
BatchJob
should request nodes. Default:"local"
project (str, Optional) – Specifies the project that should be charged for the requested machine time. Default:
"local"
optional_params (dict, Optional) – Additional system-specific parameters to set, based on fields in Balsam’s
job-template.sh
filter_tags (dict, Optional) – Directs the resultant
BatchJob
to only run Jobs with matching tags.partitions (List[dict], Optional) – Divides the allocation into multiple launcher partitions, with differing
job_mode
,num_nodes
.filter_tags
, etc. See the Balsam docs.
- Return type:
The corresponding
BatchJob
object.
- revoke_allocation(allocation, timeout=60)
Terminates a Balsam
BatchJob
machine allocation remotely. Balsam apps should no longer be submitted to this allocation. Best to run after libEnsemble completes, or after thisBatchJob
is no longer needed. Helps save machine time.- Parameters:
allocation (
BatchJob
object) – aBatchJob
with a corresponding machine allocation that should be cancelled.timeout (int, Optional) – Timeout and warn user after this many seconds of attempting to revoke an allocation.
- Return type:
bool
- submit(calc_type=None, app_name=None, app_args=None, num_procs=None, num_nodes=None, procs_per_node=None, max_tasks_per_node=None, machinefile=None, gpus_per_rank=0, transfers={}, workdir='', dry_run=False, wait_on_start=False, extra_args={}, tags={})
Initializes and submits a Balsam
Job
based on a registeredApplicationDefinition
and requested resources. A corresponding libEnsembleTask
object is returned.- Parameters:
calc_type (str, Optional) – The calculation type:
'sim'
or'gen'
Only used ifapp_name
is not supplied. Uses default sim or gen application.app_name (str, Optional) – The application name. Must be supplied if
calc_type
is not.app_args (dict) – A dictionary of options that correspond to fields to template in the ApplicationDefinition’s
command_template
field.num_procs (int, Optional) – The total number of MPI ranks on which to submit the task
num_nodes (int, Optional) – The number of nodes on which to submit the task
procs_per_node (int, Optional) – The processes per node for this task
max_tasks_per_node (int) – Instructs Balsam to schedule at most this many Jobs per node.
machinefile (str, Optional) – Name of a machinefile for this task to use. Unused by Balsam
gpus_per_rank (int, Optional) – Number of GPUs to reserve for each MPI rank
transfers (dict, Optional) – A Job-specific Balsam transfers dictionary that corresponds with an
ApplicationDefinition
transfers
field. See the Balsam docs for more information.workdir (str) – Specifies as name for the Job’s output directory within the Balsam site’s data directory. Default:
libe_workflow
dry_run (bool, Optional) – Whether this is a dry run - no task will be launched; instead runline is printed to logger (at
INFO
level)wait_on_start (bool, Optional) – Whether to block, and wait for task to be polled as
RUNNING
(or other active/end state) before continuingextra_args (dict, Optional) – Additional arguments to supply to MPI runner.
tags (dict, Optional) – Additional tags to organize the
Job
or restrict whichBatchJobs
run it.
- Returns:
task – The launched task object
- Return type:
Note that since Balsam Jobs are often sent to entirely different machines than where libEnsemble is running, how libEnsemble’s resource manager has divided local resources among workers doesn’t impact what resources can be requested for a Balsam
Job
running on an entirely different machine.
- class libensemble.executors.balsam_executor.BalsamTask(app=None, app_args=None, workdir=None, stdout=None, stderr=None, workerid=None)
Bases:
Task
Wraps a Balsam
Job
from the Balsam service.The same attributes and query routines are implemented. Use
task.process
to refer to the matching BalsamJob
initialized by theBalsamExecutor
, with every BalsamJob
method invocable on it. Otherwise, libEnsemble task methods likepoll()
can be used directly.- Parameters:
app (Application | None)
app_args (dict)
workdir (str | None)
stdout (str)
stderr (str)
workerid (int)
- poll()
Polls and updates the status attributes of the supplied task. Requests Job information from Balsam service.
- Return type:
None
- wait(timeout=None)
Waits on completion of the task or raises
TimeoutExpired
.Status attributes of task are updated on completion.
- Parameters:
timeout (int or float, Optional) – Time in seconds after which a TimeoutExpired exception is raised. If not set, then simply waits until completion. Note that the task is not automatically killed on timeout.
- Return type:
None
- kill()
Cancels the task. Killing a running task is unsupported by Balsam at this time.
- Return type:
None