Balsam Executor - Remote apps
This module launches and controls tasks via Balsam, and can submit tasks from any machine, to any machine running a Balsam site.
At this time, access to Balsam is limited to those with valid organizational logins authenticated through Globus.
To initialize a Balsam executor:
from libensemble.executors.balsam_executors import BalsamExecutor
exctr = BalsamExecutor()
Note that
Balsam ApplicationDefinition instances are registered instead of paths and task
submissions will not run until Balsam reserves compute resources at a site:
from libensemble.executors.balsam_executors import BalsamExecutor
from balsam.api import ApplicationDefinition
class HelloApp(ApplicationDefinition):
site = "my-balsam-site"
command_template = "/path/to/hello.app {{ my_name }}"
exctr = BalsamExecutor()
exctr.register_app(HelloApp, app_name="hello")
exctr.submit_allocation(
site_id=999, # corresponds to "my-balsam-site", found via ``balsam site ls``
num_nodes=4, # Total number of nodes requested for *all jobs*
wall_time_min=30,
queue="debug-queue",
project="my-project",
)
Task submissions of registered apps aren’t too different from the other executors,
except Balsam expects application arguments in dictionary form. Note that these fields
must match the templating syntax in each ApplicationDefinition’s command_template
field:
args = {"my_name": "World"}
task = exctr.submit(
app_name="hello",
app_args=args,
num_procs=4,
num_nodes=1,
procs_per_node=4,
)
Application instances submitted by the executor to the Balsam service will get
scheduled within the reserved resource allocation. Each Balsam app can only be
submitted to the site specified in its class definition. Output files will appear
in the Balsam site’s data directory, but can be automatically transferred back
via Globus.
Reading Balsam’s documentation is highly recommended.
- class libensemble.executors.balsam_executor.BalsamExecutor
Bases:
ExecutorWraps the Balsam service. Via this Executor, Balsam
Jobscan be submitted to Balsam sites, either local or on remote machines.Note
Task kills are not configurable in the Balsam executor.
- __init__()
Instantiate a new
BalsamExecutorinstance.- Return type:
None
- register_app(BalsamApp, app_name=None, calc_type=None, desc=None, precedent=None)
Registers a Balsam
ApplicationDefinitionto libEnsemble. This class instance must have asiteandcommand_templatespecified. See the Balsam docs for information on other optional fields.- Parameters:
BalsamApp (
ApplicationDefinitionobject) – A BalsamApplicationDefinitioninstance.app_name (str, Optional) – Name to identify this application.
calc_type (str, Optional) – Calculation type: Set this application as the default
'sim'or'gen'function.desc (str, Optional) – Description of this application
precedent (str | None) –
- Return type:
None
- submit_allocation(site_id, num_nodes, wall_time_min, job_mode='mpi', queue='local', project='local', optional_params={}, filter_tags={}, partitions=[])
Submits a Balsam
BatchJobmachine allocation request to Balsam. Corresponding Balsam applications with a matching site can be submitted to this allocation. Effectively a wrapper forBatchJob.objects.create().- Parameters:
site_id (int) – The corresponding
site_idfor a Balsam site. Retrieve viabalsam site lsnum_nodes (int) – The number of nodes to request from a machine with a running Balsam site
wall_time_min (int) – The number of walltime minutes to request for the
BatchJoballocationjob_mode (str, Optional) – Either
"serial"or"mpi". Default:"mpi"queue (str, Optional) – Specifies the queue from which the
BatchJobshould request nodes. Default:"local"project (str, Optional) – Specifies the project that should be charged for the requested machine time. Default:
"local"optional_params (dict, Optional) – Additional system-specific parameters to set, based on fields in Balsam’s
job-template.shfilter_tags (dict, Optional) – Directs the resultant
BatchJobto only run Jobs with matching tags.partitions (List[dict], Optional) – Divides the allocation into multiple launcher partitions, with differing
job_mode,num_nodes.filter_tags, etc. See the Balsam docs.
- Return type:
The corresponding
BatchJobobject.
- revoke_allocation(allocation, timeout=60)
Terminates a Balsam
BatchJobmachine allocation remotely. Balsam apps should no longer be submitted to this allocation. Best to run after libEnsemble completes, or after thisBatchJobis no longer needed. Helps save machine time.- Parameters:
allocation (
BatchJobobject) – aBatchJobwith a corresponding machine allocation that should be cancelled.timeout (int, Optional) – Timeout and warn user after this many seconds of attempting to revoke an allocation.
- Return type:
bool
- submit(calc_type=None, app_name=None, app_args=None, num_procs=None, num_nodes=None, procs_per_node=None, max_tasks_per_node=None, machinefile=None, gpus_per_rank=0, transfers={}, workdir='', dry_run=False, wait_on_start=False, extra_args={}, tags={})
Initializes and submits a Balsam
Jobbased on a registeredApplicationDefinitionand requested resources. A corresponding libEnsembleTaskobject is returned.- Parameters:
calc_type (str, Optional) – The calculation type:
'sim'or'gen'Only used ifapp_nameis not supplied. Uses default sim or gen application.app_name (str, Optional) – The application name. Must be supplied if
calc_typeis not.app_args (dict) – A dictionary of options that correspond to fields to template in the ApplicationDefinition’s
command_templatefield.num_procs (int, Optional) – The total number of MPI ranks on which to submit the task
num_nodes (int, Optional) – The number of nodes on which to submit the task
procs_per_node (int, Optional) – The processes per node for this task
max_tasks_per_node (int) – Instructs Balsam to schedule at most this many Jobs per node.
machinefile (str, Optional) – Name of a machinefile for this task to use. Unused by Balsam
gpus_per_rank (int, Optional) – Number of GPUs to reserve for each MPI rank
transfers (dict, Optional) – A Job-specific Balsam transfers dictionary that corresponds with an
ApplicationDefinitiontransfersfield. See the Balsam docs for more information.workdir (str) – Specifies as name for the Job’s output directory within the Balsam site’s data directory. Default:
libe_workflowdry_run (bool, Optional) – Whether this is a dry run - no task will be launched; instead runline is printed to logger (at
INFOlevel)wait_on_start (bool, Optional) – Whether to block, and wait for task to be polled as
RUNNING(or other active/end state) before continuingextra_args (dict, Optional) – Additional arguments to supply to MPI runner.
tags (dict, Optional) – Additional tags to organize the
Jobor restrict whichBatchJobsrun it.
- Returns:
task – The launched task object
- Return type:
Note that since Balsam Jobs are often sent to entirely different machines than where libEnsemble is running, how libEnsemble’s resource manager has divided local resources among workers doesn’t impact what resources can be requested for a Balsam
Jobrunning on an entirely different machine.
- class libensemble.executors.balsam_executor.BalsamTask(app=None, app_args=None, workdir=None, stdout=None, stderr=None, workerid=None)
Bases:
TaskWraps a Balsam
Jobfrom the Balsam service.The same attributes and query routines are implemented. Use
task.processto refer to the matching BalsamJobinitialized by theBalsamExecutor, with every BalsamJobmethod invocable on it. Otherwise, libEnsemble task methods likepoll()can be used directly.- Parameters:
app (Application | None) –
app_args (dict) –
workdir (str | None) –
stdout (str) –
stderr (str) –
workerid (int) –
- poll()
Polls and updates the status attributes of the supplied task. Requests Job information from Balsam service.
- Return type:
None
- wait(timeout=None)
Waits on completion of the task or raises
TimeoutExpired.Status attributes of task are updated on completion.
- Parameters:
timeout (int or float, Optional) – Time in seconds after which a TimeoutExpired exception is raised. If not set, then simply waits until completion. Note that the task is not automatically killed on timeout.
- Return type:
None
- kill()
Cancels the task. Killing a running task is unsupported by Balsam at this time.
- Return type:
None