This webpage provides basic instruction how to use the QCG PilotJob tool in the VECMA environment. For the complete reference documentation of the tool please go to this link
QCG PilotJob is a computing job playing a role of a container over a number of subordinate computing jobs. It allows to execute many subordinate jobs in a single scheduling system allocation. Direct submission of a number separate jobs to a scheduling system can result in long aggregated time to finish as each single job is scheduled independently and waits in a queue. Moreover the submission of a numerous of jobs can be restricted or even forbidden by administrative policies defined on clusters. One can argue that there are available job array mechanisms in many systems, however the traditional job array mechanism allows to run only bunch of jobs having the same resource requirements while jobs being parts of a multiscale simulation by nature vary in requirements and therefore need more flexible solutions. On a technical level QCG PilotJob is exposed as a lightweight service - QCG PilotJob Manager.
QCG PilotJob Manager is implemented in Python and as so may be run with the pure Python interpreter. Typically it is started on computing nodes utilising a small part of allocation. This may be quite tedious to run it by hand, thus, for user convenience, QCG PilotJob Manager has been integrated with the QCG middleware. Currently it may be easily used with the help of QCG-Client tool , what we recommend and describe later.
QCG PilotJob Manager offers two basic modes of its usage: static and dynamic. In both modes a user starts the QCG PilotJob Manager service, however while in the static mode the way of execution of subordinate jobs is known in advance, in the dynamic mode, the subordinate jobs are added to the QCG PilotJob Manager service on-demand programmatically, using predefined network API (e.g. from the python program).
At this moment QCG PilotJob Manager has been deployed on the Eagle cluster in PSNC. There have been registered two QCG applications with the names qcg-pm and qcg-pm-client for execution of static and dynamic mode respectively.
In the /home/plgrid-groups/plggvecma/Common/QCGPilotJob/examples/
on qcg.man.poznan.pl
machine there are available example files for the QCG PilotJob system. In order to try them, please ssh to the machine:
ssh plg*@qcg.man.poznan.pl
and copy the directory:
/home/plgrid-groups/plggvecma/Common/QCGPilotJob/examples/
somewhere to your $HOME
.
In the next part of the tutorial we will assume that the copied example files are located in ~/pj-examples
Example input files: ~/pj-examples/static
.
To run QCG PilotJob with QCG-Client in the static mode, only two following elements are needed:
The QCG description file should be formatted in a typical for QCG jobs way. For the static execution, the application should be set to qcg-pm
. An example QCG description example-static.qcg
file looks as follows:
#QCG note=example-static #QCG host=eagle #QCG walltime=PT10M #QCG nodes=1:4 #QCG stage-in-file=example-static.json #QCG stage-out-dir=.->eagle.wd.${JOB_ID} #QCG application=qcg-pm #QCG argument=example-static.json
Here, we selected that our calculations should be performed on Eagle, using 4 cores, with walltime set to 10 minutes, and so on… We defined also that the output from a job, when it finishes, should be automatically downloaded to the eagle.wd.${JOB_ID}
directory.
An argument given to the qcg-pm
application is the PilotJob execution procedure constructed accordingly to the JSON-based QCG PilotJob Manager file interface. In order to illustrate the basic capabilities of this interface, we uploaded the example-static.json
file that is a simple workflow of two /bin/date
invocations separated by 10 second sleep
. Its content is presented below:
[ { "request": "submit", "jobs": [ { "name": "date1", "execution": { "exec": "/bin/date", "stdout": "${jname}.stdout", "stderr": "${jname}.stderr" }, "resources": { "numCores": { "exact": 1 } } }, { "name": "sleep", "execution": { "exec": "/bin/sleep", "args": [ "10" ], "stdout": "${jname}.stdout", "stderr": "${jname}.stderr" }, "resources": { "numCores": { "exact": 1 } }, "dependencies": { "after": [ "date1" ] } }, { "name": "date2", "execution": { "exec": "/bin/date", "stdout": "${jname}.stdout", "stderr": "${jname}.stderr" }, "resources": { "numCores": { "exact": 1 } }, "dependencies": { "after": [ "sleep" ] } } ] }, { "request": "control", "command": "finishAfterAllTasksDone" } ]
In this example, the first executed job will be date1
, then sleep
and then date2
. The thing to note is the fact that the order of specification of jobs is important when there are dependencies between tasks applied. Thus, the sleep
job has to be defined after date1
and date2
after sleep
.
This is very basic scenario, but in a similar way QCG PilotJob system supports definition of more advanced use cases, e.g scenarios including loops and/or parallel processing.
Now, when there are defined all inputs to QCG PilotJob, the job description can be submitted to QCG with qcg-sub
command:
qcg-sub example-static.qcg
Example input files: ~/pj-examples/dynamic
.
QCG PilotJob Manager provides Python API that can be used to dynamically add/delete/manage its sub-jobs via the network interface.
In the base scenario, we can assume that QCG PilotJob Manager and the program using API are executed in the same allocation. To easily start such scenarios there was developed a dedicated QCG application wrapper called qcg-pm-client
that can be selected from QCG-Client tool. This time, instead of the execution procedure, a file with python code using API should be defined as an input argument.
Below we present a content of example example-dynamic.qcg
file:
#QCG note=example-dynamic #QCG host=eagle #QCG walltime=PT10M #QCG nodes=1:4 #QCG stage-in-file=example-dynamic.py #QCG stage-out-dir=.->eagle.wd.${JOB_ID} #QCG application=qcg-pm-client #QCG argument=example-dynamic.py
As you can see, the application is set to qcg-pm-client
and the input argument points to a example-dynamic.py
python file. From this python file, with support of qcg.appscheduler.api
package, the communication with QCG PilotJob Manager may be realised e.g. to submit new sub-jobs. Let us show an example code stored in the example-dynamic.py
file. You can note that this code realizes the same workflow as was defined in the previously considered static execution
import zmq from qcg.appscheduler.api.manager import Manager from qcg.appscheduler.api.job import Jobs # switch on debugging (by default in api.log file) m = Manager( cfg = { 'log_level': 'DEBUG' } ) # get available resources print("available resources:\n%s\n" % str(m.resources())) # submit jobs and save their names in 'ids' list ids = m.submit(Jobs(). add( name= 'date1', exec = '/bin/date', stdout = 'date1.stdout', stderr = 'date1.stderr' ). add( name= 'sleep', exec = '/bin/sleep', args = [ '10' ], stdout = 'sleep.stdout', stderr = 'sleep.stderr', after = 'date1' ). add( name= 'date2', exec = '/bin/date', stdout = 'date2.stdout', stderr = 'date2.stderr', after = 'sleep' ) ) # list submited jobs print("submited jobs:\n%s\n" % str(m.list())) # wait until submited jobs finish m.wait4(ids) # get detailed information about submited and finished jobs print("jobs details:\n%s\n" % str(m.info(ids)))
In order to get complete documentation of a tool please go to this link