Evaluating Jobs
- Which timelimits should I choose?
- How much memory do my jobs really need?
Such questions are the basic questions for any new tool to be used in batch jobs. We usually advise to launch a few test jobs with representative parameterization1). Subsequently, a setup for more, productive jobs can be chosen, such that a safety margin for wall time and memory limit is placed, which does not in turn throttle the own throughput2).
Efficiency
SLURM provides an on-board script, seff
, which can be used to evaluate jobs which have finished. To invoke it, run
$ seff <jobid>
It will give an output like:
Job ID: <given job ID> Cluster: <cluster> User/Group: <user>/<group> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 64 CPU Utilized: 05:04:22 CPU Efficiency: 86.73% of 05:50:56 core-walltime Job Wall-clock time: 00:05:29 Memory Utilized: 13.05 GB Memory Efficiency: 11.60% of 112.50 GB
Here the meaning is
Key | Interpretation |
---|---|
<given job ID> | the ID used $ seff <jobid> |
<cluster> | the cluster name |
<user> | user name for the job |
<group> | a unix group3) |
State | can be any of COMPLETED , FAILED or CANCELED |
Nodes | number of nodes reserved for the job |
Cores per node | number of cores per node for the job |
CPU Utilized | the utilized overall CPU time (used time per CPU * No. of CPUs) |
CPU Efficiency | an apparent computation efficiency (utilized CPUs over core-walltime); the core-walltime is the turn-around time of the job, including setup and cleanup |
Job Wall-clock time | elapsed time of the job |
Memory Utilized | Peak Memory |
Memory Efficiency | see below for an explanation |
Obviously, the CPU efficiency should not be too low. In the example 14% of the resources is not used – apparently. Is this good or bad? The reported “Memory Efficiency” is way below anything which can be considered “efficient”, right?
- The “CPU Efficiency” takes into account the node-preperation before job start and the subsequent cleanup time. Hence, the value will be always below 100 %. In the example, with a turn-around time of 5.5 minutes, 2 times 30 seconds for the preparation and cleanup will take 18 % of the time, already. Hence, the particular example can be considered very efficient. For longer turn-around times, this prep-/clean figure will vanish.
- To report “Memory Efficiency” the way SLURM does is using absolutely the wrong term: The default memory reservation for the used partition is 112.50 GB4). To use less is not a sign of meager efficiency, but rather a sign to use the CPUs well and not using the reserved memory.
However, please note that SLURM samples the memory usage in intervals. Hence, usage peaks may be missed.