Reviewing Jobs

How to evaluate your jobs

  • Which time limits should I choose?
  • How much memory do my jobs really need?

Such questions are the basic questions for any new tool to be used in batch jobs. We usually advise to launch a few test jobs with representative parameterization (hence, no toy data). Subsequently, a setup for more, productive jobs can be chosen, such that a safety margin for wall time and memory limit is placed, which does not in turn throttle the own throughput1.

SLURM provides an on-board script, seff, which can be used to evaluate jobs which have finished. To invoke it, run

seff <jobid>

It will give an output like:

Job ID: <given job ID>
Cluster: <cluster>
User/Group: <user>/<group>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 05:04:22
CPU Efficiency: 86.73% of 05:50:56 core-walltime
Job Wall-clock time: 00:05:29
Memory Utilized: 13.05 GB
Memory Efficiency: 11.60% of 112.50 GB

Here the meaning is

KeyInterpretation
<given job ID>the ID used $ seff <jobid>
<cluster>the cluster name
<user>user name for the job
<group>a unix group((due to our mapping, that can be any of the groups a user belongs to))
Statecan be any of COMPLETED, FAILED or CANCELED
Nodesnumber of nodes reserved for the job
Cores per nodenumber of cores per node for the job
CPU Utilizedthe utilized overall CPU time (used time per CPU * No. of CPUs)
CPU Efficiencyan apparent computation efficiency (utilized CPUs over core-walltime); the core-walltime is the turn-around time of the job, including setup and cleanup
Job Wall-clock timeelapsed time of the job
Memory UtilizedPeak Memory
Memory Efficiencysee below for an explanation

Obviously, the CPU efficiency should not be too low. In the example 14% of the resources is not used – apparently. Is this good or bad? The reported “Memory Efficiency” is way below anything which can be considered “efficient”, right?

  • The CPU Efficiency takes into account the node-preperation before job start and the subsequent cleanup time. Hence, the value will be always below 100 %. In the example, with a turn-around time of 5.5 minutes, 2 times 30 seconds for the preparation and cleanup will take 18 % of the time, already. Hence, the particular example can be considered very efficient. For longer turn-around times, this prep-/clean figure will vanish.
  • To report Memory Efficiency the way SLURM does is using absolutely the wrong term: The default memory reservation for the used partition is $112.50 GB$ (actually GiB). To use less is not a sign of meager efficiency, but rather a sign to use the CPUs well and not using the reserved memory.

Still, the reported Memory Efficiency can be an important measure for the used memory. If you want to know your peak memory usage, that measure can give you a hint.

However, please note that SLURM samples the memory usage in intervals. Hence, usage peaks may be missed.

Genuine Profiling in order to optimize an application is not the purpose of post-hoc job analysis. We offer various tools for this purpose - and provide a wiki page on this topic.

Footnotes


  1. As a rule of thumb, a few percent of the maximum memory and 10-15% above the measured maximum time is sufficient. However, for a detailed analysis, please evaluate carefully or approach the HPC team. ↩︎