User Tools

Site Tools


job_evaluation

Evaluating Jobs

  • Which timelimits should I choose?
  • How much memory do my jobs really need?

Such questions are the basic questions for any new tool to be used in batch jobs. We usually advise to launch a few test jobs with representative parameterization1). Subsequently, a setup for more, productive jobs can be chosen, such that a safety margin for wall time and memory limit is placed, which does not in turn throttle the own throughput2).

Efficiency

SLURM provides an on-board script, seff, which can be used to evaluate jobs which have finished. To invoke it, run

$ seff <jobid>

It will give an output like:

Job ID: <given job ID>
Cluster: <cluster>
User/Group: <user>/<group>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 05:04:22
CPU Efficiency: 86.73% of 05:50:56 core-walltime
Job Wall-clock time: 00:05:29
Memory Utilized: 13.05 GB
Memory Efficiency: 11.60% of 112.50 GB

Here the meaning is

Key Interpretation
<given job ID> the ID used $ seff <jobid>
<cluster> the cluster name
<user> user name for the job
<group> a unix group3)
State can be any of COMPLETED, FAILED or CANCELED
Nodes number of nodes reserved for the job
Cores per node number of cores per node for the job
CPU Utilized the utilized overall CPU time (used time per CPU * No. of CPUs)
CPU Efficiency an apparent computation efficiency (utilized CPUs over core-walltime); the core-walltime is the turn-around time of the job, including setup and cleanup
Job Wall-clock time elapsed time of the job
Memory Utilized Peak Memory
Memory Efficiency see below for an explanation

Obviously, the CPU efficiency should not be too low. In the example 14% of the resources is not used – apparently. Is this good or bad? The reported “Memory Efficiency” is way below anything which can be considered “efficient”, right?

  • The “CPU Efficiency” takes into account the node-preperation before job start and the subsequent cleanup time. Hence, the value will be always below 100 %. In the example, with a turn-around time of 5.5 minutes, 2 times 30 seconds for the preparation and cleanup will take 18 % of the time, already. Hence, the particular example can be considered very efficient. For longer turn-around times, this prep-/clean figure will vanish.
  • To report “Memory Efficiency” the way SLURM does is using absolutely the wrong term: The default memory reservation for the used partition is 112.50 GB4). To use less is not a sign of meager efficiency, but rather a sign to use the CPUs well and not using the reserved memory.

Still, the reported “Memory Efficiency” can be an important measure for the used memory. If you want to know your peak memory usage, that measure can give you a hint.

However, please note that SLURM samples the memory usage in intervals. Hence, usage peaks may be missed.

Genuine Profiling in order to optimize an application is not the purpose of post-hoc job analysis. We offer various tools for this purpose - and provide a wiki page on this topic.

1)
hence, no toy data
2)
As a rule of thumb, a few percent of the maximum memory and 10-15% above the measured maximum time is sufficient. However, for a detailed analysis, please evaluate carefully or approach the HPC team.
3)
due to our mapping, that can be any of the groups a user belongs to
4)
actually GiB
job_evaluation.txt · Last modified: 2019/10/17 09:15 by meesters