Show pageOld revisionsBacklinksBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== Evaluating Jobs ====== * Which timelimits should I choose? * How much memory do my jobs really need? Such questions are the basic questions for any new tool to be used in batch jobs. We usually advise to launch a //few// test jobs with representative parameterization((hence, no toy data)). Subsequently, a setup for more, productive jobs can be chosen, such that a safety margin for wall time and memory limit is placed, which does not in turn throttle the own throughput((As a rule of thumb, a few percent of the maximum memory and 10-15% above the measured maximum time is sufficient. However, for a detailed analysis, please evaluate carefully or approach the HPC team.)). ===== Efficiency ===== [[https://slurm.schedmd.com/|SLURM]] provides an on-board script, ''seff'', which can be used to evaluate jobs which have finished. To invoke it, run <code bash> $ seff <jobid> </code> It will give an output like: <code> Job ID: <given job ID> Cluster: <cluster> User/Group: <user>/<group> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 64 CPU Utilized: 05:04:22 CPU Efficiency: 86.73% of 05:50:56 core-walltime Job Wall-clock time: 00:05:29 Memory Utilized: 13.05 GB Memory Efficiency: 11.60% of 112.50 GB </code> Here the meaning is ^ Key ^ Interpretation ^ | ''<given job ID>'' | the ID used ''$ seff <jobid>'' | | ''<cluster>'' | the cluster name | | ''<user>'' | user name for the job | | ''<group>'' | a unix group((due to our mapping, that can be any of the groups a user belongs to)) | | State | can be any of ''COMPLETED'', ''FAILED'' or ''CANCELED'' | | Nodes | number of nodes reserved for the job | | Cores per node | number of cores per node for the job | | CPU Utilized | the utilized overall CPU time (used time per CPU * No. of CPUs) | | CPU Efficiency | an //apparent// computation efficiency (utilized CPUs over core-walltime); the core-walltime is the turn-around time of the job, including setup and cleanup | | Job Wall-clock time | elapsed time of the job | | Memory Utilized | Peak Memory | | Memory Efficiency | **see below** for an explanation | Obviously, the CPU efficiency should not be too low. In the example 14% of the resources is not used -- apparently. Is this good or bad? The reported "Memory Efficiency" is way below anything which can be considered "efficient", right? * The "CPU Efficiency" takes into account the node-preperation before job start and the subsequent cleanup time. Hence, the value will be always below 100 %. In the example, with a turn-around time of 5.5 minutes, 2 times 30 seconds for the preparation and cleanup will take 18 % of the time, already. Hence, the particular example can be considered very efficient. For longer turn-around times, this prep-/clean figure will vanish. * To report "Memory Efficiency" the way SLURM does is using **absolutely** the wrong term: The default [[partitions|memory reservation for the used partition]] is 112.50 GB((actually [[https://en.wikipedia.org/wiki/Byte#Unit_multiples|GiB]])). To use less is not a sign of meager efficiency, but rather a sign to use the CPUs well and not using the reserved memory. <WRAP center round important 90%> Still, the reported "Memory Efficiency" can be an important measure for the used memory. If you want to know your peak memory usage, that measure can give you a hint. However, please note that **SLURM samples the memory usage in intervals**. Hence, usage peaks may be missed. </WRAP> <WRAP center round info 90%> Genuine Profiling in order to optimize an application is not the purpose of post-hoc job analysis. We offer various tools for this purpose - and provide a [[development:analysis_and_optimization:|wiki page on this topic]]. </WRAP> job_evaluation.txt Last modified: 2019/10/17 09:15by meesters