Differences
This shows you the differences between two versions of the page.
io_odds_and_ends [2017/11/29 19:35] |
— (current) | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Input and Output on HPC Systems ====== | ||
- | |||
- | This page comments on general aspects related to workloads on Mogon I/II, more information with respect to filesystems can be found [[filesystems|here]]. | ||
- | |||
- | ===== Issues which may arise ===== | ||
- | |||
- | Scientific applications perform I/O to parallel file system in primarily one of two ways: | ||
- | |||
- | * Shared-file (N-to-1): | ||
- | * This increases usability: There is only one file to keep track of by the application | ||
- | * It may create lock contention and hinder performance | ||
- | * File-per‐process (N‐to‐N): | ||
- | * It may avoid lock contention on the application level, but increases the risk of file system stress when to writing to one destination, | ||
- | * It is impossible to restart these applications with a different number of tasks | ||
- | |||
- | ===== How may/can I - as a user - analyze issues? ===== | ||
- | |||
- | Currently, when suspecting I/O problems you should [[hpc@uni-mainz.de|address the HPC-team]]. There is no straight forward method available to analyze I/O problems on the user level (of third party applications). | ||
- | |||
- | <WRAP center round todo 90%> | ||
- | We may provide more tools in the foreseeable future. | ||
- | </ | ||
- | |||
- | ===== Which solution may solve which issue? ===== | ||
- | |||
- | The statements above may seem a little abstract, particularly when third-party applications have to be used and no decision can be made about the application architecture. | ||
- | |||
- | However, a few rules of thumb can be given: | ||
- | * [[node_local_scheduling|Pooling short jobs]] is generally a good idea with respect to scheduling and organizing your work flow. If this involves reading identical input files by all those application instances, the stage-in to a [[slurm_localscratch|node local scratch]] or even into [[ramdisk|RAM]] may solve performance issues: By creating a temporary input resource, the need to keep track of file accesses for these particular files for the global file system is dropped. | ||
- | * Avoid keeping open file handles by many processes (within a directory). Violating this rule may cause delays, because the global file system needs to coordinate every writing process. A possible solution is to write into the job directory (see [[slurm_localscratch|node local scratch]]) and to copy this output to the global file system after a writing process is finished / releases a file. | ||
- | * Avoid writing too many small files: The overhead in keeping track of the meta information for millions of small files can be bigger than the file size. The global file system is not optimized for this. | ||