start:working_on_mogon:gpu

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
start:working_on_mogon:gpu [2020/04/16 11:54]
jrutte02 created
start:working_on_mogon:gpu [2021/04/28 12:20] (current)
ntretyak load default CUDA-version
Line 5: Line 5:
  
 ^ Partition          ^ hosts                ^ GPUs                ^ RAM          ^  Access by ^ ^ Partition          ^ hosts                ^ GPUs                ^ RAM          ^  Access by ^
-| ''titan*''             | [[:nodes#mogon_i|i[0001-0009]]]         | 4 GeForce GTX Titan / node |  4870               | project on Mogon I | +| ''deeplearning''             | [[start:mogon_cluster:nodes|dgx[01-02]]]       V100 16G/32G11550       | project on Mogon II 
-| ''tesla*''             | [[:nodes#mogon_i|h[0001-0004]]]         4 Tesla K20m node  4900                | project on Mogon +| ''m2_gpu''             | [[start:mogon_cluster:nodes|s[0001-0030]]]        | 6 GeForce GTX 1080 ti| 11550       | project on Mogon II |
-| ''m2_gpu''             | [[:nodes#mogon_i|s[0001-0030]]]        | 6 GeForce GTX 1080 ti| 11550       | project on Mogon II |+
  
-//Physically all GPU nodes are placed together with MOGON I, hence users need to log in to MOGON I even to use the ''m2_gpu'' partition.// 
  
 Notes:  Notes: 
   * RAM displays the default memory per node in MiB.   * RAM displays the default memory per node in MiB.
-  * The Mogon I titan/tesla nodes come in as ''*short'' or ''*long'' queues, which is associated with maximum run times: 5 days and 5 hours, respectively. 
  
-<callout type="info" title="MOGON I" icon="true"> 
-titan/tesla nodes are not maintained any longer - the number of nodes is steadily declining. Eventually, they will be phased out. 
-</callout> 
  
 <callout type="warning" icon="true"> <callout type="warning" icon="true">
Line 27: Line 21:
 ===== Access ===== ===== Access =====
  
-The a**ccelerators (GPUs) of MOGON II** are placed in the ZDV premise and **are** hence **part of** the **MOGON I** infrastructure. That is to say, you have to log in to MOGON I but use your MOGON II account (''-A m2_*'') to have access to those 189 GPUs in the ''m2_gpu'' partition. The tesla/titan partition are accessible by all accounts. 
  
-To get to know which account to use for the ''m2_gpu'' partition login to MOGON I and call +To get to know which account to use for the ''m2_gpu'' partition login and call 
 <code bash>  <code bash> 
 sacctmgr list user $USER -s where Partition=m2_gpu format=User%10,Account%20,Partition%10 sacctmgr list user $USER -s where Partition=m2_gpu format=User%10,Account%20,Partition%10
Line 42: Line 35:
 ===== Limitations ===== ===== Limitations =====
  
-The ''m2_gpu'' is a single partition((In contrast to the MOGON I ''short''/''long'' scheme.)) allowing a runtime  of up to 5 days. In order to prevent single users or groups to flood the entire partition with their long running jobs, a limitation has been set, such that other users get the chance to run their jobs, too. This may result in pending reasons such as ''QOSGrpGRESRunMinutes''. For other pending reasons, see [[:start:working_on_mogon:slurm_manage|our page on job management]].+The ''m2_gpu'' is a single partition allowing a runtime  of up to 5 days. In order to prevent single users or groups to flood the entire partition with their long running jobs, a limitation has been set, such that other users get the chance to run their jobs, too. This may result in pending reasons such as ''QOSGrpGRESRunMinutes''. For other pending reasons, see [[:start:working_on_mogon:slurm_manage|our page on job management]].
  
 ===== Compiling for GPUs  ===== ===== Compiling for GPUs  =====
Line 69: Line 62:
 ''--gres-flags=enforce-binding'' is currently not working properly in our Slurm-Version. You may try to use it with Multi-task GPU job but it won't work with Jobs reserving only part of a node. Schedmd seems to work on a bug fix. ''--gres-flags=enforce-binding'' is currently not working properly in our Slurm-Version. You may try to use it with Multi-task GPU job but it won't work with Jobs reserving only part of a node. Schedmd seems to work on a bug fix.
 </callout> </callout>
-==== Simple GPU-Job ====+==== Simple single GPU-Job ==== 
 + 
 +Take a single GPU-node and run an executable on it ((Be sure that set the amount of memory appropriately)).  
 + 
 +<file myjobscript> 
 +#!/bin/bash 
 +#----------------------------------------------------------------- 
 +# Example SLURM job script to run serial applications on Mogon. 
 +
 +# This script requests one task using 2 cores on one GPU-node.   
 +#----------------------------------------------------------------- 
 + 
 +#SBATCH -J mysimplegpujob        # Job name 
 +#SBATCH -o mysimplegpujob.%j.out # Specify stdout output file (%j expands to jobId) 
 +#SBATCH -p m2_gpu                # Partition name 
 +#SBATCH -n 1                     # Total number of tasks  
 +#SBATCH -c 2                     # CPUs per task  
 +#SBATCH -t 00:30:00              # Run time (hh:mm:ss) - 0.5 hours 
 +#SBATCH --gres=gpu:            # Reserve 1 GPUs  
 +#SBATCH -A m2_account            # Specify allocation to charge against 
 + 
 +# Load all necessary modules if needed (these are examples) 
 +# Loading modules in the script ensures a consistent environment. 
 +module load system/CUDA 
 + 
 +# Launch the executable 
 +srun <myexecutable> 
 +</file> 
 + 
 + 
 +==== Simple full node GPU-Job ====
  
 Take a full GPU-node and run an executable that uses all 6 GPUs ((Be sure that you application can utilize more than 1 GPU, when you request it!)).  Take a full GPU-node and run an executable that uses all 6 GPUs ((Be sure that you application can utilize more than 1 GPU, when you request it!)). 
Line 95: Line 118:
 # Load all necessary modules if needed (these are examples) # Load all necessary modules if needed (these are examples)
 # Loading modules in the script ensures a consistent environment. # Loading modules in the script ensures a consistent environment.
-module load system/CUDA/9.1.85+module load system/CUDA
  
 # Launch the executable # Launch the executable
Line 120: Line 143:
 #SBATCH -N 1                     # Total number of nodes requested (48 cores/node per GPU node) #SBATCH -N 1                     # Total number of nodes requested (48 cores/node per GPU node)
 #SBATCH -n 6                     # Total number of tasks  #SBATCH -n 6                     # Total number of tasks 
-#SBATCH -c                     # CPUs per task +#SBATCH -c                     # CPUs per task 
 #SBATCH -t 00:30:00              # Run time (hh:mm:ss) - 0.5 hours #SBATCH -t 00:30:00              # Run time (hh:mm:ss) - 0.5 hours
 #SBATCH --gres=gpu:            # Reserve 6 GPUs  #SBATCH --gres=gpu:            # Reserve 6 GPUs 
Line 128: Line 151:
 # Load all necessary modules if needed (these are examples) # Load all necessary modules if needed (these are examples)
 # Loading modules in the script ensures a consistent environment. # Loading modules in the script ensures a consistent environment.
-module load system/CUDA/9.1.85+module load system/CUDA
  
 # Launch the tasks # Launch the tasks
  • start/working_on_mogon/gpu.1587030892.txt.gz
  • Last modified: 2020/04/16 11:54
  • by jrutte02