To quote its web site: Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. Or with other words: One can visualize all sorts of metrics otherwise obtained by numerous different shell tools at a glance.
We, the HPC team, use Ganglia on a daily basis to monitor various states of our cluster(s). You as a user can monitor the state of nodes where your jobs are running on.
Without futher ado, here is the top link to enter our ganglia page: Ganglia Entry point for the Mogon Clusters.
top is the classical tool to monitor CPU behavior of your process, relatively fine grained. As user you are allowed to log in (with
ssh) into nodes where jobs of yours are running. Remember to log out afterwards.
An example is
top -u <username>
Specifiying the username allows to limit the view on own processes.
vmstat command allows to display statistics of virtual memory, kernerl threads, disks, system processes, I/O blocks, interrupts, CPU activity and much more. This is a good example page.
lsof command can list processes and their open files. In this list included are disk files, network sockets, pipes, devices and processes.
One example would be
$ lsof | head COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME init 1 root cwd DIR 253,0 4096 2 / init 1 root rtd DIR 253,0 4096 2 / init 1 root txt REG 253,0 145180 147164 /sbin/init init 1 root mem REG 253,0 1889704 190149 /lib/libc-2.12.so init 1 root 0u CHR 1,3 0t0 3764 /dev/null init 1 root 1u CHR 1,3 0t0 3764 /dev/null init 1 root 2u CHR 1,3 0t0 3764 /dev/null init 1 root 3r FIFO 0,8 0t0 8449 pipe init 1 root 4w FIFO 0,8 0t0 8449 pipe init 1 root 5r DIR 0,10 0 1 inotify init 1 root 6r DIR 0,10 0 1 inotify init 1 root 7u unix 0xc1513880 0t0 8450 socket init 1 root DEL REG 8,2 2621484 /lib64/librt-2.12.so
Here FD stands for 'file descriptor', some of the values are:
|cwd||current working directory|
|txt||program text (code and data)|
Also in the FD column numbers like 1u are actual file descriptors and followed by u,r,w of it’s mode as:
|r||for read access.|
|w||for write access.|
|u||for read and write access.|
TYPE – of files and it’s identification.
|CHR||Character special file.|
|FIFO||First In First Out|
# to list all files of a particular user and all network connections, type: lsof -u <username> -i
I/O Statistics is a little intricate in conjunction with parallel file systems. If you have the need to retrieve detailed I/O statistics for the parallel file system, please do not hesitate to contact the HPC-team.
iostat is simple tool that will collect and show system input and output storage device statistics. This tool is often used to trace storage device performance issues including devices, local disks, remote disks. It is particularly useful if your job requires local scratch storage and you need to monitor your applicatoin working on it.
for such a statistic.