Performance Optimization

Performance Analysis Tools and Optimization Tools / Aspects

The HPC Group provides a variety of profiling and optimization or performance measurement tools:

Software Choices

perf

<code>perf</code> is a command line-based performance analyzing tool in Linux.

As perf ships with the linux kernel, there are no modules. You can use it any time (interactively, within or without jobs).

Documentation

This wiki does not aim to reproduce the excellent <code>perf</code> documentation, but rather refers to it and particularly the <code>perf</code> tutorial.

OpenSpeedShop

OpenSpeedShop https://www.openspeedshop.org/ (further information soon)

Scalasca

Scalasca is provided as modules (as it applies mostly to developers, it is currently confined to MOGON II):

perf/Scalasca

If you require assistance or updates, please contact us.

Intel’s Advisor

MOGON-specific documentation for this tool has not been written, yet. Please do not hesitate to approach us for advice or recommendations.

Intel’s Inspector

MOGON-specific documentation for this tool has not been written, yet. Please do not hesitate to approach us for advice or recommendations.

Intel’s VTune

Intel VTune Amplifier is a powerful serial and parallel profiler which can be used to collect performance statistics of your code. VTune can profile code written in C, C++, C#, FORTRAN, Java, and Assembly. VTune is designed to be used on shared memory machines so code using MPI and/or OpenMP, as long as it is confined to run on a single node, can be profiled.

Initial Setup

  1. Set up the VTune environment by loading the VTune module as follows: module load tools/VTune
  2. Build your application as you normally would but also turn on the compiler debug symbols. This is typically done by adding the -g option to the icc, gcc, mpicc, ifort, etc. command. This enables source-level profiling. It is recommended to use release build optimization flags (e.g. -O3 -xAVX). This way efforts can be spent optimizing regions not addressed by compiler optimizations.

Serial Usage with the GUI

  1. Do not use this approach for jobs running longer than a few minutes - instead submit to the scheduler and view the results in the gui (see section below).
  2. After loading the VTune module start the gui from the command line: amplxe-gui.
  3. If this is the first time you have run VTune click New Project
    • Give your project a name and choose a location to store the analysis output
    • Choose the application that you built in the initial setup stages (Ex. ~/sample_code.exe).
    • Choose a working directory
  4. Choose New Analysis; then Basic Hotspots." It is recommended to start with basic hotspots and then move to more advanced profiling analyses if necessary.
  5. On the right hand side click the Start button.
    • Your application will no run in the background while VTune collects data. The amount of time this takes is individual to your application; VTune should not add a noticeable amount of overhead.
    • To stop your application prematurely, click stop on the right hand side. This stop the program and collection, but will still display the partial results.
  6. VTune will then finalize the results and display a summary page.
    • Assuming the application was compiled with the -g flag, the Top Hotspots should point out the most time consuming functions/subroutines of your program.
    • The CPU usage histogram is not applicable for serial codes - you can safely ignore this.
  7. You can see more information by clicking on the bottom-up or top-down tree.
    • Double clicking on a line will bring you to the source code of the application and show CPU usage on a per source code line basis.
    • This will point you to the areas that should be the focus of your optimization efforts.

Serial/Parallel Usage Through Scheduler

  1. The instructions here detail how to submit your program to run with the VTune collector on a remote compute node and then finalize and visualize the results in the GUI on the head node.
  2. Build your application as you normally would but also turn on the compiler debug symbols. This is typically done by adding the -g option to the icc, gcc, mpicc, ifort, etc, command. This enables source-level profiling. It is recommended to use release build optimization flags (e.g. -O3, -xAVX). This way efforts can be spent optimizing regions not addressed by compiler optimizations.
  3. Load your application as described above and load the module.
  4. Start an interactive job and a whole node. Always reserve a whole node (or multiple nodes), when profiling or benchmarking.
  5. Start the application as described above.

Getting Help

In case of specific questions regarding the use of VTune on mogon, please see us at the HPC-Workshop (as announced on the wiki start page)


This page was written after the template of the Princeton Research Computing Facility page.