debug_tutorial

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
debug_tutorial [2013/08/02 00:28]
bogert [ltrace] Explain that ltrace is not interactive
debug_tutorial [2013/08/02 02:18] (current)
bogert [Debugger: strace] Explain that strace is explained by reading the ltrace section
Line 1: Line 1:
 ====== Tutorial: Debugging MPI programs at Mogon ====== ====== Tutorial: Debugging MPI programs at Mogon ======
 ===== Summary ===== ===== Summary =====
- 
 We show a method of attaching a debugger on the fly by request of your application.\\ We show a method of attaching a debugger on the fly by request of your application.\\
 With the shown method your application can determine on its own which nodes need to be debugged and request attachment of the debugger only on those nodes.\\ With the shown method your application can determine on its own which nodes need to be debugged and request attachment of the debugger only on those nodes.\\
Line 7: Line 6:
  
 The primary debugger which this tutorial explains is GDB.\\ The primary debugger which this tutorial explains is GDB.\\
-At the end of the tutorial, you will be shown how to use various other debuggers with the same technique: cgdb, ltrace and strace.+At the end of the tutorial, you will be shown how to use various other debuggers with the same technique: cgdb, ltrace and strace.\\ 
 +Notably, ltrace and strace support profiling the execution time of your program - you can use them for performance optimization.
  
 For questions, please contact: For questions, please contact:
Line 16: Line 16:
  
 ===== Setup ===== ===== Setup =====
-We now show you how to obtain and execute the sample program used in this tutorial. +==== Understanding & compiling the sample program ==== 
-Notice that lines starting with “#” are comments and do not need to be executed.+The tutorial is based on a sample programThis section shows how to obtain it, explains what it does, and tells you how to compile it.
  
 +First of all, we download the source code. Notice that lines starting with “#” are comments and do not need to be executed.
 <code bash> <code bash>
 ssh mogon.zdv.uni-mainz.de ssh mogon.zdv.uni-mainz.de
Line 123: Line 124:
 See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums. See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums.
  
-===== Execution ===== +==== Preparing for debugging ==== 
-We now compile the program and execute it as batch job on Mogon:+This section will show you how to provoke bugs in your programs. After that, you will be given an example of how to isolate the bugs to a small region in the program it reduces debugging time to get rough idea of where the bug might be before using the debugger.
  
 +First of all, we schedule the program for execution as a batch job on Mogon:
 <code bash> <code bash>
 # Make the libraries available which are needed for compilation and execution # Make the libraries available which are needed for compilation and execution
Line 229: Line 231:
 | Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. | | Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. |
  
-Before we fulfill our promise to actually teach you something about GDB, we note that we have managed to reduce the line count where we suspect the problem to be to 2 **without** using GDB. This gives you:+Before we fulfill our promise to actually teach you something about using a debugger, we note that we have managed to reduce the line count where we suspect the problem to be to 2 **without** using a debugger. This gives you:
  
 ^ Lesson 6 ^ ^ Lesson 6 ^
 | A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without debugging tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. | | A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without debugging tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. |
  
-===== Actually using GDB =====+===== Debugger: GDB ===== 
 +One of the most well-known and powerful debuggers in the Unix world is GDB. It has been in development since 1986. You should definitely give it a chance.
  
 ^ Lesson 7 ^ ^ Lesson 7 ^
Line 463: Line 466:
 bsub […] ./selective-debug --xterm ./debug-tutorial</code>Your program then is allowed to send the debug signal to [[selective-debug]] on multiple nodes for being able to run multiple instances of GDB. | bsub […] ./selective-debug --xterm ./debug-tutorial</code>Your program then is allowed to send the debug signal to [[selective-debug]] on multiple nodes for being able to run multiple instances of GDB. |
  
-===== Other debuggers ===== +===== Debugger: CGDB =====
-While GDB is a very powerful tool, it is not the only powerful one.\\ +
-We designed [[selective-debug]] which allows easy addition of new debuggers. An introduction to each one which is currently supported will follow.\\ +
-If you want support for another debugger to be added, feel free to contact  [[zdv@leo.bogert.de|Leo Bogert]]. +
- +
-==== CGDB ====+
 CGDB is basically a wrapper around GDB. It amends GDB with a terminal-graphics interface.\\ CGDB is basically a wrapper around GDB. It amends GDB with a terminal-graphics interface.\\
 This interface splits the terminal in a top and bottom half: This interface splits the terminal in a top and bottom half:
Line 480: Line 478:
 bsub […] ./selective-debug --xterm --cgdb ./debug-tutorial</code> | bsub […] ./selective-debug --xterm --cgdb ./debug-tutorial</code> |
  
-==== ltrace ====+===== Debugger: ltrace =====
 ltrace means "library trace". It will show a timeline (trace) of all calls to library functions which your program does.\\ ltrace means "library trace". It will show a timeline (trace) of all calls to library functions which your program does.\\
 The "functions" in "library functions" refers to normal C functions.\\ The "functions" in "library functions" refers to normal C functions.\\
Line 490: Line 488:
  
 With the ability to show a trace of calls to MPI functions, we can get a very useful timeline of our programs execution. With the ability to show a trace of calls to MPI functions, we can get a very useful timeline of our programs execution.
 +
 +Another notable difference between ltrace and GDB you need to know is that ltrace is not an //interactive// debugger:
 +  * It will not halt the execution of your program.
 +  * You cannot enter any commands while it is running. All of its behavior is determined by its parameters. They can be passed to ltrace with the environment variable <nowiki>SELECTIVE_DEBUG__DEBUGGER_PARAMS</nowiki>. We will show you how to do that later.
  
 ^ Lesson 12 ^ ^ Lesson 12 ^
Line 495: Line 497:
 bsub […] ./selective-debug --ltrace ./debug-tutorial</code> | bsub […] ./selective-debug --ltrace ./debug-tutorial</code> |
  
-The first notable difference between ltrace and GDB you need to know is that ltrace is not an //interactive// debugger+==== Ltrace example ==== 
-  * It will not halt the execution of your program. +As with GDB, we amend our code with a //BREAKPOINT_AND_SLEEP(10)// which is executed only on one rank
-  * You cannot enter any commands while it is running. All of its behavior is determined by its parameters. They can be passed to ltrace with the environment variable SELECTIVE_DEBUG__DEBUGGER_PARAMS. We will show you how to do that later.+<code c> 
 +if(my_rank == 0) 
 +  BREAKPOINT_AND_SLEEP(10); 
 +</code>
  
 +We have provided a file [[https://github.com/leo-bogert/mpi-debug-tutorial/blob/master/debug-tutorial-4-ltrace.c|debug-tutorial-4-ltrace.c]] which contains this modification. 
 +Notice that we have added the breakpoint close to the beginning of the program so we can see what a rather complete trace looks like.
 +
 +We compile the file **without** the **-g** switch which would include debug infomation: Ltrace does not need debug information!\\
 +This provides the advantage that you can use it upon programs of which you do not have the source code available.
 +<code bash>
 +mpicc debug-tutorial-4-ltrace.c -std=c99 -o debug-tutorial
 +</code>
 +
 +We execute [[selective-debug]] with the ltrace parameter:
 +<code bash>
 +bsub -I -n 10 -q short -a openmpi mpirun ./selective-debug --ltrace ./debug-tutorial
 +</code>
 +
 +The output will look similar to:
 +<code>
 +Job <23906720> is submitted to queue <short>.
 +<<Waiting for dispatch ...>>
 +<<Starting on a0398>>
 +[pid 63586] MPI_Alloc_mem(419428, 0x601660, 0x7ffff5a76900, 10, 0x7ffff5a766f0) = 0
 +[pid 63586] MPI_Scatter(0x6027b8, 104857, 0x601760, 0x2ad2c1e46000, 104857) = 0
 +[pid 63586] MPI_Free_mem(0x2ad2c1e46000, 0xffff8002, 0x2e717f67, 0x2ad2bb53e8b8, 0x1d2b0c0) = 0
 +[pid 63586] MPI_Reduce(0x7ffff5a768f8, 0x602790, 1, 0x602580, 0x601d60) = 0
 +[pid 63586] fwrite("Test FAILED!\n", 1, 13, 0x2ad2b7995860Test FAILED!
 +) = 13
 +[pid 63586] MPI_Barrier(0x601960, 0x401200, 13, -1, 0x4012a6) = 0
 +[pid 63586] MPI_Finalize(0x7ffff5a767f8, 0xffff8002, 18, 0, 0x1d2a2e0 <unfinished ...>
 +</code>
 +
 +==== Ltrace parameters: Profiling execution time ====
 +A very powerful feature of ltrace is the ability to measure how much execution time your program has spent in certain functions.\\
 +You can do this with the **-c** parameter of ltrace.
 +
 +But since ltrace is executed indirectly via [[selective-debug]], we need to use its facility of passing parameters through to the debugger.\\
 +This can be done with the environment variable <nowiki>SELECTIVE_DEBUG__DEBUGGER_PARAMS</nowiki>. To pass an environment variable to a command which you execute in the shell, you put the assignment to the variable before the command:
 +<code bash>
 +SELECTIVE_DEBUG__DEBUGGER_PARAMS='-c' bsub -I -n 10 -q short -a openmpi mpirun ./selective-debug --ltrace ./debug-tutorial
 +</code>
 +Please notice that we put the assignment of the variable at the very beginning of the command line, not before the "./selective-debug": It needs to be before the command which we tell the shell, and to the shell the "bsub" is the command and everything which follows are parameters, not commands.
 +
 +You will get an output similar to:
 +<code>
 +% time     seconds  usecs/call     calls      function
 +------ ----------- ----------- --------- --------------------
 + 96.12    2.050929     2050929         1 MPI_Finalize
 +  1.07    0.022868       22868         1 MPI_Scatter
 +  0.83    0.017779       17779         1 MPI_Free_mem
 +  0.52    0.011048       11048         1 MPI_Alloc_mem
 +  0.51    0.010905         170        64 
 +  0.49    0.010353       10353         1 MPI_Barrier
 +  0.27    0.005779        5779         1 MPI_Reduce
 +  0.19    0.003998        3998         1 fwrite
 +------ ----------- ----------- --------- --------------------
 +100.00    2.133659                    71 total
 +</code>
  
-==== strace ==== +===== Debugger: strace ===== 
-strace means "system trace". It is very similar to ltrace which we explained in the [[debug_tutorial#ltrace|previous section]]:\\+strace means "system trace". It is very similar to ltrace which we explained in the [[debug_tutorial#Debugger: ltrace|previous section]]:\\
 It traces calls to system functions. It traces calls to system functions.
 [[http://en.wikipedia.org/wiki/System_call|System functions]] are functions which are implemented in the Linux kernel, as opposed to library functions which are implemented in [[http://en.wikipedia.org/wiki/User_space|user space]]. They are typically basic low level functions such as file access, multithreading and networking. [[http://en.wikipedia.org/wiki/System_call|System functions]] are functions which are implemented in the Linux kernel, as opposed to library functions which are implemented in [[http://en.wikipedia.org/wiki/User_space|user space]]. They are typically basic low level functions such as file access, multithreading and networking.
  
-^ Lesson 12 ^+^ Lesson 13 ^
 | To use strace, use the **- - strace** switch of [[selective-debug]]:\\ \\ <code bash> | To use strace, use the **- - strace** switch of [[selective-debug]]:\\ \\ <code bash>
 bsub […] ./selective-debug --strace ./debug-tutorial</code> | bsub […] ./selective-debug --strace ./debug-tutorial</code> |
 +
 +What you have learned about ltrace also applies to strace. Therefore, please make sure to read the [[debug_tutorial#Debugger: ltrace|ltrace section]] in depth.
 +===== Logging debugger output with selective-debug =====
 +As you have seen, ltrace and strace are non-interactive debuggers. Such debuggers usually produce a large amount of output which is difficult to follow live.\\
 +Therefore, we now show you how to use [[selective-debug]] to log the output of the debugger to one file per reach OpenMPI rank.
 +
 +^ Lesson 14 ^
 +| There are two switches of [[selective-debug]] which can be used to control logging:\\ \\ <code>
 +--log        If you are using OpenMPI and want to clone the output of the debugger to a per-rank log file.
 +             Filename will be "selective-debug-rankN.log". The file will be appended, not overwritten.
 +             The output will also go to the terminal so you can use --xterm.
 +
 +--log-quiet  Only log debugger output to files, do not show it on the terminal.
 +</code> |
 +
 +The possibilities offered by those switches will be explained in the following sections:
 +==== Logging on all ranks ====
 +They allow you to use [[debug_tutorial#Debugger: ltrace|ltrace]] or [[debug_tutorial#Debugger: strace|strace]] parallely on **all ranks** because each rank gets a separate log file.\\ For doing this, just call //BREAKPOINT_AND_SLEEP()// on every rank.
 +
 +==== Separate program and debugger output ====
 +If your program produces output on the terminal, it would damage the output of the debugger as it would appear randomly in between it. The **- - log-quiet** switch is suitable for fixing this. Your program will get the terminl for output and the debugger will get the log files.
 +
 +==== Non-interactive debugging ====
 +Using the **- - log-quiet** switch, you can run the batch job **non-interactively**, i.e. remove the **-I** switch of bsub: You will get the output of your program per E-Mail and the output of the debugger will exist in the files created on disk.
 +===== Other debuggers =====
 +We designed [[selective-debug]] which allows easy addition of new debuggers.\\
 +If you want support for another debugger to be added, feel free to contact  [[zdv@leo.bogert.de|Leo Bogert]].
  
  • debug_tutorial.1375396127.txt.gz
  • Last modified: 2013/08/02 00:28
  • by bogert