debug_tutorial

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
debug_tutorial [2013/08/02 01:14]
bogert [ltrace] Reorder stuff
debug_tutorial [2013/08/02 02:18] (current)
bogert [Debugger: strace] Explain that strace is explained by reading the ltrace section
Line 1: Line 1:
 ====== Tutorial: Debugging MPI programs at Mogon ====== ====== Tutorial: Debugging MPI programs at Mogon ======
 ===== Summary ===== ===== Summary =====
- 
 We show a method of attaching a debugger on the fly by request of your application.\\ We show a method of attaching a debugger on the fly by request of your application.\\
 With the shown method your application can determine on its own which nodes need to be debugged and request attachment of the debugger only on those nodes.\\ With the shown method your application can determine on its own which nodes need to be debugged and request attachment of the debugger only on those nodes.\\
Line 7: Line 6:
  
 The primary debugger which this tutorial explains is GDB.\\ The primary debugger which this tutorial explains is GDB.\\
-At the end of the tutorial, you will be shown how to use various other debuggers with the same technique: cgdb, ltrace and strace.+At the end of the tutorial, you will be shown how to use various other debuggers with the same technique: cgdb, ltrace and strace.\\ 
 +Notably, ltrace and strace support profiling the execution time of your program - you can use them for performance optimization.
  
 For questions, please contact: For questions, please contact:
Line 16: Line 16:
  
 ===== Setup ===== ===== Setup =====
-We now show you how to obtain and execute the sample program used in this tutorial. +==== Understanding & compiling the sample program ==== 
-Notice that lines starting with “#” are comments and do not need to be executed.+The tutorial is based on a sample programThis section shows how to obtain it, explains what it does, and tells you how to compile it.
  
 +First of all, we download the source code. Notice that lines starting with “#” are comments and do not need to be executed.
 <code bash> <code bash>
 ssh mogon.zdv.uni-mainz.de ssh mogon.zdv.uni-mainz.de
Line 123: Line 124:
 See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums. See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums.
  
-===== Execution ===== +==== Preparing for debugging ==== 
-We now compile the program and execute it as batch job on Mogon:+This section will show you how to provoke bugs in your programs. After that, you will be given an example of how to isolate the bugs to a small region in the program it reduces debugging time to get rough idea of where the bug might be before using the debugger.
  
 +First of all, we schedule the program for execution as a batch job on Mogon:
 <code bash> <code bash>
 # Make the libraries available which are needed for compilation and execution # Make the libraries available which are needed for compilation and execution
Line 229: Line 231:
 | Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. | | Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. |
  
-Before we fulfill our promise to actually teach you something about GDB, we note that we have managed to reduce the line count where we suspect the problem to be to 2 **without** using GDB. This gives you:+Before we fulfill our promise to actually teach you something about using a debugger, we note that we have managed to reduce the line count where we suspect the problem to be to 2 **without** using a debugger. This gives you:
  
 ^ Lesson 6 ^ ^ Lesson 6 ^
 | A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without debugging tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. | | A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without debugging tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. |
  
-===== Actually using GDB =====+===== Debugger: GDB ===== 
 +One of the most well-known and powerful debuggers in the Unix world is GDB. It has been in development since 1986. You should definitely give it a chance.
  
 ^ Lesson 7 ^ ^ Lesson 7 ^
Line 463: Line 466:
 bsub […] ./selective-debug --xterm ./debug-tutorial</code>Your program then is allowed to send the debug signal to [[selective-debug]] on multiple nodes for being able to run multiple instances of GDB. | bsub […] ./selective-debug --xterm ./debug-tutorial</code>Your program then is allowed to send the debug signal to [[selective-debug]] on multiple nodes for being able to run multiple instances of GDB. |
  
-===== Other debuggers ===== +===== Debugger: CGDB =====
-While GDB is a very powerful tool, it is not the only powerful one.\\ +
-We designed [[selective-debug]] which allows easy addition of new debuggers. An introduction to each one which is currently supported will follow.\\ +
-If you want support for another debugger to be added, feel free to contact  [[zdv@leo.bogert.de|Leo Bogert]]. +
- +
-==== CGDB ====+
 CGDB is basically a wrapper around GDB. It amends GDB with a terminal-graphics interface.\\ CGDB is basically a wrapper around GDB. It amends GDB with a terminal-graphics interface.\\
 This interface splits the terminal in a top and bottom half: This interface splits the terminal in a top and bottom half:
Line 480: Line 478:
 bsub […] ./selective-debug --xterm --cgdb ./debug-tutorial</code> | bsub […] ./selective-debug --xterm --cgdb ./debug-tutorial</code> |
  
-==== ltrace ====+===== Debugger: ltrace =====
 ltrace means "library trace". It will show a timeline (trace) of all calls to library functions which your program does.\\ ltrace means "library trace". It will show a timeline (trace) of all calls to library functions which your program does.\\
 The "functions" in "library functions" refers to normal C functions.\\ The "functions" in "library functions" refers to normal C functions.\\
Line 499: Line 497:
 bsub […] ./selective-debug --ltrace ./debug-tutorial</code> | bsub […] ./selective-debug --ltrace ./debug-tutorial</code> |
  
-=== Ltrace example ===+==== Ltrace example ====
 As with GDB, we amend our code with a //BREAKPOINT_AND_SLEEP(10)// which is executed only on one rank: As with GDB, we amend our code with a //BREAKPOINT_AND_SLEEP(10)// which is executed only on one rank:
 <code c> <code c>
Line 535: Line 533:
 </code> </code>
  
-=== Ltrace parameters: Profiling execution time ===+==== Ltrace parameters: Profiling execution time ====
 A very powerful feature of ltrace is the ability to measure how much execution time your program has spent in certain functions.\\ A very powerful feature of ltrace is the ability to measure how much execution time your program has spent in certain functions.\\
 You can do this with the **-c** parameter of ltrace. You can do this with the **-c** parameter of ltrace.
Line 561: Line 559:
 100.00    2.133659                    71 total 100.00    2.133659                    71 total
 </code> </code>
-==== strace ==== + 
-strace means "system trace". It is very similar to ltrace which we explained in the [[debug_tutorial#ltrace|previous section]]:\\+===== Debugger: strace ===== 
 +strace means "system trace". It is very similar to ltrace which we explained in the [[debug_tutorial#Debugger: ltrace|previous section]]:\\
 It traces calls to system functions. It traces calls to system functions.
 [[http://en.wikipedia.org/wiki/System_call|System functions]] are functions which are implemented in the Linux kernel, as opposed to library functions which are implemented in [[http://en.wikipedia.org/wiki/User_space|user space]]. They are typically basic low level functions such as file access, multithreading and networking. [[http://en.wikipedia.org/wiki/System_call|System functions]] are functions which are implemented in the Linux kernel, as opposed to library functions which are implemented in [[http://en.wikipedia.org/wiki/User_space|user space]]. They are typically basic low level functions such as file access, multithreading and networking.
  
-^ Lesson 12 ^+^ Lesson 13 ^
 | To use strace, use the **- - strace** switch of [[selective-debug]]:\\ \\ <code bash> | To use strace, use the **- - strace** switch of [[selective-debug]]:\\ \\ <code bash>
 bsub […] ./selective-debug --strace ./debug-tutorial</code> | bsub […] ./selective-debug --strace ./debug-tutorial</code> |
 +
 +What you have learned about ltrace also applies to strace. Therefore, please make sure to read the [[debug_tutorial#Debugger: ltrace|ltrace section]] in depth.
 +===== Logging debugger output with selective-debug =====
 +As you have seen, ltrace and strace are non-interactive debuggers. Such debuggers usually produce a large amount of output which is difficult to follow live.\\
 +Therefore, we now show you how to use [[selective-debug]] to log the output of the debugger to one file per reach OpenMPI rank.
 +
 +^ Lesson 14 ^
 +| There are two switches of [[selective-debug]] which can be used to control logging:\\ \\ <code>
 +--log        If you are using OpenMPI and want to clone the output of the debugger to a per-rank log file.
 +             Filename will be "selective-debug-rankN.log". The file will be appended, not overwritten.
 +             The output will also go to the terminal so you can use --xterm.
 +
 +--log-quiet  Only log debugger output to files, do not show it on the terminal.
 +</code> |
 +
 +The possibilities offered by those switches will be explained in the following sections:
 +==== Logging on all ranks ====
 +They allow you to use [[debug_tutorial#Debugger: ltrace|ltrace]] or [[debug_tutorial#Debugger: strace|strace]] parallely on **all ranks** because each rank gets a separate log file.\\ For doing this, just call //BREAKPOINT_AND_SLEEP()// on every rank.
 +
 +==== Separate program and debugger output ====
 +If your program produces output on the terminal, it would damage the output of the debugger as it would appear randomly in between it. The **- - log-quiet** switch is suitable for fixing this. Your program will get the terminl for output and the debugger will get the log files.
 +
 +==== Non-interactive debugging ====
 +Using the **- - log-quiet** switch, you can run the batch job **non-interactively**, i.e. remove the **-I** switch of bsub: You will get the output of your program per E-Mail and the output of the debugger will exist in the files created on disk.
 +===== Other debuggers =====
 +We designed [[selective-debug]] which allows easy addition of new debuggers.\\
 +If you want support for another debugger to be added, feel free to contact  [[zdv@leo.bogert.de|Leo Bogert]].
  
  • debug_tutorial.1375398879.txt.gz
  • Last modified: 2013/08/02 01:14
  • by bogert