Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
debug_tutorial [2013/08/01 04:50] bogert [CGDB] Formatting |
debug_tutorial [2013/08/02 02:18] bogert [Debugger: strace] Explain that strace is explained by reading the ltrace section |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Tutorial: Debugging MPI programs at Mogon ====== | ====== Tutorial: Debugging MPI programs at Mogon ====== | ||
===== Summary ===== | ===== Summary ===== | ||
- | |||
We show a method of attaching a debugger on the fly by request of your application.\\ | We show a method of attaching a debugger on the fly by request of your application.\\ | ||
With the shown method your application can determine on its own which nodes need to be debugged and request attachment of the debugger only on those nodes.\\ | With the shown method your application can determine on its own which nodes need to be debugged and request attachment of the debugger only on those nodes.\\ | ||
Line 7: | Line 6: | ||
The primary debugger which this tutorial explains is GDB.\\ | The primary debugger which this tutorial explains is GDB.\\ | ||
- | At the end of the tutorial, you will be shown how to use various other debuggers with the same technique: cgdb, ltrace and strace. | + | At the end of the tutorial, you will be shown how to use various other debuggers with the same technique: cgdb, ltrace and strace.\\ |
+ | Notably, ltrace and strace support profiling the execution time of your program - you can use them for performance optimization. | ||
For questions, please contact: | For questions, please contact: | ||
Line 16: | Line 16: | ||
===== Setup ===== | ===== Setup ===== | ||
- | We now show you how to obtain and execute | + | ==== Understanding & compiling |
- | Notice that lines starting with “#” are comments | + | The tutorial |
+ | First of all, we download the source code. Notice that lines starting with “#” are comments and do not need to be executed. | ||
<code bash> | <code bash> | ||
ssh mogon.zdv.uni-mainz.de | ssh mogon.zdv.uni-mainz.de | ||
Line 123: | Line 124: | ||
See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums. | See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums. | ||
- | ===== Execution ===== | + | ==== Preparing for debugging |
- | We now compile | + | This section will show you how to provoke bugs in your programs. After that, you will be given an example of how to isolate the bugs to a small region in the program |
+ | First of all, we schedule the program for execution as a batch job on Mogon: | ||
<code bash> | <code bash> | ||
# Make the libraries available which are needed for compilation and execution | # Make the libraries available which are needed for compilation and execution | ||
Line 229: | Line 231: | ||
| Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. | | | Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. | | ||
- | Before we fulfill our promise to actually teach you something about GDB, we note that we have managed to reduce the line count where we suspect the problem to be to 2 **without** using GDB. This gives you: | + | Before we fulfill our promise to actually teach you something about using a debugger, we note that we have managed to reduce the line count where we suspect the problem to be to 2 **without** using a debugger. This gives you: |
^ Lesson 6 ^ | ^ Lesson 6 ^ | ||
| A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without debugging tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. | | | A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without debugging tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. | | ||
- | ===== Actually using GDB ===== | + | ===== Debugger: |
+ | One of the most well-known and powerful debuggers in the Unix world is GDB. It has been in development since 1986. You should definitely give it a chance. | ||
^ Lesson 7 ^ | ^ Lesson 7 ^ | ||
Line 251: | Line 254: | ||
“Wrapper” means that you put the script into the bsub command line as the executable to launch and add the actual program as parameter to it: | “Wrapper” means that you put the script into the bsub command line as the executable to launch and add the actual program as parameter to it: | ||
< | < | ||
- | bsub -I -n 10 -q long -a openmpi mpirun ./ | + | bsub -I -n 10 -q short -a openmpi mpirun ./ |
</ | </ | ||
Line 272: | Line 275: | ||
The macro is called " | The macro is called " | ||
- | - It takes some time for gdb to start up and halt the program. If the program does not sleep for long enough, the debugger will halt it beyond the desired breakpoint. If you notice that this happens, please increase the sleep delay.\\ | + | - It takes some time for gdb to start up and halt the program. If the program does not sleep for long enough, the debugger will halt it beyond the desired breakpoint. If you notice that this happens, please increase the sleep delay. You can tell that it happened from " |
Also please do not just enter a very large value for the delay: When you are in the debugger and want to step through the program, you will have to wait for the whole delay to expire first! | Also please do not just enter a very large value for the delay: When you are in the debugger and want to step through the program, you will have to wait for the whole delay to expire first! | ||
Line 450: | Line 453: | ||
===== Debugging multiple nodes at once ===== | ===== Debugging multiple nodes at once ===== | ||
- | The technique we have shown you using the //[[selective-debug]]// wrapper has one disadvantage: | + | The technique we have shown you using the [[selective-debug]] wrapper has one disadvantage: |
- | To debug multiple processes at once, //[[selective-debug]]// does offer you an **- - xterm** switch: It will make it launch GDB inside of a new graphical terminal window using the X-Server on your client.\\ | + | To debug multiple processes at once, [[selective-debug]] does offer you an **- - xterm** switch: It will make it launch GDB inside of a new graphical terminal window using the X-Server on your client.\\ |
- | This will allow your program to tell //[[selective-debug]]// to launch GDB on multiple nodes:\\ | + | This will allow your program to tell [[selective-debug]] to launch GDB on multiple nodes:\\ |
Each instance of GDB will have its own Xterm terminal window. | Each instance of GDB will have its own Xterm terminal window. | ||
Line 461: | Line 464: | ||
| To use multiple instances of GDB, forward the connection to your X-Server via SSH and use the **- - xterm** switch of [[selective-debug]]: | | To use multiple instances of GDB, forward the connection to your X-Server via SSH and use the **- - xterm** switch of [[selective-debug]]: | ||
ssh -Y mogon.zdv.uni-mainz.de | ssh -Y mogon.zdv.uni-mainz.de | ||
- | bsub […] -a openmpi mpirun | + | bsub […] ./ |
- | ===== Other debuggers | + | ===== Debugger: CGDB ===== |
- | While GDB is a very powerful tool, it is not the only powerful one.\\ | + | |
- | We designed [[selective-debug]] which allows easy addition of new debuggers. An introduction to each one which is currently supported will follow.\\ | + | |
- | If you want support for another debugger to be added, feel free to contact | + | |
- | + | ||
- | ==== CGDB ==== | + | |
CGDB is basically a wrapper around GDB. It amends GDB with a terminal-graphics interface.\\ | CGDB is basically a wrapper around GDB. It amends GDB with a terminal-graphics interface.\\ | ||
This interface splits the terminal in a top and bottom half: | This interface splits the terminal in a top and bottom half: | ||
Line 478: | Line 476: | ||
| To use CGDB you will need to do two modifications to your debugging setup:\\ - Make [[selective-debug]] launch Xterms for debugging as described in the [[debug_tutorial# | | To use CGDB you will need to do two modifications to your debugging setup:\\ - Make [[selective-debug]] launch Xterms for debugging as described in the [[debug_tutorial# | ||
ssh -Y mogon.zdv.uni-mainz.de | ssh -Y mogon.zdv.uni-mainz.de | ||
- | bsub […] -a openmpi mpirun | + | bsub […] ./ |
- | ==== ltrace ==== | + | ===== Debugger: |
ltrace means " | ltrace means " | ||
The " | The " | ||
Line 490: | Line 488: | ||
With the ability to show a trace of calls to MPI functions, we can get a very useful timeline of our programs execution. | With the ability to show a trace of calls to MPI functions, we can get a very useful timeline of our programs execution. | ||
+ | |||
+ | Another notable difference between ltrace and GDB you need to know is that ltrace is not an // | ||
+ | * It will not halt the execution of your program. | ||
+ | * You cannot enter any commands while it is running. All of its behavior is determined by its parameters. They can be passed to ltrace with the environment variable < | ||
^ Lesson 12 ^ | ^ Lesson 12 ^ | ||
| To use ltrace, use the **- - ltrace** switch of [[selective-debug]]: | | To use ltrace, use the **- - ltrace** switch of [[selective-debug]]: | ||
- | bsub […] -a openmpi mpirun ./ | + | bsub […] ./ |
+ | |||
+ | ==== Ltrace example ==== | ||
+ | As with GDB, we amend our code with a // | ||
+ | <code c> | ||
+ | if(my_rank == 0) | ||
+ | BREAKPOINT_AND_SLEEP(10); | ||
+ | </ | ||
+ | |||
+ | We have provided a file [[https:// | ||
+ | Notice that we have added the breakpoint close to the beginning of the program so we can see what a rather complete trace looks like. | ||
+ | |||
+ | We compile the file **without** the **-g** switch which would include debug infomation: Ltrace does not need debug information!\\ | ||
+ | This provides the advantage that you can use it upon programs of which you do not have the source code available. | ||
+ | <code bash> | ||
+ | mpicc debug-tutorial-4-ltrace.c -std=c99 -o debug-tutorial | ||
+ | </ | ||
+ | |||
+ | We execute [[selective-debug]] with the ltrace parameter: | ||
+ | <code bash> | ||
+ | bsub -I -n 10 -q short -a openmpi mpirun ./ | ||
+ | </ | ||
+ | |||
+ | The output will look similar to: | ||
+ | < | ||
+ | Job < | ||
+ | << | ||
+ | << | ||
+ | [pid 63586] MPI_Alloc_mem(419428, | ||
+ | [pid 63586] MPI_Scatter(0x6027b8, | ||
+ | [pid 63586] MPI_Free_mem(0x2ad2c1e46000, | ||
+ | [pid 63586] MPI_Reduce(0x7ffff5a768f8, | ||
+ | [pid 63586] fwrite(" | ||
+ | ) = 13 | ||
+ | [pid 63586] MPI_Barrier(0x601960, | ||
+ | [pid 63586] MPI_Finalize(0x7ffff5a767f8, | ||
+ | </ | ||
+ | |||
+ | ==== Ltrace parameters: Profiling execution time ==== | ||
+ | A very powerful feature of ltrace is the ability to measure how much execution time your program has spent in certain functions.\\ | ||
+ | You can do this with the **-c** parameter of ltrace. | ||
+ | |||
+ | But since ltrace is executed indirectly via [[selective-debug]], | ||
+ | This can be done with the environment variable < | ||
+ | <code bash> | ||
+ | SELECTIVE_DEBUG__DEBUGGER_PARAMS=' | ||
+ | </ | ||
+ | Please notice that we put the assignment of the variable at the very beginning of the command line, not before the " | ||
+ | |||
+ | You will get an output similar to: | ||
+ | < | ||
+ | % time | ||
+ | ------ ----------- ----------- --------- -------------------- | ||
+ | | ||
+ | 1.07 0.022868 | ||
+ | 0.83 0.017779 | ||
+ | 0.52 0.011048 | ||
+ | 0.51 0.010905 | ||
+ | 0.49 0.010353 | ||
+ | 0.27 0.005779 | ||
+ | 0.19 0.003998 | ||
+ | ------ ----------- ----------- --------- -------------------- | ||
+ | 100.00 | ||
+ | </ | ||
+ | |||
+ | ===== Debugger: strace ===== | ||
+ | strace means " | ||
+ | It traces calls to system functions. | ||
+ | [[http:// | ||
+ | |||
+ | ^ Lesson 13 ^ | ||
+ | | To use strace, use the **- - strace** switch of [[selective-debug]]: | ||
+ | bsub […] ./ | ||
+ | |||
+ | What you have learned about ltrace also applies to strace. Therefore, please make sure to read the [[debug_tutorial# | ||
+ | ===== Logging debugger output with selective-debug ===== | ||
+ | As you have seen, ltrace and strace are non-interactive debuggers. Such debuggers usually produce a large amount of output which is difficult to follow live.\\ | ||
+ | Therefore, we now show you how to use [[selective-debug]] to log the output of the debugger to one file per reach OpenMPI rank. | ||
+ | |||
+ | ^ Lesson 14 ^ | ||
+ | | There are two switches of [[selective-debug]] which can be used to control logging:\\ \\ < | ||
+ | --log If you are using OpenMPI and want to clone the output of the debugger to a per-rank log file. | ||
+ | | ||
+ | The output will also go to the terminal so you can use --xterm. | ||
+ | |||
+ | --log-quiet | ||
+ | </ | ||
+ | |||
+ | The possibilities offered by those switches will be explained in the following sections: | ||
+ | ==== Logging on all ranks ==== | ||
+ | They allow you to use [[debug_tutorial# | ||
+ | |||
+ | ==== Separate program and debugger output ==== | ||
+ | If your program produces output on the terminal, it would damage the output of the debugger as it would appear randomly in between it. The **- - log-quiet** switch is suitable for fixing this. Your program will get the terminl for output and the debugger will get the log files. | ||
+ | |||
+ | ==== Non-interactive debugging ==== | ||
+ | Using the **- - log-quiet** switch, you can run the batch job **non-interactively**, | ||
+ | ===== Other debuggers ===== | ||
+ | We designed [[selective-debug]] which allows easy addition of new debuggers.\\ | ||
+ | If you want support for another debugger to be added, feel free to contact | ||