debug_tutorial

# Differences

This shows you the differences between two versions of the page.

 debug_tutorial [2013/07/24 00:10]bogert Linkify selective-gdb debug_tutorial [2013/08/02 02:18] (current)bogert [Debugger: strace] Explain that strace is explained by reading the ltrace section 2013/08/02 02:18 bogert [Debugger: strace] Explain that strace is explained by reading the ltrace section2013/08/02 02:16 bogert [Logging debugger output with selective-debug] Fix lesson number2013/08/02 02:15 bogert [Logging debugger output with selective-debug] Lesson-ify text2013/08/02 02:15 bogert [Debugger: strace] Fix lesson number2013/08/02 02:13 bogert [Logging debugger output with selective-debug] Implement the section2013/08/02 01:49 bogert Add section "Logging debugger output with selective-debug"2013/08/02 01:43 bogert [Summary] Speak about performance analysis2013/08/02 01:41 bogert Fix section link2013/08/02 01:38 bogert Re-structure the index: Make each debugger a toplevel section.2013/08/02 01:31 bogert Re-structure the index. Improve section introductions to tell the user what the sections are about2013/08/02 01:19 bogert [Actually using GDB] Rename section as we support multiple debuggers now2013/08/02 01:18 bogert [Execution] Use "debugger" instead of "GDB" as we show how to use different ones now.2013/08/02 01:14 bogert [ltrace] Reorder stuff2013/08/02 01:08 bogert [ltrace] Show how to pass parameters to ltrace to generate exeuction time statistics2013/08/02 00:51 bogert [ltrace] Add ltrace example2013/08/02 00:28 bogert [ltrace] Explain that ltrace is not interactive2013/08/02 00:13 bogert [Actually using GDB] Explain how to tell that the sleep delay is too low2013/08/01 23:59 bogert [Actually using GDB] Don't use the long queue when debugging2013/08/01 05:14 bogert [ltrace] Add section for strace2013/08/01 05:08 bogert [Debugging multiple nodes at once] Reduce formatting2013/08/01 04:52 bogert [ltrace] Remove irrelevant stuff from the command line2013/08/01 04:52 bogert [CGDB] Remove irrelevant stuff from the command line2013/08/01 04:51 bogert [Debugging multiple nodes at once] Remove irrelevant stuff from the command line2013/08/01 04:50 bogert [CGDB] Formatting2013/08/01 04:49 bogert [Debugging multiple nodes at once] Formatting2013/08/01 04:47 bogert [CGDB] Add ltrace section2013/08/01 04:30 bogert [Other debuggers] Explain how to use CGDB2013/08/01 04:13 bogert [Summary] Update my name2013/08/01 04:13 bogert [Other debuggers] Add summary2013/08/01 04:09 bogert Add section "Other debuggers"2013/08/01 04:08 bogert [Summary] Stress that it is possible to attach the debugger to multiple nodes2013/08/01 04:06 bogert [Summary] Stress that it is possible to attach the debugger to multiple nodes2013/08/01 04:03 bogert [Summary] Adapt summary to the availability of new debuggers in selective-debug2013/08/01 03:59 bogert [Tutorial: Debugging MPI programs with GDB at Mogon] 2013/07/24 01:07 bogert [Debugging multiple nodes at once] Remove my username from shell command2013/07/24 01:04 bogert [Actually using GDB] Explain post-decrement even more2013/07/24 01:00 bogert [Actually using GDB] Add missing "of"2013/07/24 00:58 bogert [Actually using GDB] More clarification2013/07/24 00:53 bogert [Actually using GDB] Clarify that we are talking about the value of the variable2013/07/24 00:48 bogert [Actually using GDB] Clarify why its called "breakpoint" even more2013/07/24 00:45 bogert [Actually using GDB] Formatting2013/07/24 00:42 bogert [Execution] Clarify what we are talking about2013/07/24 00:39 bogert [Execution] Don't use the {1..N} syntax of bash for-loops since it will lead to pre-expansion of the {} which is very slow 2013/07/24 00:32 bogert [Execution] Improve documentation of command line2013/07/24 00:31 bogert [Execution] Fix documentation of command line to match the actual command line.2013/07/24 00:27 bogert [Setup] Formatting2013/07/24 00:24 bogert [Setup] Math-ify the description of the algorithm2013/07/24 00:10 bogert Linkify selective-gdb2013/07/21 02:45 bogert [Debugging multiple nodes at once] 2013/07/21 02:45 bogert [Debugging multiple nodes at once] 2013/07/21 02:45 bogert [Debugging multiple nodes at once] 2013/08/02 02:18 bogert [Debugger: strace] Explain that strace is explained by reading the ltrace section2013/08/02 02:16 bogert [Logging debugger output with selective-debug] Fix lesson number2013/08/02 02:15 bogert [Logging debugger output with selective-debug] Lesson-ify text2013/08/02 02:15 bogert [Debugger: strace] Fix lesson number2013/08/02 02:13 bogert [Logging debugger output with selective-debug] Implement the section2013/08/02 01:49 bogert Add section "Logging debugger output with selective-debug"2013/08/02 01:43 bogert [Summary] Speak about performance analysis2013/08/02 01:41 bogert Fix section link2013/08/02 01:38 bogert Re-structure the index: Make each debugger a toplevel section.2013/08/02 01:31 bogert Re-structure the index. Improve section introductions to tell the user what the sections are about2013/08/02 01:19 bogert [Actually using GDB] Rename section as we support multiple debuggers now2013/08/02 01:18 bogert [Execution] Use "debugger" instead of "GDB" as we show how to use different ones now.2013/08/02 01:14 bogert [ltrace] Reorder stuff2013/08/02 01:08 bogert [ltrace] Show how to pass parameters to ltrace to generate exeuction time statistics2013/08/02 00:51 bogert [ltrace] Add ltrace example2013/08/02 00:28 bogert [ltrace] Explain that ltrace is not interactive2013/08/02 00:13 bogert [Actually using GDB] Explain how to tell that the sleep delay is too low2013/08/01 23:59 bogert [Actually using GDB] Don't use the long queue when debugging2013/08/01 05:14 bogert [ltrace] Add section for strace2013/08/01 05:08 bogert [Debugging multiple nodes at once] Reduce formatting2013/08/01 04:52 bogert [ltrace] Remove irrelevant stuff from the command line2013/08/01 04:52 bogert [CGDB] Remove irrelevant stuff from the command line2013/08/01 04:51 bogert [Debugging multiple nodes at once] Remove irrelevant stuff from the command line2013/08/01 04:50 bogert [CGDB] Formatting2013/08/01 04:49 bogert [Debugging multiple nodes at once] Formatting2013/08/01 04:47 bogert [CGDB] Add ltrace section2013/08/01 04:30 bogert [Other debuggers] Explain how to use CGDB2013/08/01 04:13 bogert [Summary] Update my name2013/08/01 04:13 bogert [Other debuggers] Add summary2013/08/01 04:09 bogert Add section "Other debuggers"2013/08/01 04:08 bogert [Summary] Stress that it is possible to attach the debugger to multiple nodes2013/08/01 04:06 bogert [Summary] Stress that it is possible to attach the debugger to multiple nodes2013/08/01 04:03 bogert [Summary] Adapt summary to the availability of new debuggers in selective-debug2013/08/01 03:59 bogert [Tutorial: Debugging MPI programs with GDB at Mogon] 2013/07/24 01:07 bogert [Debugging multiple nodes at once] Remove my username from shell command2013/07/24 01:04 bogert [Actually using GDB] Explain post-decrement even more2013/07/24 01:00 bogert [Actually using GDB] Add missing "of"2013/07/24 00:58 bogert [Actually using GDB] More clarification2013/07/24 00:53 bogert [Actually using GDB] Clarify that we are talking about the value of the variable2013/07/24 00:48 bogert [Actually using GDB] Clarify why its called "breakpoint" even more2013/07/24 00:45 bogert [Actually using GDB] Formatting2013/07/24 00:42 bogert [Execution] Clarify what we are talking about2013/07/24 00:39 bogert [Execution] Don't use the {1..N} syntax of bash for-loops since it will lead to pre-expansion of the {} which is very slow 2013/07/24 00:32 bogert [Execution] Improve documentation of command line2013/07/24 00:31 bogert [Execution] Fix documentation of command line to match the actual command line.2013/07/24 00:27 bogert [Setup] Formatting2013/07/24 00:24 bogert [Setup] Math-ify the description of the algorithm2013/07/24 00:10 bogert Linkify selective-gdb2013/07/21 02:45 bogert [Debugging multiple nodes at once] 2013/07/21 02:45 bogert [Debugging multiple nodes at once] 2013/07/21 02:45 bogert [Debugging multiple nodes at once] Line 1: Line 1: - ====== Tutorial: Debugging MPI programs with GDB at Mogon ====== + ====== Tutorial: Debugging MPI programs at Mogon ====== ===== Summary ===== ===== Summary ===== + We show a method of attaching a debugger on the fly by request of your application.\\ + With the shown method your application can determine on its own which nodes need to be debugged and request attachment of the debugger only on those nodes.\\ + This allows you to debug while running on an amount of nodes which would be too large to attach the debugger to every single node.\\ - We show a method of attaching GDB on the fly by request of your application.\\ + The primary debugger which this tutorial explains is GDB.\\ - With the shown method your application can determine on its own which node needs to be debugged and request attachment of GDB.\\ + At the end of the tutorial, you will be shown how to use various other debuggers with the same technique: cgdb, ltrace and strace.\\ - This allows you to debug while running on an amount of nodes which would be too large to attach GDB to every single node.\\ + Notably, ltrace and strace support profiling the execution time of your program - you can use them for performance optimization. For questions, please contact: For questions, please contact: - --- //[[zdv@leo.bogert.de|Bogert, Leonhard]] 2013/07/20 23:11// + --- //[[zdv@leo.bogert.de|Bogert, Leo]] 2013/07/20 23:11// --- //[[t.suess@uni-mainz.de|Süß, Tim]] (not a signature) // --- //[[t.suess@uni-mainz.de|Süß, Tim]] (not a signature) // ===== Setup ===== ===== Setup ===== - We now show you how to obtain and execute the sample program used in this tutorial. + ==== Understanding & compiling the sample program ==== - Notice that lines starting with “#” are comments and do not need to be executed. + The tutorial is based on a sample program. This section shows how to obtain it, explains what it does, and tells you how to compile it. + First of all, we download the source code. Notice that lines starting with “#” are comments and do not need to be executed. ssh mogon.zdv.uni-mainz.de ssh mogon.zdv.uni-mainz.de Line 108: Line 112: } } + The goal of the program is to add up a large array of numbers. This is very easy to parallelize: - The goal of the program is to add up a large array of N numbers. This is very easy to parallelize: + Let N be the number of elements in the array. Let P be the amount of workers. Then we do the following: - + - Let P be the amount of workers. Then we do the following: + - We split up the array to P chunks of equal size N/P and distribute a chunk to each worker. - We split up the array to P chunks of equal size N/P and distribute a chunk to each worker. - Each worker then adds up the N/P elements in its chunk. - Each worker then adds up the N/P elements in its chunk. Line 117: Line 120: Luckily, MPI even provides two collective operations which can do steps 1 and 3 for us. “Collective” means that we call the operations on all nodes and they yield a different result on each node: Luckily, MPI even provides two collective operations which can do steps 1 and 3 for us. “Collective” means that we call the operations on all nodes and they yield a different result on each node: - * The splitting of the array into chunks is done by MPI_Scatter. It yields the chunk of the node which executes it as a result. + * The splitting of the array into chunks is done by //MPI_Scatter//. It yields the chunk of the node which executes it as a result. - * The addition of the chunk sub-sums is done by MPI_Reduce. It yields the sum of all sub sums on the root node (rank = 0). + * The addition of the chunk sub-sums is done by //MPI_Reduce//. It yields the sum of all sub sums on the root node (rank = 0). See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums. See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums. - ===== Execution ===== + ==== Preparing for debugging ==== - We now compile the program and execute it as a batch job on Mogon: + This section will show you how to provoke bugs in your programs. After that, you will be given an example of how to isolate the bugs to a small region in the program - it reduces debugging time to get a rough idea of where the bug might be before using the debugger. + First of all, we schedule the program for execution as a batch job on Mogon: # Make the libraries available which are needed for compilation and execution # Make the libraries available which are needed for compilation and execution Line 139: Line 143: #   instead of being sent out by mail. This is necessary for being able to #   instead of being sent out by mail. This is necessary for being able to #   run GDB on a terminal. #   run GDB on a terminal. - # - Use 10 processes (-n 10). The fewer you chose the less you have to wait. + # - Use 2 processes (-n 2). The fewer you chose the less you have to wait. # - Run each process on a different computer (-R 'span[ptile=1]'). # - Run each process on a different computer (-R 'span[ptile=1]'). #   This introduces latency which is good for provoking errors in parallel #   This introduces latency which is good for provoking errors in parallel Line 146: Line 150: #   time. ONLY USE THIS QUEUE IF YOUR JOB IS REALLY SHORT! #   time. ONLY USE THIS QUEUE IF YOUR JOB IS REALLY SHORT! # - (The “-a openmpi” is glue code to run MPI apps through bsub) # - (The “-a openmpi” is glue code to run MPI apps through bsub) - # - “mpirun ./debug-tutorial” is the actual command being executed 10 times. + # - “mpirun ./debug-tutorial” is the actual command being executed as each process. bsub -I -n 2 -R 'span[ptile=1]' -q short -a openmpi mpirun ./debug-tutorial bsub -I -n 2 -R 'span[ptile=1]' -q short -a openmpi mpirun ./debug-tutorial Line 184: Line 188: Because we don't spot a pattern between the node count at which it fails and the problem size, we now use the ability of the Linux shell to enter loops: Because we don't spot a pattern between the node count at which it fails and the problem size, we now use the ability of the Linux shell to enter loops: - for i in {1..20} ; do bsub -n "$i" -R 'span[ptile=1]' -q long -a openmpi mpirun ./debug-tutorial ; done + for ((i=0; i < 20; i++)) ; do bsub -n "$i" -R 'span[ptile=1]' -q long -a openmpi mpirun ./debug-tutorial ; done Line 227: Line 231: | Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. | | Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. | - Before we fulfill our promise to actually teach you something about GDB, we note that we have managed to reduce the line count where we suspect the problem to be to 2 **without** using GDB. This gives you: + Before we fulfill our promise to actually teach you something about using a debugger, we note that we have managed to reduce the line count where we suspect the problem to be to 2 **without** using a debugger. This gives you: ^ Lesson 6 ^ ^ Lesson 6 ^ - | A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. | + | A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without debugging tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. | - ===== Actually using GDB ===== + ===== Debugger: GDB ===== + One of the most well-known and powerful debuggers in the Unix world is GDB. It has been in development since 1986. You should definitely give it a chance. ^ Lesson 7 ^ ^ Lesson 7 ^ Line 243: Line 248: Unfortunately, when working with distributed computing, this is not sufficient: Unfortunately, when working with distributed computing, this is not sufficient: - * Letting GDB start the processes would result in having a single GDB instance running for every process on every node. When using interactive mode with LSF (“-I”), we only get a single terminal. As GDB is a program which is used by terminal commands, every GDB command you enter into the terminal would be executed by every instance of GDB.\\ Further, it probably does not make sense to run GDB attached to every MPI process anyway: As our tutorial shows, problems often appear on a single node and we want to debug a single process at once therefore. + * Letting GDB start the processes would result in having a single GDB instance running for every process on every node. When using interactive mode with LSF (“-I”), we only get a single terminal. As GDB is a program which is used by terminal commands, every GDB command you enter into the terminal would be executed by //every// instance of GDB.\\ Further, it probably does not make sense to run GDB attached to every MPI process anyway: As our tutorial shows, problems often appear on a single node and we want to debug a single process at once therefore. * Attaching to an existing process would require that we run GDB on the same machine as the process is running on. It is acceptable to connect to a certain machine of Mogon via SSH as long as you ask the admin first. But having to connect to the affected machine first is quite annoying. * Attaching to an existing process would require that we run GDB on the same machine as the process is running on. It is acceptable to connect to a certain machine of Mogon via SSH as long as you ask the admin first. But having to connect to the affected machine first is quite annoying. Line 249: Line 254: “Wrapper” means that you put the script into the bsub command line as the executable to launch and add the actual program as parameter to it: “Wrapper” means that you put the script into the bsub command line as the executable to launch and add the actual program as parameter to it: - bsub -I -n 10 -q long -a openmpi mpirun ./selective-debug ./debug-tutorial + bsub -I -n 10 -q short -a openmpi mpirun ./selective-debug ./debug-tutorial Line 267: Line 272: } } - The macro is called "BREAKPOINT..." because "setting a breakpoint" is the technical term for telling a debugger to halt the program at a certain breakpoint. + The macro is called "BREAKPOINT..." because "setting a breakpoint" is the technical term for telling a debugger to halt the program at a certain point - the breakpoint. The macro is called "..._AND_SLEEP" because after sending the breakpoint signal, it will do a call "sleep(10);" which causes the program to do nothing for 10 seconds.\\ The macro is called "..._AND_SLEEP" because after sending the breakpoint signal, it will do a call "sleep(10);" which causes the program to do nothing for 10 seconds.\\ - - It takes some time for gdb to start up and halt the program. If the program does not sleep for long enough, the debugger will halt it beyond the desired breakpoint. If you notice that this happens, please increase the sleep delay.\\ + - It takes some time for gdb to start up and halt the program. If the program does not sleep for long enough, the debugger will halt it beyond the desired breakpoint. If you notice that this happens, please increase the sleep delay. You can tell that it happened from "bt" (explanation will follow) showing the wrong point of execution. It is also possible that your program has exited already, which the debugger will tell you with "No such process.".\\ Also please do not just enter a very large value for the delay: When you are in the  debugger and want to step through the program, you will have to wait for the whole delay to expire first! Also please do not just enter a very large value for the delay: When you are in the  debugger and want to step through the program, you will have to wait for the whole delay to expire first! Line 334: Line 339: - Notice that what is displayed has NOT been executed yet, it is the next instruction which the processor will execute. We want to remember the “remainder_items” variable from which we compute the “index” variable so we look it up: + Notice that what is displayed has NOT been executed yet, it is the next instruction which the processor will execute. We want to remember the value of the “remainder_items” variable from which we compute the “index” variable so we look it up: (gdb) print remainder_items (gdb) print remainder_items Line 357: Line 362: - What we did here is that we first looked up which line the execution is halted at using “bt” and then we used list to show lines from 39 to 59. Notice that you could as well just have shown some lines before the current location using “list” without parameters. To find out about more ways of specifying positions, use “help list”. + What we did here is that we first looked up which line the execution is halted at using “bt”. Then we used list to show lines the lines before that - from 39 to 59. Notice that you could as well just have shown some lines before the current location using “list” without parameters, but it might not show enough of them. To find out about more ways of specifying positions, use “help list”. Lets find the problem in our program now: Lets find the problem in our program now: Line 375: Line 380: We stepped into the first execution of the while loop at line 57.\\ We stepped into the first execution of the while loop at line 57.\\ - Whats the goal of that loop? Scatter will distribute an equal amount of items of the work to each node. Because the size of the array is not divisible by the amount of nodes, an amount N mod P items remains. The value N mod P has been computed at line 40 and stored in the variable //remainder_items//.\\ + Whats the goal of that loop? Scatter will distribute an equal amount of items of the work to each node. Because the size of the array is not divisible by the amount of nodes, an amount of N mod P items remains. The value N mod P has been computed at line 40 and stored in the variable //remainder_items//.\\ The loop at line 57 which we step through shall process those //remainder_items// at the start of the work array((Notice that this is a poor algorithm: At the worst case, the amount of //remainder_items// will be as large as the size of all other work units minus one item. This means that when all nodes but the first one have finished computing already, the first node will continue computation for almost as long as it took the other nodes to finish.)). The loop at line 57 which we step through shall process those //remainder_items// at the start of the work array((Notice that this is a poor algorithm: At the worst case, the amount of //remainder_items// will be as large as the size of all other work units minus one item. This means that when all nodes but the first one have finished computing already, the first node will continue computation for almost as long as it took the other nodes to finish.)). Line 415: Line 420: In the bugged code, we obtain the index by doing In the bugged code, we obtain the index by doing remainder_items-- remainder_items-- - This is the so-called "post-decrement" operator: It returns the current value of the variable and decrements the variable after that. If you want to obtain the decremented value, you have to use the "pre-decrement" operator: + This is the so-called "post-decrement" operator: It returns the current value of the variable and decrements the variable **after having returned the old value**. If you want to obtain the decremented value, you have to use the "pre-decrement" operator: --remainder_items --remainder_items This will decrement the variable and return the decremented value. This will decrement the variable and return the decremented value. Line 448: Line 453: ===== Debugging multiple nodes at once ===== ===== Debugging multiple nodes at once ===== - The technique we have shown you using the //[[selective-debug]]// wrapper has one disadvantage: It only allows you to debug a single node at once because it is attached to the single terminal which was used to execute //bsub//. + The technique we have shown you using the [[selective-debug]] wrapper has one disadvantage: It only allows you to debug a single node at once because it is attached to the single terminal which was used to execute //bsub//. - To debug multiple processes at once, //[[selective-debug]]// does offer you an **- - xterm** switch: It will make it launch GDB inside of a new graphical terminal window using the X-Server on your client.\\ + To debug multiple processes at once, [[selective-debug]] does offer you an **- - xterm** switch: It will make it launch GDB inside of a new graphical terminal window using the X-Server on your client.\\ - This will allow your program to tell //[[selective-debug]]// to launch GDB on multiple nodes:\\ + This will allow your program to tell [[selective-debug]] to launch GDB on multiple nodes:\\ Each instance of GDB will have its own Xterm terminal window. Each instance of GDB will have its own Xterm terminal window. Line 457: Line 462: ^ Lesson 10 ^ ^ Lesson 10 ^ - | To use multiple instances of GDB, forward the connection to your X-Server via SSH and use the “- - xterm” switch of [[selective-debug]]:\\ \\ + | To use multiple instances of GDB, forward the connection to your X-Server via SSH and use the **- - xterm** switch of [[selective-debug]]:\\ \\ - ssh -Y bogert@mogon.zdv.uni-mainz.de + ssh -Y mogon.zdv.uni-mainz.de - bsub […] -a openmpi mpirun ./selective-debug --xterm ./debug-tutorialYour program then is allowed to send the debug signal to //[[selective-debug]]// on multiple nodes for being able to run multiple instances of GDB. | + bsub […] ./selective-debug --xterm ./debug-tutorialYour program then is allowed to send the debug signal to [[selective-debug]] on multiple nodes for being able to run multiple instances of GDB. | + + ===== Debugger: CGDB ===== + CGDB is basically a wrapper around GDB. It amends GDB with a terminal-graphics interface.\\ + This interface splits the terminal in a top and bottom half: + * The top always shows the source code of your program at the current point of execution. + * The bottom shows the GDB command prompt. + Having a large area of the source code always visible obviously is a huge advantage. + + ^ Lesson 11 ^ + | To use CGDB you will need to do two modifications to your debugging setup:\\ - Make [[selective-debug]] launch Xterms for debugging as described in the [[debug_tutorial#Debugging multiple nodes at once|previous section]]((This is needed because the standard SSH terminal does not provide sufficient functionality for the ncurses terminal-graphics interface of CGDB.)).\\ - Use the **- - cgdb** switch of [[selective-debug]]:\\ \\ + ssh -Y mogon.zdv.uni-mainz.de + bsub […] ./selective-debug --xterm --cgdb ./debug-tutorial | + + ===== Debugger: ltrace ===== + ltrace means "library trace". It will show a timeline (trace) of all calls to library functions which your program does.\\ + The "functions" in "library functions" refers to normal C functions.\\ + The "library" referes to all functions which are **not** implemented in the source code your program. + + In other words it will trace any functions which you //#include// from things such as: + * The [[http://en.wikipedia.org/wiki/C_standard_library|standard C library]] - it provides stuff such as //printf()//. + * The MPI library - **MPI functions are also library calls!** + + With the ability to show a trace of calls to MPI functions, we can get a very useful timeline of our programs execution. + + Another notable difference between ltrace and GDB you need to know is that ltrace is not an //interactive// debugger: + * It will not halt the execution of your program. + * You cannot enter any commands while it is running. All of its behavior is determined by its parameters. They can be passed to ltrace with the environment variable SELECTIVE_DEBUG__DEBUGGER_PARAMS. We will show you how to do that later. + + ^ Lesson 12 ^ + | To use ltrace, use the **- - ltrace** switch of [[selective-debug]]:\\ \\ + bsub […] ./selective-debug --ltrace ./debug-tutorial | + + ==== Ltrace example ==== + As with GDB, we amend our code with a //BREAKPOINT_AND_SLEEP(10)// which is executed only on one rank: + + if(my_rank == 0) + BREAKPOINT_AND_SLEEP(10); + + + We have provided a file [[https://github.com/leo-bogert/mpi-debug-tutorial/blob/master/debug-tutorial-4-ltrace.c|debug-tutorial-4-ltrace.c]] which contains this modification. + Notice that we have added the breakpoint close to the beginning of the program so we can see what a rather complete trace looks like. + + We compile the file **without** the **-g** switch which would include debug infomation: Ltrace does not need debug information!\\ + This provides the advantage that you can use it upon programs of which you do not have the source code available. + + mpicc debug-tutorial-4-ltrace.c -std=c99 -o debug-tutorial + + + We execute [[selective-debug]] with the ltrace parameter: + + bsub -I -n 10 -q short -a openmpi mpirun ./selective-debug --ltrace ./debug-tutorial + + + The output will look similar to: + + Job <23906720> is submitted to queue . + <> + <> + [pid 63586] MPI_Alloc_mem(419428, 0x601660, 0x7ffff5a76900, 10, 0x7ffff5a766f0) = 0 + [pid 63586] MPI_Scatter(0x6027b8, 104857, 0x601760, 0x2ad2c1e46000, 104857) = 0 + [pid 63586] MPI_Free_mem(0x2ad2c1e46000, 0xffff8002, 0x2e717f67, 0x2ad2bb53e8b8, 0x1d2b0c0) = 0 + [pid 63586] MPI_Reduce(0x7ffff5a768f8, 0x602790, 1, 0x602580, 0x601d60) = 0 + [pid 63586] fwrite("Test FAILED!\n", 1, 13, 0x2ad2b7995860Test FAILED! + ) = 13 + [pid 63586] MPI_Barrier(0x601960, 0x401200, 13, -1, 0x4012a6) = 0 + [pid 63586] MPI_Finalize(0x7ffff5a767f8, 0xffff8002, 18, 0, 0x1d2a2e0 + + + ==== Ltrace parameters: Profiling execution time ==== + A very powerful feature of ltrace is the ability to measure how much execution time your program has spent in certain functions.\\ + You can do this with the **-c** parameter of ltrace. + + But since ltrace is executed indirectly via [[selective-debug]], we need to use its facility of passing parameters through to the debugger.\\ + This can be done with the environment variable SELECTIVE_DEBUG__DEBUGGER_PARAMS. To pass an environment variable to a command which you execute in the shell, you put the assignment to the variable before the command: + + SELECTIVE_DEBUG__DEBUGGER_PARAMS='-c' bsub -I -n 10 -q short -a openmpi mpirun ./selective-debug --ltrace ./debug-tutorial + + Please notice that we put the assignment of the variable at the very beginning of the command line, not before the "./selective-debug": It needs to be before the command which we tell the shell, and to the shell the "bsub" is the command and everything which follows are parameters, not commands. + + You will get an output similar to: + + % time     seconds  usecs/call     calls      function + ------ ----------- ----------- --------- -------------------- + 96.12    2.050929     2050929         1 MPI_Finalize + 1.07    0.022868       22868         1 MPI_Scatter + 0.83    0.017779       17779         1 MPI_Free_mem + 0.52    0.011048       11048         1 MPI_Alloc_mem + 0.51    0.010905         170        64 + 0.49    0.010353       10353         1 MPI_Barrier + 0.27    0.005779        5779         1 MPI_Reduce + 0.19    0.003998        3998         1 fwrite + ------ ----------- ----------- --------- -------------------- + 100.00    2.133659                    71 total + + + ===== Debugger: strace ===== + strace means "system trace". It is very similar to ltrace which we explained in the [[debug_tutorial#Debugger: ltrace|previous section]]:\\ + It traces calls to system functions. + [[http://en.wikipedia.org/wiki/System_call|System functions]] are functions which are implemented in the Linux kernel, as opposed to library functions which are implemented in [[http://en.wikipedia.org/wiki/User_space|user space]]. They are typically basic low level functions such as file access, multithreading and networking. + + ^ Lesson 13 ^ + | To use strace, use the **- - strace** switch of [[selective-debug]]:\\ \\ + bsub […] ./selective-debug --strace ./debug-tutorial | + + What you have learned about ltrace also applies to strace. Therefore, please make sure to read the [[debug_tutorial#Debugger: ltrace|ltrace section]] in depth. + ===== Logging debugger output with selective-debug ===== + As you have seen, ltrace and strace are non-interactive debuggers. Such debuggers usually produce a large amount of output which is difficult to follow live.\\ + Therefore, we now show you how to use [[selective-debug]] to log the output of the debugger to one file per reach OpenMPI rank. + + ^ Lesson 14 ^ + | There are two switches of [[selective-debug]] which can be used to control logging:\\ \\ + --log        If you are using OpenMPI and want to clone the output of the debugger to a per-rank log file. + Filename will be "selective-debug-rankN.log". The file will be appended, not overwritten. + The output will also go to the terminal so you can use --xterm. + + --log-quiet  Only log debugger output to files, do not show it on the terminal. + | + + The possibilities offered by those switches will be explained in the following sections: + ==== Logging on all ranks ==== + They allow you to use [[debug_tutorial#Debugger: ltrace|ltrace]] or [[debug_tutorial#Debugger: strace|strace]] parallely on **all ranks** because each rank gets a separate log file.\\ For doing this, just call //BREAKPOINT_AND_SLEEP()// on every rank. + + ==== Separate program and debugger output ==== + If your program produces output on the terminal, it would damage the output of the debugger as it would appear randomly in between it. The **- - log-quiet** switch is suitable for fixing this. Your program will get the terminl for output and the debugger will get the log files. + + ==== Non-interactive debugging ==== + Using the **- - log-quiet** switch, you can run the batch job **non-interactively**, i.e. remove the **-I** switch of bsub: You will get the output of your program per E-Mail and the output of the debugger will exist in the files created on disk. + ===== Other debuggers ===== + We designed [[selective-debug]] which allows easy addition of new debuggers.\\ + If you want support for another debugger to be added, feel free to contact  [[zdv@leo.bogert.de|Leo Bogert]].
• debug_tutorial.1374617417.txt.gz