debug_tutorial

This is an old revision of the document!

# Tutorial: Debugging MPI programs with GDB at Mogon

We show a method of attaching GDB on the fly by request of your application.
With the shown method your application can determine on its own which node needs to be debugged and request attachment of GDB.
This allows you to debug while running on an amount of nodes which would be too large to attach GDB to every single node.

Bogert, Leonhard 2013/07/20 23:11

Süß, Tim (not a signature)

We now show you how to obtain and execute the sample program used in this tutorial. Notice that lines starting with “#” are comments and do not need to be executed.

ssh mogon.zdv.uni-mainz.de

git clone https://github.com/leo-bogert/mpi-debug-tutorial.git

# Change to the directory which contains it
cd mpi-debug-tutorial

Let's have a look at what it does - the source code is below but you can also open “debug-tutorial-1-bugged.c” with your favorite text editor:

emacs debug-tutorial-1-bugged.c

Here's the content of that file:

debug-tutorial-1-bugged.c
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define items (1024*1024)
int array[items]; // Goal of the program: Summing up this array. (Instead use allocated memory in real programs!)
long sum = 0; // The result of the computation

long sum__sequential_reference_implementation() { // Non-parallel reference implementation
long s = 0;
for(int item = 0; item < items; ++item)
s += array[item];
return s;
}

int main(int argc, char** argv) {
MPI_Init(&argc, &argv);

int my_rank; // Number of the node
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

int node_count; // Total number of nodes
MPI_Comm_size(MPI_COMM_WORLD, &node_count);

// The root must load the input data to distribute to the other nodes
if(my_rank == 0) {
// In our case it generates a random array as input data
srand(time(NULL));
for(int item = 0; item < items; ++item)
array[item] = rand();
}

int items_per_rank = items / node_count;
int remainder_items = items % node_count;
int* my_work;
MPI_Alloc_mem(items_per_rank * sizeof(int), MPI_INFO_NULL, &my_work);

// MPI_Scatter is a collective operation which distributes an equal-sized part of the given array to each node.
MPI_Scatter(&array[remainder_items] /* send buffer */, items_per_rank /* send count per node */, MPI_INT /* send type */,
my_work /* receive buffer on each node */, items_per_rank /* receive count */ , MPI_INT /* receive type */,
0 /* send buffer is stored on this rank */, MPI_COMM_WORLD /* communication channel */);

// This is the actual working-loop
long sub_sum = 0;
for(int i=0; i < items_per_rank; i++)
sub_sum += my_work[i];

if(my_rank == 0) { // Scatter cannot deal with a division remainder so we manually deal with it
while(remainder_items > 0)
sub_sum += array[remainder_items--];
}

MPI_Free_mem(my_work);

// MPI_Reduce with op-code MPI_SUM is a collective operation which sums up the input sub_sum of each node
// into single a resulting output sum on the master.
MPI_Reduce(&sub_sum /* input to sum up */, &sum /* output */, 1 /* input count */, MPI_LONG /* input type */,
MPI_SUM /* operation */, 0 /* output is stored on this rank */, MPI_COMM_WORLD /* communication channel */);

if(my_rank == 0) {
// The result of the computation now is available on rank 0.
// We compare it with the sequential reference implementation to test our parallel implementation.
if(sum == sum__sequential_reference_implementation())
fprintf(stderr, "Test OK.\n");
else
fprintf(stderr, "Test FAILED!\n");
}

MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return EXIT_SUCCESS;
}

The goal of the program is to add up a large array of numbers. This is very easy to parallelize:

Let N be the number of elements in the array. Let P be the amount of workers. Then we do the following:

1. We split up the array to P chunks of equal size N/P and distribute a chunk to each worker.
2. Each worker then adds up the N/P elements in its chunk.
3. After that is done, we only have to add up the P chunk sums to obtain the result.

Luckily, MPI even provides two collective operations which can do steps 1 and 3 for us. “Collective” means that we call the operations on all nodes and they yield a different result on each node:

• The splitting of the array into chunks is done by MPI_Scatter. It yields the chunk of the node which executes it as a result.
• The addition of the chunk sub-sums is done by MPI_Reduce. It yields the sum of all sub sums on the root node (rank = 0).

See the source code for how they are used. Also notice that they are most likely implemented with efficient parallelization as well. Therefore, it is a good idea to use those collective operations instead of manually distributing the chunks and adding up the sub sums.

We now compile the program and execute it as a batch job on Mogon:

# Make the libraries available which are needed for compilation and execution

# Compile the source code with the following parameters:
# - Compile according to C99 standard for the C language (-std=c99).
#   Since C has been around for decades there are many dialects and you
#   should decide which one you are using.
# - Output filename is “debug-tutorial” (-o debug-tutorial)
mpicc debug-tutorial-1-bugged.c -std=c99 -o debug-tutorial

# Queue the job for execution with the following parameters:
# - Interactive mode (-I): Output and input are attached to the terminal
#   instead of being sent out by mail. This is necessary for being able to
#   run GDB on a terminal.
# - Use 2 processes (-n 2). The fewer you chose the less you have to wait.
# - Run each process on a different computer (-R 'span[ptile=1]').
#   This introduces latency which is good for provoking errors in parallel
#   programs.
# - Use the queue for short jobs (- q short). This should reduce your wait
#   time. ONLY USE THIS QUEUE IF YOUR JOB IS REALLY SHORT!
# - (The “-a openmpi” is glue code to run MPI apps through bsub)
# - “mpirun ./debug-tutorial” is the actual command being executed as each process.
bsub -I -n 2 -R 'span[ptile=1]' -q short -a openmpi mpirun ./debug-tutorial

We especially want you to notice the parameter -R 'span[ptile=1]' as your first lesson:

Lesson 1
You should ensure that your program works both when all processes are run on a single multi-processor machine and on different machines across the network. You can force LSF to put each process on a different machine with -R 'span[ptile=1]'. The latencies caused by the network connections can trigger synchronization issues in your program.
Notice that you should not enforce this parameter in productive use: The more restrictions you put onto selection of the machines to run your program on, the longer your queue time. Also, performance decreases very much when processes have to be connected over the network.

The output of the program you should get is:

Job <12859536> is submitted to queue <short>.
<<Waiting for dispatch ...>>
<<Starting on a0475>>
Test OK.

The self test of the program has succeeded: It has compared the result of the parallel computation to a sequential reference implementation.
Notice that this shows you:

Lesson 2
When parallelizing a problem which is solvable in a sequential (= non-parallel) manner by nature, you should always write a sequential reference implementation of your algorithm and compare its results with the results of the parallel algorithm.
Parallelization of a “sequential problem” is often difficult so the sequential implementation is less likely to have bugs.
Also, two implementations are less likely to yield the same erroneous results than one.

Now that we have observed that the program works, we would put real work onto it. In practice this would mean that we would increase the problem size and increase the node count to get more performance. Also, we use the “long” queue now as we are consuming more resources.
So lets increase the node count for a start:

bsub -I -n 10 -R 'span[ptile=1]' -q long -a openmpi mpirun ./debug-tutorial
Job <12871894> is submitted to queue <short>.
<<Waiting for dispatch ...>>
<<Starting on a0243>>
Test FAILED!

As we have caused the program to fail now, we notice…

Lesson 3
You should always test your program with different node counts to provoke problems in the division of labor.

Because we don't spot a pattern between the node count at which it fails and the problem size, we now use the ability of the Linux shell to enter loops:

for ((i=0; i < 20; i++)) ; do bsub -n "$i" -R 'span[ptile=1]' -q long -a openmpi mpirun ./debug-tutorial ; done This will queue 20 jobs with node counts from 1 to 20. Notice the necessary changes in bsub parameters: • We got rid of the interactive execution: Running many jobs will take a long time which we don't want to watch the terminal for. We will be notified about the results by E-Mail now. • We changed the queue from short to long: Even though each job is short, we enqueue a large amount of them. We should not block the short queue for a long time, there are other users at Mogon as well. Lesson 4 If you do not spot a pattern between the appearance of bugs and the node count, try large amounts of different node counts using the shell. Now lets look at the results we received by mail:  1 node : Test OK. 2 nodes : Test OK. 3 nodes : Test FAILED! 4 nodes : Test OK. 5 nodes : Test FAILED! 6 nodes : Test FAILED! 7 nodes : Test FAILED! 8 nodes : Test OK. 9 nodes : Test FAILED! 10 nodes : Test FAILED! 11 nodes : Test FAILED! 12 nodes : Test FAILED! 13 nodes : Test FAILED! 14 nodes : Test FAILED! 15 nodes : Test FAILED! 16 nodes : Test OK. 17 nodes : Test FAILED! 18 nodes : Test FAILED! 19 nodes : Test FAILED! 20 nodes : Test FAILED! What do 1, 2, 4, 8 and 16 have in common? - They are the 2x series! If you look at the #define items (1024*1024) you will notice that 1024 is 210. So the program fails if N mod P != 0. And in fact the program itself has a modulo operation in line 35: MPI_Scatter is only able to distribute an equal amount of elements to each node so the said line computes the division remainder of N/P. The remaining items are processed in lines 49 and following. If there are no remaining items, i.e. when N mod P == 0, those lines are not executed. So the bug must be in the loop inside of the if-condition at line 35. We learn the following from this: Lesson 5 Once you have run tests for a large different amount of node counts, and you notice that they succeed for some of them, try to spot a pattern between the node count and boundary conditions of your program. Boundary conditions are likely to induce problems. Before we fulfill our promise to actually teach you something about GDB, we note that we have managed to reduce the line count where we suspect the problem to be to 2 without using GDB. This gives you: Lesson 6 A debugger is not a replacement for using your brain first to isolate the issue to a certain region in your program. However, please do not waste too much time trying find the issue without debugging tools just for sake of polishing your pride. Many issues are very easy to locate with a debugger even without thinking at all why they might be happening. Lesson 7 For allowing GDB to navigate in the binary version of your program, you should compile it with debugging information included by adding the “-ggdb” switch to the compiler: mpicc debug-tutorial-1-bugged.c -std=c99 -ggdb -o debug-tutorial How does one attach GDB to a process now? There are two options which GDB offers in general, i.e. not specifically for MPI: • Letting GDB start the process so it can attach itself immediately. • Attaching to an existing process by its process ID (PID). Unfortunately, when working with distributed computing, this is not sufficient: • Letting GDB start the processes would result in having a single GDB instance running for every process on every node. When using interactive mode with LSF (“-I”), we only get a single terminal. As GDB is a program which is used by terminal commands, every GDB command you enter into the terminal would be executed by every instance of GDB. Further, it probably does not make sense to run GDB attached to every MPI process anyway: As our tutorial shows, problems often appear on a single node and we want to debug a single process at once therefore. • Attaching to an existing process would require that we run GDB on the same machine as the process is running on. It is acceptable to connect to a certain machine of Mogon via SSH as long as you ask the admin first. But having to connect to the affected machine first is quite annoying. To make attaching GDB to a certain process easy, we developed a wrapper script called “selective-debug”. “Wrapper” means that you put the script into the bsub command line as the executable to launch and add the actual program as parameter to it: bsub -I -n 10 -q long -a openmpi mpirun ./selective-debug ./debug-tutorial What the “selective-debug” script will do now on every node is the following: • It will launch your actual command. • It will wait for your program to send the POSIX signal SIGUSR1 to it. As soon as SIGUSR1 is received, it will attach GDB to the process which sent SIGUSR1. The attachment of GDB will halt your program so you can debug it. Selective-debug does provide a header file with a macro to do the job of sending SIGUSR1. You can now debug a single rank by modifying your program like this: // This must be added to the start of your #includes. // It might not work if you have other includes first! #include “selective-debug.h” if(my_rank == <rank on which the problem exists>) { BREAKPOINT_AND_SLEEP(10); } The macro is called “BREAKPOINT…” because “setting a breakpoint” is the technical term for telling a debugger to halt the program at a certain point - the breakpoint. The macro is called “…_AND_SLEEP” because after sending the breakpoint signal, it will do a call “sleep(10);” which causes the program to do nothing for 10 seconds. - It takes some time for gdb to start up and halt the program. If the program does not sleep for long enough, the debugger will halt it beyond the desired breakpoint. If you notice that this happens, please increase the sleep delay. Also please do not just enter a very large value for the delay: When you are in the debugger and want to step through the program, you will have to wait for the whole delay to expire first! Please remember the “if()” in the above sample code: You must only send the breakpoint signal on a single rank. Otherwise you will get multiple instances of GDB attached to a single terminal - this would not work properly. Lesson 8 To attach GDB on a certain rank, prefix the command with the “selective-debug” wrapper and modify your program to use the BREAKPOINT_AND_SLEEP() macro to halt where appropriate. Now what about our tutorial program? The erroneous code in line 49 only gets executed on the node with rank 0 so it is safe to execute the “Attach GDB” signal in line 50. The amended code would look like this: if(my_rank == 0) { // Scatter cannot deal with a division remainder so we manually deal with it BREAKPOINT_AND_SLEEP(10); while(remainder_items > 0) { int index = remainder_items--; sub_sum += array[index]; } } You should notice that we have separated the index of the array access to a variable: Index computations very often go wrong so we should look at the actual index which is being used. We have provided a file debug-tutorial-2-breakpoint.c which contains these modifications. We recompile and execute it: mpicc debug-tutorial-2-breakpoint.c -std=c99 -ggdb -o debug-tutorial bsub -I -n 10 -q short -a openmpi mpirun ./selective-debug ./debug-tutorial After your job has been started a GDB prompt should appear: (gdb)  The process with rank 0 is halted now and you can debug it. To see where the execution was halted, enter “bt” (backtrace). The result should be similar to: (gdb) bt #0 0x00002acb77d543bd in nanosleep () from /lib64/libc.so.6 #1 0x00002acb77d54230 in sleep () from /lib64/libc.so.6 #2 0x00000000004010ce in main (argc=1, argv=0x7fffeb0c3fd8) at debug-tutorial-2-breakpoint.c:55 This means that line 56 of your main() function has executed a function called sleep() and that function has executed a function called nanosleep() and the processing was halted in that nanosleep() function. We now want to continue execution in main() by single stepping through it. To do that, first we must get out of nanosleep() and sleep() using the “finish” command: (gdb) finish Run till exit from #0 0x00002aaf645243bd in nanosleep () from /lib64/libc.so.6 0x00002aaf64524230 in sleep () from /lib64/libc.so.6 (gdb) Run till exit from #0 0x00002aaf64524230 in sleep () from /lib64/libc.so.6 main (argc=1, argv=0x7fff14e21318) at debug-tutorial-2-breakpoint.c:55 57 while(remainder_items > 0) { Notice that we pressed ENTER the second time instead of entering the finish command again: Pressing ENTER will execute the previous command once more. Now we step into the “while(remainder_items > 0) {“ loop: (gdb) step 58 int index = remainder_items--; Notice that what is displayed has NOT been executed yet, it is the next instruction which the processor will execute. We want to remember the value of the “remainder_items” variable from which we compute the “index” variable so we look it up: (gdb) print remainder_items$1 = 6

This means that the remainder_items variable has a value of 6.

We now step over the current line so the result of the index variable is computed and look it up:

(gdb) step
59            sub_sum += array[index];
(gdb) print index
$2 = 6 We now want to think about whether this index is valid. Of course we have not memorized the full source code so we just use GDB to show us some of it: (gdb) bt #0 main (argc=1, argv=0x7fff54ca4ca8) at debug-tutorial-2-breakpoint.c:59 (gdb) list 39,59 What we did here is that we first looked up which line the execution is halted at using “bt” and then we used list to show lines from 39 to 59. Notice that you could as well just have shown some lines before the current location using “list” without parameters. To find out about more ways of specifying positions, use “help list”. Lets find the problem in our program now: (gdb) list 39,59 39 int items_per_rank = items / node_count; 40 int remainder_items = items % node_count; 41 int* my_work; ... 54 if(my_rank == 0) { // Scatter cannot deal with a division remainder so we manually deal with it 55 BREAKPOINT_AND_SLEEP(10); 56 57 while(remainder_items > 0) { 58 int index = remainder_items--; 59 sub_sum += array[index]; We stepped into the first execution of the while loop at line 57. Whats the goal of that loop? Scatter will distribute an equal amount of items of the work to each node. Because the size of the array is not divisible by the amount of nodes, an amount N mod P items remains. The value N mod P has been computed at line 40 and stored in the variable remainder_items. The loop at line 57 which we step through shall process those remainder_items at the start of the work array1). At the first execution of the loop, we have observed remainder_items == 6 and index == 6. We now want to know what happens in the last iteration of the loop: Problems are likely to occur at boundary conditions, the middle of the loop is not very interesting. To quickly reach the end of the loop, we tell GDB to halt on “remainder_items == 1”, which is the last value which satisfies the loop-condition: (gdb) watch remainder_items if remainder_items==1 Hardware watchpoint 1: remainder_items (gdb) c Continuing. Hardware watchpoint 1: remainder_items Old value = 2 New value = 1 main (argc=1, argv=0x7fff973b3c98) at debug-tutorial-2-breakpoint.c:59 59 sub_sum += array[index]; This has stopped inside of previous-to-last execution of the while() now because remainder_items is updated inside of it. We “step” some more to reach the last iteration and get the index which is used in the last iteration: (gdb) step 57 while(remainder_items > 0) { (gdb) 58 int index = remainder_items--; (gdb) 59 sub_sum += array[index]; (gdb) print index$3 = 1

Lets summarize what we have observed with GDB now:

• Before the first iteration of the loop, remainder_items was 6. The goal of the loop therefore was to process the first 6 items of the work array.
• At the first iteration, index was 6, i.e. the first iteration processed slot 6 of the array.
• At the last iteration, index was 1, i.e. the last iteration processed slot 1 of the array.

This shows that the bug is a classical off-by-one issue: C arrays start at index 0, not at index 1!
How do we fix this?

In the bugged code, we obtain the index by doing

remainder_items--

This is the so-called “post-decrement” operator: It returns the current value of the variable and decrements the variable after that. If you want to obtain the decremented value, you have to use the “pre-decrement” operator:

--remainder_items

This will decrement the variable and return the decremented value.

So the bugfixed code would look like this:

if(my_rank == 0) { // Scatter cannot deal with a division remainder so we manually deal with it
while(remainder_items > 0)
sub_sum += array[--remainder_items];
}

We have provided a file debug-tutorial-3-fixed.c which contains this fix.

Compiling and executing it with the originally failing parameters should show the output:

mpicc debug-tutorial-3-fixed.c -std=c99 -o debug-tutorial
bsub -I -n 10 -R 'span[ptile=1]' -q long -a openmpi mpirun ./debug-tutorial
Job <14863096> is submitted to queue <long>.
<<Waiting for dispatch ...>>
<<Starting on a0324>>
Test OK.

The self-test which compares the result of the parallel computation to the sequential reference implementation has succeeded now!

Notice that up to now we have only tested a single node count which had failed previously. In practice, you should as well repeat the shell for-loop which tries your program with a large amount of different node counts.

Before you would move on to productive use, you should always remember the following lesson:

Lesson 9
Don't forget to remove all debug tools before running your software productively as they might have a huge impact on performance:
- Remove the “-ggdb” switch of the compiler
- Remove “./selective-debug” from the command line
- Disable the execution of the self-test.
Also, make sure to test your bugfix as well as you have tested the bugged version.

The technique we have shown you using the selective-debug wrapper has one disadvantage: It only allows you to debug a single node at once because it is attached to the single terminal which was used to execute bsub.

To debug multiple processes at once, selective-debug does offer you an - - xterm switch: It will make it launch GDB inside of a new graphical terminal window using the X-Server on your client.
This will allow your program to tell selective-debug to launch GDB on multiple nodes:
Each instance of GDB will have its own Xterm terminal window.

For allowing the processes on Mogon to connect to your X-Server, you have to use the -Y option of SSH.

Lesson 10
To use multiple instances of GDB, forward the connection to your X-Server via SSH and use the “- - xterm” switch of selective-debug:

ssh -Y bogert@mogon.zdv.uni-mainz.de
bsub […] -a openmpi mpirun ./selective-debug --xterm ./debug-tutorial

Your program then is allowed to send the debug signal to selective-debug on multiple nodes for being able to run multiple instances of GDB.

1)
Notice that this is a poor algorithm: At the worst case, the amount of remainder_items will be as large as the size of all other work units minus one item. This means that when all nodes but the first one have finished computing already, the first node will continue computation for almost as long as it took the other nodes to finish.
• debug_tutorial.1374620009.txt.gz