====== Debugging HPC programs ====== \\ This page is dedicated to debugging methods for HPC codes. New HPC developers should know these basic options to save time in their work. All methods here provide a way to trace the related bug, which means finding the exact code line that is generating the bug. I try to keep these pages up to date, but some flags may be deprecated. Beware: some things named bug here may not be bugs but only mathematical/physical results. For example, a calculation may finish with a result just too high to be stored in 64bit memory. In fact, this is not really a bug, just a limitation, code and calculations are good. If you are new in HPC programming or in debugging, a small tutorial on how to use the following flags is available. See [[software:development:debug:help|help debug]]. There are also examples for FPE and Uninitiliazed values debugging. All methods are then based on the same philosophy. For reference :\\ Compilers used : * gcc and gfortran 4.8.2 (from ubuntu 14.04 x86_64) * icc and ifort 14.0.3 (from Intel Parallel Studio 2013 SP1 x86_64) Tools used : * valgrind 3.10.0 * gdb 7.7 Files used to simulate most of common bugs : [[software:development:debug:debf|deb_f.f90]] , [[software:development:debug:debc|deb_c.c]]. ===== Main types of bugs ===== When developing HPC programs, bugs encountered are often the sames. Here is a list of most common bugs : * [[#floating_point_exception|Floating point exceptions called fpe (Invalid, Overflow, Zero) ]] * [[#Uninitialized_variables|Uninitialized values reading]] * [[#Allocation/deallocation_issues|Allocation/deallocation issues]] * [[#Array_out_of_bound_reading/writing|Array out of bound reading/writing]] * [[#IO_issues|IO issues]] * [[#Memory_leak|Memory leak]] * [[#Stack_overflow|Stack overflow]] * [[#Buffer_overflow|Buffer overflow]] There are many other types of bugs, but these are the most common and the most easy to solve when using the appropriate tools. ===== When could there be a bug ? ===== First of all is to identify the presence of a bug : * Program returns an error message * Program returns an error exit code (other than 0) * Program finishes with NaN or +Inf values * Program ends unexpectedly * Other cases, many scenario are possible How to get the exit code of a program ? * $? gives you the exit code of the last executed command. * Other than 0 means something went wrong, and this code may help you understand why. ~$ gfortran myokprog.f90 ~$ ./a.out Hello world ! ~$ echo $? 0 ~$ gfortran mybugprog.f90 ~$ ./a.out Program received signal SIGSEGV: Segmentation fault - invalid memory reference. Backtrace for this error: #0 0x7FFC993C87D7 #1 0x7FFC993C8DDE #2 0x7FFC9901FC2F Segmentation fault (core dumped) $ echo $? 139 ===== How to find them ===== Here is the list of debug flags/tools to use to trace bugs discussed above. First part is generic (Quick debug strategy), while the second part is specific for each bug. ==== Quick debug strategy ==== Most of the time, these compilation options will find your bug (except for gcc which has only few debug options) : ^ Compiler ^ Compiler options ^ | gfortran | -Wuninitialized -O -g -fbacktrace -ffpe-trap=zero,underflow,overflow,invalid -fbounds-check -fimplicit-none -ftrapv | | gcc | -g -Wall | | ifort | -g -traceback -fpe0 -check all -ftrapuv -fp-stack-check -warn all -no-ftz | | icc | Test 1 : -g -traceback -check=uninit -fp-stack-check -no-ftz\\ Test 2 : -g -traceback -check-pointers=rw | If C code, try FPE strategy (see below). If not enough, compile with : ^ Compiler ^ Compiler options ^ | gfortran | -g -fbacktrace | | gcc | -g | | ifort | -g -traceback | | icc | -g -traceback | And launch the program with valgrind : ~$ valgrind myprog.exe Most of the time it will get the error. ==== Floating Point exception ==== There are three types of FPE : * **Zero** : when you divide by zero, very common in HPC. For example : A/0.0=+∞ * **Invalid** : when the operation is mathematically impossible. For example : acos(10.0) = NaN * **Overflow/Underflow** : when you reach maximum/minimum number that system can hold. For example : exp(10E15) = A huge number = +Inf **Behavior :** FPE will not generate an error at runtime or at compilation time (GCC/INTEL). ===Tracing in Fortran === ^ Compiler ^ Way to trace bug ^ | gfortran | Compiler flags : **-g -fbacktrace -ffpe-trap=zero,underflow,overflow,invalid**.\\ The fpe will be explicitly displayed at runtime. | | ifort | Compiler flags : **-g -traceback -fpe0**.\\ The fpe will be explicitly displayed at runtime. | ===Tracing in C === ^ Compiler ^ Way to trace bug ^ | gcc and icc | Add **#include ** in the main source file, then use **feenableexcept(FE_DIVBYZERO| FE_INVALID|FE_OVERFLOW);** juste after main.\\ Compiler flags : **-g**.\\ The fpe will generate a floating point error at runtime. Then use gdb to get informations on the code line generating the fpe. | ==== Uninitialized variables ==== When you try to read a non initialized variable. The program may not stop, and all following calculations will be based on a random value. This is common with MPI programs (Ghosts, etc).\\ Three main types of initialized variables : * **Static variable** : variable uninitialized is static * **Dynamic variable** : variable uninitialized is dynamic * **Not allocated variable** : try to use a non allocated dynamic variable **Behavior :** * Static variable : no error at runtime * Dynamic variable : no error at runtime * Not allocated variable : segmentation fault at runtime Memcheck of Valgrind will let the program run and use uninitialized values, keeping track of these operations. It will only complain when a variable "goes out" of the program (printing in the terminal, writing in a file, etc). The error will be indicated at the line of this print/write. To get more informations on the variable uninitialized, use %%--%%track-origins=yes as Valgrind flag. === Tracing in Fortran === ^ Compiler ^ Way to trace bug ^ | gfortran | - static variable : Compiler options : **-Wuninitialized -O -g -fbacktrace**. Will display a warning at compilation time.\\ To get more informations, use **Valgrind**. The error will be a “Conditional jump or move depends on uninitialized value(s)” | | | - dynamic variable : Compiler options : **-g -fbacktrace**.\\ Use **Valgrind %%--%%track-origins=yes**. The error will be a “Conditional jump or move depends on uninitialized value(s)” | | | - not allocated variable : Compiler options : **-g -fbacktrace**. The error will be explicitly displayed at runtime.| | ifort | - static variable : Compiler options : **-check all**. The error will be explicitly displayed at runtime.\\ Possibility to replace all uninitialized values by a huge number, use -ftrapuv | | | - dynamic variable : Compiler options : **-g -traceback**.\\ Use **Valgrind %%--%%track-origins=yes**. The error will be a “Conditional jump or move depends on uninitialized value(s)” | | | - not allocated variable : Compiler options : **-g -traceback**. The error will be explicitly displayed at runtime.| === Tracing in C === ^ Compiler ^ Way to trace bug ^ | gcc | - static variable : Compiler options : **-Wuninitialized** or **-Wall**. Will display a warning at compilation time.\\ To get more informations, use **Valgrind**. The error will be a “Conditional jump or move depends on uninitialized value(s)” | | | - dynamic variable : Compiler options : **-g**.\\ Use **Valgrind %%--%%track-origins=yes**. The error will be a “Conditional jump or move depends on uninitialized value(s)” | | | - not allocated variable : Compiler options : **-Wuninitialized** or **-Wall**. Will display a warning at compilation time.\\ To get more informations, use **Valgrind**. The error will be a “Conditional jump or move depends on uninitialized value(s)”\\ To get more informations, use **gdb** and ask **backtrace**. | | icc | - static variable : Compiler options : **-Wuninitialized**. Will display a warning at compilation time.\\ **-g -traceback -check=uninit**. The error will be explicitly displayed at runtime. | | | - dynamic variable : Compiler options : **-g -traceback**.\\ Use **Valgrind %%--%%track-origins=yes**. The error will be a “Conditional jump or move depends on uninitialized value(s)” | | | - not allocated variable : Compiler options : **-Wuninitialized**. Will display a warning at compilation time.\\ **-g -traceback -check=uninit**. The error will be explicitly displayed at runtime. | ==== Allocation/deallocation issues ==== === Tracing in Fortran === ^ Compiler ^ Way to trace bug ^ | gfortran | - free a non allocated variable : Compiler options : **-g -fbacktrace**. The error will be explicitly displayed at runtime. | | | - allocate an already allocated variable : Compiler options : **-g -fbacktrace**. The error will be explicitly displayed at runtime. | | | - not freed memory : Compiler options : **-g -fbacktrace**.\\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. | | ifort | - free a non allocated variable : Compiler options : **-g -traceback**. The error will be explicitly displayed at runtime. | | | - allocate an already allocated variable : Compiler options : **-g -traceback**. The error will be explicitly displayed at runtime. | | | - not freed memory : Compiler options : **-g -traceback**. \\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. | === Tracing in C === ^ Compiler ^ Way to trace bug ^ | gcc | - free a non allocated variable : Compiler options : **-Wuninitialized** or **-Wall**. Will display a warning at compilation time.\\ To get more informations, use **Valgrind**. The error will be a “Conditional jump or move depends on uninitialized value(s)” | | | - allocate an already allocated variable : Compiler options : **-g -fbacktrace**.\\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. | | | - not freed memory : Compiler options : **-g -fbacktrace**.\\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. | | icc | - free a non allocated variable : Compiler options : **-Wuninitialized**. Will display a warning at compilation time.\\ **-g -traceback -check=uninit**. The error will be explicitly displayed at runtime. | | | - allocate an already allocated variable : Compiler options : **-g -traceback**. \\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. | | | - not freed memory : Compiler options : **-g -traceback**. \\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. | ==== Array out of bound reading/writing ==== === Tracing in Fortran === ^ Compiler ^ Way to trace bug ^ | gfortran | Compiler options : **-g -fbacktrace -fbounds-check**. The error will be explicitly displayed at runtime. | | ifort | Compiler options : **-g -traceback -check all** (or -check bounds). The error will be explicitly displayed at runtime.| === Tracing in C === ^ Compiler ^ Way to trace bug ^ | gcc | Compiler options : **-g**. Use **Valgrind**, the error will be a "Invalid read/write of size 8/16".\\ Or patch gcc and recompile it with bounds checking (http://sourceforge.net/projects/boundschecking/) | | icc | Compiler options : **-g -traceback -check-pointers=rw**. The error will be explicitly displayed at runtime. \\ Warning : check-pointers=rw makes all other debugging options not working when activated, be careful.| ==== IO issues ==== IO errors are often very explicit. No need to use a debugging tool. However, Valgrind and fpe options can detect some related errors (bad reading = bad initialized value or = fpe, etc.) Do not forget to set **-g -fbacktrace** (gfortran) or **-g -traceback** (icc/ifort) to get useful error information. Simply be careful by securing all read/write (get output code and check it). ==== Memory leak ==== === Tracing in Fortran === ^ Compiler ^ Way to trace bug ^ | gfortran | Compiler options : **-g -fbacktrace**. Use Valgrind --leak-check=full. Look for LEAK SUMMARY, definitely lost. | | ifort | Compiler options : **-g -traceback**. Use Valgrind --leak-check=full. Look for LEAK SUMMARY, definitely lost. | === Tracing in C === ^ Compiler ^ Way to trace bug ^ | gcc | Compiler options : **-g**. Use Valgrind --leak-check=full. Look for LEAK SUMMARY, definitely lost. | | icc | Compiler options : **-g -traceback**. Use Valgrind --leak-check=full. Look for LEAK SUMMARY, definitely lost. | ==== Stack overflow ==== === Tracing in Fortran === ^ Compiler ^ Way to trace bug ^ | gfortran | Compiler options : **-g -fbacktrace**. Use Valgrind. Look for "Stack overflow in thread X" or "Access not within mapped region".\\ **gdb** will catch it with backtrace but not a lot of informations. | | ifort | Compiler options : **-g -traceback**. Use Valgrind. Look for "Stack overflow in thread X" or "Access not within mapped region".\\ **gdb** will catch it with backtrace but not a lot of informations. | === Tracing in C === ^ Compiler ^ Way to trace bug ^ | gcc | Compiler options : **-g**. Use Valgrind. Look for "Stack overflow in thread X" or "Access not within mapped region".\\ **gdb** will catch it with backtrace but not a lot of informations. | | icc | Compiler options : **-g -traceback**. Use Valgrind. Look for "Stack overflow in thread X" or "Access not within mapped region".\\ **gdb** will catch it with backtrace but not a lot of informations. | ==== Buffer overflow ==== === Tracing in Fortran === ^ Compiler ^ Way to trace bug ^ | gfortran | Compiler options : **-g -fbacktrace**. The error will be explicitly displayed at runtime. | | ifort | Compiler options : **-g -traceback**. The error will be explicitly displayed at runtime. | === Tracing in C === ^ Compiler ^ Way to trace bug ^ | gcc | Compiler options : **-g**. Use **gdb**. Ask for **backtrace** after error, lot of informations. | | icc | Compiler options : **-g -traceback -check-pointers=rw**. The error will be explicitly displayed at runtime. \\ Warning : check-pointers=rw makes all other debugging options not working when activated, be careful.|