This chapter provides two tutorials for using the SpeedShop tools to gather and analyze performance data in a Fortran program. There are three versions of the first program:
The linpack directory contains files for the n32-bit ABI.
The linpack64 directory contains files for the 64-bit ABI.
The linpacko32 directory contains files for the o32-bit ABI.
The first tutorial covers the following topics:
“Tutorial Overview”, introduces the sample program and explains the different scenarios in which it will be used.
“Tutorial Setup”, leads you through the necessary setup for running the experiment.
“Analyzing Performance Data”, steps you through different experiments, discussing first how to do the experiments, then how to interpret the results.
The second tutorial creates a Message Passing Interface (MPI) experiment. The experiment file is generated by SpeedShop and displayed by the WorkShop performance analyzer. See “MPI Tracing Tutorial”.
| Note: Because of inherent differences between systems and also due to concurrent processes that may be running on your system, your experiment will produce different results from the one in this tutorial. However, the basic structure of the results should be the same. |
This tutorial is based on a standard benchmark program called linpackup. There are two versions of the program: the linpack directory contains files for the n32-bit ABI, and the linpacko32 directory contains files for the o32-bit ABI. Each linpack directory contains versions of the program for a single processor (linpackup) and for multiple processors (linpackd). When you work with the tutorial, choose the version of the program that is most appropriate for your system. A good guideline is to choose whichever version corresponds to the way you expect to develop your programs.
This tutorial was written and tested using the single-processor version of the program (linpackup) in the linpack directory.
The linpack program is a standard benchmark designed to measure CPU performance in solving dense linear equations. The program focuses primarily on floating-point performance.
Output from the linpackup program looks like the following:
.
.
.
norm. resid resid machep x(1) x(n)
5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00
times are reported for matrices of order 300
dgefa dgesl total mflops unit ratio
times for array with leading dimension of 301
3.720E+00 4.000E-02 3.760E+00 4.835E+00 4.136E-01 6.714E+01
3.780E+00 3.000E-02 3.810E+00 4.772E+00 4.191E-01 6.804E+01
3.730E+00 4.000E-02 3.770E+00 4.822E+00 4.147E-01 6.732E+01
3.730E+00 4.000E-02 3.770E+00 4.822E+00 4.147E-01 6.732E+01
times for array with leading dimension of 300
3.800E+00 4.000E-02 3.840E+00 4.734E+00 4.224E-01 6.857E+01
3.810E+00 4.000E-02 3.850E+00 4.722E+00 4.235E-01 6.875E+01
3.770E+00 4.000E-02 3.810E+00 4.772E+00 4.191E-01 6.804E+01
3.782E+00 4.000E-02 3.822E+00 4.757E+00 4.205E-01 6.825E+01 |
Copy the program to a directory where you have write permission and compile it so that you can use it in the tutorial.
Change to the /usr/demos/SpeedShop directory.
Copy the appropriate linpack directory and its contents to a directory in which you have write permission:
% cp -r linpack your_dir |
Change to the directory you just created:
% cd your_dir/linpack |
Compile the program by entering:
% make all |
This provides an executable for the experiment.
This section lists the steps you need to perform the following experiments on the linpackup program, generate the experiment's results, and interpret the results:
The usertime experiment. It returns the CPU time used by each routine in your program. See “A usertime Experiment”.
The pcsamp experiment. It returns CPU time for each routine in your program. See “A pcsamp Experiment”.
The dsc_hwc (secondary data cache hardware counter) experiment. In a hardware counter experiment, the program counter is sampled every time a hardware counter exceeds a specified limit. In the experiment performed in this section, the hardware counter keeps track of the number of times a data item required in a calculation was not present in secondary data cache. When a data item is not in cache, it must be retrieved from memory, which is a more time-consuming process. See “A Hardware Counter Experiment”.
The bbcounts experiment. This experiment calculates the best time achievable. See “A bbcounts Experiment”.
This section lists the steps you need to perform a usertime experiment. The usertime experiment allows you to gather data on the amount of CPU time spent in each routine in your program. For more information, see “usertime Experiment” in Chapter 4.
From the command line, enter the following:
% ssrun -v -usertime linpackup |
This starts the experiment. The -v flag tells ssrun to print a log to stderr.
Output from linpackup and from ssrun is printed to stdout, as shown in the following example. A data file is also generated. The name consists of the process name (linpackup), the experiment type (usertime), and the experiment ID. In this example, the filename is linpackup.usertime.m12205.
ssrun: target PID 12205
ssrun: setenv _SPEEDSHOP_MARCHING_ORDERS ut:cu
ssrun: setenv _SPEEDSHOP_EXPERIMENT_TYPE usertime
ssrun: setenv _SPEEDSHOP_TARGET_FILE linpackup
ssrun: setenv _RLD_LIST libss.so:libssrt.so:DEFAULT
Please send the results of this run to:
Jack J. Dongarra
Mathematics and Computer Science Division
Argonne National Laboratory
Argonne, Illinois 60439
Telephone: 312-972-7246
ARPAnet: DONGARRA@ANL-MCS
norm. resid resid machep x(1) x(n)
5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00
times are reported for matrices of order 300
dgefa dgesl total mflops unit ratio
times for array with leading dimension of 301
3.960E+00 4.000E-02 4.000E+00 4.545E+00 4.400E-01 7.143E+01
3.960E+00 4.000E-02 4.000E+00 4.545E+00 4.400E-01 7.143E+01
3.970E+00 4.000E-02 4.010E+00 4.534E+00 4.411E-01 7.161E+01
3.960E+00 4.000E-02 4.000E+00 4.545E+00 4.400E-01 7.143E+01
times for array with leading dimension of 300
3.910E+00 4.000E-02 3.950E+00 4.603E+00 4.345E-01 7.054E+01
3.880E+00 8.000E-02 3.960E+00 4.591E+00 4.356E-01 7.071E+01
3.930E+00 4.000E-02 3.970E+00 4.579E+00 4.367E-01 7.089E+01
3.922E+00 3.800E-02 3.960E+00 4.591E+00 4.356E-01 7.071E+01 |
To generate a report on the data collected, enter the following at the command line:
% prof your_output_file_name > usertime.results |
The prof command interprets the type of experiment you have performed and prints results to stdout. The following report shows partial prof output.
| Note: Lines have been wrapped because of line width restrictions. |
-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb 2 13:37:38 1998
prof linpackup.usertime.m12205
linpackup (n32): Target program
usertime: Experiment name
ut:cu: Marching orders
R4400 / R4000: CPU / FPU
1: Number of CPUs
175: Clock frequency (MHz.)
Experiment notes--
From file linpackup.usertime.m12205:
Caliper point 0 at target begin, PID 12205
/usr/demos/SpeedShop/linpack.demos/fortran/linpackup
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical callstack sampling data (usertime)--
2777: Total Samples
0: Samples with incomplete traceback
83.310: Accumulated Time (secs.)
30.0: Sample interval (msecs.)
-------------------------------------------------------------------------
Function list, in descending order by exclusive time
-------------------------------------------------------------------------
[index] excl.secs excl.% cum.% incl.secs incl.% samples function
(dso: file, line)
[5] 78.090 93.7% 93.7% 78.090 93.7% 2603 daxpy
(linpackup: linpackup.f, 495)
[6] 2.730 3.3% 97.0% 2.730 3.3% 91 matgen
(linpackup: linpackup.f, 199)
[4] 1.920 2.3% 99.3% 79.680 95.6% 2656 dgefa
(linpackup: linpackup.f, 221)
[8] 0.270 0.3% 99.6% 0.270 0.3% 9 dscal
(linpackup: linpackup.f, 670)
[9] 0.180 0.2% 99.9% 0.180 0.2% 6 idamax
(linpackup: linpackup.f, 700)
[10] 0.090 0.1% 100.0% 0.090 0.1% 3 dmxpy
(linpackup: linpackup.f, 826)
[7] 0.030 0.0% 100.0% 0.810 1.0% 27 dgesl
(linpackup: linpackup.f, 324)
[1] 0.000 0.0% 100.0% 83.310 100.0% 2777 __start
(linpackup: crt1text.s, 101)
[2] 0.000 0.0% 100.0% 83.310 100.0% 2777 main
(libftn.so: main.c, 76)
[3] 0.000 0.0% 100.0% 83.310 100.0% 2777 linp
(linpackup: linpackup.f, 3) |
The report shows information for each function.
The index column, which enumerates the routines in the program, provides an index number for reference.
The excl.secs column shows how much time, in seconds, was spent in the routine itself (exclusive time). For example, less than one hundredth of a second was spent in linp, but 1.92 seconds were spent in dgefa.
The excl.% column shows the percentage of a program's total time that was spent in the function. For example, the daxpy routine consumed 93.7% of the program's time.
The cum.% column shows the percentage of the complete program time that has been spent in the routines that have been listed so far. For instance, when the dgefa routine completes, 99.3% of the program has completed by the routines listed so far.
The incl.secs column shows how much time, in seconds, was spent in the function and descendents of the function. For example, 0.81 seconds were spent in dgesl and the routines that were called from it.
The incl.% column shows the cumulative percentage of inclusive time spent in each routine and its descendents. For example, 1% of the time was spent in dgesl and all the routines that were called from it.
The samples column provides the number of samples taken from the function and all of its descendants.
The function (dso:file, line) column lists the routine name, its DSO name, its file name, and its line number. For example, the top line reports statistics for the routine daxpy, the DSO name linpackup, in the file linpackup.f, at line 495.
| Note: Many functions shown here have only one or two hits. The data for those functions is not statistically significant. (Routines that begin with an underscore, such as __start, are internal routines that you cannot change.) |
This section lists the steps you need to perform a pcsamp experiment. The pcsamp experiment allows you to gather information on actual CPU time for each source code line, machine line, and function in your program. For more information on pcsamp, see “pcsamp Experiment” in Chapter 4.
From the command line, enter the following:
% ssrun -pcsamp linpackup |
This starts the experiment.
Output from linpackup and from ssrun is printed to stdout, as shown in the following example. A data file is also generated. The name consists of the process name (linpackup), the experiment type (pcsamp), and the experiment ID. In this example, the file name is linpackup.pcsamp.m12333.
.
.
.
norm. resid resid machep x(1) x(n)
5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00
.
.
. |
To generate a report on the data collected, enter the following at the command line:
% prof your_output_file_name > pcsamp.results |
The prof command interprets the type of experiment you have performed and prints results to stdout. The following report shows partial prof output.
-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb 2 13:52:27 1998
prof linpackup.pcsamp.m12333
linpackup (n32): Target program
pcsamp: Experiment name
pc,2,10000,0:cu: Marching orders
R4400 / R4000: CPU / FPU
1: Number of CPUs
175: Clock frequency (MHz.)
Experiment notes--
From file linpackup.pcsamp.m12333:
Caliper point 0 at target begin, PID 12333
/usr/demos/SpeedShop/linpack.demos/fortran/linpackup
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (pcsamp)--
8272: Total samples
82.720: Accumulated time (secs.)
10.0: Time per sample (msecs.)
2: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index] secs % cum.% samples function (dso: file, line)
[1] 77.440 93.6% 93.6% 7744 daxpy (linpackup: linpackup.f, 495)
[2] 2.690 3.3% 96.9% 269 matgen (linpackup: linpackup.f, 199)
[3] 1.940 2.3% 99.2% 194 dgefa (linpackup: linpackup.f, 221)
[4] 0.370 0.4% 99.7% 37 idamax (linpackup: linpackup.f, 700)
[5] 0.210 0.3% 99.9% 21 dscal (linpackup: linpackup.f, 670)
[6] 0.060 0.1% 100.0% 6 dmxpy (linpackup: linpackup.f, 826)
0.010 0.0% 100.0% 1 **OTHER** (includes excluded DSOs, rld, etc.)
82.720 100.0% 100.0% 8272 TOTAL |
The report has the following columns:
The index column assigns a reference number to each function.
The secs column shows the amount of CPU time spent in the routine.
The (%) column shows the percentage of the total program time that was spent in the function.
The cum.% column shows the percentage of the complete program time that has been spent by the routines listed so far.
The samples column shows how many samples were taken when the process was executing in the function.
The function (dso:file, line) columns list the routine name, its DSO name, its file name, and its line number. For example, the first line reports statistics for the routine daxpy, in the DSO linpackup, in the file linpackup.f, at line number 495.
| Note: This experiment can be performed only on systems that have built-in hardware counters (the R10000, R12000, R14000, and R16000 classes of machines). |
Hardware counters keep track of a variety of hardware information. For a complete list of hardware counter experiments, see the ssrun(1) man page.
This section lists the steps you need to perform a hardware counter experiment. The tutorial describes the steps involved in performing the dsc_hwc experiment. This experiment allows you to capture information about secondary data cache misses. For more information on hardware counter experiments, see “SpeedShop Hardware Counter Experiments” in Chapter 4.
From the command line, enter the following:
% ssrun -dsc_hwc linpackup |
This starts the experiment. Output from linpackup and from ssrun will be printed to stdout. A data file is also generated. The name consists of the process name (linpackup), the experiment type (dsc_hwc), and the experiment ID. In this example, the filename is linpackup.dsc_hwc.m438011.
To generate a report on the data collected, enter the following at the command line:
% prof your_output_file_name > dsc_hwc.results |
Output similar to the following is generated:
-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb 2 13:56:59 1998
prof linpackup.dsc_hwc.m438011
linpackup (n32): Target program
dsc_hwc: Experiment name
hwc,26,131:cu: Marching orders
R10000 / R10010: CPU / FPU
16: Number of CPUs
195: Clock frequency (MHz.)
Experiment notes--
From file linpackup.dsc_hwc.m438011:
Caliper point 0 at target begin, PID 438011
/usr/demos/SpeedShop/linpack.demos/fortran/linpackup
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of R10K perf. counter overflow PC sampling data (dsc_hwc)--
2929: Total samples
Sec cache D misses (26): Counter name (number)
131: Counter overflow value
383699: Total counts
-------------------------------------------------------------------------
Function list, in descending order by counts
-------------------------------------------------------------------------
[index] counts % cum.% samples function (dso: file, line)
[1] 309029 80.5% 80.5% 2359 daxpy (linpackup: linpackup.f, 495)
[2] 46636 12.2% 92.7% 356 dgefa (linpackup: linpackup.f, 221)
[3] 25938 6.8% 99.5% 198 matgen (linpackup: linpackup.f, 199)
[4] 1310 0.3% 99.8% 10 idamax (linpackup: linpackup.f, 700)
[5] 131 0.0% 99.8% 1 _FWF (libfortran.so: wf90.c, 47)
[6] 131 0.0% 99.9% 1 memset (libc.so.1: bzero.s, 98)
524 0.1% 100.0% 4 **OTHER** (includes excluded DSOs, rld, etc.)
383699 100.0% 100.0% 2929 TOTAL |
The information immediately above the function list displays the following:
The Total samples is the number of times the program counter was sampled. It is sampled once for each overflow, or each time the hardware counter exceeds the specified value.
The Counter name (number) indicates the hardware counter used in the experiment. In this case, hardware counter 26 counts the number of times a value required in a calculation was not available in secondary cache. For a complete list of the hardware counters and their numbers, see Table 4-3.
The Counter overflow value is the number at which the hardware counter overflows, or exceeds its preset value. In this case, the value is 131, which is the default. You can change the overflow value by setting the _SPEEDSHOP_HWC_COUNTER_OVERFLOW environment variable to a value larger than 0, the _SPEEDSHOP_HWC_COUNTER_NUMBER environment variable to 26, and running the prof_hwc experiment rather than dsc_hwc.
See “_hwctime Hardware Counter Experiments” in Chapter 4 to learn how to choose a counter overflow value.
The Total counts is the total number of times a value was not in secondary cache when needed. This value is determined by multiplying the total number of samples by the overflow value; extra counts that do not cause an overflow are not recorded.
The function list has the following columns:
The index column assigns a reference number to each function.
The counts column shows the number of times a data item was not in secondary cache when needed for a calculation during the execution of the routine. As with Total counts (described earlier), a routine's counts value is determined by multiplying its samples value (described later) by the overflow value.
The % column shows the percentage of the program's overflows that occurred in the routine.
The cum.% column shows the percentage of the program's overflows that occurred in the routines listed so far. For example, although the matgen routine had only 6.8% of the program's overflows, by the time it is encountered in the routine list, 99.5% of the program's total overflows have been recorded.
The samples column shows the number of times the program counter was sampled during execution of the routine. A sample is taken for each overflow of the hardware counter.
The function (dso:file, line) columns show the name, the DSO, the file name, and line number of the routine.
This section provides the steps you need to perform a bbcounts or basic block counts experiment. This experiment counts basic block usage and estimates a linear time. It also maps a complete call graph. See “bbcounts Experiment” in Chapter 4.
From the command line, enter the following:
% ssrun -bbcounts linpackup |
This starts the experiment. This entails making copies of the libraries and executables, giving them an extension that depends on the ABI, and inserting information into the copies. The extension is .pixbb for the executable, .pix32 for all o32 libraries, .pixn32 for all n32 libraries, and .pix64 for all 64 libraries.
Output from linpackup and from ssrun is printed to stdout, as shown in the following example. A data file is also generated. The name consists of the process name (linpackup), the experiment type (bbcounts), and the experiment ID. In this example, the file name is linpackup.bbcounts.m77549.
instrumenting /lib32/rld instrumenting /usr/lib32/libssrt.so instrumenting /usr/lib32/libss.so instrumenting /usr/lib32/libmp.so instrumenting /usr/lib32/libftn.so instrumenting /usr/lib32/libm.so instrumenting /usr/lib32/libc.so.1 |
To generate a report on the data collected, enter the following at the command line:
% prof your_output_file_name > bbcounts.results |
The prof command redirects output to a file called bbcounts.results. The file should contain results that look something like the following.
-------------------------------------------------------------------------
SpeedShop profile listing generated Tue Oct 16 16:02:42 2001
prof linpackup.bbcounts.m77549
linpackup (n32): Target program
bbcounts: Experiment name
it:cu: Marching orders
R5000 / R5000: CPU / FPU
1: Number of CPUs
180: Clock frequency (MHz.)
Experiment notes--
From file linpackup.bbcounts.m77549:
Caliper point 0 at target begin, PID 77549
/d3/jcarter/pixie/SpeedShop/linpack/linpackup
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of basic block time data (bbcounts)--
5112044612: Total number of instructions executed
6935000905: Total computed cycles
38.528: Total computed execution time (secs.)
1.357: Average cycles / instruction
-------------------------------------------------------------------------
Function list, in descending order by exclusive ideal time
-------------------------------------------------------------------------
[index] excl.secs excl.% cum.% cycles instructions calls function (dso:file, line)
[1] 36.341 94.3% 94.3% 6541395864 4828393890 772633 daxpy (linpackup.pixbb: linpackup.f, 495)
[2] 1.262 3.3% 97.6% 227149380 160652700 18 matgen (linpackup.pixbb: linpackup.f, 199)
[3] 0.638 1.7% 99.3% 114769057 80790171 17 dgefa (linpackup.pixbb: linpackup.f, 221)
[4] 0.130 0.3% 99.6% 23361468 18659693 5083 dscal (linpackup.pixbb: linpackup.f, 670)
[5] 0.109 0.3% 99.9% 19598722 16437793 5083 idamax (linpackup.pixbb: linpackup.f, 700)
[6] 0.027 0.1% 99.9% 4866714 3785181 1 dmxpy (linpackup.pixbb: linpackup.f, 826)
[7] 0.006 0.0% 100.0% 1149098 767142 17 dgesl (linpackup.pixbb: linpackup.f, 324)
[8] 0.004 0.0% 100.0% 710286 659635 2156 general_find_symbol (rld: rld.c, 2038)
[9] 0.003 0.0% 100.0% 481218 439585 4819 resolve_relocations (rld: rld.c, 2636)
[10] 0.001 0.0% 100.0% 237532 237532 2170 elfhash (rld: obj.c, 1184)
[11] 0.001 0.0% 100.0% 167623 167623 7295 obj_dynsym_got (rld: objfcn.c, 46)
[12] 0.001 0.0% 100.0% 156368 156368 3085 strcmp (rld: strcmp.s, 34)
[13] 0.001 0.0% 100.0% 128884 124596 2158 resolve_symbol (rld: rld.c, 1828)
[14] 0.001 0.0% 100.0% 123436 121312 2129 resolving (rld: rld.c, 1499)
[15] 0.001 0.0% 100.0% 122038 109016 3 fix_all_defineds (rld: rld.c, 3419)
[16] 0.000 0.0% 100.0% 64706 64365 7 search_for_externals (rld: rld.c, 3987)
[17] 0.000 0.0% 100.0% 61565 59304 1116 __flsbuf (libc.so.1: _flsbuf.c, 25)
[18] 0.000 0.0% 100.0% 56077 43137 4314 obj_set_dynsym_got (rld: objfcn.c, 82)
[19] 0.000 0.0% 100.0% 42883 38501 1 linp (linpackup.pixbb: linpackup.f, 3)
[20] 0.000 0.0% 100.0% 42595 42594 867 x_putc (libftn.so: wsfe.c, 177)
[21] 0.000 0.0% 100.0% 24161 21845 28 x_wEND (libftn.so: wsfe.c, 225)
[22] 0.000 0.0% 100.0% 17014 17000 8 memset (libc.so.1: bzero.s, 98)
[23] 0.000 0.0% 100.0% 14671 13537 71 do_fio64_mp (libftn.so: fmt.c, 626)
[24] 0.000 0.0% 100.0% 14575 11501 53 wrt_E (libftn.so: wrtfmt.c, 353)
[25] 0.000 0.0% 100.0% 13664 12432 28 getenv (libc.so.1: getenv.c, 25)
.
.
.
[329] 0.000 0.0% 100.0% 1 1 1 __istart (linpackup.pixbb: crt1tinit.s, 14) |
The report has the following columns:
The index column assigns a reference number to each function.
The excl.secs column shows the minimum number of seconds that might be spent in the routine under ideal conditions. For example, 52.073 seconds is optimal for the daxpy routine. The pcsamp experiment (see “A pcsamp Experiment”) times this routine at 77.44 seconds.
The excl.% column represents how much of the program's total time was spent in the routine.
The cum.% column shows the cumulative percentage of time spent in the routines listed so far.
The cycles column shows the total number of machine cycles used by the routine. For example, 91,12,833,799 CPU clock cycles were spent in the daxpy routine.
The instructions column shows the total number of instructions executed by a routine. For example, the dgefa routine executed 72,336,088 instructions.
The calls column shows the total number of calls to the routine. For example, there was just one call to the dmxpy routine.
The function (dso:file, line) column lists the name, the DSO name, the file name, and the line number for the routine.
The following steps generate tracing data for an MPI program:
First, set the MPI_RLD_HACK_OFF environment variable to prevent SpeedShop confusion over the organization of the DSOs.
% setenv MPI_RLD_HACK_OFF 1 |
Compile the matmul.f source file and include the MPI library:
% f90 -o matmul matmul.f -lmpi |
Now run the ssrun command as part of the mpirun(1) command on the executable file to generate experiment files:
% mpirun -np 4 ssrun -mpi matmul |
The result will be a series of experiment files, one for each process (the identifier begins with an f) and one for the master process (the identifier begins with an m):
| matmul.mpi.f9587021 |
| matmul.mpi.f9905720 |
| matmul.mpi.f9930637 |
| matmul.mpi.f9930718 |
| matmul.mpi.m9951566 |
Finally, display an experiment file with the WorkShop cvperf(1) command. Remember, you cannot use prof to display an MPI trace experiment.
% cvperf matmul.mpi.f9587021 |
To display the output, select either MPI Stats View (Graphs) or MPI Stats View (Numerical) from the Views menu. See Figure 3-1, for an illustration of the MPI Stats View (Graphs).