Chapter 3. Tutorial for Fortran Users

This chapter provides two tutorials for using the SpeedShop tools to gather and analyze performance data in a Fortran program. There are three versions of the first program:

The first tutorial covers the following topics:

The second tutorial creates a Message Passing Interface (MPI) experiment. The experiment file is generated by SpeedShop and displayed by the WorkShop performance analyzer. See “MPI Tracing Tutorial”.


Note: Because of inherent differences between systems and also due to concurrent processes that may be running on your system, your experiment will produce different results from the one in this tutorial. However, the basic structure of the results should be the same.


Tutorial Overview

This tutorial is based on a standard benchmark program called linpackup. There are two versions of the program: the linpack directory contains files for the n32-bit ABI, and the linpacko32 directory contains files for the o32-bit ABI. Each linpack directory contains versions of the program for a single processor (linpackup) and for multiple processors (linpackd). When you work with the tutorial, choose the version of the program that is most appropriate for your system. A good guideline is to choose whichever version corresponds to the way you expect to develop your programs.

This tutorial was written and tested using the single-processor version of the program (linpackup) in the linpack directory.

The linpack program is a standard benchmark designed to measure CPU performance in solving dense linear equations. The program focuses primarily on floating-point performance.

Output from the linpackup program looks like the following:

.
.
.
     norm. resid      resid           machep         x(1)          x(n)
  5.35882395E+00  7.13873405E-13  2.22044605E-16  1.00000000E+00  1.00000000E+00


    times are reported for matrices of order   300
      dgefa      dgesl      total     mflops       unit      ratio
 times for array with leading dimension of 301
  3.720E+00  4.000E-02  3.760E+00  4.835E+00  4.136E-01  6.714E+01
  3.780E+00  3.000E-02  3.810E+00  4.772E+00  4.191E-01  6.804E+01
  3.730E+00  4.000E-02  3.770E+00  4.822E+00  4.147E-01  6.732E+01
  3.730E+00  4.000E-02  3.770E+00  4.822E+00  4.147E-01  6.732E+01

 times for array with leading dimension of 300
  3.800E+00  4.000E-02  3.840E+00  4.734E+00  4.224E-01  6.857E+01
  3.810E+00  4.000E-02  3.850E+00  4.722E+00  4.235E-01  6.875E+01
  3.770E+00  4.000E-02  3.810E+00  4.772E+00  4.191E-01  6.804E+01
  3.782E+00  4.000E-02  3.822E+00  4.757E+00  4.205E-01  6.825E+01

Tutorial Setup

Copy the program to a directory where you have write permission and compile it so that you can use it in the tutorial.

  1. Change to the /usr/demos/SpeedShop directory.

  2. Copy the appropriate linpack directory and its contents to a directory in which you have write permission:

    % cp -r linpack your_dir

  3. Change to the directory you just created:

    % cd your_dir/linpack

  4. Compile the program by entering:

    % make all

    This provides an executable for the experiment.

Analyzing Performance Data

This section lists the steps you need to perform the following experiments on the linpackup program, generate the experiment's results, and interpret the results:

  • The usertime experiment. It returns the CPU time used by each routine in your program. See “A usertime Experiment”.

  • The pcsamp experiment. It returns CPU time for each routine in your program. See “A pcsamp Experiment”.

  • The dsc_hwc (secondary data cache hardware counter) experiment. In a hardware counter experiment, the program counter is sampled every time a hardware counter exceeds a specified limit. In the experiment performed in this section, the hardware counter keeps track of the number of times a data item required in a calculation was not present in secondary data cache. When a data item is not in cache, it must be retrieved from memory, which is a more time-consuming process. See “A Hardware Counter Experiment”.

  • The bbcounts experiment. This experiment calculates the best time achievable. See “A bbcounts Experiment”.

A usertime Experiment

This section lists the steps you need to perform a usertime experiment. The usertime experiment allows you to gather data on the amount of CPU time spent in each routine in your program. For more information, see “usertime Experiment” in Chapter 4.

Performing a usertime Experiment

From the command line, enter the following:

% ssrun -v -usertime linpackup

This starts the experiment. The -v flag tells ssrun to print a log to stderr.

Output from linpackup and from ssrun is printed to stdout, as shown in the following example. A data file is also generated. The name consists of the process name (linpackup), the experiment type (usertime), and the experiment ID. In this example, the filename is linpackup.usertime.m12205.

ssrun: target PID 12205
ssrun: setenv _SPEEDSHOP_MARCHING_ORDERS ut:cu
ssrun: setenv _SPEEDSHOP_EXPERIMENT_TYPE usertime
ssrun: setenv _SPEEDSHOP_TARGET_FILE linpackup
ssrun: setenv _RLD_LIST libss.so:libssrt.so:DEFAULT
 Please send the results of this run to:

 Jack J. Dongarra
 Mathematics and Computer Science Division
 Argonne National Laboratory
 Argonne, Illinois 60439

 Telephone: 312-972-7246

 ARPAnet: DONGARRA@ANL-MCS

     norm. resid      resid           machep         x(1)          x(n)
  5.35882395E+00  7.13873405E-13  2.22044605E-16  1.00000000E+00  1.00000000E+00


    times are reported for matrices of order   300
      dgefa      dgesl      total     mflops       unit      ratio
 times for array with leading dimension of 301
  3.960E+00  4.000E-02  4.000E+00  4.545E+00  4.400E-01  7.143E+01
  3.960E+00  4.000E-02  4.000E+00  4.545E+00  4.400E-01  7.143E+01
  3.970E+00  4.000E-02  4.010E+00  4.534E+00  4.411E-01  7.161E+01
  3.960E+00  4.000E-02  4.000E+00  4.545E+00  4.400E-01  7.143E+01

 times for array with leading dimension of 300
  3.910E+00  4.000E-02  3.950E+00  4.603E+00  4.345E-01  7.054E+01
  3.880E+00  8.000E-02  3.960E+00  4.591E+00  4.356E-01  7.071E+01
  3.930E+00  4.000E-02  3.970E+00  4.579E+00  4.367E-01  7.089E+01
  3.922E+00  3.800E-02  3.960E+00  4.591E+00  4.356E-01  7.071E+01

Generating a Report

To generate a report on the data collected, enter the following at the command line:

% prof your_output_file_name > usertime.results

The prof command interprets the type of experiment you have performed and prints results to stdout. The following report shows partial prof output.


Note: Lines have been wrapped because of line width restrictions.


-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb  2 13:37:38 1998
   prof linpackup.usertime.m12205
               linpackup (n32): Target program
                      usertime: Experiment name
                         ut:cu: Marching orders
                 R4400 / R4000: CPU / FPU
                             1: Number of CPUs
                           175: Clock frequency (MHz.)
  Experiment notes--
          From file linpackup.usertime.m12205:
        Caliper point 0 at target begin, PID 12205
                 /usr/demos/SpeedShop/linpack.demos/fortran/linpackup
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical callstack sampling data (usertime)--
                          2777: Total Samples
                             0: Samples with incomplete traceback
                        83.310: Accumulated Time (secs.)
                          30.0: Sample interval (msecs.)
-------------------------------------------------------------------------
Function list, in descending order by exclusive time
-------------------------------------------------------------------------
 [index]  excl.secs excl.%   cum.%  incl.secs incl.%    samples  function  
(dso: file, line)

     [5]     78.090  93.7%   93.7%     78.090  93.7%       2603  daxpy 
(linpackup: linpackup.f, 495)
     [6]      2.730   3.3%   97.0%      2.730   3.3%         91  matgen
(linpackup: linpackup.f, 199)
     [4]      1.920   2.3%   99.3%     79.680  95.6%       2656  dgefa 
(linpackup: linpackup.f, 221)
     [8]      0.270   0.3%   99.6%      0.270   0.3%          9  dscal 
(linpackup: linpackup.f, 670)
     [9]      0.180   0.2%   99.9%      0.180   0.2%          6  idamax
(linpackup: linpackup.f, 700)
    [10]      0.090   0.1%  100.0%      0.090   0.1%          3  dmxpy 
(linpackup: linpackup.f, 826)
     [7]      0.030   0.0%  100.0%      0.810   1.0%         27  dgesl 
(linpackup: linpackup.f, 324)
     [1]      0.000   0.0%  100.0%     83.310 100.0%       2777  __start 
(linpackup: crt1text.s, 101)
     [2]      0.000   0.0%  100.0%     83.310 100.0%       2777  main 
(libftn.so: main.c, 76)
     [3]      0.000   0.0%  100.0%     83.310 100.0%       2777  linp 
(linpackup: linpackup.f, 3)

Analyzing the Report

The report shows information for each function.

  • The index column, which enumerates the routines in the program, provides an index number for reference.

  • The excl.secs column shows how much time, in seconds, was spent in the routine itself (exclusive time). For example, less than one hundredth of a second was spent in linp, but 1.92 seconds were spent in dgefa.

  • The excl.% column shows the percentage of a program's total time that was spent in the function. For example, the daxpy routine consumed 93.7% of the program's time.

  • The cum.% column shows the percentage of the complete program time that has been spent in the routines that have been listed so far. For instance, when the dgefa routine completes, 99.3% of the program has completed by the routines listed so far.

  • The incl.secs column shows how much time, in seconds, was spent in the function and descendents of the function. For example, 0.81 seconds were spent in dgesl and the routines that were called from it.

  • The incl.% column shows the cumulative percentage of inclusive time spent in each routine and its descendents. For example, 1% of the time was spent in dgesl and all the routines that were called from it.

  • The samples column provides the number of samples taken from the function and all of its descendants.

  • The function (dso:file, line) column lists the routine name, its DSO name, its file name, and its line number. For example, the top line reports statistics for the routine daxpy, the DSO name linpackup, in the file linpackup.f, at line 495.


Note: Many functions shown here have only one or two hits. The data for those functions is not statistically significant. (Routines that begin with an underscore, such as __start, are internal routines that you cannot change.)


A pcsamp Experiment

This section lists the steps you need to perform a pcsamp experiment. The pcsamp experiment allows you to gather information on actual CPU time for each source code line, machine line, and function in your program. For more information on pcsamp, see “pcsamp Experiment” in Chapter 4.

Performing a pcsamp Experiment

From the command line, enter the following:

% ssrun -pcsamp linpackup

This starts the experiment.

Output from linpackup and from ssrun is printed to stdout, as shown in the following example. A data file is also generated. The name consists of the process name (linpackup), the experiment type (pcsamp), and the experiment ID. In this example, the file name is linpackup.pcsamp.m12333.

.
.
.
       norm. resid      resid           machep         x(1)          x(n)
    5.35882395E+00  7.13873405E-13  2.22044605E-16  1.00000000E+00  1.00000000E+00
.
.
.

Generating a Report

To generate a report on the data collected, enter the following at the command line:

% prof your_output_file_name > pcsamp.results

The prof command interprets the type of experiment you have performed and prints results to stdout. The following report shows partial prof output.

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb  2 13:52:27 1998
   prof linpackup.pcsamp.m12333
               linpackup (n32): Target program
                        pcsamp: Experiment name
               pc,2,10000,0:cu: Marching orders
                 R4400 / R4000: CPU / FPU
                             1: Number of CPUs
                           175: Clock frequency (MHz.)
  Experiment notes--
          From file linpackup.pcsamp.m12333:
        Caliper point 0 at target begin, PID 12333
         /usr/demos/SpeedShop/linpack.demos/fortran/linpackup
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (pcsamp)--
                          8272: Total samples
                        82.720: Accumulated time (secs.)
                          10.0: Time per sample (msecs.)
                             2: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
 [index]      secs    %    cum.%   samples  function (dso: file, line)

     [1]    77.440  93.6%  93.6%      7744  daxpy (linpackup: linpackup.f, 495)
     [2]     2.690   3.3%  96.9%       269  matgen (linpackup: linpackup.f, 199)
     [3]     1.940   2.3%  99.2%       194  dgefa (linpackup: linpackup.f, 221)
     [4]     0.370   0.4%  99.7%        37  idamax (linpackup: linpackup.f, 700)
     [5]     0.210   0.3%  99.9%        21  dscal (linpackup: linpackup.f, 670)
     [6]     0.060   0.1% 100.0%         6  dmxpy (linpackup: linpackup.f, 826)
             0.010   0.0% 100.0%         1  **OTHER** (includes excluded DSOs, rld, etc.)

            82.720 100.0% 100.0%      8272  TOTAL

Analyzing the Report

The report has the following columns:

  • The index column assigns a reference number to each function.

  • The secs column shows the amount of CPU time spent in the routine.

  • The (%) column shows the percentage of the total program time that was spent in the function.

  • The cum.% column shows the percentage of the complete program time that has been spent by the routines listed so far.

  • The samples column shows how many samples were taken when the process was executing in the function.

  • The function (dso:file, line) columns list the routine name, its DSO name, its file name, and its line number. For example, the first line reports statistics for the routine daxpy, in the DSO linpackup, in the file linpackup.f, at line number 495.

A Hardware Counter Experiment


Note: This experiment can be performed only on systems that have built-in hardware counters (the R10000, R12000, R14000, and R16000 classes of machines).

Hardware counters keep track of a variety of hardware information. For a complete list of hardware counter experiments, see the ssrun(1) man page.

This section lists the steps you need to perform a hardware counter experiment. The tutorial describes the steps involved in performing the dsc_hwc experiment. This experiment allows you to capture information about secondary data cache misses. For more information on hardware counter experiments, see “SpeedShop Hardware Counter Experiments” in Chapter 4.

Performing a Hardware Counter Experiment

From the command line, enter the following:

% ssrun -dsc_hwc linpackup

This starts the experiment. Output from linpackup and from ssrun will be printed to stdout. A data file is also generated. The name consists of the process name (linpackup), the experiment type (dsc_hwc), and the experiment ID. In this example, the filename is linpackup.dsc_hwc.m438011.

Generating a Report

To generate a report on the data collected, enter the following at the command line:

% prof your_output_file_name > dsc_hwc.results

Output similar to the following is generated:

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb  2 13:56:59 1998
   prof linpackup.dsc_hwc.m438011
               linpackup (n32): Target program
                       dsc_hwc: Experiment name
                 hwc,26,131:cu: Marching orders
               R10000 / R10010: CPU / FPU
                            16: Number of CPUs
                           195: Clock frequency (MHz.)
  Experiment notes--
          From file linpackup.dsc_hwc.m438011:
        Caliper point 0 at target begin, PID 438011
         /usr/demos/SpeedShop/linpack.demos/fortran/linpackup
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of R10K perf. counter overflow PC sampling data (dsc_hwc)--
                          2929: Total samples
       Sec cache D misses (26): Counter name (number)
                           131: Counter overflow value
                        383699: Total counts
-------------------------------------------------------------------------
Function list, in descending order by counts
-------------------------------------------------------------------------
 [index]    counts     %   cum.% samples  function (dso: file, line)

     [1]    309029  80.5%  80.5%    2359  daxpy (linpackup: linpackup.f, 495)
     [2]     46636  12.2%  92.7%     356  dgefa (linpackup: linpackup.f, 221)
     [3]     25938   6.8%  99.5%     198  matgen (linpackup: linpackup.f, 199)
     [4]      1310   0.3%  99.8%      10  idamax (linpackup: linpackup.f, 700)
     [5]       131   0.0%  99.8%       1  _FWF (libfortran.so: wf90.c, 47)
     [6]       131   0.0%  99.9%       1  memset (libc.so.1: bzero.s, 98)
               524   0.1% 100.0%       4  **OTHER** (includes excluded DSOs, rld, etc.)

                383699 100.0% 100.0%      2929  TOTAL

Analyzing the Report

The information immediately above the function list displays the following:

  • The Total samples is the number of times the program counter was sampled. It is sampled once for each overflow, or each time the hardware counter exceeds the specified value.

  • The Counter name (number) indicates the hardware counter used in the experiment. In this case, hardware counter 26 counts the number of times a value required in a calculation was not available in secondary cache. For a complete list of the hardware counters and their numbers, see Table 4-3.

  • The Counter overflow value is the number at which the hardware counter overflows, or exceeds its preset value. In this case, the value is 131, which is the default. You can change the overflow value by setting the _SPEEDSHOP_HWC_COUNTER_OVERFLOW environment variable to a value larger than 0, the _SPEEDSHOP_HWC_COUNTER_NUMBER environment variable to 26, and running the prof_hwc experiment rather than dsc_hwc.

    See “_hwctime Hardware Counter Experiments” in Chapter 4 to learn how to choose a counter overflow value.

  • The Total counts is the total number of times a value was not in secondary cache when needed. This value is determined by multiplying the total number of samples by the overflow value; extra counts that do not cause an overflow are not recorded.

The function list has the following columns:

  • The index column assigns a reference number to each function.

  • The counts column shows the number of times a data item was not in secondary cache when needed for a calculation during the execution of the routine. As with Total counts (described earlier), a routine's counts value is determined by multiplying its samples value (described later) by the overflow value.

  • The % column shows the percentage of the program's overflows that occurred in the routine.

  • The cum.% column shows the percentage of the program's overflows that occurred in the routines listed so far. For example, although the matgen routine had only 6.8% of the program's overflows, by the time it is encountered in the routine list, 99.5% of the program's total overflows have been recorded.

  • The samples column shows the number of times the program counter was sampled during execution of the routine. A sample is taken for each overflow of the hardware counter.

  • The function (dso:file, line) columns show the name, the DSO, the file name, and line number of the routine.

A bbcounts Experiment

This section provides the steps you need to perform a bbcounts or basic block counts experiment. This experiment counts basic block usage and estimates a linear time. It also maps a complete call graph. See “bbcounts Experiment” in Chapter 4.

Performing a bbcounts Experiment

From the command line, enter the following:

% ssrun -bbcounts linpackup

This starts the experiment. This entails making copies of the libraries and executables, giving them an extension that depends on the ABI, and inserting information into the copies. The extension is .pixbb for the executable, .pix32 for all o32 libraries, .pixn32 for all n32 libraries, and .pix64 for all 64 libraries.

Output from linpackup and from ssrun is printed to stdout, as shown in the following example. A data file is also generated. The name consists of the process name (linpackup), the experiment type (bbcounts), and the experiment ID. In this example, the file name is linpackup.bbcounts.m77549.

instrumenting /lib32/rld
instrumenting /usr/lib32/libssrt.so
instrumenting /usr/lib32/libss.so
instrumenting /usr/lib32/libmp.so
instrumenting /usr/lib32/libftn.so
instrumenting /usr/lib32/libm.so
instrumenting /usr/lib32/libc.so.1

Generating a Report

To generate a report on the data collected, enter the following at the command line:

% prof your_output_file_name > bbcounts.results

The prof command redirects output to a file called bbcounts.results. The file should contain results that look something like the following.

-------------------------------------------------------------------------
SpeedShop profile listing generated Tue Oct 16 16:02:42 2001
   prof linpackup.bbcounts.m77549
               linpackup (n32): Target program
                      bbcounts: Experiment name
                         it:cu: Marching orders
                 R5000 / R5000: CPU / FPU
                             1: Number of CPUs
                           180: Clock frequency (MHz.)
  Experiment notes--
	  From file linpackup.bbcounts.m77549:
	Caliper point 0 at target begin, PID 77549
			/d3/jcarter/pixie/SpeedShop/linpack/linpackup
	Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of basic block time data (bbcounts)--
                    5112044612: Total number of instructions executed
                    6935000905: Total computed cycles
                        38.528: Total computed execution time (secs.)
                         1.357: Average cycles / instruction
-------------------------------------------------------------------------
Function list, in descending order by exclusive ideal time
-------------------------------------------------------------------------
 [index]   excl.secs   excl.%     cum.%        cycles  instructions    calls  function  (dso:file, line)

     [1]      36.341    94.3%     94.3%    6541395864    4828393890   772633  daxpy (linpackup.pixbb: linpackup.f, 495)
     [2]       1.262     3.3%     97.6%     227149380     160652700       18  matgen (linpackup.pixbb: linpackup.f, 199)
     [3]       0.638     1.7%     99.3%     114769057      80790171       17  dgefa (linpackup.pixbb: linpackup.f, 221)
     [4]       0.130     0.3%     99.6%      23361468      18659693     5083  dscal (linpackup.pixbb: linpackup.f, 670)
     [5]       0.109     0.3%     99.9%      19598722      16437793     5083  idamax (linpackup.pixbb: linpackup.f, 700)
     [6]       0.027     0.1%     99.9%       4866714       3785181        1  dmxpy (linpackup.pixbb: linpackup.f, 826)
     [7]       0.006     0.0%    100.0%       1149098        767142       17  dgesl (linpackup.pixbb: linpackup.f, 324)
     [8]       0.004     0.0%    100.0%        710286        659635     2156  general_find_symbol (rld: rld.c, 2038)
     [9]       0.003     0.0%    100.0%        481218        439585     4819  resolve_relocations (rld: rld.c, 2636)
    [10]       0.001     0.0%    100.0%        237532        237532     2170  elfhash (rld: obj.c, 1184)
    [11]       0.001     0.0%    100.0%        167623        167623     7295  obj_dynsym_got (rld: objfcn.c, 46)
    [12]       0.001     0.0%    100.0%        156368        156368     3085  strcmp (rld: strcmp.s, 34)
    [13]       0.001     0.0%    100.0%        128884        124596     2158  resolve_symbol (rld: rld.c, 1828)
    [14]       0.001     0.0%    100.0%        123436        121312     2129  resolving (rld: rld.c, 1499)
    [15]       0.001     0.0%    100.0%        122038        109016        3  fix_all_defineds (rld: rld.c, 3419)
    [16]       0.000     0.0%    100.0%         64706         64365        7  search_for_externals (rld: rld.c, 3987)
    [17]       0.000     0.0%    100.0%         61565         59304     1116  __flsbuf (libc.so.1: _flsbuf.c, 25)
    [18]       0.000     0.0%    100.0%         56077         43137     4314  obj_set_dynsym_got (rld: objfcn.c, 82)
    [19]       0.000     0.0%    100.0%         42883         38501        1  linp (linpackup.pixbb: linpackup.f, 3)
    [20]       0.000     0.0%    100.0%         42595         42594      867  x_putc (libftn.so: wsfe.c, 177)
    [21]       0.000     0.0%    100.0%         24161         21845       28  x_wEND (libftn.so: wsfe.c, 225)
    [22]       0.000     0.0%    100.0%         17014         17000        8  memset (libc.so.1: bzero.s, 98)
    [23]       0.000     0.0%    100.0%         14671         13537       71  do_fio64_mp (libftn.so: fmt.c, 626)
    [24]       0.000     0.0%    100.0%         14575         11501       53  wrt_E (libftn.so: wrtfmt.c, 353)
    [25]       0.000     0.0%    100.0%         13664         12432       28  getenv (libc.so.1: getenv.c, 25)
   
.
.
.
   [329]       0.000     0.0%    100.0%             1             1        1  __istart (linpackup.pixbb: crt1tinit.s, 14)

Analyzing the Report

The report has the following columns:

  • The index column assigns a reference number to each function.

  • The excl.secs column shows the minimum number of seconds that might be spent in the routine under ideal conditions. For example, 52.073 seconds is optimal for the daxpy routine. The pcsamp experiment (see “A pcsamp Experiment”) times this routine at 77.44 seconds.

  • The excl.% column represents how much of the program's total time was spent in the routine.

  • The cum.% column shows the cumulative percentage of time spent in the routines listed so far.

  • The cycles column shows the total number of machine cycles used by the routine. For example, 91,12,833,799 CPU clock cycles were spent in the daxpy routine.

  • The instructions column shows the total number of instructions executed by a routine. For example, the dgefa routine executed 72,336,088 instructions.

  • The calls column shows the total number of calls to the routine. For example, there was just one call to the dmxpy routine.

  • The function (dso:file, line) column lists the name, the DSO name, the file name, and the line number for the routine.

MPI Tracing Tutorial

The following steps generate tracing data for an MPI program:

  1. First, set the MPI_RLD_HACK_OFF environment variable to prevent SpeedShop confusion over the organization of the DSOs.

    % setenv MPI_RLD_HACK_OFF 1

  2. Compile the matmul.f source file and include the MPI library:

    % f90 -o matmul matmul.f -lmpi

  3. Now run the ssrun command as part of the mpirun(1) command on the executable file to generate experiment files:

    % mpirun -np 4 ssrun -mpi matmul

    The result will be a series of experiment files, one for each process (the identifier begins with an f) and one for the master process (the identifier begins with an m):

    matmul.mpi.f9587021
    matmul.mpi.f9905720
    matmul.mpi.f9930637
    matmul.mpi.f9930718
    matmul.mpi.m9951566
  4. Finally, display an experiment file with the WorkShop cvperf(1) command. Remember, you cannot use prof to display an MPI trace experiment.

    % cvperf matmul.mpi.f9587021

    To display the output, select either MPI Stats View (Graphs) or MPI Stats View (Numerical) from the Views menu. See Figure 3-1, for an illustration of the MPI Stats View (Graphs).

Figure 3-1. An MPI Experiment in cvperf