As described in Chapter 4, “Compiler and Development Tools”, the development process can be thought of as a chain of processes aided by a variety of software tools. Table 5-1 shows the development tool chain in table form.
Chapter 4, “Compiler and Development Tools” described the compilation process and tools available for Altix systems. This chapter outlines other development tools which are mainly used after an application has been built (or to automate the build process).
Table 5-1. Development Process
Activity | Tools | IRIX versions | Linux versions |
|---|---|---|---|
Source code development | Editors | vi, emacs, jot, etc | vi, emacs, etc. |
Executable creation | Compilers | cc, CC, f77, f90 | ecc, gcc, efc/ifort, g77 |
Object file creation | Assemblers | as | ias, as |
Linkage | Linker | ld | ld |
Archiving | Archiver | ar | ar |
Object file inspection | Object tools | elfdump, dwarfdump | objdump |
Debugging | Debuggers | dbx, cvd | gdb, idb, ddd,DDT |
Performance analysis | Profilers | SpeedShop, perfex | VTUNE, perfmon, histx |
Automation | Make | make, smake, pmake | gmake |
Environment configuration | Scripts | modules | modules |
The archiver (ar) maintains groups of files as a single archive file. Generally, you use this utility to create and update library files that the linker uses, however, you can use the archiver for any similar purpose. On Linux, ar is the GNU archiver. The archiver flags are similar on IRIX and Linux; Table 5-2 summarizes their common flags.
Table 5-2. IRIX and Linux Common Archiver Options
-d | Deletes specified object |
-m | Moves specified object to the end of the archive |
-p | Prints the specified members of the archive to stdout |
-q | Appends specified object to the end of the archive |
-r | Replaces an earlier version of the object in the archive |
-t | Lists the table of contents of the archive |
-x | Extracts a file from the archive |
Example 5-1 shows how to use the -q option to build an archive file and the -t option to list its contents.
Example 5-1. Building an Archive
%gcc -c foo1.c # creates foo1.o %gcc -c foo2.c # creates foo2.o %ar -q archive.a foo1.o foo2.o # creates archive.a %ar -t archive.a # lists contents of archive foo1.o foo2.o |
Table 5-3 provides a summary of other commands that can be used to inspect and manipulate object files. Like ar(1), these commands are GNU based. It should be noted that dis is actually an alias for objdump -d rather than a separate command. Likewise there is no elfdump on Linux but there is objdump. It should also be noted that the functionality and flags accepted by the various commands differ between IRIX and Linux. For more information, see the man pages for the various commands (e.g. %man objdump).
Table 5-3. Additional Object File Tools
IRIX | Linux | Function |
|---|---|---|
file | file | Lists the general properties of the file |
size | size | Lists the size of each section of the object file |
elfdump | readelf | Lists the contents of an ELF object file |
ldd | ldd | Lists the shared library dependencies |
nm | nm | Lists the symbol table information |
elfdump | objdump | Dump object file information contents |
dis | objdump -d | Disassemble the source code |
strip | strip | Remove the symbol table and relocation information |
c++filt | c++filt | Demangle names for C++ (nm -C) |
Debuggers on IRIX and Linux fall into two categories:
Command line (text based) debuggers
GUI (windowed) debuggers
On IRIX the ProDev WorkShop tools provides the dbx command line debugger and the CaseVision cvd GUI debugger. Both are able to debug programs compiled by any MIPSpro compiler and also support debugging of multi-threaded code. A second GUI debugger, TotalView is available from Etnus Corp (www.etnus.com ). TotalView is also available from Etnus for Altix machines.
Debuggers that ship with Altix machines are provided by Intel and GNU. The Intel debugger is called idb. Like its GNU counterpart, gdb, it is a command line debugger that can attach to a running process or debug a core file. It supports debugging programs written in all of the languages supported by the Intel compilers and has been improved in the area of debugging multithreaded applications (OpenMP or pthreads). By default, it supports dbx commands though it can also (via option) support gdb commands. Table 5-4 lists some of the more commonly used commands of these debuggers.
Table 5-4. Command Line Debugger Commonly Used Commands
MIPSpro dbx and idb Default Command | gdb Command | Function |
|---|---|---|
run | run | Start program |
continue | continue | Continue stopped program |
attach pid | attach pid | Attach to running process |
stop in function | break func | Set breakpoint in function |
stop at line | break line | Set breakpoint on line # |
status | info | Print breakpoints |
delete N | delete | Delete breakpoint |
print expr | print expr | Print expression value |
step | step | Single step (into functions) |
next | next | Single step (over functions) |
return | finish | Continue running until current function returns |
printregs | info registers | Print register values |
address/Ni | disassemble | Disassemble source code |
list | list | List source code |
exit |
| Exit debugger |
A full set of commands supported by the idb and gdb debuggers can be found by listing their respective man pages idb(1) and gdb(1). Documentation on gdb is available at the GNU web site: http://www.gnu.org/software/gdb/documentation/.
In addition to the previously mentioned Etnus TotalView debugger (a discussion of which is beyond the scope of this manual), there also exists a graphical front-end interface to either gdb (by default) or idb called DataDisplayDebugger or ddd. (For information, see http://www.gnu.org/software/ddd/ .)
To invoke ddd running idb in dbx mode type, execute the following:
% ddd --debugger idb --dbx ./a.out |
This creates a debugger console window where debugger commands can be typed. This also creates window panes for the source code, disassembled code, and array values. You can use the View menu to switch these panes on and off.
Figure 5-1 shows a typical ddd display.
Some commonly used commands from Table 5-4 are found in the Program pull-down menu and command.
The following site contains a thorough repository of information about ddd: http://www.gnu.org/manual/ddd/html_mono/ddd.html . The ddd(1) man page is also useful.
Another GUI debugger available for Altix is called the distributed debugging tool (DDT), available from Streamline Computing. DDT focuses on providing support for debugging parallel applications. For more information see: http://www.streamline-computing.com/softwaredivision_1.shtml .
A variety of hardware and software support is provided for timing on IRIX and Altix systems. Understanding their implementation is critical to avoiding faulty conclusions when measuring application performance. Under IRIX systems, this support is summarized in the timers(5) man page. The rest of this discussion focuses on Altix systems and briefly outlines the differences between the systems.
On Altix platforms the IA64 processor has a high-resolution timer register that operates at the clock speed of the processor. This timer is available through the application register AR.ITC, and is commonly referred to as the itc. While providing 1 nanosecond resolution (at 1GHz), the itc registers are not synchronized across processors. Likewise the Altix Numalink hardware provides a timer that currently gives 40 nanosecond resolution. This timer is the SN.RTC and its value is synchronized across processors on Altix.
The basic LINUX gettimeofday() system call uses a pointer to a timeval structure containing two long integers used to return the time of day in seconds and microseconds since midnight (00:00) Coordinated Universal Time (UTC), January 1, 1970. The following example illustrates its use:
Example 5-2. Using gettimeofday()
%cat td.c
#include <stdio.h>
#include <sys/time.h>
main()
{
int i;
struct timeval T;
i= gettimeofday(&T,0);
if (i==0)
printf(“gettimeofday returned %ld seconds and %ld
microseconds\n”,T.tv_sec, T.tv_usec);
}
%icc td.c
%./a.out
gettimeofday returned 1078969017 seconds and 370026 microseconds
|
The gettimeofday values are updated by on every timer interrupt in the kernel. Currently this occurs at the rate of 1024 interrupts per second. If better resolution is required, variants of clock_gettime() can be used.
The clock_gettime function returns the current value for the specified clock (passed in by the first parameter clock_id). The value is returned through a pointer to a timespec structure consisting of two long integers containing values for seconds and nanoseconds.
Depending on the clock's resolution, it may be possible to obtain the same time value with consecutive reads of the clock. The time value may also have a higher precision then the resolution of the clock.
The resolution of any clock can be obtained by calling the c lock_getres() function. The resolution of the clock will be returned through a pointer to a timespec structure.
On Altix systems the list of supported clocks differs from those on IRIX. These clocks are:
The CLOCK_SGI_CYCLE and CLOCK_SGI_FAST supported on IRIX systems are not supported on Altix system.
Example 5-3 gets the resolution of these clocks and then uses CLOCK_PROCESS_CPUTIME_ID to time how long it took to do so.
Example 5-3. Determining clock resolution time
%cat tr.c
#include <stdio.h>
#include <time.h>
main()
{
int i;
struct timespec N;
i = clock_getres(CLOCK_REALTIME, &N);
if (i == 0)
printf(“Resolution is %ld seconds and %lld nanoseconds for
CLOCK_REALTIME \n”,N.tv_sec, N.tv_nsec);
i = clock_getres(CLOCK_PROCESS_CPUTIME_ID, &N);
if (i == 0)
printf(“Resolution is %ld seconds and %lld nanoseconds for
CLOCK_PROCESS_CPUTIME_ID\n”,N.tv_sec, N.tv_nsec);
i = clock_getres(CLOCK_THREAD_CPUTIME_ID, &N);
if (i == 0)
printf(“Resolution is %ld seconds and %lld nanoseconds for
CLOCK_THREAD_CPU_TIME_ID\n”,N.tv_sec, N.tv_nsec);
i = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &N);
if (i == 0)
printf(“Elapsed time is %ld seconds and %lld nanoseconds
\n”,N.tv_sec, N.tv_nsec);
}
%icc tr.c -lrt
%./a.out
Resolution is 0 seconds and 976562 nanoseconds for CLOCK_REALTIME
Resolution is 0 seconds and 40 nanoseconds for CLOCK_PROCESS_CPUTIME_ID
Resolution is 0 seconds and 40 nanoseconds for CLOCK_THREAD_CPU_TIME_ID
Elapsed time is 0 seconds and 9125600 nanoseconds
|
As mentioned before, the CLOCK_REALTIME clock calls gettimeofday and has a resolution of 1/1024 seconds. The other two clocks provide much better resolution.
MPI applications can take advantage of two portable timing routines provided with the library calls MPI_Wtime() and MPI_Wtick(). Both calls return double precision floating point numbers which represent the time and resolution in seconds respectively. They also read the SN.RTC and have submicrosecond resolution that is synchronized across nodes.
The following is a Fortran example that uses these timing routines.
Example 5-4. Using MPI Timing Routines
%cat m.f
PROGRAM M
INCLUDE “mpif.h”
DOUBLE PRECISION TME1
DOUBLE PRECISION TME2
DOUBLE PRECISION ELAPSED
DOUBLE PRECISION RES
INTEGER error
CALL MPI_INIT(error)
TME1=MPI_WTIME()
RES=MPI_WTICK()
PRINT *,”RESOLUTION IS “,RES
PRINT *,”TIME1 IS “,TME1
TME2=MPI_WTIME()
PRINT *,”TIME2 IS “,TME2
ELAPSED = TME2-TME1
PRINT *,”ELAPSED TIME IS “,ELAPSED
CALL MPI_FINALIZE(error)
END
%ifort m.f -o m -lmpi
%mpirun -np 1 m
RESOLUTION IS 4.000000000000000E-008
TIME1 IS 207612.193793240
TIME2 IS 207612.196069960
ELAPSED TIME IS 2.276720013469458E-003
|
Performance analysis tools typically work in two phases. First, the application is run and performance data is collected. Typically, this data is created by one of three methods:
Periodically interrupting a running application and capturing the program counter (PC-sampling) or the entire stack frame (Call-stack Sampling).
Instrumenting the executable program to generate performance data as it executes certain (or all) parts of the program.
Using hardware to detect and track certain events.
On Itanium based systems the third category is particularly important. The Itanium 2 Performance Monitoring Unit (PMU) defines over four hundred different events that can be measured in four 48-bit counters. The different types of events that can be measured fall into the following categories:
Basic Events (Clock cycles, Retired instructions)
Instruction Dispersal Events
(18 events; FP_OPS_RETIRED, FP_FLUSH_TO_ZERO)
Instruction Execution Events
Stall Events
Branch Events
Memory Hierarchy
System Events
TLB Events
System Bus Events
Register Stack Engine Events
In the second phase, the collected data is analyzed and presented to the user. As with debuggers, the presentation of performance analysis tools is classified into two categories:
Command Line (text based)
GUI (windowed)
On IRIX, tools such as perfex(1), SpeedShop(1) and prof fall into the first category while cvperf (in ProDev WorkShop) falls in the latter.
Performance Tools on Altix are available from both Intel and the open source community (including SGI contributions). The following sections briefly document the following tools:
VTune (Intel)
gprof (GNU)
pfmon (HP labs)
profile.pl (SGI)
histx (SGI)
VTune (see: http://www.intel.com/software/products/vtune/vlin/ ) provides call stack sampling as well as comprehensive support for event based sampling of the Itanium PMU. Two versions of the tool are available. The first requires that the collected data be copied from the Altix to Windows based machine where the analysis takes place under a GUI framework. The second is a command line tool natively hosted on the Itanium system where the data was collected.
For additional information, see http://ssales.corp.sgi.com/products/servers/altix350/intelfaq.html and scroll down to find the comparison chart for VTune 7.1 versus VTune 2.0.
The gprof tool requires that the application being analyzed be compiled with the -pg option of gcc. When run, the resulting program creates a gmon.out file which contains information that can be used to generate three types of reports by the command line based gprof tool:
| Flat Profile | Shows how much time your program spent in each function, and how many times that function was called. | |
| Call Graph | Shows, for each function, which functions called it, which other functions it called, and how many times. | |
| Annotated Source | Shows how many times each line of the programs source code was executed. |
For further information, see the gprof man page (%man gprof).
The pfmon tool uses the Itanium Performance Monitoring Unit (PMU) to count and sample during runs made on unmodified binaries. It can function on a per-process basis or take a system-wide view on a dedicated CPU or a set of CPUs. It also can monitor events at the user level or at the system level.
The -l option to pfmon lists the (currently 475) supported events. These event names can then be used as arguments to the -e option which specifies which events to monitor. For example, executing the following command will monitor four different events:
%pfmon -ecpu_cycles,ia64_inst_retired_this,nops_retired, \ back_end_bubble_all a.out |
Note: that there is no space between the -e option and the name of the first event or between the commas.
This is recommended as the first step in using pfmon to count cycles, instructions, NOPs and back-end stall cycles.
More information about pfmon can be found in its man page (%man pfmon) and a user guide normally installed under /usr/share/doc/pfmon-2.0/pfmon_usersguide.txt.
profile.pl is a Perl script interface to pfmon. It uses dplace to bind the application to specific processors and invoke other Perl scripts to generate a readable report. It requires that the application contain symbol table information (i.e., not be stripped). Table 5-5 shows some commonly used options to profile.pl.
Option | Meaning |
|---|---|
-Cprocessor_list | Used by dplace to bind processes to processors |
-Eevent | pfmon event name (CPU_CYCLES is the default) |
-Nnumber | Controls how often sampling is done |
-Ofilename | Puts analysis file into filename (profile.out is the default) |
-K | Keep each CPU sample file and produce a separate report for each CPU |
For more information see the profile.pl(1), analyze.pl(1), and makemap.pl(1) man pages.
SGI Histx is a performance analysis tool designed to complement pfmon. The software is designed to run on Altix systems only. Used internally by SGI developers and benchmarkers, the product is offered as a service to SGI customers with a no fee end-user proprietary license via the SGI Download Cool Software (DCS) Web site. Customers wishing to use SGI Histx should be aware that there is no support planned for this product and customers who use it accept it “as is”.
Histx consists of a group of tools:
First there are three data collection programs:
| libfpm | This tool resembles the perfex tool on IRIX. It supports individual threads and MPI processes reporting counts of specified events for the entire run of the program. | |
| samppm | Similar to libfpm, it tracks counts of events as a function of time. The binary output file is then processed by dumppm into a report. | |
| histx | Provides PC (or more accurately instruction pointer, or ip) sampling and call stack sampling |
Then there are three filters for performance data postprocessing and display:
| dumppm | Formats samppm data into a report. | |
| iprep | Formats histx PC (ip) sampling data into a report | |
| csrep | Formats histx call stack sampling data into a report that resembles an IRIX SpeedShop “butterfly” report. |
The histx command does not have a man page; however, typing the command by itself (or %histx -h) will print relevant options. For example:
%histx
usage: histx [-b width] [-f] [-e source] [-h] [-k] -o file [-s type] [-t signo] command args...
-b specify bin bits when using ip sampling: 16,32 or 64 (default: 16)
-e specify event source (default: timer@1)
-f follow fork (default: off)
-h this message (command not run)
-k also count kernel events for pm source (default: off)
-l include line level counts in ip sampling report (default: off)
-o send output to file.<prog>.<pid> (REQUIRED)
-s type of sampling (default: ip)
-t `toggle' signal number (default: none)
Event sources:
timer@N profiling timer events. A sample is recorded
every N ticks.
pm:<event>@N performance monitor events. A sample is
recorded whenever the number of occurrences of
<event> is N larger than the number of occurrences
at the time of the previous sample.
dlatM@N A sample is recorded whenever the number of
loads whose latency exceeded M cycles is N larger
than the number at the time of the previous
sample. M must be a power of 2 between 4 and
4096
Types of sampling:
ip Sample instruction pointer
callstack[N] Sample callstack. N, if given, specifies
the maximum callstack depth (default: 8)
Notes:
A list of valid performance monitor <event>s can be found
in Intel manuals.
`command' must not be compiled using the `-p' compiler flag
One tick is about 0.977 milliseconds
|
Thus
%histx -e timer@1 -o out ./a.out |
will generate the output file out.a.out.XXXX (where XXXX is the process id) which provides the number of timer ticks for each function in the a.out file.