Chapter 3. PCA Command-Line Options

This chapter explains how to use the Power C Analyzer and describes the pca command-line options (also see the pca(1) reference page).

Chapter 2 described how to use PCA by passing –pca and its options to the C compiler. In this mode, PCA is run as a phase of compilation. PCA analyzes the code, adds parallel directives, and then compiles the program.

This chapter explains how to run PCA as a standalone analysis tool. By using pca and reviewing its analysis report, you can try different options and/or code modifications to see their effect on your program. Once you have reached optimum parallelism, you can do a final compilation with the –pca option.

pca Command-Line Syntax

The pca command-line syntax is:

/usr/lib/pca [ options ] ... filename.c

When you specify a command-line option, you can use the long name, short name, or any portion of the name that uniquely identifies the command (for example, –roundoff or –r). If a command-line option appears more than once on the command line, PCA uses the last occurrence—except for input/output options. PCA does not accept multiple occurrences of input/output options.

The command-line options described in this chapter appear in lowercase letters; however, PCA is not case sensitive.

Table 3-1 lists the pca command-line options. The first column defines the functional category of the option: concurrentization, general optimization, in-lining and interprocedural analysis, input/output, and listing. The next three columns list the long name, short name, and default values of each option.

Table 3-1. pca Command-Line Options

Purpose

Long Name

Short Name

Default Value

Run code in parallel

concurrentize

noconcurrentize

minconcurrent=n

conc

nconc

mc=n

concurrentize

concurrentize

minconcurrent=1000

Optimize code

arclimit=naddress_resolution_level=nlimit=nmachine=listnomachine

optimize=nroundoff=nscalaropt=nsyntax=[a|k]

unroll=nunroll2=n

arclm=narl=nlm=nma=listnma

o=nr=nso=nsy=[a|k]

ur=nur2=n

arclimit=2000

arl=1

limit=5000

machine=s

machine=s

optimize=5

roundoff=0

scalaropt=3

syntax=a

unroll=4

unroll2=100

In-lining and Inter-procedural Analysis

inline[=names]

ipa[=names]

inline_create=fileipa_create=fileinline_from_files=listinline_from_libraries=listipa_from_files=listipa_from_libraries=listinline_depth[=n]

inline_looplevel[=n]

ipa_looplevel[=n]

inline_manual

ipa_manual

inl[=names]

ipa[=names ]

incr=fileipacr=fileinff=listinfl=listipaff=listipafl=listind[=n]

inll[=n]

ipall[=n]

inm

ipam

(off)

(off)

(off)

(off)

current source file

(off)

current source file

(off)

ind=2

inll=2

ipall=2

(off)

(off)

Input/Output

cmp=filenocm

p

 

input[=file]

list=filenolist

cmp=filencmp

i[=file]

l=filenl

see text

see text

see text

nolist

nolist

Listing

cmpoptions=listnocmpoptions

lines=nlistoptions=listlistingwidth=<80|132>

cp=listncp

ln=nlo=listlw=<80,132>

nocmpoptions

nocmpoptions

lines=55

(no listing)

80

Memory Management

cacheline=ncachesize=ndpregisters=nfpregisters=nsetassociativity=n

chl=nchs=ndpr=nfpr=nsasc=n

chl=64

chs=64

dpr=6

fpr=12

sasc=1

Invariant IF Floating

each_invariant_if_growth=nmax_invariant_if_growth=n

eiifg=nmiifg=n

eiifg=20

miifg=500

Command Line Options for Portability

DOLLAR

FLOAT

SIGNED

VOLATILE

PROCESSORS

INLINE_AND_COPY

STDIO

 

 

 

 

P

INLC

STDIO

off

off

off

off

P=0

off

off

The following pages explain each option giving its short name, long name, and default, and whether or not you can disable it by using the no ([n]) notation. In-lining and Interprocedural Analysis are explained in Chapter 7, “In-lining and Interprocedural Analysis.”

PCA runs after the standard C preprocessor. The code examples in this chapter show the original code (before the preprocessor) and the PCA-transformed code (with some of the C preprocessor additions stripped off for clarity).

Concurrentization Options

Concurrentization is the process by which PCA converts code to execute concurrently (in parallel) on multiple processors.

concurrentize

The syntax for this option is:

–[n]conc
–[no]concurrentize  (long name)
–conc               (default value)

The –concurrentize option tells PCA to mark eligible loops to run concurrently (in parallel). The –noconcurrentize option tells PCA not to mark loops to run in parallel but does not prohibit any of the other optimizations that PCA can make.

minconcurrent

The syntax for this option is:

–mc=n
–minconcurrent=n   (long name)
–mc=1000           (default value)

Executing a loop in parallel incurs overhead that varies with different loops. If a loop has little work, parallel execution might be slower than serial execution because of the overhead. However, beyond a certain level, you can improve performance through parallel execution. This level is passed to PCA with the –minconcurrent option.

The range of values for the –minconcurrent option is:

>=0

The higher the –minconcurrent value, the larger (more iterations, more statements, or both) the loop body must be in order to run concurrently. To disable this feature and run all possible code in parallel, use the command-line option –minconcurrent=0.

At analysis time, PCA estimates the amount of computation inside a loop. You can see this estimate in the Loop Summary (see Chapter 8, “Loop Table (l)”) in the “iteration workload” column. This estimate is roughly the number of operators plus the number of operands, excluding the loop index. The product of the workload in each iteration times the number of iterations is considered to be the amount of work of the loop, and this is the value that is compared with the –minconcurrent value. If the loop bounds are constant and the estimated amount of work is greater than the –minconcurrent value, PCA generates concurrent code for the loop. Otherwise, it leaves the loop serial. However, if the for loop bounds are not known at compilation time, PCA generates an if expression in the parallel pragma. The compiler interprets this expression as a request to generate two loops, one concurrentized and one left serial, which are checked at runtime to decide whether or not to execute the loop in parallel.

The following loop illustrates this feature with the –minconcurrent default:

int a[], b[], c[], n;
void example_4_2_2 ()
{
    int i;
    for (i=0; i<n; i++) {
        a[i] = b[i] + c[i];
    }
}

becomes:

int a[];
int b[];
int c[];
int n;
void example_4_2_2(  )
 
{
    int i;
#pragma parallel if(n > 201) byvalue(n) shared(a, b, c) local(i)
#pragma pfor iterate(i=0;n;1)
    for ( i = 0; i<n; i++ ) {
        a[i] = b[i] + c[i];
    }
}

The Loop Summary (from the listing file) shows what PCA concurrentized.

----------------------------Loop Table-------------------------
                              Nest
Loop         Message             Level    Contains Lines
===============================================================
for i                            1        5-7 "example_4_2_2.c"
   1. Concurrent                 1        5-7 "example_4_2_2.c"

PCA calculates that the amount of “work” being done by each iteration is 5 units. At run time, if the iteration count n is less than or equal to 200 (1000/5), the concurrent loop is executed serially; otherwise it is executed in parallel.

If you specify –minconcurrent=0 on the command line, the if(n > 201) clause will be left out of the #pragma parallel, and the loop will always execute in parallel.

Optimization Options

The following sections explain each optimization command-line option.

syntax

The syntax for the syntax option is:

-sy=[a|k]
-syntax=[a|k]            (long name)
-syntax=a                (default value)

The syntax option allows you to select the dialect of C that PCA expects. The default dialect is ANSI C (–syntax=a). Specifying –syntax=k instructs PCA to accept traditional, K&R C.

If you don't specify a dialect, PCA will adjust to the actual dialect used in your source.

address_resolution_level

The syntax for this option is:

–arl=n

–address_resolution_level=n        (long name)
–arl=1                             (default value)

The –address_resolution_level option lets you control the assumptions that PCA makes about memory aliases. Table 3-2 lists the levels of control.

An associated directive, #pragma arl=n, has the same meaning as the –arl command-line option (see Chapter 4, “Power C Analyzer Directives,” for details).

Each of the levels described in Table 3-2 is cumulative; that is, specifying arl=3 includes all the actions of arl=1, arl=2, as well as arl=3.

Table 3-2. Address Resolution Levels, arl

Value

Description

0

Make no assumptions about memory aliases.

1

Assume that there are no pointer self-references (the default); that is, a pointer will not contain its own address. Self-referencing pointers are not common, and this level avoids the problem in loops such as:

int *p...for ( i=0; i<n; i++ ) } p[i] = a[i]; }

 

In the example, there could be dependencies from the first iteration to the other iterations since p[0] might be &p.

2

Assume that none of the objects represented by the parameters overlap in memory; that is, each argument is distinct from the other. This is equivalent to #pragma distinct for all parameters (see Chapter 4, “Power C Analyzer Directives,” for a description of #pragma distinct).

 

This is not true for most C functions, and PCA will assume there is (or could be) parameter aliasing unless you specify arl=2 or greater.

3

Assume globals, parameters, and locals form distinct groups. The memory locations referred to using local variables will be different from the memory locations referred to using global variables, and both of these will be different from the memory locations referred to through parameters. For example:

 

float *a;f(x)float x[1000];{ int i; float f[1000]; for ( i=0; i<1000; i++ ) { a[i] = x[i] + f[i]; } }

 

pca will not concurrentize this loop unless you specify arl=3 (or greater), which indicates that the arrays a, f, and x are distinct.

4

Assume that there are no aliases for objects; that is, all pointers/arrays are distinct from each other. If pointers are used, only one name is used to reference an object.


scalaropt

The syntax for this option is:

–so=n

–scalaropt=n        (long name)
–scalaropt=2        (default value)

The –scalaropt option sets the level of scalar optimization PCA will perform. Scalar optimizations include dusty-deck transformations, dead code elimination, and loop unrolling.

The parameter sets the optimization level as described in Table 3-3.

Scalar optimizations are discussed in detail in “Scalar Optimizations” in Chapter 6.

Table 3-3. Scalar Optimization Levels

Value

Description

0

Perform no scalar optimizations.

1

Perform only simple scalar optimizations, such as dead-code elimination, global forward substitution, and dusty-deck IF transformations. Perform code floating if –roundoff >=1.

2

Perform the full range of scalar optimizations. Remove floating invariant IFs from loops. Recognize induction variables. Reroll loops, expand arrays, peel loops, perform loop fusion.

3

Enable memory management if –roundoff=3. Allow dead code elimination of unnecessary program fragments during output conversion. Other optimizations might expose more dead code.


limit

The syntax for this option is:

–lm=n

–limit=n            (long name)
–lm=5000            (default value)

PCA estimates how much time it would need to analyze each loop-nest construct. If a nest of loops is too deep, PCA ignores the outer loop and recursively visits the inner loops until it finds a nest of loops that is not too deep. The –limit option is the upper threshold of the amount of work that controls what PCA thinks is “too deep.”

Larger loop-nest limits might allow PCA to convert the outer loops of a deeply nested loop structure to run in parallel. (Running the outermost loop in parallel usually results in the best performance increase.) But larger loop-nest limits can increase the analysis time. The limit does not correspond to the for loop-nest level. It is an estimate of the number of loop orderings that PCA can generate from a loop-nest. The –limit option resets this internal limit.


Note: This limit is adequate for most programs. If your program is extremely complex, you might want to increase this limit.


arclimit

The syntax for this option is:

–arclm=n

–arclimit=n         (long name)
–arclm=2000         (default value)

The –arclimit option sets the size of the “dependence arc data structure” that PCA uses to perform data-dependence analysis. (See Appendix B, “Data-Dependence Analysis,” for a description of data-dependence analysis.) This data structure is dynamically allocated on a loop-nest by loop-nest basis.

The formula PCA uses to estimate the number of dependence arcs for a given loop-nest is:

array_size = max (#_of_statements * 4, arclimit value)

PCA assumes that each statement will have four dependence arcs (a worst-case estimate).

When you include the Loop Summary in the listing file (–listoptions=l), PCA marks any loop that was too complex for the dependence data structure to hold the information. The following example shows the Loop Summary (from the listing file for a PCA run with the value –arclimit=200). In this example, PCA detected that the given loop, which contained 123 statements, had too many dependence arcs for the data structure as allocated. The storage that was allocated for the dependence arc array had been:

max(123 * 4 , 200) = 492
The Loop Summary looks like this:
---------------------------Loop Table--------------------------
                               Nest
Loop         Message          Level    Contains Lines
===============================================================
for i                         1      5-129 "example_4_3_5a.c"
    1. Scalar                 1      5-129 "example_4_3_5a.c"
             Line:5  Data dependence analysis aborted due to insufficient storage for graph arcs.

Suppose for the above example, you change the –arclimit value to something greater than 492. PCA might be able to optimize the given loop provided that there are no data-dependence violations. The next example shows the Loop Summary after setting –arclimit=2000.

--------------------------Loop Table---------------------------

                               Nest
Loop         Message          Level    Contains Lines
=============================================================== 
for i                           1      5-129 "example_4_3_5b.c"
    1. Concurrent               1      5-129 "example_4_3_5b.c"

The maximum valid –arclimit value is 2000. If you specify a value greater than 2000, PCA defaults to allocating 2000 for the data-dependence array. PCA gives no warning when it does this.

machine

The syntax for this option is:

–[n]ma=list

–[no]machine=list    (long name)
-ma=s                (default value)

The –machine option is list-valued. It has three valid values: n, o, and s. Table 3-4 defines these values.

Table 3-4. machine Values

Value

Description

n

Prefer nonstride-1 array access over stride-1 array access. For some arrays, nonstride-1 array access provides the best performance.

o

Do not consider innermost loops for parallel execution. If a loop does not do very much, running the loop in parallel might take longer than running the loop serially because of the overhead. PCA makes decisions concerning the overhead:benefit ratio when it evaluates a loop for parallel execution. If the loop bounds are unknown at analysis time, PCA might generate concurrent code for innermost loops (depending on the minconcurrent value), a practice that might be inefficient for the actual loop bounds.

s

Prefer a for loop that generates stride-1 (contiguous) references over one that generates nonstride-1 operands when PCA must choose only one to mark for parallel execution. This option typically generates the most efficient code, and is the default.

If you change the machine option to include choices other than the default value (–machine=s), you must also include the default value s if it is still to be in effect. For instance, if you want to tell PCA not to try to run inner loops concurrently (option value o) but to consider all other eligible loops for parallel execution (option value s) you must specify

–machine=os

If you specify –machine=o, you enable NO-INNER-LOOPS, but disable the default (prefer stride-1) option. You can use any combination of the three choices, except for the self-contradicting combination of s (prefer stride-1) and n (prefer nonstride-1).

To disable the options, on the command line, enter:

–nomachine

optimize

The syntax for this option is:

–o=n

–optimize=n        (long name)
–o=5               (default value)

The –optimize option sets the optimization level, ranging from the integer 0 (minimum optimization) to the integer 5 (maximum optimization). Each optimization level is cumulative: level 5 performs all optimizations made by the previous levels. Table 3-5 describes optimization levels.

A higher optimization level results in more optimization along with increased analysis time. Many programs written for a parallel processing environment do not need advanced transformations; with these programs, a lower optimization level is enough.

Table 3-5. Optimization Levels

Value

Description

0

Do not mark code for parallel execution.

1

Mark eligible code for parallel execution.

2

Apply for loop interchanging techniques and recognize sum reductions as safe for parallel execution. (PCA doesn't mark sum reduction loops for parallelization unless roundoff=2.) Use lifetime analysis to determine when the code needs last- value assignment of scalars to make a loop safe to run in parallel. Use more powerful data-dependence tests to find more loops that can run safely in parallel.

3

Recognize linear recurrences as safe for parallel execution. Use loop interchanging, when possible, to improve memory referencing. This level also allows loop interchanging for triangular loops. Use special case data- dependence tests to find more loops that can run safely in parallel. Recognize special index sets, (wrap-around variables) as safe for parallel execution.

4

Split a loop in two, if necessary, to break a data-dependence arc. Use exact data-dependence tests to find more loops that are safe to run in parallel. Enable loop unrolling.

5

Transform two adjacent loops into a single loop. Use data-dependent tests to allow fusion of more loops than possible with standard techniques.


roundoff

The syntax for this option is:

–r=n

–roundoff=n         (long name)
–r=0                (default value)

The –roundoff option allows control of whether or not PCA runs reductions (for example, the summing of an array of values) in parallel. When a reduction runs serially, all operations occur in the same order, so the roundoff error is the same from one execution of the code to the next. But when a reduction runs in parallel, the separate threads of execution do not do all the operations in the same order as the serial version. Thus, the roundoff error can differ from that of the serial version. Furthermore, the roundoff error of the multiprocess version can vary from one run to the next. Often, roundoff error is not important.

Unfortunately, some algorithms (for example, branching on an exact match) are sensitive to even small differences in roundoff error. If your code is sensitive to roundoff error, you can tell PCA not to allow reductions in the code it converts to run in parallel. This guarantees that the results of the multiprocess code is always the same as the serial version. In fact, that is the reason that the default value of roundoff is 0 (no arithmetic reductions).

Each –roundoff level is cumulative (level 3 performs everything up to and including this level). Table 3-6 describes the roundoff levels.

Table 3-6. roundoff Levels

Value

Description

0

Do not convert reductions to run in parallel (the default). In particular, PCA does not convert arithmetic recurrences and arithmetic reductions (such as SUM and PRODUCT) to run in parallel. PCA can still convert nonarithmetic reductions to run in parallel (such as MAX of a vector).

1

Allow PCA to simplify expressions with operands that are between binary and unary operators. Allow expression simplification due to forward substitution. Allow code floating, if the scalaropt switch is \xb3 1. The same as 0 for reductions.

2

Allow PCA to mark reductions to run concurrently. Allow loop interchanging around arithmetic reductions. Perform concurrent reductions with pre-scheduled concurrent loops and local accumulation of reduction results. Thus, the answers can vary from one execution to the next.

3

Recognize real (float) induction variables. Enable memory management if scalaropt=3.


unroll and unroll2

The syntax is:

–ur=n
–unroll=n             (long name)
–ur=4                 (default value)
–ur2=n–unroll2=n      (long name)
–ur2=100              (default value)

The –unroll and –unroll2 options control how PCA unrolls scalar inner loops. In most cases, when PCA cannot convert loops to execute concurrently, PCA can unroll the loop to improve performance. (More work per iteration with fewer iterations gives less overhead.) Set –optimize=4 to enable the –unroll and –unroll2 options. Table 3-7 describes unroll values.

Table 3-7. unroll Values

Value

Description

0

Use the default values to unroll.

1

Do no unrolling.

n>=2

At most, unroll n iterations.

For example, the default (4,100) means at most four iterations, and a maximum work per unrolled iteration of 100.

You can control unrolling in two ways. The first is to use the number of iterations, and the second is to use the “work per unrolled iteration” factor. To use the “work per unrolled iteration” factor, PCA analyzes a given loop by computing an estimate of the computational work that is inside the loop for ONE iteration. This rough estimate is based on the following criteria:

number of assignments +

number of if statements +

number of subscripts +

number of arithmetic operations

The following example assumes unroll=8 and unroll2=100.

int a[], b[], n;
void example_4_3_9 ()
{
    int i;
    for (i=0; i<n; i++)
        a[i] = b[i] / a[i-1];
}

This example has:

1 assignment

0 ifs

3 subscripts

1 arithmetic operator

----------------------------

5 is the weighted sum (the work for 1 iteration)

PCA then divides this into 100 to give an unroll factor of 20. But eight was specified for the maximum number of unrolled iterations. PCA takes the minimum of the two values (8) and unrolls that many iterations. The maximum number of iterations that PCA can unroll is 100. If you request more than that number, PCA gives no warning of its inability to comply.

In the case of an unknown number of iterations, PCA generates two loops—the primary unrolled loop and a cleanup loop to insure that the number of iterations in the main loop is a multiple of the unrolling factor.

For example:

int a[];
int b[];
int n;
void example_4_3_9(  ) 
{
    int i;
    int _Kii1;
    _Kii1 = (n)%(8);
    for ( i = 0; i<_Kii1; i++ ) {
        a[i] = b[i] / a[i-1];
    }
    for ( i = _Kii1; i<n; i+=8 ) {
        a[i] = b[i] / a[i-1];
        a[i+1] = b[i+1] / a[i];
        a[i+2] = b[i+2] / a[i+1];
        a[i+3] = b[i+3] / a[i+2];
        a[i+4] = b[i+4] / a[i+3];
        a[i+5] = b[i+5] / a[i+4];
        a[i+6] = b[i+6] / a[i+5];
        a[i+7] = b[i+7] / a[i+6];
    }
}

Input-Output Options

The following sections explain the function of each option that affects PCA's input-output file selection.

cmp

The syntax for this option is:

–[n]cmp=file

–[no]cmp=file              (long name)
standard output            (the default)

The –cmp (compile file) option tells PCA to write the optimized C program to a file. If you specify –cmp=file, PCA writes the transformed C to the specified file. The default file for the transformed code is standard output.

If you use –cmp without a file name, PCA writes the transformed code to file.M, where file is the input file name from the command line with the trailing .c (if any) stripped off. (See the following description of the –input option for a special case.)

To tell PCA not to generate a C output file, enter

–nocmp

on the command line.

input

The syntax for this option is:

–i=file

–input=file     (long name)
no default

Usually, you will simply include the input file name on the command line. The –input=file option is an alternative way of specifying the input file.

Specifying –input without a file name tells PCA to read the source file from standard input. Then PCA writes the transformed code and (optional) listing file to standard output unless you use the –cmp and –list options to give explicit file names.

list

The syntax for this option is:

–[n]l=file

–[no]list=file   (long name)
–nolist          (the default)

The –list option tells PCA where to write the listing you request when you use the –listoptions option. If you specify –list=file, PCA writes the listing to the specified file. To explicitly disable generation of the listing file, enter

–nolist on

on the command line.

If you specify –list without a file name, PCA writes the listing file to file.L, where file is the input file name with the trailing .c (if any) stripped off. (See the previous description of the –input option for a special case.)

If you do not use the –list option, but do use –listoptions=list, PCA writes the listing file to standard output. PCA writes all diagnostic messages, syntax errors, and so forth, to standard error.

Listing Options

The following sections explain the function of each listing option. You must use these options in conjunction with the –list option.

listingwidth

This option sets the maximum line length for the listing file produced by PCA. The syntax for this option is:

–lw=[132|80]

–listingwidth=[132|80]   (long name)
–lw=80                   (the default)

The line length affects the format of the loops summary table (produced by –lo=l) and the PCA options table (–lo=k). The default line length is 80, convenient for use on most terminals. The 132 column width is optimal for most line printers. No other values are allowed at present.

cmpoptions

The syntax for this option is:

-[n]cp=i

-[no]cmpoptions=i      (long version)
-ncp                   (default value)

The cmpoptions flag specifies additional information for inclusion in the transformed (.cmp) file. PCA currently supports only the i value for cmpoptions, which directs PCA to include special line-number directives.

Special line numbers are # line directives which can appear in the transformed program file to reference line numbers of the original source code. The line in the transformed code immediately following a “# line” comment is either the transformed version of the referenced line, or a line inserted by PCA just before the referenced line. PCA includes the name of the source file in the form it appeared in on the command line.

In the unrolled loop below, the for in the original source code was on line 7, and the assignment on line 8:

# line 7 "../csource/unr5.c"
   for ( i = il + 1; i<=n; i+=3) {
      a[i] = b[i] / a[i-1]
# line 8 "../csource/unr5.c"
      a[i+1] = b[i+1] / a[i];
# line 8 "../csource/unr5.c"
      a[i+2] = b[i+2] / a[i+1];
# line 8 "../csource/unr5.c"
   }

lines

The syntax for this option is:

–ln=n

–lines=n             (long name)
–ln=55               (default value)

The –lines option tells PCA to paginate the listing file for printing. Use the –lines option to change the number of lines printed per page. The –lines=0 option tells PCA to paginate only at subroutine boundaries.

listoptions

The syntax for this option is:

–lo=list

–listoptions=list     (long name)
no listing            (the default)

The –listoptions option tells PCA what information to include in the listing file.

Table 3-8 describes the –listoptions values.

Table 3-8. listoptions Values

Value

Description

c

Print the Calling Tree of the entire program.

i

Insert line numbers into transformed code referencing line numbers of the original.

k

Print PCA options used at the end of the listing.

l

Print the loop-by-loop optimization table.

n

Print program unit names, as processed, in the error file.

p

Print the analysis performance statistics.

s

Summarize loop optimizations.

The transformed code is always recorded in the transformed code file, whether or not you request a listing file.

Memory Management Options

These options set parameters which PCA uses to optimize memory hierarchy usage. You can obtain better optimization of memory reference patterns if you know how much data can be kept in fast memory, such as cache or arithmetic registers, and the costs of moving data in the memory hierarchy. To enable memory management, you must set –scalaropt=3 and –roundoff=3.

cacheline

The syntax of this option is:

-chl=n

-cacheline=n        (long version)
-chl=16             (default value)

Use the cacheline option to inform PCA of the width in bytes of the memory channel between cache and main memory.

cachesize

The syntax of this option is:

-chs=n

-cachesize=n        (long version)
-chs=64             (default value)

Use the cachesize option to inform PCA of the size in kilobytes of the cache memory.

dpregisters

The syntax of this option is:

-dpr=n

-dpregisters=n       (long version)
-dpr=6               (default value)

The dpregisters option specifies the number of double-precision floating point registers each processor has.

spregisters

The syntax of this option is:

-spr=n

-spregisters=n      (long version)
-spr=12             (default value)

The spregisters option specifies the number of single-precision floating point registers each processor has.

setassociativity

The syntax of this option is:

-sasc=n

-setassociativity=n  (long version)
-sasc=1              (default value)

The setassociativity option provides information on the mapping of physical addresses in main memory to cache pages. The default, 1, specifies that a datum in main memory can be placed in only one place in cache. If this cache page is in use, its current contents must be dropped in order to copy the new page into cache.

Invariant IF Floating Options

You can use two options to control how much code expansion PCA will allow when expanding invariant-IF loops. The options are each_invariant_if_growth and max_invariant_if_growth. Use these options to control the code growth of a program unit, that is, a subroutine, function, or main procedure. Each option has a product-specific default.

The syntax of these options is given in Table 3-9.

Table 3-9. Invariant-IF Options

Long Form

Short Form

Valid Range

Default Value

each_invariant_if_growth=

eiifg=

0–100

50

max_invariant_if_growth=

miifg=

0–1000

500


for (i= …) {
   section-1
   if ()
      section-2
   else
      section-3
   section-4
}

The each_invariant_if_growth option controls the allowed sizes of sections 1 and 4, where size is the number of user-visible executable statements. If sections 1 and 4 are smaller than the value of each_invariant_if_growth, then the invariant IF will be floated as shown below:

if () then
   for (i= …) {
      section-1
      section-2
      section-4
   }
else
   for (i= …) {
      section-1
      section-3
      section-4
   }

The max_invariant_if_growth option sets a threshold that acts as a regulatory mechanism for the invariant-IF transformation. Whenever code growth (measured in user-visible executable statements) in a program unit has exceeded this threshold, PCA will only perform invariant-IF floating in that program unit if there is no code replication. In the example above, no code replication would be necessary in the original loop nest if sections 1 and 4 were absent.

Command Line Options for Portability

These options are provided for the sake of easy portability among compilers. Note that there are currently no short versions of these options names.

DOLLAR, (no short name), (off)

The dollar command line option allows dollar signs to be used as identifiers under both ANSI C and Kernighan and Ritchie C.

For example, the following program will work correctly under either ANSI mode or Kernighan and Ritchie mode if the dollar option is enabled.

int $i=121961;
main(){
printf("$i is %d.\\n",$i);
}

FLOAT, (no short name), (off)

Under Kernighan and Ritchie C, all variables declared as type float are promoted to type double before arithmetic operations are performed on them.

The float option prevents this promotion to double, that is, all variables declared as type float remain type float.

This option is ignored under ANSI C, since the default behavior of ANSI C treats float variables as float with no promotion to double.

SIGNED, (no short name), (off)

By default, a variable declared as type char is interpreted as an unsigned char. The signed option causes variables declared as type char to be interpreted as type signed char.

This option is sometimes necessary when porting code from other platforms whose C compiler defaults char to signed char.

VOLATILE, (no short name), (off)

The volatile option indicates that all variables are implicitly volatile.

Use of this option severely limits the optimization that can be done.

PROCESSORS, P, P=0

The kap option optimizes for an unknown number of processors.

Certain of the concurrency optimizations require knowing the number of processors that are available. If this number is known at compile time, the generated code is more efficient.

If integer is 1, kap turns off concurrency.

INLINE_AND_COPY, INLC, (off)

The inline_and_copy option functions like the inline option except that if all CALLs or references to a subprogram are inlined, the text of the routine is not optimized but is rather copied unchanged to the transformed code file. This option is intended for use when inlining routines from the same file as the call and has no special effect when the routines being inlined are taken from a library or another source file.

After a subprogram has been inlined everywhere it is used, leaving it unoptimized saves compilation time. When a program involves multiple source files, the unoptimized routine will still be available in case one of the other source files contains a reference to it, so no errors will result.


Note: The inline_and_copy algorithm assumes that all CALLs and references to the routine precede it in the source file.

If the routine is referenced after the text of the routine and that particular call site cannot be inlined, the unoptimized version of the routine will be invoked.

STDIO, STDIO, (off)

The stdio qualifier instructs kap to perform strength reduction on calls to certain functions in the standard I/O library.

Programs which use functions such as printf heavily will generally have improved I/O performance when this is done.

The -scalaropt=3 option is required to enable this transformation.

Summary

This chapter described the details of the pca command-line options and explained how to use PCA as a standalone analyzer to mark code to run on multiple processors. The next four chapters present additional ways of obtaining concurrentized code. These chapters describe:

  • PCA directives that you can insert into the code

  • Compiler directives that the multiprocessing C compiler recognizes

  • PCA transformations that optimize concurrentization of a loop

  • In-lining and interprocedural analysis that streamline function calls