This chapter describes how to use IRIS Power C to get the maximum amount of code to run in parallel. You can use the Power C Analyzer either in conjunction with the cc compiler (as Chapter 1, “Compiling with IRIS Power C” explained) or as a standalone analyzer (as described in Chapter 3, “PCA Command-Line Options”).
What does PCA do to your code? It inserts directives into the code that tells the compiler to run loops in parallel. It does not change the logic of your program although, if requested, it will rewrite portions of the program to run faster. This chapter explains what happens to the code when you use pca and explains how to use pca as an analyzer.
The Power C Analyzer is a powerful tool that can:
direct C code to run in parallel
determine data dependencies
distribute well-behaved loops and certain other code across multiprocessors
optimize source code
Power C is neither an extension to the C language nor a way to concurrentize any arbitrary C program. Furthermore, if the Power C directives are removed or ignored, then the code becomes a valid serial program. This allows full source code compatibility with other non-multiprocessing systems.
This section explains some of the ways to use PCA and what happens during the stages of compilation. Figure 2-1 shows the program flow through the C compiler. When you compile a program you have the option of specifying –pca (the Power C Analyzer and multiprocessing compiler) or –mp (the multiprocessing compiler only).
The first stage invokes cpp to handle cpp directives. (For more information, see cpp(1) in the IRIX User's Reference Manual.) If you specify –pca, PCA puts directives into the cpp output that allows data-independent loops to run in parallel. As explained in Chapter 1, “Compiling with IRIS Power C” you can use the list and keep options to direct PCA to generate a listing file (with the suffix .L) and/or an intermediate multiprocessing source file (with the suffix .M). Also, you can hand-insert directives into the code prior to compilation. Finally, the multiprocessing C compiler compiles the transformed PCA-generated file to produce an executable object file. You can run this executable on any Silicon Graphics IRIS system (single or multiprocessor); the executable will adapt to the number of processors present at execution.
Figure 2-2 shows how PCA internally processes the C source code. PCA parses the source into an internal representation, performs data dependency analysis and transformations, generates C source code from the internal representation, and produces intermediate C code.
If you specify the multiprocessing compiler, with the –mp or –pca options, it identifies parallel directives, rewrites the parallel code with explicit runtime calls, processes the C code, and generates the executable object. Figure 2-3 shows this code flow.
You can use Power C in different ways to produce concurrentized code: automatically, manually, and iteratively. If you use Power C automatically as a phase of compilation, PCA concurrentizes parts of the code that it determines will safely run in parallel and then compiles the code. You will probably invoke pca this way when you first begin using Power C. Figure 2-4 shows the process flow.
Rather than using the Power C Analyzer, you can enter directives manually to concurrentize the code. This method is shown in “Examples” toward the end of this chapter. Figure 2-5 shows the program flow when you hand insert the multiprocessing directives.
In addition to the automatic and manual methods of compilation, you can also use Power C iteratively, as shown in Figure 2-6. This method is best for getting the maximum amount of code to run in parallel.
Now that you have seen the different ways you can use Power C, the next step is to figure out how further to concurrentize the maximum amount of code. The sections that follow illustrate the iterative method outlined in Figure 2-6, and describe how Power C can help to optimize a program and run more of it in parallel.
First, use the listing file to see exactly which loops were concurrentized. To generate a listing file, use the list option. (Refer to Chapter 1, “Compiling with IRIS Power C” for more information on the list option.) The listing file contains the annotated pca listing of the parts of the program that can (and cannot) run in parallel on multiprocessors. The content of this file varies depending on the value of the –lo command-line option (see “listoptions” in Chapter 3).
For example, suppose you have a simple C program named test.c that looks like this:
int a[1000], sum;
void example_3_2_2 ()
{
int i;
sum = 0;
for (i=0; i<1000; i++)
a[i] = i;
for (i=0; i<1000; i++)
sum += a[i];
printf ("The sum is %d\n", sum);
}
|
Now, let PCA do the work for you. Compile the program by passing the –pca option to cc, and generate a listing file by using the list option.
To compile the program, enter:
cc -c -pca list test.c |
which produces a PCA listing file, test.L.
The listing file looks similar to this:
---------------------------Loop Table--------------------------
Nest
Loop Message Level Contains Lines
===============================================================
for i 1 6-7 "test.c"
Original Loop Split Into Sub-Loops
1. Concurrent & Enhanced Scalar
1 6-7, 9 "test.c"
Line:6 Loop unrolled 4 times to improve scalar performance.
Line:6 Loop has been fused with others to reduce overhead.
2. Enhanced Scalar 1 6, 9 "test.c"
Line:6 Loop unrolled 4 times to improve scalar performance.
Line:9 Data dependence involving this line
due to variable"sum".
Line:9 Loop has been fused with others to reduce overhead.
for i 1 8-9 "test.c"
2 loops total
1 loops concurrentized
1 this loop has been fused with other loops
|
PCA was able to concurrentize the first for loop. In this case, initialization of array a was executed in parallel on the available processors. PCA was unable to concurrentize the second loop, hence the “preferred scalar mode” message. However, in subsequent sections, you will see how to get PCA to concurrentize the other for loop. (For more information on the listing file, see Chapter 8, “The PCA Listing.”)
To generate a multiprocessing source file as well as a listing file, use the keep option. (Refer to Chapter 1, “Compiling with IRIS Power C” for more information on the keep option.) This option produces a test.L file and an intermediate file, test.M.
To generate these files, enter:
cc -c -pca keep test.c |
which yields a listing file, test.L, and an intermediate file containing concurrentized C source code, test.M. The content of the file, test.M, appears in the example that follows:
int a[1000];
int sum;
void example_3_2_3( )
{
int i;
int _Kii1;
sum = 0;
#pragma parallel shared(a) local(_Kii1)
#pragma pfor iterate(_Kii1=0;1000;1)
for ( _Kii1 = 0; _Kii1<=999; _Kii1++ ) {
a[_Kii1] = _Kii1;
}
for ( _Kii1 = 0; _Kii1<=999; _Kii1++ ) {
sum += a[_Kii1];
}
printf( "The sum is %d\n", sum );
}
|
The test.M file shows that PCA finds it can run a for loop in parallel. PCA inserts the directives:
#pragma parallel shared(a) local(_Kii1) |
and
#pragma pfor iterate(_Kii1=0;1000;1) |
into the code. The #pragma parallel tells the multiprocessing C compiler to start a parallel region. The phrase shared(a) tells the compiler that all processes that execute the for loop share the array a. The phrase local(i) indicates that every process executing the loop has a local variable i.
The #pragma pfor tells the compiler that the next for loop can run in parallel, and the loop statements are executed on the processors available. The phrase iterate gives the multiprocessing C compiler the information it needs to uniquely identify the iterations of the loop and partition them to particular threads of execution.
As you can see in the previous examples, using PCA without options has some drawbacks. If roundoff error is not critical, the second for loop should also be able to safely run in parallel. All that's needed is to pass an option to pca, specifically the –roundoff (–r) option. To do this, enter:
cc -pca keep -WK,-r=2 test.c |
The –r=2 option tells PCA to mark reductions to run concurrently. See Chapter 3, “PCA Command-Line Options” for details on the –roundoff option.
After execution, both loops were concurrentized. The listing file looks similar to this:
---------------------------Loop Table--------------------------
Nest
Loop Message Level Contains Lines
===============================================================
for i 1 6-7 "test.c"
1. Concurrent & Enhanced Scalar 1 6-7, 9 "test.c"
Line:6 Loop has been fused with others to reduce overhead.
for i 1 8-9 "test.c"
2 loops total
1 loops concurrentized
1 this loop has been fused with other loops
|
The intermediate file, test.M, now looks like this:
int a[1000];
int sum;
void example_3_2_4( )
{
int i;
int _Kii1;
int sum1;
sum = 0;
#pragma parallel shared(a, sum) local(_Kii1, sum1)
{
sum1 = 0;
#pragma pfor iterate(_Kii1=0;1000;1)
for ( _Kii1 = 0; _Kii1<=999; _Kii1++ ) {
a[_Kii1] = _Kii1;
sum1 += a[_Kii1];
}
#pragma critical
{
sum += sum1;
}
}
printf( "The sum is %d\n", sum );
}
|
It's easy to see where PCA found it could run more code in parallel. PCA expanded the previously inserted directives:
#pragma parallel shared(a) local(i) |
to
#pragma parallel shared(a, sum) local(_Kii1, sum1) |
and inserted a new directive:
#pragma critical |
into the code. The #pragma parallel is expanded to include sum and sum1. The #pragma critical instructs each thread to add its partial sum, sum1 to sum, one at a time.
As you learned in the previous section, the Power C Analyzer automatically identifies loops in a C program that can run safely in parallel. For a few programs, you may be able to use PCA automatically to make a significant part of the code run in parallel. And to make the program more efficient, you can pass PCA options to cc, such as the -roundoff option that was used in the previous example. However, for most programs, you will also want to make small code changes that let PCA run more of the code in parallel. Often, the changes are easy to make.
The PCA listing indirectly supplies information about when and where to modify your code. The more you know about where your program spends much of its time, and the better you understand the PCA listings, the easier it is to recognize where small changes to the source code can produce more parallel code.
When you use the Power C Analyzer, PCA does a data-dependency analysis on the code. During this analysis, PCA looks for for loops with the property that each iteration of the loop is independent of all other iterations. Because each iteration of the loop is self-contained, the system can execute the iterations in any order (or even simultaneously on separate processors) and get the same answer after running all iterations as it would running the iterations serially.
When PCA finds a loop that has the property of data independence, it knows it can safely run the loop in parallel. When PCA finds a loop that has iterations that could depend on other iterations, it cannot run the loop in parallel, but PCA can tell you what is causing the problem. You can use directives to get around the problem, if you know that there are no real data dependencies.
Finally, if PCA is unable to run the loop in parallel, you can often modify your program to make it safe to run in parallel.
When trying to find loops to run in parallel, review the listing file and focus your efforts on the areas of the code that use the bulk of the run time. You cannot significantly improve the performance of your program by spending a lot of time trying to parallelize a routine that used only 1% of its run time.
To find out where your code spends its time, take an execution profile of the program. Use either pc-sample profiling by using cc(1) with the –p option, or basic block profiling by using pixie(1). The application program is then run to gather performance data, and prof(1) is used to generate a profile report from the collected data. For details on these commands, see the IRIX User's Man Pages and the IRIS-4D Series Compiler Guide.
You can do the profiling in two ways. The first is the conservative approach, where you take a profile of the original (nonparallel) job. Then you run in parallel only the loops that account for most of the run time. This approach reduces the chances that something may go wrong because it makes fewer changes to the code. It also focuses most of the effort into the smallest number of lines of code.
The second approach is more aggressive. After running the program through PCA, profile the resultant multiprocessed job. Use this approach if you think that PCA does a good job with the existing program (for example, a converted program that already runs well on traditional vector architectures). Many such programs run in parallel very well without additional effort. You can then focus on the routines with which PCA has a problem.
After you decide on an approach, use the profile data to direct your efforts to the most time-consuming routines. Once you find such a routine, submit it alone to PCA. If the routine is in the middle of a large file, consider using csplit(1) to isolate that routine. Compile it with the –pca list option, and examine the listing file. The PCA listing identifies which loops PCA can and cannot run in parallel. For the latter, the listing also tells why PCA could not convert the loop for parallel execution.
At times, PCA may not run a loop in parallel. PCA generates messages in the listing that complain about the code. One message is:
dependencies prevent parallelism |
This message means that PCA believes data dependencies exist in the loop. The listing also names the variables PCA believes are involved in the dependence. Dependencies can be simple to complex, so deal with each dependency on an individual basis.
PCA may generate another message:
unoptimizable function/subroutine call |
This message means that the loop includes a call to an external procedure or function. Because PCA cannot see into that other routine, it must assume that the routine contains data dependencies. To change this assumption, either use in-lining or interprocedural analysis (described in Chapter 7, “In-lining and Interprocedural Analysis”) or, if you know the routine is safe to run in parallel, insert the concurrent, concurrent call, or no side effects directives.
The following examples show how inserting directives into the C source code produces code that runs in parallel.
This example uses the file permute.c, which contains the following C code:
void example_3_4_1 (double a[], double b[], long index[])
{
long i;
#pragma distinct (a[], b[], index[])
#pragma concurrent
for (i=0; i<10000; i++)
a[index[i]] += b[i];
}
|
Two directives have been inserted into the code; without these directives, PCA cannot produce concurrentized code. The #pragma distinct tells PCA that no data dependencies occur between a, b, and index; that is, these objects do not overlap. The #pragma concurrent tells PCA to ignore assumed dependencies in the next loop. (See Chapter 4, “Power C Analyzer Directives,” for a complete description of these pragmas.)
Compiling permute.c:
cc -pca keep permute.c |
produces the listing file, permute.L, which looks similar to this:
----------------------- Loop Table --------------------------
Nest
Loop Message Level Contains Lines
===============================================================
for i 1 6-7 "permute.c"
1. Concurrent & Enhanced Scalar 1 6-7 "permute.c"
1 loops total
1 loops concurrentized
|
The intermediate multiprocessing file, permute.M, looks like this:
void permute( double a[], double b[], long index[] )
{
long i;
#pragma parallel shared(index, a, b) local(i)
#pragma pfor iterate(i=0;10000;1)
for ( i = 0; i<=9999; i++ ) {
a[index[i]] += b[i];
}
}
|
PCA was able to safely run the loop in parallel. PCA inserted #pragma parallel shared and #pragma pfor iterate. If you can guarantee that the values of index[i] are always different for each value of i, then there is no dependence (each iteration would get a different location in a). A permutation vector is a list of numbers, each of which is different from all the others. If you know that a is a permutation vector, then data independence exists. An example of a permutation index is a list of objects in which each object appears exactly once.
![]() | Note: PCA cannot check the truth of directives. When you use a directive, you must be certain that the directive is always true for all possible input data. |
The next example has #pragma concurrent call inserted into the code. This pragma tells PCA that the function calls in the next loop are safe to run in parallel. The file sub.c contains:
double a[10000], b[10000];
void sub (double [], double [], long);
void example_3_4_2 ()
{
long i;
#pragma concurrent call
for (i=0; i<10000; i++)
sub (a, b, i);
}
|
The listing file, sub.L, looks similar this:
---------------------- Loop Table ---------------------------
Nest
Loop Message Level Contains Lines
===============================================================
for i 1 7-8 "sub.c"
1. Concurrent & Enhanced Scalar 1 7-8 "sub.c"
1 loops total
1 loops concurrentized
|
After compilation, the intermediate file, sub.M, looks like this:
double a[10000];
double b[10000];
void sub( double [], double [], long );
void sub( )
{
long i;
#pragma parallel shared(a, b) local(i)
#pragma pfor iterate(i=0;10000;1)
for ( i = 0; i<=9999; i++ ) {
sub( a, b, i );
}
}
|
PCA was able to safely run the loop in parallel. PCA inserted #pragma parallel shared and #pragma pfor iterate into the code.
PCA provides a great deal of information about the dependencies of loops in a C program. Often, PCA can use the information to run loops automatically in parallel. However, when PCA is unable to convert the code automatically for parallel execution, you can often tell it more information that allows it to mark the code to run in parallel. Often, you'll have to make only small changes to the code or add a PCA command-line option to transform a program into an efficient parallel version.
To learn more about PCA command-line options and how to use PCA as a standalone analyzer, turn to Chapter 3, “PCA Command-Line Options.”