Chapter 5. Multiprocessing C Compiler Directives

In addition to the usual interpretation performed by any other C compiler, the multiprocessing C compiler can process explicit multiprocessing directives to produce code that can run concurrently on multiple processors. Table 5-1 lists the multiprocessing directives used when processing code in parallel regions.

The multiprocessing C compiler does not know whether you or PCA (or a combination of the two) put the directives in the code. The multiprocessing C compiler does not check for or warn against data dependencies that have been violated. That kind of analysis is left to PCA.

Table 5-1. Multiprocessing C Compiler Directives

Pragma

Description

#pragma parallel

Start a parallel region

#pragma pfor

Mark a for loop to run in parallel

#pragma one processor

Execute statement on only one processor

#pragma critical

Protect access to critical statement(s)

#pragma independent

Start independent code section that executes in parallel with other code in the parallel region

#pragma synchronize

Stop threads until all threads reach here

#pragma enter gate

Note threads that have reached here

#pragma exit gate

Stop threads until all threads have passed the matching #pragma enter gate

#pragma plist

 

#pragma ordered

 

After the multiprocessing directives are inserted (either by the multiprocessing C compiler or by you), you can pass the code through PCA. The directives and their associated code will remain unchanged and pass directly through to the .out file, but unrelated sections of code will be optimized.

Why Use Parallel Regions?

To understand many of the multiprocessing C compiler directives, consider the concept of a parallel region. On some systems, a parallel region is merely a single loop that runs in parallel. However, with Power C, a parallel region can include several loops and/or independent code segments that execute in parallel.

Using large parallel regions can improve the performance of your code in ways not possible merely by executing a series of isolated loops in parallel. For example, parallel regions save some of the processing overhead associated with preparing each region to run in parallel. In addition, parallel regions do not force synchronization at the end of each of the contained loops.

Thus, if a thread finishes its work early, it can go on to execute the next section of code—providing that the next section of code is not dependent on the completion of the previous section. However, when creating parallel regions, you need more sophisticated synchronization methods than you need for isolated parallel loops.

New Multiprocessing Compiler Directives

PCA does not recognize or generate directives that were only recently added to the multiprocessing C compiler. If PCA finds one of these new multiprocessing C compiler directives in your code, it prints a warning message and discards it. This guide clearly notes the new directives that are not processed by PCA. In future releases, PCA will recognize (and where appropriate, generate) these new directives. Thus, you should feel free to use these new directives in your code, but add them only after you have finished with PCA.

Coding Rules of Pragmas

Power C pragmas are modeled after the Parallel Computing Forum (PCF) directives for parallel FORTRAN. The PCF directives define a broad range of parallel execution modes and provide a framework for defining C pragmas.

Some changes have been made to make the pragmas more C-like:

  • Each pragma starts with #pragma and follows the conventions of ANSI-C for compiler directives. You may use white space before and after the #, and you must sometimes use white space to separate the words in a pragma, as with C syntax. A line that contains a pragma can contain nothing else (code or comments).

  • Pragmas apply to only one succeeding statement. If a pragma applies to more than one statement, you must make a compound statement. C syntax lets you use curly braces, { }, to do this. Because of the differences between C syntax and FORTRAN, C can omit the PCF directives that indicate the end of a range (for example, END PSECTIONS).

  • If you put a variable on a local list, it is as if you declared a variable of the same type and name inside the parallel statement.

  • The pfor pragma replaces the PARALLEL DO directive because the for statement in C is more loosely defined than the FORTRAN DO statement.

To make it easier to use pragmas, you can put several keywords on a single pragma line, or spread the keywords over several lines. In either case, you must put the key words in the correct order, and each pragma must contain an initial keyword.

For example:

#pragma parallel shared(a,b,c, n) local(i) pfor
#pragma iterate(i=0;n;1)
for (i=0; i<n; i++) a[i]=b[i]+c[i];

does the same thing as:

#pragma parallel
#pragma shared( a )
#pragma shared( b, c, n )
#pragma local( i )
#pragma pfor
#pragma iterate(i=0;n;1)
   for (i=0; i<n; i++) a[i]=b[i]+c[i];

Parallel Regions

A parallel region consists of a number of work-sharing constructs. Currently, Power C supports the following work-sharing constructs:

  • a loop executed in parallel

  • an independent code section executed in parallel with the rest of the code in the parallel region

  • “local” code run (identically) by all threads

  • code executed by only one thread

  • code run in “protected mode” by all threads

In addition, Power C supports two types of explicit synchronization:

  • synchronize

  • enter/exit gate

A simple parallel region consists of only one work-sharing construct, usually a loop. (A parallel region consisting of only a serial section or independent code is a waste of time.)

A parallel region of code can contain sections that execute sequentially as well as sections that execute concurrently. A single large parallel region has a number of advantages over a series of isolated parallel regions: each isolated region executes a single loop in parallel. At the very least, the single large parallel region can help reduce the overhead associated with moving from serial execution to parallel execution.

Large mixed parallel regions also let you avoid the forced synchronization that occurs at the end of each parallel region. The large mixed parallel region also allows you to use pragmas that execute independent code sections that run concurrently.

To start a parallel region, use the parallel pragma. To mark a for loop to run in parallel, use the pfor pragma. To start an independent code section that executes in parallel with the rest of the code in the parallel region, use the independent pragma.

Figure 5-1 shows the execution of a typical parallel program with parts running in sequential and parallel mode.

Figure 5-1. Program Execution


When you or PCA start a program, nothing actually runs in parallel until it reaches a parallel region. Then multiple threads begin (or continue, if this isn't the first parallel region), and the program runs in parallel mode. When the program exits a parallel region, only a single thread continues (sequential mode) until the program again enters a parallel region and the process repeats.

The synchronization needs within a simple parallel region are simple; you can use the critical or one processor pragma to handle them.

The following subsections describe these directives.

#pragma parallel

To start a parallel region, use the parallel pragma. This pragma has a number of modifiers, but to run a single loop in parallel, the only modifiers you usually use are shared, byvalue, and local. These options tell the multiprocessing C compiler which variables to share between all threads of execution and which variables should be treated as local.

The code that comprises the parallel region is delimited by curly braces
({ }) and immediately follows the parallel pragma and its modifiers.

The syntax for this pragma is:

#pragma parallel shared (variables) byvalue (variables)
#pragma local (variables) optional modifiers
{ code }

The parallel pragma has six modifiers: shared, byvalue, local, if, ifinline, and numthreads.

Their syntax is:

shared ( variable names )
byvalue ( variable names )
local ( variable names )
if ( integer valued expr )
[no]ifinline
numthreads ( expr )
numthreads (percent=expr)
numthreads (expr)

Where:

shared 

Tells the multiprocessing C compiler the names of all the variables that the threads must share. (If PCA creates a parallel region, it does this for you.)

byvalue 

Puts a variable in the variable names list after this option to tell the multiprocessing C compiler that it can pass those shared variables as values rather than by reference. This fine-tuning option helps the multiprocessing C compiler optimize code. PCA will generate this variable as appropriate. However, used incorrectly, this option can generate erroneous code.

Be careful what you put in this variable list or you may generate incorrect code.

You can put a variable in this list only if the variable is:

  • a scalar

  • not already in the shared list

  • read only

local 

Tells the multiprocessing C compiler the names of all the variables that must be private to each thread. (When PCA sets up a parallel region, it does this for you.)

if 

Lets you set up a condition that is evaluated at run time to determine whether or not to run the statement(s) serially or in parallel. At compile time, it is not always possible to judge how much work a parallel region does (for example, loop indices are often calculated from data supplied at run time). Avoid running trivial amounts of code in parallel because you cannot make up the overhead associated with running code in parallel. PCA will also generate this condition as appropriate.

If the if condition is false (equal to zero), then the statement(s) runs serially. Otherwise, the statement(s) run in parallel.

[no]ifinline 

Helps the multiprocessing C compiler optimize code when you also use the if option. This option is a fine-tuning option. Using the ifinline option (which is the default unless you use noifinline) causes a slight increase in code size but faster execution. This feature is turned off if you use noifinline (that is, the code is smaller but a little slower).

numthreads 

(min=expr; max=expr)

numthreads 

(percent=expr)

numthreads 

(expr)

Tells the multiprocessing C compiler the number of available threads to use when running this region in parallel. (The default is all the available threads.)

The min clause instructs the compiler that this section is not to run in parallel unless at least expr threads are available.

The max clause indicates that at most expr threads out of the available threads should be used. The actual number used is the smaller of expr and the number of threads available.

The percent clause instructs the compiler to use expr percent of the available threads.

In general, you should never have more threads of execution than you have processors, and you should specify numthreads with the MPC_NUM_THREADS environmental variable at run time (see Appendix C, “Run Time Environment Variables”). If you want to run a loop in parallel while you run some other code, you can use this option to tell the multiprocessing C compiler to use only some of the available threads.

The usage #pragma numthreads (expr) is equivalent to #pragma numthreads (max=expr).

expr should evaluate to a positive integer.

For example, to start a parallel region in which to run the following code in parallel:

for (idx=n; idx; idx--) {
   a[idx] = b[idx] + c[idx];
}

you or PCA must enter:

#pragma parallel shared( a, b, c ) byvalue(n) local( idx )

or:

#pragma parallel
#pragma shared( a, b, c )
#pragma byvalue(n)
#pragma local(idx)

before the statement or compound statement (code in curly braces, { }) that comprises the parallel region.

Any code within a parallel region but not within any of the explicit parallel constructs (pfor, independent, one processor, and critical) is termed local code. Local code typically modifies only local data and is run by all threads.

Figure 5-2 shows local code execution.

Figure 5-2. Execution of Local Code Segments


#pragma pfor

Use #pragma pfor to run a for loop in parallel only if the loop meets all of these conditions:

  • All the values of the index variable can be computed independently of the iterations.

  • All iterations are independent of each other—that is, data used in one iteration does not depend on data created by another iteration. A quick test for independence: if the loop can be run backwards, then chances are good the iterations are independent.

  • The number of iterations is known (no infinite or data-dependent loops) at execution time.

  • The pfor is contained within a parallel region.

If the code after a pfor is not dependent on the calculations made in the pfor loop, there is no reason to synchronize the threads of execution before they continue. So, if one thread from the pfor finishes early, it can go on to execute the serial code without waiting for the other threads to finish their part of the loop.

The #pragma pfor directive takes several modifiers; the only one that is required is iterate. Figure 5-3 shows #pragma parallel, which starts a parallel region and tells the multiprocessing C compiler that the i variable must be local (private) to each processor. #pragma pfor tells the compiler that each iteration of the loop is unique and to partition the iterations among the threads for execution.

The syntax for #pragma pfor is:

#pragma pfor iterate ( ) optional modifiers
for ...
  { code ... }

The pfor pragma has three modifiers. Their syntax is:

iterate( index variable=expr1; expr2; expr3 )
schedtype ( type )
chunksize  ( expr )

Figure 5-3 shows parallel code segments using #pragma pfor.

Figure 5-3. Parallel Code Segments Using #pragma pfor


Where:

iterate 

Gives the multiprocessing C compiler the information it needs to identify the unique iterations of the loop and partition them to particular threads of execution.

index variable is the index variable of the for loop you want to run in parallel.

expr1 is the starting value for the loop index.

expr2 is the number of iterations for the loop you want to run in parallel.

expr3 is the increment of the for loop you want to run in parallel.

For example, for the for loop

for (idx=n; idx; idx--) {
    a[idx] = b[idx] + c[idx];
}

the iterate modifier to pfor should be:

iterate(idx=n;n;-1)

This loop counts down from the value of n, so the starting value is the current value of n. The number of trips through the loop is n, and the increment is -1.

schedtype (type) 


Tells the multiprocessing C compiler how to share the loop iterations among the processors. The schedtype chosen depends on the type of system you are using and the number of programs executing (see Table 5-2).

Table 5-2. Choosing a schedtype

Single-User System *

Multiuser System

simple (iterations take same amount of time)

gss (data-sensitive iterations vary slightly)

gss (data-sensitive iterations vary slightly)

dynamic (data-sensitive iterations vary greatly)

dynamic (data-sensitive iterations vary greatly)

 

* If you are on a single-user system but are executing multiple
programs, select the scheduling from the Multiuser column.

 


Figure 5-4 shows how loop iterations can vary.

Figure 5-4. Variance of Loop Iterations


You can use the following valid types to modify schedtype:

simple 

(the default) tells the run time scheduler to partition the iterations evenly among all the available threads.

runtime 

tells the compiler that the real schedule type will be specified at run time.

dynamic 

tells the run time scheduler to give each thread chunksize iterations of the loop. chunksize should be smaller than (number of total iterations)/(number of threads). The advantage of dynamic over simple is that dynamic helps distribute the work more evenly than simple.

Depending on the data, some iterations of a loop can take longer to compute than others, so some threads may finish long before the others. In this situation, if the iterations are distributed by simple, then the thread waits for the others. But if the iterations are distributed by dynamic, the thread does not wait, but goes back to get another chunksize iteration until the threads of execution have run all the iterations of the loop.

interleave 

tells the run time scheduler to give each thread chunksize iterations (described below) of the loop, which are then assigned to the threads in an interleaved way.

gss 

(guided self-scheduling) tells the run time scheduler to give each processor a varied number of iterations of the loop. This is like dynamic, but instead of a fixed chunksize, the chunk size iterations begin with big pieces and end with small pieces.

If I iterations remain and P threads are working on them, the piece size is roughly:

I/(2P) + 1

Programs with triangular matrices should use gss.

Figure 5-5 shows the effects of the different types of loop scheduling.

Figure 5-5. Loop Scheduling Types


chunksize (expr) 


Tells the multiprocessing C compiler how many iterations to define as a chunk when you use the dynamic or interleave modifier (described above).

expr should be positive integer, and should evaluate to the following formula:

number of iterations
--------------------
          X

where X \xb2 2* - 10* the number of threads. Select 2* the number of threads when iterations vary slightly and 10* the number of threads when iterations vary greatly. Performance gain may diminish after 10*.

To run the example:

for (idx=n; idx; idx--){
   a[idx] = b[idx] + c[idx];
}

in parallel, PCA or you must enter the pragmas:

#pragma parallel
#pragma shared( a, b, c )
#pragma byvalue(n)
#pragma local(idx)
#pragma pfor iterate(idx=n;n;-1)
for (idx=n; idx; idx--){
   a[idx] = b[idx] + c[idx];
}

#pragma one processor

A #pragma one processor directive causes the statement that follows it to be executed by exactly one thread.

The syntax of this pragma is:

#pragma one processor
{ code }

Figure 5-6 shows code executed by only one thread. No thread may proceed past this code until it has been executed.

Figure 5-6. One Processor Segment


If a thread is executing the statement following this pragma, then other threads that encounter this statement must wait until the statement has been executed by the first thread, then skip the statement and continue on.

If a thread has completed execution of the statement preceded by this pragma, then all threads encountering this statement skip the statement and continue without pause.

#pragma critical

Sometimes the bulk of the work done by a loop can be done in parallel, but the entire loop cannot run in parallel because of a single data-dependent statement. Often, you can move such a statement out of the parallel region. When that is not possible, you can sometimes use a lock on the statement to preserve the integrity of the data.

In Power C, use the critical pragma to put a lock on a critical statement (or compound statement using { }). When you put a lock on a statement, only one thread at a time can execute that statement. If one thread is already working on a critical protected statement, any other thread that wants to execute that statement must wait until the other thread has finished executing it. Figure 5-7 shows critical segment execution.


Note: The current release of Power C allocates one global lock that is shared among all #pragma critical directives by default. Most uses of the critical pragma are to protect access to a very limited set of data, data that is usually referenced in many places in the program. By sharing one lock, all references by all guarded statements are properly protected. See “The lock Clause” for more information.

Figure 5-7. Critical Segment Execution


The syntax of the critical pragma is:

#pragma critical
{ code }

The statement(s) after the critical pragma will be executed by all threads, but only by one at a time.

#pragma independent

Running a loop in parallel is a class of parallelism sometimes called “fine-grained parallelism” or “homogeneous parallelism.” It is called homogeneous because all the threads execute the same code on different data. Another class of parallelism is called “coarse-grained parallelism” or “heterogeneous parallelism.” As the name suggests, the code in each thread of execution is different.

Ensuring data independence for heterogeneous code executed in parallel is not always as easy as it is for homogeneous code executed in parallel. (And assuring data independence for homogeneous code is not a trivial task.)

The independent pragma has no modifiers. Use this pragma to tell the multiprocessing C compiler to run code in parallel with the rest of the code in the parallel region. Figure 5-8 shows an independent segment with execution by only one thread. However, other threads may proceed past this code as soon as it starts execution.

Figure 5-8. Independent Segment Execution


The syntax for #pragma independent is:

#pragma independent
{ code }


Note: The Power C Analyzer does not yet know how to generate this new pragma. Do not include it in code that you intend to pass through PCA. Insert this pragma only after you are finished with PCA.


Synchronization

To account for data dependencies, it is sometimes necessary for threads to wait for all other threads to complete executing an earlier section of code. Two sets of directives implement this coordination: #pragma synchronize and #pragma enter/exit gate.

#pragma synchronize

A #pragma synchronize tells the multiprocessing C compiler that within a parallel region, no thread can execute the statements that follows this pragma until all threads have reached it. This directive is a classic barrier construct. Figure 5-9 shows this synchronization.

Figure 5-9. Synchronization


The syntax for this pragma is:

#pragma synchronize

#pragma enter gate and #pragma exit gate

You can use two additional pragmas to coordinate the processing of code within a parallel region. These additional pragmas work as a matched set. They are #pragma enter gate and #pragma exit gate.

A gate is a special barrier. No thread may exit the gate until all threads have entered it. Figure 5-10 shows execution using gates.

Figure 5-10. Execution Using Gates


This construct gives you more flexibility when managing dependencies between the work-sharing constructs within a parallel region.

For example, suppose you have a parallel region consisting of the work-sharing constructs A, B, C, D, E, and so forth. A dependency might exist between B and E such that you could not execute E until all the work on B was completed, as shown below:

#pragma parallel ...
{
..A..
..B..
..C..
..D..
..E.. (depends on B)
}

One way to handle this would be to put a synchronize before E. But this directive is wasteful if all the threads have cleared B and are already in C or D. All the faster threads would pause before E until the slowest thread completed C and D:

#pragma parallel ...
{
..A..
..B..
..C..
..D..
#pragma synchronize
..E..
}

To reflect this dependency, put a #pragma enter gate (name) after B and a #pragma exit gate (name) before E. Putting the enter gate after B tells the system to note which threads have completed the B work-sharing construct. Putting the exit gate pragma prior to the E work sharing construct tells the system to allow no thread into E until all threads have cleared B.

#pragma parallel ...
{
..A..
..B..
#pragma enter gate (foo)
..C..
..D..
#pragma exit gate (foo)
..E..
}

#pragma enter gate

The syntax of this pragma is:

#pragma enter gate ( name )

name 

is a name you create to uniquely identify the work construct controlled by this pragma.

For example, construct D might be dependent on construct A, and construct F might be dependent on construct B, but you would not want to stop at construct D because all the threads had not cleared B. By using enter/exit gate pairs, you can make subtle distinctions about which construct is dependent on which other construct.

Put this pragma after the work-sharing construct that all threads must clear before the #pragma exit gate of the same name.


Note: The Power C Analyzer does not yet know how to generate this new pragma. Do not include it in code that you intend to pass through PCA. Insert this pragma only after you are finished with PCA.


#pragma exit gate

The syntax of this pragma is:

#pragma exit gate( name )

Put this pragma before the work-sharing construct that is dependent on the #pragma enter gate of the same name. No thread enters this work-sharing construct until all threads have cleared the work-sharing construct controlled by the corresponding #pragma enter gate of the same name.


Note: The Power C Analyzer does not yet know how to generate this new pragma. Do not include it in code that you intend to pass through PCA. Insert this pragma only after you are finished with PCA.


The lock Clause

The pragma lock clause lets you control the lock to be used during the execution of the various parallel code segments.

The syntax of this pragma is:

lock (locktype)

where locktype can have one of the following values:

block 

use a lock exclusively for the block representing this parallel code segment

region 

use a lock that is unique to the parallel region

global 

use a lock that is unique to the parallel runtimes

others 

use a lock that is provided by you. The name, in this case, should correspond to a user-defined variable. It is your responsibility to acquire and dispose of the lock.

For a critical region outside of a parallel region, region and block are not valid lock types. If you don't specify the lock directive, the default values are:

  • For a critical segment, a global lock is assumed.

  • For other segments, a block lock is assumed.