Chapter 6. PCA Transformations

Source transformations applied by PCA convert ordinary C code into explicit concurrent syntax. This chapter describes some of the rules and conditions that must be met for a loop to be successfully optimized. This chapter also describes the possible transformations in this process:

Loop Concurrentization

The following subs discuss how PCA treats reductions and local variables when it converts the loop to run concurrently (in parallel).

PCA searches the submitted code for for loops that are safe to run in parallel and inserts a directive before eligible loops. The simplest cases occur when the statements of the loop have no loop-carried dependence. For example:

int a[], b[], c[], d[], n;
void example_7_1 ()
{
    int i;
    for (i=0; i<n; i++) {
        a[i] = b[i] + c[i];
        if (d[i])
            a[i] = a[i] / d[i];
    }
}
becomes:
int a[];
int b[];
int c[];
int d[];
int n;
void example_7_1(  )
{
    int i;
#pragma parallel if(n > 84) byvalue(n) shared(a, b, c, d) local(i)
#pragma pfor iterate(i=0;n;1)
    for ( i = 0; i<n; i++ ) {
        a[i] = b[i] + c[i];
        if (d[i]) {
            a[i] /=  d[i];
            }
    }
}

Reductions

Loops that involve reductions can run in parallel only if you synchronize access to the reduced value. For example, if a reduction performed some calculation on the elements of an array and then summed the results, you need to synchronize access to the sum. Only one thread of execution at a time should access the sum. This type of synchronization is perhaps the most common. The following example illustrates how PCA typically transforms a sum reduction to run in parallel. You must set –roundoff=3 to enable sum reductions. For example:

int b[], c[], n, sum;
void example_7_1_1 ()
{
    int i;
    for (i=0; i<n; i++) {
        sum += b[i] + c[i];
    }
}
becomes:
int b[];
int c[];
int n;
int sum;
void example_7_1_1(  )
{
    int i;
    int sum1;
#pragma parallel if(n > 126) byvalue(n) shared(c, b, sum) local(i) reduction(sum1)
    {
        sum1 = 0;
#pragma pfor iterate(i=0;n;1)
        for ( i = 0; i<n; i++ ) {
            sum1 +=  b[i] + c[i];
        }
#pragma critical
        {
            sum +=  sum1;
        }
    }
}

To produce the sum, PCA produces a local partial sum (sum1) for each thread of execution and then sums these partial sums in a controlled manner.

Local Variables

To generate concurrent code for a loop, PCA sometimes needs to use temporary variables. Each independent thread of execution needs its own set of temporary variables. PCA allocates local variables (temporaries) to the separate threads of execution by putting variable names in the local clause of the #pragma parallel that starts the parallel region. For example:

int a[], b[], c[], d[], n;
void example_7_1_2 ()
{
    int i, t;
    for (i=0; i<n; i++) {
        t = a[i] + b[i];
        c[i] = t * 2.335;
        d[i] = t * 3.221;
    }
}
becomes:
int a[], b[], c[], d[], n;
void example_7_1_2(  )
{
    int i;
    int t;
    int _Kii1;
    _Kii1 = (n)%(4);
    for ( i = 0; i<_Kii1; i++ ) {
        t = a[i] + b[i];
        c[i] = t * 2.335;
        d[i] = t * 3.221;
    }
#pragma parallel if(n>38) byvalue(_Kii1,n) shared(a,b,c,d) local(t,i)
#pragma pfor iterate(i=_Kii1;(n-_Kii1+3)/4;4)
    for ( i = _Kii1; i<n; i+=4 ) {
        t = a[i] + b[i];
        c[i] = t * 2.335;
        d[i] = t * 3.221;
        t = a[i+1] + b[i+1];
        c[i+1] = t * 2.335;
        d[i+1] = t * 3.221;
        t = a[i+2] + b[i+2];
        c[i+2] = t * 2.335;
        d[i+2] = t * 3.221;
        t = a[i+3] + b[i+3];
        c[i+3] = t * 2.335;
        d[i+3] = t * 3.221;
    }
}

Loop Reordering

Sometimes, PCA cannot convert an outer for loop to execute concurrently, but it can convert an inner loop. In these cases, PCA attempts to interchange the loops in the for nest, which increases the amount of work that each thread of execution does and improves the efficiency of the code. For example:

int a[][1000], b[][1000], n;
void example_7_2 ()
{
    int i, j;
    for (j=0; j<n; j++)
        for (i=0; i<n; i++)
            a[j][i] = a[j-1][i] + b[j][i];
}
becomes:
int a[][1000];
int b[][1000];
int n;
void example_7_2(  )
{
    int i;
    int j;
    int _Kii1;
#pragma parallel byvalue(n) shared(a, b) local(_Kii1, i, j)
#pragma pfor iterate(i=0;n;1)
    for ( i = 0; i<n; i++ ) {
        _Kii1 = (n)%(4);
        for ( j = 0; j<_Kii1; j++ ) {
            a[j][i] = a[j-1][i] + b[j][i];
        }
        for ( j = _Kii1; j<n; j+=4 ) {
            a[j][i] = a[j-1][i] + b[j][i];
            a[j+1][i] = a[j][i] + b[j+1][i];
            a[j+2][i] = a[j+1][i] + b[j+2][i];
            a[j+3][i] = a[j+2][i] + b[j+3][i];
        }
    }
 }

Scalar Optimizations

PCA uses the following standard optimizations to enhance performance of the concurrent code it generates.

  • Induction Variable Recognition

  • Global Forward Substitution

  • Loop Peeling

  • Lifetime Analysis

  • Invariant if Floating

  • Derived Assertions

  • Dead Code Elimination

Induction Variable Recognition

Induction variables are integers that are incremented or decremented by the same amount for each for loop iteration. Auxiliary loop induction variables, such as j and k shown in the examples that follow, are recognized. For example:

int a[], b[], n;
void example_7_3_1 ()
{
    int i, j, k;
    for (i=0; i<n; i++) {
        a[j] = b[k];
        j++;
        k += 2;
    }
}

becomes:

int a[];
int b[];
int n;
void example_7_3_1(  )
{
    int i;
    int j;
    int k;
    int _Kii2;
    _Kii2 = k;
#pragma parallel if(n > 53) byvalue(n, j, _Kii2) shared(a, b) local(i)
#pragma pfor iterate(i=0;n;1)
    for ( i = 0; i<n; i++ ) {
        a[j+i] = b[_Kii2+i*2];
    }
}

Global Forward Substitution

A global forward substitution pass finds relationships between variables that do not necessarily depend on the loop index variables. Consider n and npl in this example:

int a[][1000], m, n;
void example_7_3_2 ()
{
    int i, np1;
    np1 = n + 1;
    for (i=0; i<m; i++) {
        a[i][n] = a[i-1][np1];
    }
}

PCA determines that npl depends only on n, and not on the loop index variable. It breaks the apparent data dependence, and makes the loop parallel:

int a[][1000];
int m;
int n;
void example_7_3_2(  )
 
{
    int i;
    int np1;
 #pragma parallel if(m > 91) byvalue(m, n) shared(a) local(i)
#pragma pfor iterate(i=0;m;1)
    for ( i = 0; i<m; i++ ) {
        a[i][n] = a[i-1][(n+1)];
    }
}

Loop Peeling

There are occasions when you will want to use a trick known as a “wrap-around” variable. For example, you might be using an array to simulate a cylindrical coordinate system (where the left edge of the array is adjacent to the right edge). In the following example, the variable jml wraps around partway through the loop.

nt a[], b[], n;
void example_7_3_3 ()
{
    int j, jm1;
    jm1 = n;
    for (j=0; j<n; j++) {
        b[j] = (a[j] + a[jm1]) / 2;
        jm1 = j;
    }
}

In the first iteration, jm1 is n. In all iterations except for j=1, the value of jm1 is j-1. Thus, jm1 is an induction variable for the loop after the first iteration. By peeling off the first iteration of the loop, the jm1 induction variable can be exploited.

For example:

int a[][1000];
int m;
int n;
void example_7_3_2(  )
{
    int i;
    int np1;
 #pragma parallel if(m > 91) byvalue(m, n) shared(a) local(i)
#pragma pfor iterate(i=0;m;1)
    for ( i = 0; i<m; i++ ) {
        a[i][n] = a[i-1][(n+1)];
    }
}

PCA can peel off several iterations where multiple wrap-around variables exist.

Lifetime Analysis

In general, when PCA inserts a temporary local variable in a thread of execution, PCA must assign the last value of the temporary variable to the original scalar. When you set optimize to 2 or higher (see Chapter 3, “PCA Command-Line Options”). PCA does a lifetime analysis to determine if the value of the original scalar is used outside the loop. It also eliminates dead code, as described in the section on dead code removal. (See “Dead Code Elimination”.)

If the variable is reused within the compilation unit, the last value of the scalar is assigned as shown in the following code segment. For example:

int a[], b[], c[], n;
void example_7_3_4 ()
{
    int i, x, y;
    for (i=0; i<n; i++) {
        x = a[i];
        y = b[i];
        c[i] = x + y;
    }
    printf (" x=%d \\n", x);
}
becomes:
int a[],  b[],  c[];
int n;
void example_7_3_4(  )
 
{
    int i;
    int x;
    int y;
    int _Kii1;
    _Kii1 = ((n)>(0) ? (n) : (0));
#pragma parallel if(n > 101) byvalue(n) shared(c, a, b) local(i)
#pragma pfor iterate(i=0;n;1)
    for ( i = 0; i<n; i++ ) {
        c[i] = a[i] + b[i];
    }
    if (_Kii1 > 0)
        x = a[_Kii1-1];
    printf( " x=%d \\n", x );
}

Invariant if Floating

When an if condition is invariant in the loop, PCA often can move the if statement outside the loop, further improving the performance of the code. You need to set –optimize=4 or higher for this optimization to occur.

For example:

int a[], b[], c[], n, x;
void example_7_3_5 ()
{
    int i;
    for (i=0; i<n; i++) {
        a[i] = b[i] + c[i];
        if (x > 0)
            a[i] *= x;
    }
}

becomes:

int a[],  b[],  c[];
int n,  x;
void example_7_3_5(  )
{
    int i;
    if (x > 0) {
#pragma parallel if(n > 101) byvalue(n, x) shared(a, b, c) local(i)
#pragma pfor iterate(i=0;n;1)
        for ( i = 0; i<n; i++ ) {
            a[i] = b[i] + c[i];
            a[i] *=  x;
        }
    } else {
#pragma parallel if(n > 201) byvalue(n) shared(a, b, c) local(i)
#pragma pfor iterate(i=0;n;1)
        for ( i = 0; i<n; i++ ) {
            a[i] = b[i] + c[i];
        }
        }
}

Derived Assertions

PCA can derive some information about the relative values of scalar integers from the if and assignment statements in the program. For example:

int a[], b[], m, n;
void example_7_3_6 ()
{
    int i;
    if (m > n) {
        for (i=0; i<n; i++) {
            a[i] = a[i+m] + b[i];
        }
    }
}

becomes:

int a[], b[], m, n;
void example_7_3_6(  )
{
    int i;
    if (m > n) {
#pragma parallel if(n > 143) byvalue(n, m) shared(a, b) local(i)
#pragma pfor iterate(i=0;n;1)
        for ( i = 0; i<n; i++ ) {
            a[i] = a[i+m] + b[i];
        }
        }
}

The transformation is legal because the loop can be executed only when the value of m is greater than n.

Dead Code Elimination

When you set the -scalaropt option to 1 or higher, PCA performs dead code elimination. To get the full benefit of dead code elimination, combine it with optimizations such as subprogram in-lining (function call expansion) and forward substitution. These optimizations expose useless or unreachable code that PCA cannot otherwise see. PCA will also perform the following optimizations:

  • Removal of unreachable code. For example:

    float x, y;
    void example_7_3_7a ()
    {
        goto hop;
        x = 2.0;
    hop:
        y = 13.0;
    }
    

    becomes:

    float x;
    float y;
    void example_7_3_7a(  )
    { 
    hop:
        y = 13.0;
    }
    

  • Removal of zero-trip and empty for loops. For example, the following code is deleted completely:

    float x, y;
    void example_7_3_7b ()
    {
        int i;
        for (i=10; i<2; i++)
            x = x + y;
    }
    

  • Elimination of resolved conditionals. For example:

    float x;
    void example_7_3_7c ()
    {
        if (12 > 10)
            x = 1.0;
        else
            x = 2.0;
    }
    

    Because the true branch is always taken, this code becomes:

    float x;
    void example_7_3_7c(  )
    {
        x = (float)(1.0);
    }
    

  • Removal of unnecessary or unprofitable code. (scalaropt must be \xb3 2 and optimize\xb3 2 for this optimization.) PCA performs lifetime analysis to determine the reaching definitions of variables and removes unused definitions.

    void example_7_3_7d ()
    {
        float x, y;
        y = 5.0;     /* no subsequent use of local variable y */
        x = 3.0;     /* variable x redefined */
        x = 4.0;
        printf ("%g \n", x);
    }
    

    becomes:

    void example_7_3_7d(  )
    {
       float x;
       float y;
       x = (float )(4.0);
       printf( "%g \n", x );
    }
    

Loop Rerolling

Many programs have loops that were unrolled manually over several iterations to amortize the cost of the test and branch at each iteration of the for loop. Before PCA can evaluate these unrolled loops for parallel execution, the loops must be rerolled to a simpler form. See the following dusty-deck transformations.

int a[], b[], c[], n;
void example_7_4a ()
{
    int i;
    for (i=0; i<n; i+=2) {
        a[i] = b[i] + c[i];
        a[i+1] = b[i+1] + c[i+1];
    }
}

PCA recognizes this iteration as an unrolled loop and rerolls it before evaluating it for parallel execution, as follows:

int a[], b[], c[], n;
void example_7_4a(  )
 
{
    int i;
#pragma parallel if(n > 199) byvalue(n) shared(a, b, c) local(i)
#pragma pfor iterate(i=0;(n+1)/2*2;1)
    for ( i = 0; i<=(n + 1) / 2 * 2 - 1; i++ ) {
        a[i] = b[i] + c[i];
    }
}

PCA can also recognize unrolled summations (with –roundoff=3):

int b[], c[], n, sum;
void example_7_4b ()
{
    int i;
    for (i=0; i<n; i+=2) {
        sum += b[i] + c[i];
        sum += b[i+1] + c[i+1];
    }
}
and reroll them:
int b[], c[], n, sum;
void example_7_4b(  )
{
    int i;
    int sum1;
#pragma parallel if(n > 125) byvalue(n) shared(c, b, sum) local(i) reduction(sum1)
    {
        sum1 = 0;
#pragma pfor iterate(i=0;(n+1)/2*2;1)
        for ( i = 0; i<=(n + 1) / 2 * 2 - 1; i++ ) {
            sum1 +=  b[i] + c[i];
        }
#pragma critical
        {
            sum +=  sum1;
        }
    }
}

Loop Unrolling

Unrolling of for loops is the standard manual optimization technique that creates more statements in a small loop by repeating the original statement. PCA can automatically unroll both serial and parallel loops to speed execution. Unrolling a loop involves duplicating the loop body one or more times within the loop, adding an increment (or changing the increment that was already in the loop), and possibly inserting code before the loop to execute the excess iterations of the loop (the “cleanup code”).

If the loop bounds are constant and the iteration count of the loop is small, PCA can eliminate the loop entirely and replace it with copies of the loop body, or PCA can omit the cleanup code. To enable loop unrolling, set the –scalaropt command-line option to 2.

The following example was run with –unroll=8, and –unroll2=1000. (See Chapter 3, “PCA Command-Line Options” for more information on these command-line options.) If the loop bounds are unknown at compilation time, PCA might analyze a loop.

For example:

int a[], b[], c[], n;
void example_7_5a ()
{
    int i;
    for (i=0; i<n; i++) {
        a[i] = b[i] + c[i];
    }
}

is unrolled as:

int a[], b[], c[], n;
void example_7_5a(  )
{
    int i;
    int _Kii1;
    _Kii1 = (n)%(8);
    for ( i = 0; i<_Kii1; i++ ) {
        a[i] = b[i] + c[i];
    }
#pragma parallel if(n > 201) byvalue(_Kii1, n) shared(a, b, c) local(i)
#pragma pfor iterate(i=_Kii1;(n-_Kii1+7)/8;8)
    for ( i = _Kii1; i<n; i+=8 ) {
        a[i] = b[i] + c[i];
        a[i+1] = b[i+1] + c[i+1];
        a[i+2] = b[i+2] + c[i+2];
        a[i+3] = b[i+3] + c[i+3];
        a[i+4] = b[i+4] + c[i+4];
        a[i+5] = b[i+5] + c[i+5];
        a[i+6] = b[i+6] + c[i+6];
        a[i+7] = b[i+7] + c[i+7];
    }
 }

But if the loop iteration count is constant and small, PCA can remove the loop entirely. For example:

int a[], b[], c[];
void example_7_5b ()
{
    int i;
    for (i=0; i<5; i++) {
        a[i] = b[i] + c[i];
    }
}

becomes:

int a[], b[], c[];
void example_7_5b(  )
 
{
    int i;
    a[0] = b[0] + c[0];
    a[1] = b[1] + c[1];
    a[2] = b[2] + c[2];
    a[3] = b[3] + c[3];
    a[4] = b[4] + c[4];
}

Loop Fusion

Loop fusion is a conventional compiler optimization that transforms two adjacent loops into a single loop. The use of data-dependence tests allows fusion of more loops than is possible with standard techniques. You must use –scalaropt=2 to enable loop fusion.

In the following example, the first two loops are fused and concurrentized together. Fusing these loops reduces for-loop overhead and the amount of synchronization required. PCA recognizes that the third loop must execute after the first two and does not fuse it with the others.

For example:

int a[], b[], c[], d[], n;
void example_7_6 ()
{
    int i;
    for (i=0; i<n; i++) {
        a[i] = b[i] + c[i];
    }
    for (i=0; i<n; i++) {
        a[i] = a[i] + d[i];
    }
    for (i=0; i<n; i++) {
        d[i] = a[i+1];
    }
}

becomes:

int a[], b[], c[], d[], n;
void example_7_6(  )
{
    int i, _Kii1;
#pragma parallel if(n > 50) byvalue(n) shared(a, b, c, d) local(_Kii1, i)
    {
#pragma pfor iterate(_Kii1=0;n;1)
        for ( _Kii1 = 0; _Kii1<n; _Kii1++ ) {
            a[_Kii1] = b[_Kii1] + c[_Kii1];
            a[_Kii1] +=  d[_Kii1];
        }
#pragma synchronize
#pragma pfor iterate(i=0;n;1)
        for ( i = 0; i<n; i++ ) {
            d[i] = a[i+1];
        }
    }
}

Memory Management for Data Locality

The following sections describe PCA's memory management options and how it uses them. The options available are:

  • cachesize – use this option to pick block sizes, that is, the sizes of the sections in the cache.

  • cacheline – use this option to inform PCA of the width in bytes of the memory channel between cache and main memory.

  • fpregisters and spregisters – use these options to pick unrolling factors, and to make sure registers are not overflowed when unrolling.

  • setassociativity – use this option to decide which memory management algorithm to use.

Memory Management Techniques

PCA can enhance your program's performance on machines that use cache memory. It uses a combination of memory padding, loop blocking, loop interchanging, and outer loop unrolling to optimize re-use of operands in memory. You can enable memory management by setting scalaropt\xb3 3 and roundoff\xb3 3. (The default settings are scalaropt=3 and roundoff=0.)

PCA improves memory access patterns in loops processing arrays by working on small sections of the arrays that fit into cache and give large cache hit ratios. You can control the sizes of these array sections by using the memory management command-line options. The default settings for these options are the standard characteristics of the machine the code is targeted for. The default settings will produce the best results for most programs. However, in some specialized cases you might want to modify the settings to adjust the array section sizes.

PCA determines at runtime which of two memory management algorithms to use, depending on the settings for cacheline and setassociativity. The default algorithm keeps square blocks of data in cache, while the other keeps long, narrow blocks of data in cache.

The example below shows how loop blocking, loop interchanging, and loop unrolling can be used to improve the performances of this matrix multiplication code. The PCA command-line options used to analyze it were o=3, so=2, r=3, and arl=2. The arl option is set because the arrays are function arguments and PCA assumes by default that they might overlap.

double matm(n,a,b,c)
int n;
double a[200][200],b[200][200],c[200][200];
{
   int i,j,k;
   for (i=0; i<n; i++) 
      for (j=0; j<n; j++) {
         a[i][j] = 0.0;
         for (k=0; k<n; k++)
            a[i][j] = a[i][j] + b[i][k]*c[k][j];
      }
   return (a[3][5]);
}

becomes:

double matm( n, a, b, c )
    int n;
    double  (*a)[200];
    double  (*b)[200];
    double  (*c)[200];
{
    int i;
    int j;
    int k;
    int _Kii1;
    int _Kii2;
    int _Kii3;
    int _Kii4;
    int _Kii5;
    int _Kii6;
    int _Kii9;
    double _Kdd1;
    int _Kii10;
    int _Kii11;
    double _Kdd2;
    int _Kii12;
    double _Kdd3;
    int _Kii13;
    double _Kdd4;
    int _Kii14;
    double _Kdd5;
    int _Kii15;
    double _Kdd6;
    int _Kii16;
    double _Kdd7;
    int _Kii17;
    double _Kdd8;
    int _Kii18;
    double _Kdd9;
    int _Kii19;
    double _Kdd10;
    int _Kii20;
    double _Kdd11;
    int _Kii21;
    int _Kii22;
    double _Kdd12;
    int _Kii23;
    int _Kii24;
    int _Kii25;
    int _Kii26;
    int _Kii27;
    int _Kii28;
    _Kii10 = n - 1;
    _Kii27 = n / 10;
#pragma parallel byvalue(n, _Kii10) shared(a) local(i, j)
#pragma pfor iterate(i=0;n/10;10)
    for ( i = 0; i<=n - 10; i+=10 ) {
        for ( j = 0; j<=_Kii10; j++ ) {
            a[i][j] = 0.0;
            a[i+1][j] = 0.0;
            a[i+2][j] = 0.0;
            a[i+3][j] = 0.0;
            a[i+4][j] = 0.0;
            a[i+5][j] = 0.0;
            a[i+6][j] = 0.0;
            a[i+7][j] = 0.0;
            a[i+8][j] = 0.0;
            a[i+9][j] = 0.0;
        }
    }
    _Kii1 = _Kii27 * 10;
    _Kii11 = n - 1;
    for ( i = _Kii1; i<=_Kii11; i++ ) {
        _Kii26 = (_Kii11 + 1)%(3);
        for ( j = 0; j<_Kii26; j++ ) {
            a[i][j] = 0.0;
        }
        for ( j = _Kii26; j<=_Kii11; j+=3 ) {
            a[i][j] = 0.0;
            a[i][j+1] = 0.0;
            a[i][j+2] = 0.0;
        }
    }
    _Kii2 = n;
    _Kii5 = 0;
    _Kii3 = (_Kii2 - 1)%(546) + 1;
    _Kii4 = _Kii3;
    _Kii22 = n - 1;
    _Kii25 = n - 10;
    _Kii24 = n - 1;
    _Kii28 = (_Kii25 + 10) / 10;
    for ( _Kii6 = 1; _Kii6>=_Kii2; _Kii6+=546 ) {
        _Kii21 = _Kii5 + _Kii4 - 1;
        for ( k = 0; k<=_Kii25; k+=10 ) {
            _Kii12 = k + 1;
            _Kii13 = k + 2;
            _Kii14 = k + 3;
            _Kii15 = k + 4;
            _Kii16 = k + 5;
            _Kii17 = k + 6;
            _Kii18 = k + 7;
            _Kii19 = k + 8;
            _Kii20 = k + 9;
#pragma parallel byvalue(_Kii22, k, _Kii12, _Kii13, _Kii14, _Kii15, _Kii16, _Kii17, _Kii18, _Kii19, _Kii20, _Kii5, _Kii21)
#pragma         shared(b, a, c) local(_Kdd1, _Kdd2, _Kdd3, _Kdd4, _Kdd5, _Kdd6, _Kdd7, _Kdd8, _Kdd9, _Kdd10, _Kdd11, i, j)
#pragma pfor iterate(i=0;_Kii22+1;1)
            for ( i = 0; i<=_Kii22; i++ ) {
                _Kdd2 = b[i][k];
                _Kdd3 = b[i][_Kii12];
                _Kdd4 = b[i][_Kii13];
                _Kdd5 = b[i][_Kii14];
                _Kdd6 = b[i][_Kii15];
                _Kdd7 = b[i][_Kii16];
                _Kdd8 = b[i][_Kii17];
                _Kdd9 = b[i][_Kii18];
                _Kdd10 = b[i][_Kii19];
                _Kdd11 = b[i][_Kii20];
                for ( j = _Kii5; j<=_Kii21; j++ ) {
                    _Kdd1 = a[i][j];
                    _Kdd1 +=  _Kdd2 * c[k][j];
                    _Kdd1 +=  _Kdd3 * c[_Kii12][j];
                    _Kdd1 +=  _Kdd4 * c[_Kii13][j];
                    _Kdd1 +=  _Kdd5 * c[_Kii14][j];
                    _Kdd1 +=  _Kdd6 * c[_Kii15][j];
                    _Kdd1 +=  _Kdd7 * c[_Kii16][j];
                    _Kdd1 +=  _Kdd8 * c[_Kii17][j];
                    _Kdd1 +=  _Kdd9 * c[_Kii18][j];
                    _Kdd1 +=  _Kdd10 * c[_Kii19][j];
                    _Kdd1 +=  _Kdd11 * c[_Kii20][j];
                    a[i][j] = _Kdd1;
                }
            }
        }
        _Kii9 = _Kii28 * 10;
        _Kii23 = _Kii5 + _Kii4 - 1;
#pragma parallel byvalue(_Kii24, _Kii5, _Kii23, _Kii9) shared(b, a, c) local(_Kdd12, i, j, k)
        {
            for ( k = _Kii9; k<=_Kii24; k++ ) {
#pragma pfor iterate(i=0;_Kii24+1;1)
                for ( i = 0; i<=_Kii24; i++ ) {
                    _Kdd12 = b[i][k];
                    for ( j = _Kii5; j<=_Kii23; j++ ) {
                        a[i][j] +=  _Kdd12 * c[k][j];
                    }
                }
#pragma synchronize
            }
        }
        _Kii5 +=  _Kii4;
        _Kii4 = 546;
    }
    return a[3][5];
}