Chapter 2. Analyzing Loops: 32-bit Sample Sessions

This chapter provides three interactive sample sessions that demonstrate most of the Parallel Analyzer View's features for the 32-bit version of MPF. These sessions also demonstrate various aspects of parallelization and the use of the POWER Fortran Accelerator (PFA).

The sample sessions consist of a step-by-step examination of three sample programs. The samples sessions cover the following:

To use these sample sessions, the subsystem WorkShopMPF_sw.demos must be installed.


Note: These sample sessions are applicable for the 32-bit compilers only. For a discussion of the 64-bit version of the compilers, see Chapter 3, “Analyzing Loops: 64-bit Sample Sessions.”


Setting Up the Dummy Sample Session

The Parallel Analyzer View comes with a demonstration directory /usr/demos/WorkShopMPF. It contains a subdirectory tutorial, which contains a source file called dummy.f_orig and a Makefile. The file contains 27 DO loops, each of which exemplifies one aspect of the parallelization process. In that directory, running make creates a scratch copy of the demonstration program dummy.f and then creates a run of PFA on the copy. PFA produces a transformed source file dummy.m, a listing file dummy.l, and an “analysis” file dummy.anl.

Prepare for the session by opening a shell window and entering make in the /usr/demos/WorkShopMPF/tutorial directory:

% cd /usr/demos/WorkShopMPF/tutorial 
% make 

Once the demo directory has been prepared, start the session by entering:

% cvpav -f dummy.f

The main window of the Parallel Analyzer View opens, displaying the list of loops in the source file, dummy.f. Position the view at the upper left of the screen.


Note: If you receive a message related to licensing, refer to the NetLS License System Administration Guide or WorkShopProMPF Release Notes.

Figure 2-1 shows the Parallel Analyzer View with an alternative color scheme. To start a session in these colors, enter cvpav -scheme Potrero -f dummy.f. The black and white figures in the hard copy version of this guide were prepared using the Grayscale scheme. Another scheme used in this book is IndigoMagic.

Figure 2-1. Parallel Analyzer View Main Window


Using the Loop List Display

The loop list display shows information about each loop in the program with an icon next to it that reflects the parallelization status of the loop. Pull down the Admin menu and select “Icon Legend...” to bring up a legend dialog box that explains the meaning of the various icons (see Figure 2-2). Move the legend dialog box to the side, and scroll through the list of loops to see the various icons. When you are done, close the legend dialog box by clicking the Close button in the lower right of the dialog box.

Figure 2-2. Launching the “Icon Legend...” Dialog Box


The loop list display contains the following items:

Workload 

a number that is supposed to reflect the amount of work done in each iteration of the loop

Nest 

the nesting level for the loop

Loop-ID 

the FORTRAN description of the loop

Variable 

the loop index variable

Subroutine, Lines, File 


where the loop is located in the source code

Olid 

the original loop ID; an internal identifier for the loop (Please refer to this number when reporting bugs.)

Underneath the list display is a search field and a set of option menus and buttons that control the display of information in the loop list.

Sorting the Loop List

You can sort the list either in the order of the source code, or by loop workload, or (if you are running a performance experiment on the program using the WorkShop Performance Analyzer) by performance cost. You control sorting with the option menu to the left below the list.

Figure 2-3. Source Order Sort


When loops are sorted in source order, the Loop-ID is indented according to the nesting level of the loop; for the demonstration program, only the last several loops are nested, so you will have to scroll down to see it (see Figure 2-3).

For other sorting, the list is not indented. Select “Sort by Workload” and notice the Loop-ID is no longer indented (see Figure 2-4). (The same is true of “Sort by Perf. Cost”. It is grayed out because there is no performance tool running at this time.) When you are done, select “Sort in Source Order” once again.

Figure 2-4. Sorting the Loop List by Workload


Filtering the Loop List

You may want to look at only some of the loops in large programs. The list can be filtered in two ways: by parallelization status or by origin of the loop.

Filtering by Parallelization Status

The parallelization status filtering is controlled by an option menu centered below the list. It initially reads “Show All Loop Types”.

You can filter the list to show only those loops that cannot be parallelized, those that are parallel, or those that are serial (see Figure 2-5 ).

Figure 2-5. Parallelization Status Option Menu


Try selecting each of these, and then return to “Show All Loop Types”. It can also filter to show those loops for which you have requested modifications (requesting modifications to loops is described later in this section). Since you haven't yet requested any modifications, selecting this option will result in a message saying that no loops meet the filter criterion.

Filtering by Loop Origin

Another way to filter is to show loops that come from a single file or a single subroutine:

  1. Open the Subroutines and Files View by pulling down the Views menu and selecting “Subroutines and Files View.” Alternatively, you may use the keyboard accelerator for this operation by typing <Ctrl>-F with the cursor anywhere in the main view. A subsidiary view that lists the subroutines and files that are in the fileset opens (See Figure 2-6.)

    Figure 2-6. Subroutines and Files View


  2. From the Filter option menu (figure 2-7), select “Filter by File.”

    Figure 2-7. Filter Option Menu


  3. Double-click the line for the file dummy.f in the function/file list of the Subroutines and Files View window. The name will appear in the filtering text field labeled Title: (see Figure 2-8) and the list will be rescanned. Similarly, you may try selecting “Filter by Subroutine” from the main view option menu, and double-click the line for subroutine DUMMY in the Subroutine and Files View.

    Figure 2-8. Filter by File Option Menu and Text Field


For this example, there is only one file and one subroutine, so the filtering is not very useful, but for large programs with many files and subroutines, it would be. When you are done, display all of the loops in the sample source file once again by selecting “No Filtering” from that option menu.

You won't be needing the Subroutines and Files View further, so close it by pulling down the Admin menu and selecting “Close.”

Viewing Source

The Parallel Analyzer View gives you access to views of both your original Fortran source and the source as it is transformed by the POWER Fortran Accelerator.

Viewing Original Source

Click the Source button to the left side of the main view to bring up the Source View, as shown in Figure 2-9. This view is the same Source View that is used in the WorkShop Debugger and Performance Analyzer.

Figure 2-9. Source View


When the source display opens, position it to the right of the main view. (On machines with low-resolution screens, the windows will overlap.) Scroll up and down in the file and observe that the source window displays colored brackets that mark the location of each loop. These colors match the colors of the parallelization icons and serve to indicate the parallelization status of each loop at a glance. The color indicates which loops are parallelized, which are unparallelizable, and which are left serial.

Viewing Transformed Source

PFA is a source-to-source translator that takes the various loops in the program and transforms them both for scalar optimization and for parallelization. Each loop may be rewritten into one, two, or more transformed loops or may be combined with others or optimized away. The result of these transformations is a transformed source file that you may examine.

Click the Transformed Source button. Another source window labeled “Parallel Analyzer View — Transformed Source” opens as shown in Figure 2-10.

Figure 2-10. Transformed Source Window


Position it below the Source View. Scroll through it, and notice that it, too, has bracketing marking the loops. The bracketing for the transformed source cannot always distinguish between serial loops and unparallelizable loops, so some unparallelizable loops will be displayed as serial (for example, those with data dependencies).

Viewing Detailed Information about a Loop

Each line in the loop list summarizes some information about a loop. Much more information is available, and this section will show you how to examine it.

Selecting a Loop

To get more information about a loop, you must select it by

  • double-clicking the loop line text (but not on its icon)

  • clicking the brackets in either of the source windows

  • stepping through the list with the Next Loop and Previous Loop buttons

Selecting a loop has a number of effects:

  • The previously empty display below the list fills with information on the selected loop.

  • The Source View scrolls to the selected loop and highlights the source code of the loop.

  • The Transformed Source window highlights the first of the loops into which the original selected loop was transformed and displays a bright vertical bar next to each transformed loop that came from the original loop.

If the Transformed Loops View or the PFA Analysis Parameters View is open, it too will be switched to show the selected loop. We will look at these views later. See Figure 2-11.

Figure 2-11. Global Effects of Selecting a Loop


In this figure and many of those following, the loop list is resized to reduce the number of loops displayed. The adjustment button is in the lower right hand corner of the loop list display, just above the loop information display. Your screen shows the full list unless you resize it.

Try scrolling through the list and double-clicking various loops, and scrolling through the source displays and clicking the loop brackets to select loops. Note that when you select each loop, its icon acquires a check mark showing that you've looked at it. When you are done, scroll to the top of the loop list in the main view and double-click the first loop's line.

Using the Loop Information Display

The loop information display occupies the lower half of the main view (see Figure 2-12). It contains detailed information about the currently selected loop. It consists of a series of lines in several blocks.

Figure 2-12. Loop Information Display


Parallelization Controls

The first line of the display is labeled Parallelization Controls:. On the far right, the first line shows how many transformed loops were derived from the selected loop. When the session is run with a performance experiment, an additional block appears above the Parallelization Controls. It gives performance information for the loop (shown in Figure 2-39). Since we do not have an experiment on this program (which does not, in fact, execute), the performance information is absent.

Below this are two option menus, the first controlling parallelization status and the second controlling the loop MP scheduling (it is shown for all loops, but is applicable to parallel loops only), and a text input field for adding an expression for the scheduling chunk size. Text labels to the right of the option menus list the current values for parallelization and scheduling.

Loop Information Messages

Below the first separator line appear up to five blocks of additional information. These are lists of:

  • questions that PFA asked about the loops, if any

  • obstacles to parallelization, if any

  • assertions made about the loop, if any

  • directives applied to the loops, if any

  • messages about the loop, if any

Figure 2-13. Highlighting Button


Some of these lines may be accompanied by small “light bulb” highlighting buttons (see Figure 2-13). Each highlights a relevant part of the code in the Source View when clicked. The lines for assertions, directives, and questions also may have menus accompanying them. Lines that refer to parallelization status or PFA parameters will not have menus because they are controlled using the parallelization status menu or from the PFA Analysis Parameters View, respectively. You'll use these features later in the session. The first loop in the file (which you selected previously) has two messages and no highlighting buttons.

Using the PFA Analysis Parameters View

Figure 2-14. Views Menu


The PFA analysis parameters control what kinds of transformations PFA will make on the program. The values for the selected loop may be changed using the PFA Analysis Parameters View. To bring it up, pull down the Views menu and select “PFA Analysis Parameters View” (see Figure 2-14). Alternatively, you may use the keyboard accelerator for this operation by typing Ctrl-A with the cursor anywhere in the main view.

Figure 2-15. PFA Analysis Parameters View


A new view comes up, listing each of the parameters with a numeric input field to the right of each of them. Entering a new numeric value in the input field will request a change to the loop. Don't do this now; close the view by pulling down the View's Admin menu and selecting “Close.”

Using the Transformed Loops View

You can also see detailed information about the transformed loops coming from a particular loop (see Figure 2-16). To do so, pull down the Views menu and select “Transformed Loops View.” Alternatively, you may use the keyboard accelerator for this operation by typing Ctrl-T with the cursor anywhere in the main view.

Figure 2-16. Transformed Loops View for Loop do-1000


When the view opens, position it at the left of the screen, below the main view. It contains information about the loops into which the currently selected original loop was transformed. Each transformed loop has a block of information associated with it, and the blocks are separated by horizontal lines.

Transformed Loop Description

The first line in each block contains a parallelization status icon, a highlighting button, and the ID of the transformed loop. (The ID is assigned by PFA.) The button, if clicked, highlights the transformed loop in the Transformed Source window and the original loop in the Source View.

The next two lines describe the transformed loop. The first provides information such as whether it is a primary loop (directly transformed from the selected original loop) or secondary loop (transformed from a different original loop but incorporating some code from the selected original loop), its parallelization state, whether it is an ordinary loop or interchanged loop, its nesting level, and workload. The second line displays the location of the loop in the transformed source.

Following the description lines is a list of messages generated by PFA, if any. To the left of the message lines are buttons, and clicking them will highlight the part of the original source that relates to the message. Often it is the first line of the original loop that is shown, since the message refers to the entire loop.

For the currently selected loop (do-1000), the original loop was transformed into two loops, one that runs parallelized and one that runs serial. As the messages state, the original loop was unrolled 4 times, and a cleanup loop was added. Unrolling is described in “Loop Unrolling”.

Selecting Transformed Loops

Transformed loops can also be selected. By default, the first of the transformed loops is selected when the view is brought up, and the transformed source is highlighted to show it. At the same time, the color highlighting of the original source changes, although the lines highlighted have not. See Figure 2-17. You will later see that for loops with more extensive transformations the highlighted lines will be different (for example, loops do-1300 and do-1350, the fused loops).

Now click the button for the second transformed loop. The transformed source will highlight a different region (the cleanup loop), but the original source will highlight the same lines as before, as shown in Figure 2-18. This is because when a transformed loop is selected, those lines in the original source that go into the transformed loop will be highlighted. In this case, the same lines go into both the transformed loops. Transformed loops may also be selected by clicking the corresponding loop brackets in the Transformed Source window.

Figure 2-17. Transformed Loops in Source Windows


Figure 2-18. Second Transformed Loop Highlighting


You may either leave this window open or close it by selecting the “Close” command from its File menu.

Examining Loops

Now that you have familiarized yourself with the basic windows in the Parallel Analyzer View's user interface, you can start examining and analyzing loops. First you will look at a few simple loops, next at loops with obstacles to parallelization, then at loops for which PFA asks questions, and finally at more complex, nested loops.

Simple Loops

The six loops you will examine in this section are the simplest kind of Fortran loop.

A Simple Parallelizable Loop

Scroll the list of loops back to the top and select loop do-1000. As the two messages state, this loop is transformed into two loops, one an unrolled, parallelized loop, and the second a clean-up loop for unrolling. (Unrolling is discussed in “Loop Unrolling”.)

Move to loop do-1100 by clicking the Next Loop button.

A Preferably Serial Loop

Loop do-1100 is preferably serial, because the amount of work done is too little to justify the parallelization overhead. Unlike the previous loop, the iteration count is known, so the total work can be computed. See Figure 2-19.

Figure 2-19. Preferably Serial Loop


Also note that this loop is unrolled as the previous one was but that no cleanup loop is needed because the count is known to be a multiple of the unrolling.

Move to loop do-1200 by clicking the Next Loop button.

An Explicitly Parallelized Loop

Loop do-1200 is parallelized because it contains an explicit C$DOACROSS directive; PFA will pass the directive through in the transformed source but does nothing further with the loop, as the messages indicate. See Figure 2-20.

Figure 2-20. Explicitly Parallelized Loop


The loop status option menu is set to “C$DOACROSS...”and it is shown with a highlighting button. Clicking the button will bring up both the Source View and the Parallelization Control View, which shows more information about the parallelization directive. If you have clicked on the button, close the Parallelization Control View by pulling down its Admin menu and selecting “Close.” You will come back to the use of this view later. See “Building a Custom DOACROSS Directive”. Close the Source View by pulling down its File menu and selecting “Close.”

The C$DOACROSS directive is displayed with a highlighting button. Click it, and the Source View comes up. Notice the highlighting of the directive in the source. See Figure 2-21.

Figure 2-21. Source View of C$DOACROSS Directive


Move to loop do-1300 by clicking the Next Loop button.

A Pair of Fused Loops

Loop do-1300 is the first of two loops that can be fused. That is, the loops have the same bounds, and the code in the body of the two loops is independent, so they can be combined to save the loop overhead. Even when a loop has been fused, the Source View is highlighted to show only the selected loop, not the other loops that have been fused with it.

Notice that in the Transformed Source window, the highlighted loop has the bodies of the two original loops interleaved, and replicated for unrolling (see Figure 2-22). Click the bracket next to the loop in the transformed source. Now you see that the lines highlighted in the original source come from both loops. Then click the bracket for the loop below it in the transformed source (the cleanup loop for unrolling) and see that it, too, highlights source from both loops.

Figure 2-22. Fused Loops in Transformed Source Window


Move to loop do-1350 by clicking the Next Loop button. Loop do-1350 is the other half of the fused pair. Its icon indicates that it was fused, and the highlighting in the transformed source indicates that it was transformed into the same pair of loops as the previous one.

Move to loop do-1400 by clicking the Next Loop button.

Loop Unrolling

Unrolling is done to reduce the loop overhead relative to the real work of the loop. The simpler the body of the loop, the more profitable unrolling can be. In many cases, the loop iteration count is not known, so an additional loop, called a cleanup loop, is necessary to handle the last few iterations. Sometimes, the iteration count is known but is not a multiple of the unrolling; in such cases, PFA will usually explicitly add code for the last few iterations.

Loop do-1400 is the same as the first loop in the program, but a directive “SCALAR OPTIMIZE(1)” has been added. The loop is not unrolled. By default, the scalar optimization parameter is set to 3, which allows loop unrolling.

Move to loop do-1500 by clicking the Next Loop button.

A Loop That Is Optimized Away

Loop do-1500 is an example of a loop so unnecessary that PFA can get rid of it entirely. First, PFA sees that the body of the loop is independent of the loop, so it can be promoted out, and the loop eliminated. Then it sees that the body sets a variable that is not subsequently used, so it can throw that out, too. The transformed source is not scrolled and highlighted because nothing is there. Scroll down a few lines from the previous loop, and note the absence of the code for the loop that was optimized away.

The loop also has a directive controlling scalar optimization, but it is there only to reset the default for subsequent loops.

Move to loop do-2000 by clicking the Next Loop button.

Loops with Obstacles to Parallelization

There are a number of reasons that a loop may not be parallelized. The following loops illustrate various of these reasons, along with variants that allow parallelization. You will step through each of them in turn.

Loops with Data Dependences

Loop do-2000 is an example of a loop that cannot be parallelized because of a data dependence. In this case, one element of an array is used to set another. (This construct is called a recurrence.) If the loop were to be parallelized, the iterations might execute out of order, and iteration 4, which sets A(4) to A(5), might occur after iteration 5, which would have reset the value of A(5). Consequently, the program would give the wrong answer. See Figure 2-23.

Figure 2-23. Obstacle to Parallelization


There is a line listing the obstacle to parallelization; click the button that accompanies it. Two kinds of highlighting take place. The first is a line highlight showing the relevant line that has the dependence, and the second is a symbol (or token) highlight that shows the uses of the variable that is the obstacle to parallelization. Only the uses of the variable within the loop are highlighted.

Move to loop do-2100 by clicking the Next Loop button.

Not all loops with similar constructs are unparallelizable. Loop do-2100 is similar to loop do-2000, but the array elements used differ by an offset, M. If M is equal to NSIZE, for example, and the array is twice NSIZE, the code is actually copying the upper half of the array into the lower half, and there is no reason why that cannot be run in parallel. PFA cannot recognize this from the source, but the author has added an assertion that there is no recurrence, so the loop is parallelized. See Figure 2-24. Click the highlighting button to show the assertion.

Figure 2-24. Parallelizable Data Dependence


Move to loop do-2200 by clicking the Next Loop button.

Data dependence can involve more than one line of a program. In loop do-2200, a similar dependence occurs, but the use of the variable occurs on a different line than its setting. Click the highlight button on the obstacle line, and note that both lines receive the line highlighting, and the token highlighting shows the dependency variable on the two lines (see Figure 2-25). Of course, real programs can, and typically do, have far more complex dependencies than this.

Figure 2-25. Highlighting on Multiple Lines


Move to loop do-2300 by clicking the Next Loop button.

Loops with Reductions

Loop do-2300 shows a data dependence that is called a reduction. In a reduction, the variable responsible for the data dependence is being accumulated or “reduced” in some fashion. Reductions can be summation, multiplication, or a minimum or maximum determination. For summation, as shown in this loop, PFA could accumulate partial sums in each processor, and then add the partial sums at the end. However, because floating-point arithmetic is inexact, the order of addition might give different answers because of round-off error.

This does not imply that the serial execution answer is “correct” and the parallel execution answer is “incorrect”; they are equally valid within the limits of round-off error. Since, by default, PFA assumes it is not OK to introduce round-off error, the loop is left serial. PFA does, however, have a parameter to allow you to say that such round-off error is OK.

Move to loop do-2400 by clicking the Next Loop button.

In loop do-2400, the author has added a directive controlling round-off error. The same loop that was left serial above is now parallelized. Click the button for the directive, and you can see how it is highlighted in the source. Refer to the PFA manual for a more detailed explanation of the meaning and use of this directive. The round-off setting will be left at this value for the remainder of the program.

Move to loop do-2500 by clicking the Next Loop button.

Loops with Input-output Operations

Loop do-2500 has an input/output (I/O) operation in it. It cannot be parallelized, because the output would appear in a different order depending on the scheduling of the individual CPUs. Click the button indicating the obstacle, and note the highlighting of the print statement. Also note that the transformed source shows that this loop is not unrolled, either. Actually, there is no real obstacle to unrolling, but it is not done because the cost of performing the I/O operation is so great compared to the loop iteration overhead that the savings gained are not worth the increase in the size of the program.

Move to loop do-2600 by clicking the Next Loop button.

Loops with Premature Exits

Loop do-2600 has a premature exit; that is, it cannot be determined at compilation time how many iterations will take place. If PFA did parallelize it, one thread might execute iterations past the point where another has determined to exit the loop.

Click the button indicating the premature exit. Note that the line with the exit from the loop is highlighted in the source.

Move to loop do-2700 by clicking the Next Loop button.

Loops with Subroutine Calls

Loop do-2700 is also unparallelizable, because there is a call to a routine, RTC, and PFA cannot determine whether or not that call will have side effects. Click the obstacle line. Note the highlighting of the line containing the call and the subroutine name. Also note that the loop is not unrolled, as the presence of the call inhibits unrolling.

Move to loop do-2800 by clicking the Next Loop button.

Although loop do-2800 has a similar subroutine call in it, it can be parallelized because the author has asserted that the call has no side effects that will prevent it from running concurrently. Click the assertion line to highlight the source line containing the assertion.

When you are done, move to loop do-3000 by clicking the Next Loop button.

Loops That Prompt Questions from PFA

Sometimes PFA can parallelize a loop more efficiently if it knows more information than it can infer from the source. In these cases, PFA asks questions that appear in the loop information display for the loop, along with a menu that allows you to answer the question.

Loops with Relationships between Variables

PFA can sometimes parallelize a loop if it can be told the relationship between variables in the program. Although you may know such relationships from the nature of the physical problem the program is dealing with, PFA cannot safely infer the information just from the code.

Loop do-3000 can be parallelized if it is known that the iterations do not overlap, but not otherwise. PFA will ask three questions, although for this type of construct, it actually generates code to determine the relationship at run time, and the program will execute one of the two sequences depending on that determination. You can see this by observing that the loop was transformed into four loops, one pair of unroll/cleanup loops when it can be parallelized, and a second when it cannot. Look at the transformed source code for each of these pairs.

For any such questions, the line asking them has an associated option menu that will allow you to answer. The generated code will be correct even if you do not answer or do not know. If PFA knows the answer, it can omit the alternate form and produce a tighter program.

Move to loop do-3100 by clicking the Next Loop button.

In loop do-3100, the author has added an assertion answering the question, and PFA has generated just one version of the loop, the one that runs in parallel. The menu next to the questions for the previous loop will generate such an assertion.

Move to loop do-3200 by clicking the Next Loop button.

Permutation Vectors

Loop do-3200 has a construct known as a permutation vector. In it, an array is referenced by an index value contained in another array. If the B(I) values are all distinct, the iterations do not depend on each other, and the loop can be parallelized; if the same value occurs in more than one B(I), it cannot. PFA asks the question but leaves the loop serial. Note that both the question and the data dependence message have associated highlighting buttons.

Move to loop do-3300 by clicking the Next Loop button.

Here an assertion has been added that the index array, B(I), is indeed a permutation vector, and the loop is parallelized.

Move to loop do-4000 by clicking the Next Loop button.

Complex Loops and Loop Nests

Finally, let's look at somewhat more complicated, nested loops.

Doubly-nested Loops and Interchanges

Loop do-4000 is the outer loop of a pair of loops; it runs in parallel, and the inner loop runs in serial: one parallel loop cannot be nested inside another. Also note that the outer loop is not unrolled, but the inner loop is.

Move to loop do-4010 by clicking the Next Loop button to show the inner loop, and then click Next Loop again to select the outer loop of the next pair.

Note that this outer loop, loop do-4100, is shown as serial inside a parallel loop, and the following loop is parallel. How can this be? It happens because PFA has recognized that the two loops can be interchanged, and furthermore, that the CPU cache is likely to be more efficiently used if the loops are run in the interchanged order.

Move to loop do-4110 to show the inner loop, and then click the Next Loop button once again to move to the following triply-nested loop.

Triply-nested Loops and Strip-mining

The next set of loops is a triply-nested matrix multiply. Just as PFA optimized a doubly-nested pair of loops by interchanging the loops, it will do even more to get optimal cache performance by “strip-mining” a triply-nested loop. In this case, different sections of the matrix will be executed by different threads, so that the threads will not cause cache conflicts among themselves.

The outer original loop, do-5000, is interchanged, unrolled, and split into block and strip loops, in a fairly complicated way; it is transformed into ten loops. The middle loop has part of its work in a second-level unrolled loop, and part of it in parallelized third-level loops. The inner loop is shown as unparallelizable, although it is actually preferably serial. (This is a bug in the current version of WorkShopProMPF.) Do not be surprised if the code seems difficult to understand; the strip-mining transformation is very complex and confusing.

Use the Next Loop button to first step to the middle of the three, loop do-5010, and then the inner one, loop do-5020. Notice how each of the loops is transformed into various combinations of loops at different nesting levels.

This brings you to the end of your examination of the loops under analysis. In the next section, you will find out how to modify your source code using the Parallel Analyzer.

Modifying Source Files

So far, you've ignored the controls that can be used to change the source file and allow a subsequent pass of PFA to do a better job. Now you will go back and make changes. There are two steps in modifying source files:

  1. Asking for the changes using the Parallel Analyzer View controls.

  2. Actually modifying the files and rebuilding the program and its analysis files.

Asking for Changes

You may ask for changes by answering any of the questions that PFA poses, by building a DOACROSS for a specific loop, by modifying the analysis parameters that PFA uses for its processing, or by adding or deleting assertions or directives. In this sample session, you will request changes to loops in the order they appear in the file, but they may be requested in any order.

Changing the PFA Analysis Parameters

Scroll to the top of the loop list and select the first loop, which was unrolled four times. Pull down the Views menu and select “PFA Analysis Parameters View” to open the PFA Analysis Parameters View. Locate the line that reads:

Unroll:

Figure 2-26. Changing a PFA Analysis Parameter


Enter 6 into the numeric field next to it (it contains 4 by default). First click the field and then type <Backspace> followed by 6. This changes the loop unrolling from 4 to 6. Note the turned-down corner in the text field as shown in Figure 2-26. Clicking this corner toggles between the old and new values in the field.

Close the View by pulling down the Admin menu and selecting “Close.” Notice that a red plus sign now appears in the icon next to the loop, indicating that a change has been requested for it as shown in Figure 2-27. Move to loop do-1100 by clicking the Next Loop button.

Figure 2-27. Effect of Changes on the Loop List Display


Building a Custom DOACROSS Directive

Figure 2-28. DOACROSS Menu


Loop do-1100 was left serial because it was too small; sometimes you might want such a loop parallelized anyway. Go to the Loop Status option menu (to the left of the loop status icon in the loop information display that reads “Default”), and select “C$DOACROSS...” as shown in Figure 2-28. This brings up the Parallelization Control View (see Figure 2-29), showing the loop that was selected, a parallelized condition input field into which you can type a condition for parallelization, an MP scheduling option menu, an MP chunk size input field, and a list of all the variables in the loop, with an icon indicating whether the variable was read, written, or both. (These icons are described in the Icon Legend.) Notice that each variable has a highlighting button that shows its use within the loop.Notice also the red plus sign next to this loop in the main view.

Dismiss the View by pulling down the Admin menu and selecting “Close.”

Figure 2-29. Parallelization Control View for Loop do-1100


Adding a New Assertion

Now you will add an assertion to a loop. Find the loop with ID do-2700 by using the search feature of the loop list. Go to the search field, and enter 2700. You can double-click the highlighted line in the loop list to select the loop.

You're going to add a concurrent call assertion. To add the assertion,pull down the Operations menu, pull down the Add Assertion submenu, and select “C*$*ASSERT CONCURRENT CALL.”

This adds an assertion that the call to RTC(), which PFA thought to be an obstacle to parallelization, is actually safe to parallelize. When you add the assertion, the loop information display updates to show the new assertion, along with its menu labeled “Insert” as shown in Figure 2-30.

Figure 2-30. Adding an Assertion


Answering a Question

Now try answering a question. Put the cursor into the search field, backspace to remove the previous contents, and enter 3200 into the field. Select that loop by double-clicking. Loop do-3200 has a question about a permutation vector. Pull down the option menu next to the question in the loop information display, and select “Assert True” as shown in Figure 2-31.

Figure 2-31. Answering a Question


Deleting an Existing Assertion

Now let's delete an existing assertion. Move to loop do-3300 using the Next Loop button, and go to the “ASSERT PERMUTATION(B)” assertion. Pull down its option menu and select “Delete”. Figure 2-32 shows the result. The same procedure can be used for directives.

Figure 2-32. Deleting an Assertion


Updating the File

Now you have made a set of changes and can update the file. Select “Update All Files” from the Update menu (see Figure 2-33); alternatively, you may use the keyboard accelerator for this operation by typing Ctrl-U with the cursor anywhere in the main view. The Parallel Analyzer View will generate a sed script to modify the source, rename the original file to one with the suffix .old, run sed on that file to produce a new version of the file dummy.f, and then spawn the WorkShop Build Manager to rerun PFA on the new version of the file.

Figure 2-33. Update All Files


The Parallel Analyzer View can also open a gdiff window showing the changes, but by default it does not. If you select the toggle labeled “Run gdiff After Update” from the Update menu, it will do so. If you have selected it, use the right mouse button to step through the changes, and then quit gdiff. If you always wish to see the gdiff window, you can set the resource in your .Xdefaults file:

cvpav*gDiff: True

Figure 2-34. Setting the Run Editor Toggle


The Parallel Analyzer View can also open an editor for you to make additional changes after running sed. To do so, select the toggle labeled “Run Editor After Update” in the Update menu (see Figure 2-34). If you do so, an xwsh window with vi running in it opens after you update the file.

If you always wish to run the editor, you can set the resource in your .Xdefaults file:

cvpav*runUserEdit: True

If you prefer a different window shell or a different editor, you can change the resource in your .Xdefaults file, changing the xwsh and/or vi as you prefer:

cvpav*userEdit: xwsh -e vi %s +%d

The +%d tells vi at what line to position itself in the file and is replaced with 1 by default (you can also omit the +%d parameter if you wish). The edited file's name will either replace any explicit %s, or if the %s is omitted, the file name will be appended to the command.

After you quit from the gdiff window and/or editor (if you have selected them), the program will spawn the WorkShop Build Manager. When it comes up, verify that the directory shown is the directory in which you are running the sample session; if not, change it. Then, click the Build button, and it will start to reprocess the changed file.

Examining the Modified File

When the build completes, the Parallel Analyzer View will update to reflect the changes that were made. You will now examine the new version of the file to see the effect of the changes requested above.

Unroll Change

Click the Next Loop button twice to select the first loop. Notice that loop do-1000 is now shown as being unrolled six times, not four as it was before. Also the loop has a directive, implementing the change in unrolling that was requested.

Move to loop do-1100 by clicking the Next Loop button.

New Custom DOACROSS

Loop do-1200 previously was serial because it had too little work in it, but is now parallel because it was explicitly parallelized.

New Assertion

Go to the search field and enter 2700. Double-click the line and notice that loop do-2700, which previously was unparallelizable because of the call to RTC(), is now parallel. It also has the assertion that was added.

Answered Question

Clear the search field, enter 3200 in it, and double-click the selected line. Notice that loop do-3200 now has an assertion in it, added as a result of your reply to the question. The loop is also now parallelized.

Move to loop do-3300 by clicking the Next Loop button.

Deleted Assertion

Loop do-3300 previously had the assertion that B was a permutation vector; note that the assertion is gone, and PFA now asks the question.

Examing Subroutines That Use PCF Directives

PCF directives are not supported by the current 32-bit PFA processor. If you put them into your code, they will be treated as comments, rather than properly interpreted. The six loops, do-6001 through do-6006 are processed ignoring the directives. To see the effect of the directives, see “Examining Subroutines That Use PCF Directives” in Chapter 3.

Examining a Subroutine That Contains Syntax Errors

The PFA preprocessor does not provide error messages in the analysis file to show what the syntax errors were, so WorkShopProMP cannot show them. The routine itself is shown with the error indicator for it, but no highlighting button and messages will appear. To understand the errors, look at the listing file, dummy.l, in the directory. More information is provided in the 64-bit tutorial, q.v.

Exiting from the Dummy Sample Session

This completes the first sample session. Quit the Parallel Analyzer View by pulling down the Admin menu and selecting “Exit.”

To clean up the directory, so that the session can be rerun, enter:

% make clean

in your shell window. All of the generated files will be removed.

Setting Up the linpackd Sample Session

The second sample session is a brief demonstration of the integration of WorkShopProMPF and the WorkShop performance tools. It requires that WorkShop also be installed.

Go to the subdirectory linpack in the /usr/demos/WorkShopMPF directory and run make:

% cd /usr/demos/WorkShopMPF/linpack

% make

This will update the directory by compiling the source program linpackd.f and creating the necessary files. The performance experiment you will use is already there. This operation will take a few minutes.

Starting the Parallel Analysis View

Once the directory has been updated, start the demo by typing:

% cvpav -e linpackd 

from within the directory (note the flag is -e, not -f as in the previous sample session). The main window of the Parallel Analysis View will open, showing the list of loops in the program.

Scroll briefly through the list and bring up the source by clicking the Source button. Note that there are many unparallelized loops, but there is no way to know which are important. Also note that the second line in the main view shows that there is no performance experiment currently associated with the view.

Starting the Performance Analyzer

Start the Performance Analyzer by pulling down the Admin menu, selecting the Launch Tool submenu, and selecting “Performance Analyzer,” as shown in Figure 2-35.

The main window of the Performance Analyzer will open, although it will be empty. A small window labeled “Experiment:” will also open at the same time. This window is used to enter the name of an experiment. For this session, we will use the prerecorded experiment that is installed. Type:

test.linpack.cpu 

in the “Experiment Dir:” field in the Experiment: window, and click the OK button. See Figure 2-35. The Performance Analyzer will show a busy cursor, fill in its main window with the list of functions, and highlight the function main().

For more information about the Performance Analyzer and how it affects the user interface, see the Performance Analyzer User's Guide.

Figure 2-35. Starting the Performance Analyzer


Using the Parallel Analyzer with Performance Data

At the same time the Performance Analyzer window fills in, the Parallel Analyzer recognizes that there is now a performance analyzer, and posts a busy cursor with a message “Loading Performance Data.” When the message goes away, performance data will have been imported by the Parallel Analyzer, and a number of changes will have taken place as shown in Figure 2-36:

Figure 2-36. Performance Data — Parallel Analyzer View


  • The second column of the list of loops has changed from reading “Workload” to reading “Perf. Cost”, and the numbers below it are now percentages.

  • The second line in the view now shows the name of the performance experiment and shows the total cost of the run. In addition, the sort menu's second entry “Sort by Perf. Cost” is no longer grayed-out.

  • The Source View now has three additional columns to the left of the loop brackets that show the performance metrics, including the number of times the line has been executed and ideal CPU times as shown in Figure 2-37. The times are exclusive, inclusive, ideal, or CPU time in milliseconds.

    These columns reflect the measured performance data. If you select loop do-30 of subroutine DAXPY from the main view, the Source View displays as shown in Figure 2-37.

    Figure 2-37. Source View for Performance Experiment


Select the “Sort by Perf. Cost” entry. Note that the top three lines now show three loops that represent approximately 85%, 82%, and 81% of the total time. These numbers are inclusive numbers, with each reflecting the time in the loop and in any nested loops or functions called from within the loop. See Figure 2-38.

Figure 2-38. Sort by Performance Cost


The first of these loops contains the second loop nested inside it. The second loop calls the subroutine DAXPY, which contains the third loop. The third loop is the heart of the linpack benchmark and is already parallel.

Double-click the third loop. Note that the loop information display now contains an additional line of text listing the performance cost of the loop, both in time and as a percentage of the total time. See Figure 2-39.

Figure 2-39. Loop Information Display with Performance Data


Exiting from the linpackd Sample Session

This completes the second sample session. Quit by selecting the “Exit” command from the Project submenu of the Admin menu in the Parallel Analyzer View. All the windows will close.

You don't need to clean the directory, because you haven't made any changes in this session. If you do make changes, when you are finished you can clean up the directory by entering:

% make clean

in your shell window. All generated files will be removed.

Setting Up the f90 Sample Session

The f90 sample session is located in the directory /usr/demos/WorkShopMPF/cgdriver. Prepare for the session by changing directories to the demo directory and creating the needed files:

% cd /usr/demos/WorkShopMPF/cgdriver 
% make 

Once the demo directory has been prepared, start the session by entering:

% cvpav -f cgdriver.f

Notice that the loop list contains Fortran 90 array syntax statements. Double click on the first statement in CGTEST ( b = 0). You can see in the loop information display that the array-syntax is an implied loop and the statement was converted from array notation into a serial loop.

Click on the Source button. Notice that in source view, Fortran 90 array syntax statements (in the subroutine CGTEST) are bracketed in blue (they are shown as loops). Click on the Transformed Source button to see the transformation that PFA has performed. You can see that since b is a 3-dimensional array which is initialized to 0, the transformed source contains 3 nested do loops (each one spanning one dimension).

Exiting from the f90 Sample Session

This completes the third sample session. Quit the Parallel Analyzer View by selecting “Exit” from the Admin menu.

To clean up the directory, so that the session can be rerun, enter:
% make clean

in your shell window. All of the generated files will be removed.