This chapter provides three interactive sample sessions that demonstrate most of the Parallel Analyzer View's features for the 32-bit version of MPF. These sessions also demonstrate various aspects of parallelization and the use of the POWER Fortran Accelerator (PFA).
The sample sessions consist of a step-by-step examination of three sample programs. The samples sessions cover the following:
The dummy sample session is designed to show the various types of FORTRAN loops, how they are transformed by PFA, and how they are displayed by the Parallel Analyzer View. The sample session begin at “Setting Up the Dummy Sample Session”.
The linpackd sample session briefly illustrates how the Parallel Analyzer View can be used in conjunction with the WorkShop Performance Analyzer cvperf. The sample session begin at “Setting Up the linpackd Sample Session”.
The f90 sample session briefly illustrates how to use MPF with Fortran-90 code. The sample session begin at “Setting Up the f90 Sample Session”.
To use these sample sessions, the subsystem WorkShopMPF_sw.demos must be installed.
![]() | Note: These sample sessions are applicable for the 32-bit compilers only. For a discussion of the 64-bit version of the compilers, see Chapter 3, “Analyzing Loops: 64-bit Sample Sessions.” |
The Parallel Analyzer View comes with a demonstration directory /usr/demos/WorkShopMPF. It contains a subdirectory tutorial, which contains a source file called dummy.f_orig and a Makefile. The file contains 27 DO loops, each of which exemplifies one aspect of the parallelization process. In that directory, running make creates a scratch copy of the demonstration program dummy.f and then creates a run of PFA on the copy. PFA produces a transformed source file dummy.m, a listing file dummy.l, and an “analysis” file dummy.anl.
Prepare for the session by opening a shell window and entering make in the /usr/demos/WorkShopMPF/tutorial directory:
% cd /usr/demos/WorkShopMPF/tutorial % make |
Once the demo directory has been prepared, start the session by entering:
% cvpav -f dummy.f |
The main window of the Parallel Analyzer View opens, displaying the list of loops in the source file, dummy.f. Position the view at the upper left of the screen.
![]() | Note: If you receive a message related to licensing, refer to the NetLS License System Administration Guide or WorkShopProMPF Release Notes. |
Figure 2-1 shows the Parallel Analyzer View with an alternative color scheme. To start a session in these colors, enter cvpav -scheme Potrero -f dummy.f. The black and white figures in the hard copy version of this guide were prepared using the Grayscale scheme. Another scheme used in this book is IndigoMagic.
The loop list display shows information about each loop in the program with an icon next to it that reflects the parallelization status of the loop. Pull down the Admin menu and select “Icon Legend...” to bring up a legend dialog box that explains the meaning of the various icons (see Figure 2-2). Move the legend dialog box to the side, and scroll through the list of loops to see the various icons. When you are done, close the legend dialog box by clicking the Close button in the lower right of the dialog box.
The loop list display contains the following items:
Underneath the list display is a search field and a set of option menus and buttons that control the display of information in the loop list.
You can sort the list either in the order of the source code, or by loop workload, or (if you are running a performance experiment on the program using the WorkShop Performance Analyzer) by performance cost. You control sorting with the option menu to the left below the list.
For other sorting, the list is not indented. Select “Sort by Workload” and notice the Loop-ID is no longer indented (see Figure 2-4). (The same is true of “Sort by Perf. Cost”. It is grayed out because there is no performance tool running at this time.) When you are done, select “Sort in Source Order” once again.
You may want to look at only some of the loops in large programs. The list can be filtered in two ways: by parallelization status or by origin of the loop.
The parallelization status filtering is controlled by an option menu centered below the list. It initially reads “Show All Loop Types”.
You can filter the list to show only those loops that cannot be parallelized, those that are parallel, or those that are serial (see Figure 2-5 ).
Another way to filter is to show loops that come from a single file or a single subroutine:
Open the Subroutines and Files View by pulling down the Views menu and selecting “Subroutines and Files View.” Alternatively, you may use the keyboard accelerator for this operation by typing <Ctrl>-F with the cursor anywhere in the main view. A subsidiary view that lists the subroutines and files that are in the fileset opens (See Figure 2-6.)
From the Filter option menu (figure 2-7), select “Filter by File.”
Double-click the line for the file dummy.f in the function/file list of the Subroutines and Files View window. The name will appear in the filtering text field labeled Title: (see Figure 2-8) and the list will be rescanned. Similarly, you may try selecting “Filter by Subroutine” from the main view option menu, and double-click the line for subroutine DUMMY in the Subroutine and Files View.
For this example, there is only one file and one subroutine, so the filtering is not very useful, but for large programs with many files and subroutines, it would be. When you are done, display all of the loops in the sample source file once again by selecting “No Filtering” from that option menu.
You won't be needing the Subroutines and Files View further, so close it by pulling down the Admin menu and selecting “Close.”
The Parallel Analyzer View gives you access to views of both your original Fortran source and the source as it is transformed by the POWER Fortran Accelerator.
Click the Source button to the left side of the main view to bring up the Source View, as shown in Figure 2-9. This view is the same Source View that is used in the WorkShop Debugger and Performance Analyzer.
When the source display opens, position it to the right of the main view. (On machines with low-resolution screens, the windows will overlap.) Scroll up and down in the file and observe that the source window displays colored brackets that mark the location of each loop. These colors match the colors of the parallelization icons and serve to indicate the parallelization status of each loop at a glance. The color indicates which loops are parallelized, which are unparallelizable, and which are left serial.
PFA is a source-to-source translator that takes the various loops in the program and transforms them both for scalar optimization and for parallelization. Each loop may be rewritten into one, two, or more transformed loops or may be combined with others or optimized away. The result of these transformations is a transformed source file that you may examine.
Click the Transformed Source button. Another source window labeled “Parallel Analyzer View — Transformed Source” opens as shown in Figure 2-10.
Position it below the Source View. Scroll through it, and notice that it, too, has bracketing marking the loops. The bracketing for the transformed source cannot always distinguish between serial loops and unparallelizable loops, so some unparallelizable loops will be displayed as serial (for example, those with data dependencies).
Each line in the loop list summarizes some information about a loop. Much more information is available, and this section will show you how to examine it.
To get more information about a loop, you must select it by
Selecting a loop has a number of effects:
The previously empty display below the list fills with information on the selected loop.
The Source View scrolls to the selected loop and highlights the source code of the loop.
The Transformed Source window highlights the first of the loops into which the original selected loop was transformed and displays a bright vertical bar next to each transformed loop that came from the original loop.
If the Transformed Loops View or the PFA Analysis Parameters View is open, it too will be switched to show the selected loop. We will look at these views later. See Figure 2-11.
In this figure and many of those following, the loop list is resized to reduce the number of loops displayed. The adjustment button is in the lower right hand corner of the loop list display, just above the loop information display. Your screen shows the full list unless you resize it.
Try scrolling through the list and double-clicking various loops, and scrolling through the source displays and clicking the loop brackets to select loops. Note that when you select each loop, its icon acquires a check mark showing that you've looked at it. When you are done, scroll to the top of the loop list in the main view and double-click the first loop's line.
The loop information display occupies the lower half of the main view (see Figure 2-12). It contains detailed information about the currently selected loop. It consists of a series of lines in several blocks.
The first line of the display is labeled Parallelization Controls:. On the far right, the first line shows how many transformed loops were derived from the selected loop. When the session is run with a performance experiment, an additional block appears above the Parallelization Controls. It gives performance information for the loop (shown in Figure 2-39). Since we do not have an experiment on this program (which does not, in fact, execute), the performance information is absent.
Below this are two option menus, the first controlling parallelization status and the second controlling the loop MP scheduling (it is shown for all loops, but is applicable to parallel loops only), and a text input field for adding an expression for the scheduling chunk size. Text labels to the right of the option menus list the current values for parallelization and scheduling.
Below the first separator line appear up to five blocks of additional information. These are lists of:
questions that PFA asked about the loops, if any
obstacles to parallelization, if any
assertions made about the loop, if any
directives applied to the loops, if any
messages about the loop, if any
A new view comes up, listing each of the parameters with a numeric input field to the right of each of them. Entering a new numeric value in the input field will request a change to the loop. Don't do this now; close the view by pulling down the View's Admin menu and selecting “Close.”
You can also see detailed information about the transformed loops coming from a particular loop (see Figure 2-16). To do so, pull down the Views menu and select “Transformed Loops View.” Alternatively, you may use the keyboard accelerator for this operation by typing Ctrl-T with the cursor anywhere in the main view.
When the view opens, position it at the left of the screen, below the main view. It contains information about the loops into which the currently selected original loop was transformed. Each transformed loop has a block of information associated with it, and the blocks are separated by horizontal lines.
The first line in each block contains a parallelization status icon, a highlighting button, and the ID of the transformed loop. (The ID is assigned by PFA.) The button, if clicked, highlights the transformed loop in the Transformed Source window and the original loop in the Source View.
The next two lines describe the transformed loop. The first provides information such as whether it is a primary loop (directly transformed from the selected original loop) or secondary loop (transformed from a different original loop but incorporating some code from the selected original loop), its parallelization state, whether it is an ordinary loop or interchanged loop, its nesting level, and workload. The second line displays the location of the loop in the transformed source.
Following the description lines is a list of messages generated by PFA, if any. To the left of the message lines are buttons, and clicking them will highlight the part of the original source that relates to the message. Often it is the first line of the original loop that is shown, since the message refers to the entire loop.
For the currently selected loop (do-1000), the original loop was transformed into two loops, one that runs parallelized and one that runs serial. As the messages state, the original loop was unrolled 4 times, and a cleanup loop was added. Unrolling is described in “Loop Unrolling”.
Transformed loops can also be selected. By default, the first of the transformed loops is selected when the view is brought up, and the transformed source is highlighted to show it. At the same time, the color highlighting of the original source changes, although the lines highlighted have not. See Figure 2-17. You will later see that for loops with more extensive transformations the highlighted lines will be different (for example, loops do-1300 and do-1350, the fused loops).
Now click the button for the second transformed loop. The transformed source will highlight a different region (the cleanup loop), but the original source will highlight the same lines as before, as shown in Figure 2-18. This is because when a transformed loop is selected, those lines in the original source that go into the transformed loop will be highlighted. In this case, the same lines go into both the transformed loops. Transformed loops may also be selected by clicking the corresponding loop brackets in the Transformed Source window.
You may either leave this window open or close it by selecting the “Close” command from its File menu.
Now that you have familiarized yourself with the basic windows in the Parallel Analyzer View's user interface, you can start examining and analyzing loops. First you will look at a few simple loops, next at loops with obstacles to parallelization, then at loops for which PFA asks questions, and finally at more complex, nested loops.
The six loops you will examine in this section are the simplest kind of Fortran loop.
Scroll the list of loops back to the top and select loop do-1000. As the two messages state, this loop is transformed into two loops, one an unrolled, parallelized loop, and the second a clean-up loop for unrolling. (Unrolling is discussed in “Loop Unrolling”.)
Move to loop do-1100 by clicking the Next Loop button.
Loop do-1100 is preferably serial, because the amount of work done is too little to justify the parallelization overhead. Unlike the previous loop, the iteration count is known, so the total work can be computed. See Figure 2-19.
Also note that this loop is unrolled as the previous one was but that no cleanup loop is needed because the count is known to be a multiple of the unrolling.
Move to loop do-1200 by clicking the Next Loop button.
Loop do-1200 is parallelized because it contains an explicit C$DOACROSS directive; PFA will pass the directive through in the transformed source but does nothing further with the loop, as the messages indicate. See Figure 2-20.
The loop status option menu is set to “C$DOACROSS...”and it is shown with a highlighting button. Clicking the button will bring up both the Source View and the Parallelization Control View, which shows more information about the parallelization directive. If you have clicked on the button, close the Parallelization Control View by pulling down its Admin menu and selecting “Close.” You will come back to the use of this view later. See “Building a Custom DOACROSS Directive”. Close the Source View by pulling down its File menu and selecting “Close.”
The C$DOACROSS directive is displayed with a highlighting button. Click it, and the Source View comes up. Notice the highlighting of the directive in the source. See Figure 2-21.
Move to loop do-1300 by clicking the Next Loop button.
Loop do-1300 is the first of two loops that can be fused. That is, the loops have the same bounds, and the code in the body of the two loops is independent, so they can be combined to save the loop overhead. Even when a loop has been fused, the Source View is highlighted to show only the selected loop, not the other loops that have been fused with it.
Notice that in the Transformed Source window, the highlighted loop has the bodies of the two original loops interleaved, and replicated for unrolling (see Figure 2-22). Click the bracket next to the loop in the transformed source. Now you see that the lines highlighted in the original source come from both loops. Then click the bracket for the loop below it in the transformed source (the cleanup loop for unrolling) and see that it, too, highlights source from both loops.
Move to loop do-1350 by clicking the Next Loop button. Loop do-1350 is the other half of the fused pair. Its icon indicates that it was fused, and the highlighting in the transformed source indicates that it was transformed into the same pair of loops as the previous one.
Move to loop do-1400 by clicking the Next Loop button.
Unrolling is done to reduce the loop overhead relative to the real work of the loop. The simpler the body of the loop, the more profitable unrolling can be. In many cases, the loop iteration count is not known, so an additional loop, called a cleanup loop, is necessary to handle the last few iterations. Sometimes, the iteration count is known but is not a multiple of the unrolling; in such cases, PFA will usually explicitly add code for the last few iterations.
Loop do-1400 is the same as the first loop in the program, but a directive “SCALAR OPTIMIZE(1)” has been added. The loop is not unrolled. By default, the scalar optimization parameter is set to 3, which allows loop unrolling.
Move to loop do-1500 by clicking the Next Loop button.
Loop do-1500 is an example of a loop so unnecessary that PFA can get rid of it entirely. First, PFA sees that the body of the loop is independent of the loop, so it can be promoted out, and the loop eliminated. Then it sees that the body sets a variable that is not subsequently used, so it can throw that out, too. The transformed source is not scrolled and highlighted because nothing is there. Scroll down a few lines from the previous loop, and note the absence of the code for the loop that was optimized away.
The loop also has a directive controlling scalar optimization, but it is there only to reset the default for subsequent loops.
Move to loop do-2000 by clicking the Next Loop button.
There are a number of reasons that a loop may not be parallelized. The following loops illustrate various of these reasons, along with variants that allow parallelization. You will step through each of them in turn.
Loop do-2000 is an example of a loop that cannot be parallelized because of a data dependence. In this case, one element of an array is used to set another. (This construct is called a recurrence.) If the loop were to be parallelized, the iterations might execute out of order, and iteration 4, which sets A(4) to A(5), might occur after iteration 5, which would have reset the value of A(5). Consequently, the program would give the wrong answer. See Figure 2-23.
There is a line listing the obstacle to parallelization; click the button that accompanies it. Two kinds of highlighting take place. The first is a line highlight showing the relevant line that has the dependence, and the second is a symbol (or token) highlight that shows the uses of the variable that is the obstacle to parallelization. Only the uses of the variable within the loop are highlighted.
Move to loop do-2100 by clicking the Next Loop button.
Not all loops with similar constructs are unparallelizable. Loop do-2100 is similar to loop do-2000, but the array elements used differ by an offset, M. If M is equal to NSIZE, for example, and the array is twice NSIZE, the code is actually copying the upper half of the array into the lower half, and there is no reason why that cannot be run in parallel. PFA cannot recognize this from the source, but the author has added an assertion that there is no recurrence, so the loop is parallelized. See Figure 2-24. Click the highlighting button to show the assertion.
Move to loop do-2200 by clicking the Next Loop button.
Data dependence can involve more than one line of a program. In loop do-2200, a similar dependence occurs, but the use of the variable occurs on a different line than its setting. Click the highlight button on the obstacle line, and note that both lines receive the line highlighting, and the token highlighting shows the dependency variable on the two lines (see Figure 2-25). Of course, real programs can, and typically do, have far more complex dependencies than this.
Move to loop do-2300 by clicking the Next Loop button.
Loop do-2300 shows a data dependence that is called a reduction. In a reduction, the variable responsible for the data dependence is being accumulated or “reduced” in some fashion. Reductions can be summation, multiplication, or a minimum or maximum determination. For summation, as shown in this loop, PFA could accumulate partial sums in each processor, and then add the partial sums at the end. However, because floating-point arithmetic is inexact, the order of addition might give different answers because of round-off error.
This does not imply that the serial execution answer is “correct” and the parallel execution answer is “incorrect”; they are equally valid within the limits of round-off error. Since, by default, PFA assumes it is not OK to introduce round-off error, the loop is left serial. PFA does, however, have a parameter to allow you to say that such round-off error is OK.
Move to loop do-2400 by clicking the Next Loop button.
In loop do-2400, the author has added a directive controlling round-off error. The same loop that was left serial above is now parallelized. Click the button for the directive, and you can see how it is highlighted in the source. Refer to the PFA manual for a more detailed explanation of the meaning and use of this directive. The round-off setting will be left at this value for the remainder of the program.
Move to loop do-2500 by clicking the Next Loop button.
Loop do-2500 has an input/output (I/O) operation in it. It cannot be parallelized, because the output would appear in a different order depending on the scheduling of the individual CPUs. Click the button indicating the obstacle, and note the highlighting of the print statement. Also note that the transformed source shows that this loop is not unrolled, either. Actually, there is no real obstacle to unrolling, but it is not done because the cost of performing the I/O operation is so great compared to the loop iteration overhead that the savings gained are not worth the increase in the size of the program.
Move to loop do-2600 by clicking the Next Loop button.
Loop do-2600 has a premature exit; that is, it cannot be determined at compilation time how many iterations will take place. If PFA did parallelize it, one thread might execute iterations past the point where another has determined to exit the loop.
Click the button indicating the premature exit. Note that the line with the exit from the loop is highlighted in the source.
Move to loop do-2700 by clicking the Next Loop button.
Loop do-2700 is also unparallelizable, because there is a call to a routine, RTC, and PFA cannot determine whether or not that call will have side effects. Click the obstacle line. Note the highlighting of the line containing the call and the subroutine name. Also note that the loop is not unrolled, as the presence of the call inhibits unrolling.
Move to loop do-2800 by clicking the Next Loop button.
Although loop do-2800 has a similar subroutine call in it, it can be parallelized because the author has asserted that the call has no side effects that will prevent it from running concurrently. Click the assertion line to highlight the source line containing the assertion.
When you are done, move to loop do-3000 by clicking the Next Loop button.
Sometimes PFA can parallelize a loop more efficiently if it knows more information than it can infer from the source. In these cases, PFA asks questions that appear in the loop information display for the loop, along with a menu that allows you to answer the question.
PFA can sometimes parallelize a loop if it can be told the relationship between variables in the program. Although you may know such relationships from the nature of the physical problem the program is dealing with, PFA cannot safely infer the information just from the code.
Loop do-3000 can be parallelized if it is known that the iterations do not overlap, but not otherwise. PFA will ask three questions, although for this type of construct, it actually generates code to determine the relationship at run time, and the program will execute one of the two sequences depending on that determination. You can see this by observing that the loop was transformed into four loops, one pair of unroll/cleanup loops when it can be parallelized, and a second when it cannot. Look at the transformed source code for each of these pairs.
For any such questions, the line asking them has an associated option menu that will allow you to answer. The generated code will be correct even if you do not answer or do not know. If PFA knows the answer, it can omit the alternate form and produce a tighter program.
Move to loop do-3100 by clicking the Next Loop button.
In loop do-3100, the author has added an assertion answering the question, and PFA has generated just one version of the loop, the one that runs in parallel. The menu next to the questions for the previous loop will generate such an assertion.
Move to loop do-3200 by clicking the Next Loop button.
Loop do-3200 has a construct known as a permutation vector. In it, an array is referenced by an index value contained in another array. If the B(I) values are all distinct, the iterations do not depend on each other, and the loop can be parallelized; if the same value occurs in more than one B(I), it cannot. PFA asks the question but leaves the loop serial. Note that both the question and the data dependence message have associated highlighting buttons.
Move to loop do-3300 by clicking the Next Loop button.
Here an assertion has been added that the index array, B(I), is indeed a permutation vector, and the loop is parallelized.
Move to loop do-4000 by clicking the Next Loop button.
Finally, let's look at somewhat more complicated, nested loops.
Loop do-4000 is the outer loop of a pair of loops; it runs in parallel, and the inner loop runs in serial: one parallel loop cannot be nested inside another. Also note that the outer loop is not unrolled, but the inner loop is.
Move to loop do-4010 by clicking the Next Loop button to show the inner loop, and then click Next Loop again to select the outer loop of the next pair.
Note that this outer loop, loop do-4100, is shown as serial inside a parallel loop, and the following loop is parallel. How can this be? It happens because PFA has recognized that the two loops can be interchanged, and furthermore, that the CPU cache is likely to be more efficiently used if the loops are run in the interchanged order.
Move to loop do-4110 to show the inner loop, and then click the Next Loop button once again to move to the following triply-nested loop.
The next set of loops is a triply-nested matrix multiply. Just as PFA optimized a doubly-nested pair of loops by interchanging the loops, it will do even more to get optimal cache performance by “strip-mining” a triply-nested loop. In this case, different sections of the matrix will be executed by different threads, so that the threads will not cause cache conflicts among themselves.
The outer original loop, do-5000, is interchanged, unrolled, and split into block and strip loops, in a fairly complicated way; it is transformed into ten loops. The middle loop has part of its work in a second-level unrolled loop, and part of it in parallelized third-level loops. The inner loop is shown as unparallelizable, although it is actually preferably serial. (This is a bug in the current version of WorkShopProMPF.) Do not be surprised if the code seems difficult to understand; the strip-mining transformation is very complex and confusing.
Use the Next Loop button to first step to the middle of the three, loop do-5010, and then the inner one, loop do-5020. Notice how each of the loops is transformed into various combinations of loops at different nesting levels.
This brings you to the end of your examination of the loops under analysis. In the next section, you will find out how to modify your source code using the Parallel Analyzer.
So far, you've ignored the controls that can be used to change the source file and allow a subsequent pass of PFA to do a better job. Now you will go back and make changes. There are two steps in modifying source files:
Asking for the changes using the Parallel Analyzer View controls.
Actually modifying the files and rebuilding the program and its analysis files.
You may ask for changes by answering any of the questions that PFA poses, by building a DOACROSS for a specific loop, by modifying the analysis parameters that PFA uses for its processing, or by adding or deleting assertions or directives. In this sample session, you will request changes to loops in the order they appear in the file, but they may be requested in any order.
Scroll to the top of the loop list and select the first loop, which was unrolled four times. Pull down the Views menu and select “PFA Analysis Parameters View” to open the PFA Analysis Parameters View. Locate the line that reads:
Unroll: |
Close the View by pulling down the Admin menu and selecting “Close.” Notice that a red plus sign now appears in the icon next to the loop, indicating that a change has been requested for it as shown in Figure 2-27. Move to loop do-1100 by clicking the Next Loop button.
Dismiss the View by pulling down the Admin menu and selecting “Close.”
Now you will add an assertion to a loop. Find the loop with ID do-2700 by using the search feature of the loop list. Go to the search field, and enter 2700. You can double-click the highlighted line in the loop list to select the loop.
You're going to add a concurrent call assertion. To add the assertion,pull down the Operations menu, pull down the Add Assertion submenu, and select “C*$*ASSERT CONCURRENT CALL.”
This adds an assertion that the call to RTC(), which PFA thought to be an obstacle to parallelization, is actually safe to parallelize. When you add the assertion, the loop information display updates to show the new assertion, along with its menu labeled “Insert” as shown in Figure 2-30.
Now try answering a question. Put the cursor into the search field, backspace to remove the previous contents, and enter 3200 into the field. Select that loop by double-clicking. Loop do-3200 has a question about a permutation vector. Pull down the option menu next to the question in the loop information display, and select “Assert True” as shown in Figure 2-31.
Now let's delete an existing assertion. Move to loop do-3300 using the Next Loop button, and go to the “ASSERT PERMUTATION(B)” assertion. Pull down its option menu and select “Delete”. Figure 2-32 shows the result. The same procedure can be used for directives.
Now you have made a set of changes and can update the file. Select “Update All Files” from the Update menu (see Figure 2-33); alternatively, you may use the keyboard accelerator for this operation by typing Ctrl-U with the cursor anywhere in the main view. The Parallel Analyzer View will generate a sed script to modify the source, rename the original file to one with the suffix .old, run sed on that file to produce a new version of the file dummy.f, and then spawn the WorkShop Build Manager to rerun PFA on the new version of the file.
The Parallel Analyzer View can also open a gdiff window showing the changes, but by default it does not. If you select the toggle labeled “Run gdiff After Update” from the Update menu, it will do so. If you have selected it, use the right mouse button to step through the changes, and then quit gdiff. If you always wish to see the gdiff window, you can set the resource in your .Xdefaults file:
cvpav*gDiff: True |
If you always wish to run the editor, you can set the resource in your .Xdefaults file:
cvpav*runUserEdit: True |
If you prefer a different window shell or a different editor, you can change the resource in your .Xdefaults file, changing the xwsh and/or vi as you prefer:
cvpav*userEdit: xwsh -e vi %s +%d |
The +%d tells vi at what line to position itself in the file and is replaced with 1 by default (you can also omit the +%d parameter if you wish). The edited file's name will either replace any explicit %s, or if the %s is omitted, the file name will be appended to the command.
After you quit from the gdiff window and/or editor (if you have selected them), the program will spawn the WorkShop Build Manager. When it comes up, verify that the directory shown is the directory in which you are running the sample session; if not, change it. Then, click the Build button, and it will start to reprocess the changed file.
When the build completes, the Parallel Analyzer View will update to reflect the changes that were made. You will now examine the new version of the file to see the effect of the changes requested above.
Click the Next Loop button twice to select the first loop. Notice that loop do-1000 is now shown as being unrolled six times, not four as it was before. Also the loop has a directive, implementing the change in unrolling that was requested.
Move to loop do-1100 by clicking the Next Loop button.
Loop do-1200 previously was serial because it had too little work in it, but is now parallel because it was explicitly parallelized.
Go to the search field and enter 2700. Double-click the line and notice that loop do-2700, which previously was unparallelizable because of the call to RTC(), is now parallel. It also has the assertion that was added.
Clear the search field, enter 3200 in it, and double-click the selected line. Notice that loop do-3200 now has an assertion in it, added as a result of your reply to the question. The loop is also now parallelized.
Move to loop do-3300 by clicking the Next Loop button.
PCF directives are not supported by the current 32-bit PFA processor. If you put them into your code, they will be treated as comments, rather than properly interpreted. The six loops, do-6001 through do-6006 are processed ignoring the directives. To see the effect of the directives, see “Examining Subroutines That Use PCF Directives” in Chapter 3.
The PFA preprocessor does not provide error messages in the analysis file to show what the syntax errors were, so WorkShopProMP cannot show them. The routine itself is shown with the error indicator for it, but no highlighting button and messages will appear. To understand the errors, look at the listing file, dummy.l, in the directory. More information is provided in the 64-bit tutorial, q.v.
This completes the first sample session. Quit the Parallel Analyzer View by pulling down the Admin menu and selecting “Exit.”
To clean up the directory, so that the session can be rerun, enter:
% make clean |
in your shell window. All of the generated files will be removed.
The second sample session is a brief demonstration of the integration of WorkShopProMPF and the WorkShop performance tools. It requires that WorkShop also be installed.
Go to the subdirectory linpack in the /usr/demos/WorkShopMPF directory and run make:
% cd /usr/demos/WorkShopMPF/linpack |
% make |
This will update the directory by compiling the source program linpackd.f and creating the necessary files. The performance experiment you will use is already there. This operation will take a few minutes.
Once the directory has been updated, start the demo by typing:
% cvpav -e linpackd |
from within the directory (note the flag is -e, not -f as in the previous sample session). The main window of the Parallel Analysis View will open, showing the list of loops in the program.
Scroll briefly through the list and bring up the source by clicking the Source button. Note that there are many unparallelized loops, but there is no way to know which are important. Also note that the second line in the main view shows that there is no performance experiment currently associated with the view.
Start the Performance Analyzer by pulling down the Admin menu, selecting the Launch Tool submenu, and selecting “Performance Analyzer,” as shown in Figure 2-35.
The main window of the Performance Analyzer will open, although it will be empty. A small window labeled “Experiment:” will also open at the same time. This window is used to enter the name of an experiment. For this session, we will use the prerecorded experiment that is installed. Type:
test.linpack.cpu |
in the “Experiment Dir:” field in the Experiment: window, and click the OK button. See Figure 2-35. The Performance Analyzer will show a busy cursor, fill in its main window with the list of functions, and highlight the function main().
For more information about the Performance Analyzer and how it affects the user interface, see the Performance Analyzer User's Guide.
At the same time the Performance Analyzer window fills in, the Parallel Analyzer recognizes that there is now a performance analyzer, and posts a busy cursor with a message “Loading Performance Data.” When the message goes away, performance data will have been imported by the Parallel Analyzer, and a number of changes will have taken place as shown in Figure 2-36:
The second column of the list of loops has changed from reading “Workload” to reading “Perf. Cost”, and the numbers below it are now percentages.
The second line in the view now shows the name of the performance experiment and shows the total cost of the run. In addition, the sort menu's second entry “Sort by Perf. Cost” is no longer grayed-out.
The Source View now has three additional columns to the left of the loop brackets that show the performance metrics, including the number of times the line has been executed and ideal CPU times as shown in Figure 2-37. The times are exclusive, inclusive, ideal, or CPU time in milliseconds.
These columns reflect the measured performance data. If you select loop do-30 of subroutine DAXPY from the main view, the Source View displays as shown in Figure 2-37.
Select the “Sort by Perf. Cost” entry. Note that the top three lines now show three loops that represent approximately 85%, 82%, and 81% of the total time. These numbers are inclusive numbers, with each reflecting the time in the loop and in any nested loops or functions called from within the loop. See Figure 2-38.
The first of these loops contains the second loop nested inside it. The second loop calls the subroutine DAXPY, which contains the third loop. The third loop is the heart of the linpack benchmark and is already parallel.
Double-click the third loop. Note that the loop information display now contains an additional line of text listing the performance cost of the loop, both in time and as a percentage of the total time. See Figure 2-39.
This completes the second sample session. Quit by selecting the “Exit” command from the Project submenu of the Admin menu in the Parallel Analyzer View. All the windows will close.
You don't need to clean the directory, because you haven't made any changes in this session. If you do make changes, when you are finished you can clean up the directory by entering:
% make clean |
The f90 sample session is located in the directory /usr/demos/WorkShopMPF/cgdriver. Prepare for the session by changing directories to the demo directory and creating the needed files:
% cd /usr/demos/WorkShopMPF/cgdriver % make |
Once the demo directory has been prepared, start the session by entering:
% cvpav -f cgdriver.f |
Notice that the loop list contains Fortran 90 array syntax statements. Double click on the first statement in CGTEST ( b = 0). You can see in the loop information display that the array-syntax is an implied loop and the statement was converted from array notation into a serial loop.
Click on the Source button. Notice that in source view, Fortran 90 array syntax statements (in the subroutine CGTEST) are bracketed in blue (they are shown as loops). Click on the Transformed Source button to see the transformation that PFA has performed. You can see that since b is a 3-dimensional array which is initialized to 0, the transformed source contains 3 nested do loops (each one spanning one dimension).