Chapter 6. Setting Environment Variables

This chapter describes the variables that specify the environment under which your MPI programs will run. Environment variables have default values if not explicitly set. You can change some variables to achieve particular performance objectives; others are required values for standard-compliant programs.

Setting MPI Environment Variables

Table 6-1 describes the MPI environment variables you can set for your programs. Unless otherwise specified, these variables are available for both Linux and IRIX systems.

Table 6-1. MPI Environment Variables

Variable

 

Description

 

Default

MPI_ARRAY
(IRIX systems only)

 

Sets an alternative array name to be used for communicating with Array Services when a job is being launched.

 

Default name set in the arrayd.conf file.

MPI_BAR_DISSEM

 

Specifies the use of the alternate barrier algorithm, the dissemination/butterfly, within the MPI_Barrier(3) and MPI_Win_fence(3) functions. This alternate algorithm provides better performance on jobs with larger PE counts. The MPI_BAR_DISSEM option is recommended for jobs with PE counts of 64 or greater.

 

Disabled if job contains less than 64 PEs; otherwise, enabled.

MPI_BUFFER_MAX
(IRIX systems only)

 

Specifies a minimum message size, in bytes, for which the message will be considered a candidate for single-copy transfer. Currently, this mechanism is available only for communication between MPI processes on the same host. The sender data must reside in either the symmetric data, symmetric heap, or global heap.

 

Not enabled.

 

 

If cross mapping of data segments is enabled at job startup, data in common blocks will reside in the symmetric data segment. On systems running IRIX 6.5.2 or later, this feature is enabled by default. You can employ the symmetric heap by using the shmalloc (shpalloc) functions in LIBSMA.

 

 

 

 

Most MPI applications benefit more from buffering of medium-sized messages than from buffering of large messages, even though buffering of medium-sized messages requires an extra copy of data. However, highly synchronized applications that perform large message transfers can benefit from the single-copy pathway.

 

 

MPI_BUFS_PER_HOST

 

Determines the number of shared message buffers (16 Kbytes each) that MPI is to allocate for each host. These buffers are used to send large messages.

 

16 pages (each page is 16 Kbytes)

MPI_BUFS_PER_PROC

 

Determines the number of private message buffers (16 Kbytes each) that MPI is to allocate for each process. These buffers are used to send large messages.

 

16 pages (each page is 16 Kbytes)

MPI_BYPASS_CRC
(IRIX systems only)

 

Adds a checksum to each large message sent via HIPPI bypass. If the checksum does not match the data received, the job is terminated. Use of this environment variable might degrade performance.

 

Not set.

MPI_BYPASS_DEVS
( IRIX systems only)

 

Sets the order for opening HIPPI adapters. The list of devices does not need to be space-delimited (0123 is also valid).

 

0 1 2 3

 

 

An array node usually has at least one HIPPI adapter, the interface to the HIPPI network. The HIPPI bypass is a lower software layer that interfaces directly to this adapter. The bypass sends MPI control and data messages that are 16 or fewer Kbytes.

  

 

 

When you know that a system has multiple HIPPI adapters, you can use the MPI_BYPASS_ DEVS variable to specify the adapter that a program opens first. You can use this variable to ensure that multiple MPI programs distribute their traffic across the available adapters. If you prefer not to use the HIPPI bypass, you can turn it off by setting the MPI_BYPASS_OFF variable.

  

 

 

When a HIPPI adapter reaches its maximum capacity of four MPI programs, it is not available to additional MPI programs. If all HIPPI adapters are busy, MPI sends internode messages by using TCP over the adapter instead of the bypass.

  
MPI_BYPASS_OFF
(IRIX systems only)
 

Disables the HIPPI bypass.

 

Not enabled.

MPI_BYPASS_SINGLE
(IRIX systems only)

 

Allows MPI messages to be sent over multiple HIPPI connections if multiple connections are available. The HIPPI OS bypass multiboard feature is enabled by default. This environment variable disables it. When you set this variable, MPI operates as it did in previous releases, with use of a single HIPPI adapter connection, if available.

 

 

MPI_BYPASS_VERBOSE
(IRIX systems only)

 

Allows additional MPI initialization information to be printed in the standard output stream. This information contains details about the HIPPI OS bypass connections and the HIPPI adapters that are detected on each of the hosts.

  

MPI_CHECK_ARGS

 

Enables checking of MPI function arguments. Segmentation faults might occur if bad arguments are passed to MPI, so this is useful for debugging purposes. Using argument checking adds several microseconds to latency.

 

Not enabled.

MPI_COMM_MAX

 

Sets the maximum number of communicators that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.)

 

256

MPI_DIR

 

Sets the working directory on a host. When an mpirun command is issued, the Array Services daemon on the local or distributed node responds by creating a user session and starting the required MPI processes. The user ID for the session is that of the user who invokes mpirun, so this user must be listed in the .rhosts file on the corresponding nodes. By default, the working directory for the session is the user's $HOME directory on each node. You can direct all nodes to a different directory (an NFS directory that is available to all nodes, for example) by setting the MPI_DIR variable to a different directory.

 

$HOME on the node. If using -np or -nt, the default is the current directory.

MPI_DSM_CPUCLUSTER
(IRIX systems only)

 

When set on an Origin 2000 or an Origin 3000 system running IRIX 6.5.11 or greater, TOPOLOGY_CPUCLUSTER will be used for mld placement and the number of processes to be mapped to every memory (MPI_DSM_PPM) will be set to 1.

 

Not enabled.

MPI_DSM_MUSTRUN
(IRIX systems only)

 

Enforces memory locality for MPI processes. Use of this feature ensures that each MPI process obtains a CPU and physical memory on the node to which it was originally assigned. This variable improves program performance on IRIX systems running release 6.5.7 and earlier, when running a program on a quiet system. With later IRIX releases, under certain circumstances, you do not need to set this variable. Internally, this feature directs the library to use the process_cpulink(3) function instead of process_mldlink(3) to control memory placement.

You should not use MPI_DSM_MUSTRUN when the job is submitted to Miser (see miser_submit (1)) because this might cause the program to hang.

 

Not enabled.

MPI_DSM_OFF
(IRIX systems only)

 

Turns off nonuniform memory access (NUMA) optimization in the MPI library.

 

Not enabled.

MPI_DSM_PPM
(IRIX systems only)

 

Sets the number of MPI processes per memory locality domain (mld). For Origin 2000 systems, values of 1 or 2 are allowed. For Origin 3000 systems, values of 1, 2, or 4 are allowed.

 

Origin 2000 systems, 2; Origin 3000 systems, 4.

MPI_DSM_VERBOSE
(IRIX systems only)

 

Instructs mpirun to print information about process placement for jobs running on NUMA systems.

 

Not enabled.

MPI_GM_ON

 

Enables use of GM (Myrinet) software. MPI attempts to establish Myrinet connections among all hosts involved in the job. If unable to do so, TCP/IP is used for interhost communication.

 

Not enabled.

MPI_GM_VERBOSE

 

Allows some diagnostic information concerning messaging between processes using GM (Myrinet) to be displayed on stderr.

 

Not enabled.

MPI_GROUP_MAX

 

Sets the maximum number of groups that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.)

 

256

MPI_GSN_DEVS
(IRIX 6.5.9 systems or later)

 

Sets the order for opening GSN adapters. The list of devices does not need to be quoted or space-delimited (0123 is valid).

 

MPI will use all available GSN devices.

MPI_GSN_ON
(IRIX 6.5.9 systems or later)

 

Enables use of the GSN (ST protocol) bypass. MPI attempts to establish GSN connections among all hosts in the job. If unable to do so, HIPPI bypass connections will be attempted. If HIPPI is unavailable on all hosts, TCP/IP will be used for interhost communication.

GSN imposes a limit of one MPI process using GSN per CPU on a system. So, for example, on a 128-CPU system, you can run multiple MPI jobs, as long as the total number of MPI processes using the GSN bypass does not exceed 128.

Once the maximum allowed MPI processes using GSN is reached, subsequent MPI jobs will return an error to the user output such as the following:

MPI: gsn_endpoint/st_endpoint: Resource temporarily unavailable

An error will also be printed to the SYSLOG file.

If there are a few CPUs still available, but not enough to satisfy the entire MPI job, the error will still be issued and the MPI job terminated.

 

Not enabled.

MPI_GSN_VERBOSE
(IRIX 6.5.9 systems or later)

 

Allows additional MPI initialization information to be printed in the standard output stream. This information contains details about the GSN (ST protocol) OS bypass connections and the GSN adapters that are detected on each of the hosts.

 

Not enabled.

MPI_MAX_MSGS

 

Controls the total number of message headers that can be allocated. This allocation applies to messages exchanged between processes on a single host, or between processes on different hosts when using the GM(Myrinet) OS bypass protocol. Note that the initial allocation of memory for message headers is 128 Kbytes.

 

Allow up to 64 Mbytes to be allocated for message headers. If you set this variable, specify the maximum number of message headers.

MPI_MSG_RETRIES

 

Specifies the number of times the MPI library attempts to get a message header, if none are available. Each MPI message that is sent requires an initial message header. If one is not available after the specified number of attempts, the job will abort.

 

500

 

 

Note that this variable no longer applies to processes on the same host, or when using the GM (Myrinet) protocol. In these cases, message headers are allocated dynamically on an as-needed basis.

 

 

MPI_MSGS_PER_HOST

 

Sets the number of message headers to allocate for MPI messages on each MPI host. Space for messages that are destined for a process on a different host is allocated as shared memory on the host on which the sending processes are located. MPI locks these pages in memory. Use this variable to allocate buffer space for interhost messages.


Caution: If you set the memory pool for interhost packets to a large value, you can cause allocation of so much locked memory that total system performance is degraded.


 

1024

 

 

The previous description does not apply to processes that use the GM(Myrinet) OS bypass protocol. In this case, message headers are allocated dynamically as needed. See the MPI_MSGS_MAX variable description.

 

 

MPI_MSGS_PER_PROC

 

This variable is effectively obsolete. Message headers are now allocated on an as-needed basis for messaging either between processes on the same host, or between processes on different hosts when using the GM (Myrinet) OS bypass protocol. You can use the new MPI_MSGS_MAX variable to control the total number of message headers that can be allocated.

 

1024

MPI_REQUEST_MAX

 

Sets the maximum number of simultaneous nonblocking sends and receives that can be active at one time. Use this variable to increase internal default limits. (May be required by standard-compliant programs.)

 

16384

MPI_SHARED_VERBOSE

 

Allows some diagnostic information concerning messaging within a host to be displayed on stderr.

 

Not enabled.

MPI_SLAVE_DEBUG_ATTACH

 

Specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH to N, the MPI process with rank N prints a message during program startup, describing how to attach to it from another window using the dbx debugger on IRIX or the gdb debugger on Linux. You must attach the debugger to process N within ten seconds of the printing of the message.

 

Not enabled.

MPI_STATS

 

Enables printing of MPI internal statistics. Each MPI process prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. To prefix the statistics messages with the MPI rank, use the -p option on the mpirun command.


Note: Because the statistics-collection code is not thread-safe, this variable should not be set if the program uses threads.


 

Not enabled.

MPI_TYPE_DEPTH

 

Sets the maximum number of nesting levels for derived data types. (May be required by standard-compliant programs.) This variable limits the maximum depth of derived data types that an application can create. MPI logs error messages if the limit specified by MPI_TYPE_DEPTH is exceeded.

 

8 levels

MPI_TYPE_MAX

 

Sets the maximum number of derived data types that can be used in an MPI program. Use this variable to increase internal default limits. (May be required by standard-compliant programs.)

 

1024


Internal Message Buffering in MPI

An MPI implementation can copy data that is being sent to another process into an internal temporary buffer so that the MPI library can return from the MPI function, giving execution control back to the user. However, according to the MPI standard, you should not assume that there is any message buffering between processes because the MPI standard does not mandate a buffering strategy. Some implementations choose to buffer user data internally, while other implementations block in the MPI routine until the data can be sent. These different buffering strategies have performance and convenience implications.

Most MPI implementations do use buffering for performance reasons and some programs depend on it. Table 6-2 illustrates a simple sequence of MPI operations that cannot work unless messages are buffered. If sent messages were not buffered, each process would hang in the initial MPI_Send call, waiting for an MPI_Recv call to take the message. Because most MPI implementations do buffer messages to some degree, a program like this does not usually hang. The MPI_Send calls return after putting the messages into buffer space, and the MPI_Recv calls get the messages. Nevertheless, program logic like this is not valid by the MPI standard.

The SGI implementation of MPI uses buffering under most circumstances. Short messages of 64 or fewer bytes are always buffered. On IRIX systems, longer messages are buffered unless the message to be sent resides in either a common block, the symmetric heap, or global shared heap and the sending and receiving processes reside on the same host. The MPI data type on the send side must also be a contiguous type. The message size must also be equal to or greater than the size setting for MPI_BUFFER_MAX (see Table 6-1). Under these circumstances, the receiver copies the data directly into its receive message area without buffering. Obviously, MPI applications with code segments equivalent to that shown in Table 6-2 will almost certainly deadlock if this bufferless pathway is available.


Note: This feature is not currently available on Linux systems.


Table 6-2. Outline of Improper Dependence on Buffering

Process 1

Process 2

MPI_Send(2,....)

MPI_Send(1,....)

MPI_Recv(2,....)

MPI_Recv(1,....)