| Note: This manual only applies to SGI® UV™ 100 and SGI® UV™ 1000 and SGI® UV™ 2000 systems. |
The GRU is part of the SGI UV Hub application-specific integrated circuit (ASIC). The UV Hub is the heart of the SGI UV system compute blade. It connects to two Intel® Xeon® processor sockets through the Intel QuickPath Interconnect (QPI) ports and to the high speed SGI NUMAlink® interconnect fabric through NUMAlink ports.
The UV Hub acts as a crossbar between the processors, local SDRAM memory, and the network interface. The Hub ASIC enables any processor in the single-system image (SSI) to access the memory of all processors in the SSI.
The GRU is a coprocessor that is effective at assisting the transfer of data between compute nodes at a higher speed than socket-level instructions. In addition, the GRU has support for the following:
Efficient atomic memory operations (AMOs)
Internode messages sent to message queues located on remote nodes
On partitioned systems, the GRU is able to access memory located on remote single system images (SSIs).
The GRU features are available to both the kernel and to user applications. Message Passing Toolkit (MPT), the SGI MPI library, is a primary user of the GRU. In addition, user applications can directly reference the GRU using header files and libraries that are provided by SGI.
The system architecture for the next generation SGI® UV™ system, the SGI UV 2000, is a six-generation NUMAflex® distributed, shared memory (DSM) architecture known as NUMAlink 6. On SGI UV 2000 systems, the UV Hub board assembly has a HUB ASIC with two identical hubs. Each hub supports one 8.0 GT/s QPI channel to a processor socket. The Intel Xeon processor has eight-core processors per socket. The SGI UV 2000 series Hub has four NUMAlink 6 ports that connect with the NUMAlink 6 interconnect fabric.
In the NUMAlink architecture, all processors and memory can be tied together into a single logical system.
For more information on the SGI UV hub, SGI UV compute blades, QPI, NUMAlink 5, and NUMAlink 6 see the SGI Altix UV 1000 System User's Guide, the SGI Altix UV 100 System User's Guide or the SGI UV 2000 System User Guide, respectively. This chapter covers the following topics:
The low level SGI UV GRU API provides direct access to the full set of GRU instructions. Most of these instructions are not available through the use of the MPT, SHMEM, or UPC APIs. The full benefit of the GRU is in the ability to have the GRU asynchronously executing instructions in the background while the user application performs other work.
The GRU instruction set and hardware architecture provides the following capabilities:
Provide a large globally addressable memory
Take advantage of the available bandwidth and NUMAlink message efficiency with vector-like instruction
Take advantage of specific hardware mechanisms aimed at reducing network traffic
Expand the reach of the limited processor cores outstanding references by bringing latency tolerant remote references in the local hub resources, increasing sustained bandwidth in all but the smallest systems.
Improve the apparent processor bus efficiency by compacting strided or random references into cache lines.
Provide efficient synchronization and communication hardware assisted primitives aimed at improving latency for common synchronization and messaging operations (including MPI applications) by reducing the number of network traversals between GRU users and target references in system memory.
Provide fast remote copy intiated by a CPU and performed by the GRU asynchronously
Provide scatter-gather, fast barriers, and AMO support
Provide external TLB with large page support
In order to access and use the SGI UV GRU direct access API, you need to install the following RPMs on your SGI SGI UV system:
xpmem-devel
gru-devel
gru_alloc-devel
libgru-devel
| Note: These RPMs are not installed by default. |
Message Passing Interface (MPI), SHMEM, and Unified Parallel C (UPC) high level APIs and programming models that are implemented and supported by SGI that support access to GRU functionality. For more information, see mpi(1), shmem(3), or sgiupc(1) man pages and the Message Passing Toolkit (MPT) User Guide and Unified Parallel C (UPC) User Guide.
The Direct GRU Access API has four components, as follows:
GRU resource allocators
The GRU resource allocator functions provide management of the GRU resources to allow independent software components in the same program access the GRU without oversubscribing the GRU resources.
GRU memory access functions
The GRU memory access functions perform GRU operations that include memory read, memory write, memory-to-memory copies, and atomic memory operations and so on.
XPMEM address mapping functions
The XPMEM address mapping functions set up mappings to target memory throughout the system into local GRU-mapped virtual addresses.
MPT address mapping functions
The MPT address mapping functions are a layer on top of XPMEM, and expose mapped memory regions already set up for MPI and SHMEM to the user application.
The UV global reference unit (GRU) has control block (CB) and data segment (DSEG) resources associated with it. User applications need to allocate CB resources and usually DSEG resources for use in GRU memory access functions.
There are two categories of GRU resources used by any thread: temporarily and permanently allocated. A program starts running with all the available GRU resources being in the temporary pool until some resources are allocated permanently via the gru_pallocate() function.
The preferred way to get access to all the GRU temporary CBs and DSEG is through the use of the lightweight gru_temp_reserve() and gru_temp_release() functions. These functions should wrap any use of the GRU memory access functions, with an exception to be described later.
#include <gru_alloc.h>
void gru_temp_reserve(gru_alloc_thdata_t *gat);
typedef struct {
gru_segment_t *gruseg;
gru_control_block_t *cbp;
void *dsegp;
int cb_cnt;
int dseg_size;
} gru_alloc_thdata_t;
|
The gru_alloc_thdata_t structure returned from this function will describe the GRU resources available for use until the next call to gru_temp_release().
The following code example shows a GRU memory access function gru_gamirr() being called after which the gru_temp_reserve() function reserves the GRU resources, and before the gru_wait_abort() function waits for completion of the operation. Then, followed by a call to gru_temp_release() to release the temporary GRU resources.
Example 1-1. GRU Memory Access Function ( gru_gamirr())
gru_alloc_thdata_t gat; gru_temp_reserve(&gat); gru_gamirr( gat.cbp, EOP_IRR_DECZ, address, XTYPE_DW, IMA_CB_DELAY); gru_wait_abort(gat.cbp); gru_temp_release(); |
The effect of the gru_temp_reserve() and gru_temp_release() functions is thread-private, so related POSIX threads or OpenMP threads could be executing the above sequence, concurrently.
An alternative allocation scheme is permanent allocation. The gru_pallocate() function returns CB and DSEG resources that can be used at any time thereafter. This can simplify the allocation strategy but it has the disadvantage of reducing the number of GRU resources that can be used by other software. An example would be a call to gru_bcopy() which allows you to pass a DSEG work buffer of any size. The achieved bandwidth for gru_bcopy() is higher with larger DSEG work buffers.
See the gru(7) man page for a complete list of GRU man pages. You can also use the man gru command to view these pages.
The following functions are use to create and manage user access to the GRU. Each function has an associated man (3) page. For a list of GRU man pages, refer to the SEE ALSO section at the bottom of the gru(7) man page.
gru_create_context()
Creates a GRU context to allow a user access to the GRU
gru_get_data_pointer()
Gets a pointer to a GRU control block
gru_wait_abort()
Waits for an active GRU instruction to complete. Aborts on error
gru_set_context_blade_chiplet()
Selects GRU blade and chiplet for context
gru_unload_context()
Unloads a GRU context
gru_check_status()
Checks the status of a GRU instruction to complete
gru_wait()
Waits for an active GRU instruction to complete
gru_get_cb_exception_detail_str()
Gets string describing a GRU instruction exception
gru_print_cb_detail()
Prints detailed error information for GRU instruction failure
gru_flush_tlb()
Flushes a virtual address range from the GRU
gru_create_message_queue()
Creates a GRU message queue
gru_abort()
Causes abnormal process termination due to GRU instruction error
gru_destroy_message_queue()
Frees resource allocated to a GRU message queue
gru_get_cb_substatus()
Gets the GRU instruction sub-status
gru_get_amo_value()
Gets the AMO value from a GRU control block
gru_get_cb_status()
Gets the GRU instruction status
gru_free_message()
Frees a message from a GRU message queue
gru_get_amo_value_head()
Gets the head value for a GRU message queue AMO
gru_get_amo_value_limit()
Gets the limit value for a GRU message queue AMO
gru_get_next_message()
Gets the next message from a GRU message queue
gru_send_message()
Sends a message to a GRU message queue
gru_start_message()
Sends an asynchronous message to a GRU message queue
gru_wait_message()
Waits for asynchronous message sent to a GRU message queue
gru_mesq_head()
Returns a GRU message queue header value
gru_destroy_context()
Destroys a GRU context and free the GRU resources
gru_get_thread_gru_segment()
Gets GRU context identifier to use to access a GRU context
gru_get_cb_pointer()
Gets a pointer to a GRU control block
gru_get_tri()
Gets a tri0/tri1 index to a GRU data segment element
For detailed information, see the gru(7) man page.
The GRU memory access functions perform GRU operations that include memory read, memory write, memory-to-memory copies, and atomic memory operations. These functions use an ordinary virtual address or a GRU-mapped virtual address to reference the remote memory. Each function has an associated man(3) page.
The following in-line functions are provided by gru_instructions.h and are used to initiate GRU instructions:
gru_bcopy()
Memory to memory copy using the GRU
gru_bstore()
Stores data from the GRU into system memory
gru_gamer()
GRU unregistered atomic memory operation with explicit data
gru_gamerr()
GRU registered atomic memory operation with explicit data
gru_ivload()
Indirectly load data into the GRU from system memory
gru_ivset()
Indirectly store a defined data value into system memory using the GRU
gru_ivstore()
Indirectly store data from the GRU into system memory
gru_nop()
No operation, cancel active GRU instruction
gru_vflush()
Flush cache lines from processor caches using the GRU
gru_vload()
Load data into the GRU from system memory
gru_vstore()
Store data from the GRU into system memory
gru_gamxr()
GRU unregistered atomic memory operation with extended data
gru_mesq()
Atomically send a message to a message queue using the GRU
gru_vset()
Store a defined data value into system memory using the GRU
For detailed information, see the gru(7) man page.
The interfaces to these functions are viewable in the uv/gru/gru_instructions.h header file installed by the gru-devel RPM.
The following code example of a GRU memory access function illustrates the basic call structure.
Example 1-2. GRU Memory Access Function Basic Call Structure
static inline
void gru_vload(gru_control_block_t *cb, void *mem_addr,
unsigned int tri0, unsigned char xtype, unsigned long nelem,
unsigned long stride, unsigned long hints);
Arguments are:
cb - pointer to CB
mem_addr - address of targeted memory
tri0 - index to DSEG buffer. Compute it
using gru_get_tri().
xtype - log2 of data type byte size (XTYPE_B ...)
nelem - number of elements to transfer
stride - memory stride, scaled in elements
hints - IMA_CB_DELAY is commonly used
|
All memory access operations are asynchronous. The wait functions, such as, gru_wait_abort(), specify the CB handle and are used to wait to completion.
The XPMEM interface can map a virtual address range in one process into the GRU-mapped virtual address in another process. The XPMEM interface was designed to meet the needs of MPI and SHMEM implementations and provide ways to map any data region. As a GRU API user, you need to find a way to map the needed memory regions into the processes or threads involved. The Linux operating system offers many options for doing this, as follows:
mmap
System V shared memory
memory sharing among pthreads
memory sharing among OpenMP threads
These methods are the likely first choice for most potential GRU users.
The sn/xpmem.h header file installed by the xpmem-devel RPM has interface definitions for all the XPMEM functions.
The following example shows the main XPMEM functions:
Example 1-3. Main XPMEM Functions
extern __s64 xpmem_make_2(void *, size_t, int, void *); extern int xpmem_remove_2(__s64); extern __s64 xpmem_get_2(__s64, int, int, void *); extern int xpmem_release_2(__s64); extern void *xpmem_attach_2(__s64, off_t, size_t, void *); extern void *xpmem_attach_high_2(__s64, off_t, size_t, void *); extern int xpmem_detach_2(void *, size_t size); extern void *xpmem_reserve_high_2(size_t, size_t); extern int xpmem_unreserve_high_2(void *, size_t); |
For more information on using XPMEM, see SGI UV Systems Configuration and Operations Guide.
The MPT libmpi library uses XPMEM to cross-map virtual memory between all the processes in an MPI job. Several functions are available to lookup mapped virtual addresses that are pre-attached in the virtual address space of a process by MPI. The addresses returned by the lookups may be passed to the GRU library functions.
Not all GRU API users can require their code to execute in an MPI job, but if you do, you may find the MPT address mapping functions are a convenient way to reference remote data arrays and objects.
The MPT address mapping functions are shown below. They reference ordinary virtual addresses or addresses of symmetric data objects. Symmetric data is static data or array-defined in the intro_shmem(3) man page.
The following example shows an MPI_SGI_gam_type:
#include <mpi_ext.h>
int
MPI_SGI_gam_type(int rank, MPI_Comm comm)
Return value is the XPMEM accessibility of the specified rank.
MPI_GAM_NONE - not referenceable by load/store or GRU
MPI_GAM_CPU_NONCOH - Altix 3700 noncoherent
MPI_GAM_CPU - if referencable by load/store only
MPI_GAM_GRU - if referenceable by GRU only
MPI_GAM_CPU_PREF - if referenceable by either load/store
or GRU, preferred by load/store
MPI_GAM_GRU_PREF - if referenceable by either load/store
or GRU, preferred by GRU
|
The MPT address mapping functions are influenced by the MPI_GSM_NEIGHBORHOOD environment variable. This variable may be used to specify the "neighborhood size" for shared memory accesses. Contiguous groups of ranks within a host can be considered to be in the same neighborhood. The MPI_GSM_NEIGHBORHOOD variable specifies the size of these neighborhoods, as follows:
MPI processes within a neighborhood will return gam_type MPI_GAM_CPU_PREF.
MPI processes outside a neighborhood with a host will return gam_type MPI_GAM_GRU_PREF.
MPI processes from a different host within a SGI UV system will return gam_type MPI_GAM_GRU.
When MPI_GSM_NEIGHBORHOOD is not set, the neighborhood size defaults to all ranks in the current host.
The MPI_SGI_gam_ptr function is, as follows:
#include <mpi_ext.h> void * MPI_SGI_gam_ptr(void *rem_addr, size_t len, int remote_rank, MPI_Comm comm, int acc_mode); |
Given a virtual address in a specified MPI process rank, returns a general virtual address that may be used to directly reference the memory.
This function is for general users.
| acc_mode | Chooses CPU or GRU addressable | |
| MPI_GAM_CPU | Requests CPU address that can be referenced | |
| MPI_GAM_GRU | Requests GRU address that can be referenced |
This function prints an error message when error conditions occur and then aborts.
The MPI_SGI_symmetric_addr function is, as follows:
void *MPI_SGI_symmetric_addr(void *local_addr, size_t len, int remote_rank, MPI_Comm comm) |
For symmetric objects, returns the virtual address (VA) of the corresponding object in a specified MPI process.
A GRU library programming example follows:
/* This SHMEM program uses the GRU API gru_bcopy function to read the bbb
* variable on PE N+1. This accomplishes a global circular shift into aaa.
*/
#include <mpi_ext.h>
#include <mpi.h>
#include <mpp/shmem.h>
#include <uv/gru/gru_alloc.h>
#include <uv/gru/gru_instructions.h>
int aaa, bbb; /* static data is remotely accessible */
int main ()
{
int *gptr;
gru_alloc_thdata_t thd;
int tri;
start_pes (0);
bbb = _my_pe ();
shmem_barrier_all ();
gru_temp_reserve (&thd); /* reserve temp GRU resources */
gptr = MPI_SGI_gam_ptr (
&bbb, /* address of source */
1, /* number of elements */
(_my_pe () + 1) % _num_pes (), /* PE owner of data */
MPI_COMM_WORLD, /* SHMEM uses MPI_COMM_WORLD */
MPI_GAM_GRU); /* get GRU-accessible address */
tri = gru_get_tri (thd.dsegp); /* get offset to DSR buffer */
gru_bcopy (
thd.cbp, /* CB 0 will be used */
gptr, /* GRU pointer for source of copy */
&aaa, /* GRU pointer for destination of copy */
tri, /* offset to DSR buffer */
XTYPE_W, /* data type is 4 byte word */
1, /* number of elements to copy */
2, /* number of cache lines of DSR buffer */
0); /* hints */
gru_wait_abort (thd.cbp); /* wait for completion of gru_bcopy() */
gru_temp_release (); /* release GRU resources */
shmem_barrier_all ();
printf ("pe %d aaa=%d bbb=%d\n", _my_pe(), aaa, bbb);
return 0;
} |
The GRU library programming example, shown in “GRU Library Program Example”, may be compiled and run on four processes, as follows:
% module load mpt % cc prog.c -lmpi -lsma -lgru_alloc % mpirun -np 4 ./a.out |
This section provides a very simple example of a GRU program, as follows:
/*
* Very simple example of a program that:
* - creates a GRU context
* - uses VSTORE to store data to memory
* - uses VLOAD to load data
* - validates data.
*/
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include "uv/gru/gru.h"
#include "uv/gru/gru_instructions.h"
#define perrorx(s) do {printf("%s: %s", s, strerror(errno)); exit(EXIT_FAILURE);} while (0)
#define Dprintf(s...) do {if (verbose) printf(s);} while (0)
#define MAGIC 0xdeadbeef12345678UL
static int cbrs = 1;
static int dsrbytes = 64;
static unsigned long data;
int main(int argc, char **argv)
{
gru_cookie_t cookie;
gru_segment_t *gseg;
gru_control_block_t *cb;
unsigned long *dsr;
/*
* Create GRU context for accessing the GRU
*/
if (gru_create_context(&cookie, NULL, cbrs, dsrbytes, 1, 0) < 0)
perrorx("cant open gregs");
if ((gseg = gru_get_thread_gru_segment(cookie, 0)) == NULL)
perrorx("cant open gregs");
/*
* Get pointers to CBR & DSR space
*/
cb = gru_get_cb_pointer(gseg, 0);
dsr = gru_get_data_pointer(gseg, 0);
/*
* Initialize DSR0. Value equal MAGIC
*/
dsr[0] = MAGIC;
/*
* Execute the VSTORE and wait for completion
*/
gru_vstore(cb, &data, 0, XTYPE_DW, 1, 1, 0);
gru_wait_abort(cb);
/*
* Execute the VLOAD and wait for completion
*/
gru_vload(cb, &data, 64, XTYPE_DW, 1, 1, 0);
gru_wait_abort(cb);
/*
* Validate data
*/
if (dsr[0] != dsr[8] || dsr[0] != MAGIC)
printf("miscompare: expected 0x%lx, found 0x%lx\n", dsr[0], dsr[8]);
if (gru_destroy_context(cookie))
perrorx("error closing gru segment");
}
|