Chapter 1. UV GRU Direct Access API


Note: This manual only applies to SGI® UV™ 100 and SGI® UV™ 1000 and SGI® UV™ 2000 systems.


This chapter provides an overview of the SGI UV global reference unit (GRU) development kit. It describes the application programming interface (API) that allows an application direct access to GRU functionality.

Introduction

The GRU is part of the SGI UV Hub application-specific integrated circuit (ASIC). The UV Hub is the heart of the SGI UV system compute blade. It connects to two Intel® Xeon® processor sockets through the Intel QuickPath Interconnect (QPI) ports and to the high speed SGI NUMAlink® interconnect fabric through NUMAlink ports.

The UV Hub acts as a crossbar between the processors, local SDRAM memory, and the network interface. The Hub ASIC enables any processor in the single-system image (SSI) to access the memory of all processors in the SSI.

The GRU is a coprocessor that is effective at assisting the transfer of data between compute nodes at a higher speed than socket-level instructions. In addition, the GRU has support for the following:

  • Efficient atomic memory operations (AMOs)

  • Internode messages sent to message queues located on remote nodes

  • On partitioned systems, the GRU is able to access memory located on remote single system images (SSIs).

The GRU features are available to both the kernel and to user applications. Message Passing Toolkit (MPT), the SGI MPI library, is a primary user of the GRU. In addition, user applications can directly reference the GRU using header files and libraries that are provided by SGI.

The system architecture for the next generation SGI® UV™ system, the SGI UV 2000, is a six-generation NUMAflex® distributed, shared memory (DSM) architecture known as NUMAlink 6. On SGI UV 2000 systems, the UV Hub board assembly has a HUB ASIC with two identical hubs. Each hub supports one 8.0 GT/s QPI channel to a processor socket. The Intel Xeon processor has eight-core processors per socket. The SGI UV 2000 series Hub has four NUMAlink 6 ports that connect with the NUMAlink 6 interconnect fabric.

In the NUMAlink architecture, all processors and memory can be tied together into a single logical system.

For more information on the SGI UV hub, SGI UV compute blades, QPI, NUMAlink 5, and NUMAlink 6 see the SGI Altix UV 1000 System User's Guide, the SGI Altix UV 100 System User's Guide or the SGI UV 2000 System User Guide, respectively. This chapter covers the following topics:

Advantages Provided by Directly Programming the GRU

The low level SGI UV GRU API provides direct access to the full set of GRU instructions. Most of these instructions are not available through the use of the MPT, SHMEM, or UPC APIs. The full benefit of the GRU is in the ability to have the GRU asynchronously executing instructions in the background while the user application performs other work.

The GRU instruction set and hardware architecture provides the following capabilities:

  • Provide a large globally addressable memory

  • Take advantage of the available bandwidth and NUMAlink message efficiency with vector-like instruction

  • Take advantage of specific hardware mechanisms aimed at reducing network traffic

  • Expand the reach of the limited processor cores outstanding references by bringing latency tolerant remote references in the local hub resources, increasing sustained bandwidth in all but the smallest systems.

  • Improve the apparent processor bus efficiency by compacting strided or random references into cache lines.

  • Provide efficient synchronization and communication hardware assisted primitives aimed at improving latency for common synchronization and messaging operations (including MPI applications) by reducing the number of network traversals between GRU users and target references in system memory.

  • Provide fast remote copy intiated by a CPU and performed by the GRU asynchronously

  • Provide scatter-gather, fast barriers, and AMO support

  • Provide external TLB with large page support

Accessing the SGI UV GRU Direct Access API

In order to access and use the SGI UV GRU direct access API, you need to install the following RPMs on your SGI SGI UV system:

  • xpmem-devel

  • gru-devel

  • gru_alloc-devel

  • libgru-devel


Note: These RPMs are not installed by default.


SGI High Level APIs Supporting GRU Access

Message Passing Interface (MPI), SHMEM, and Unified Parallel C (UPC) high level APIs and programming models that are implemented and supported by SGI that support access to GRU functionality. For more information, see mpi(1), shmem(3), or sgiupc(1) man pages and the Message Passing Toolkit (MPT) User Guide and Unified Parallel C (UPC) User Guide.

Overview of API for Direct GRU Access

The Direct GRU Access API has four components, as follows:

  • GRU resource allocators

    The GRU resource allocator functions provide management of the GRU resources to allow independent software components in the same program access the GRU without oversubscribing the GRU resources.

  • GRU memory access functions

    The GRU memory access functions perform GRU operations that include memory read, memory write, memory-to-memory copies, and atomic memory operations and so on.

  • XPMEM address mapping functions

    The XPMEM address mapping functions set up mappings to target memory throughout the system into local GRU-mapped virtual addresses.

  • MPT address mapping functions

    The MPT address mapping functions are a layer on top of XPMEM, and expose mapped memory regions already set up for MPI and SHMEM to the user application.

GRU Resource Allocators

The UV global reference unit (GRU) has control block (CB) and data segment (DSEG) resources associated with it. User applications need to allocate CB resources and usually DSEG resources for use in GRU memory access functions.

There are two categories of GRU resources used by any thread: temporarily and permanently allocated. A program starts running with all the available GRU resources being in the temporary pool until some resources are allocated permanently via the gru_pallocate() function.

The preferred way to get access to all the GRU temporary CBs and DSEG is through the use of the lightweight gru_temp_reserve() and gru_temp_release() functions. These functions should wrap any use of the GRU memory access functions, with an exception to be described later.

#include <gru_alloc.h>

void gru_temp_reserve(gru_alloc_thdata_t *gat);

typedef struct {
      gru_segment_t       *gruseg;
      gru_control_block_t *cbp;
      void                *dsegp;
      int                 cb_cnt;
      int                 dseg_size;
} gru_alloc_thdata_t;

The gru_alloc_thdata_t structure returned from this function will describe the GRU resources available for use until the next call to gru_temp_release().

The following code example shows a GRU memory access function gru_gamirr() being called after which the gru_temp_reserve() function reserves the GRU resources, and before the gru_wait_abort() function waits for completion of the operation. Then, followed by a call to gru_temp_release() to release the temporary GRU resources.

Example 1-1. GRU Memory Access Function ( gru_gamirr())

gru_alloc_thdata_t gat;
gru_temp_reserve(&gat);
gru_gamirr( gat.cbp, EOP_IRR_DECZ, address, XTYPE_DW, IMA_CB_DELAY);
gru_wait_abort(gat.cbp);
gru_temp_release();


The effect of the gru_temp_reserve() and gru_temp_release() functions is thread-private, so related POSIX threads or OpenMP threads could be executing the above sequence, concurrently.

An alternative allocation scheme is permanent allocation. The gru_pallocate() function returns CB and DSEG resources that can be used at any time thereafter. This can simplify the allocation strategy but it has the disadvantage of reducing the number of GRU resources that can be used by other software. An example would be a call to gru_bcopy() which allows you to pass a DSEG work buffer of any size. The achieved bandwidth for gru_bcopy() is higher with larger DSEG work buffers.

See the gru(7) man page for a complete list of GRU man pages. You can also use the man gru command to view these pages.

GRU Access Functions

The following functions are use to create and manage user access to the GRU. Each function has an associated man (3) page. For a list of GRU man pages, refer to the SEE ALSO section at the bottom of the gru(7) man page.

  • gru_create_context()

    Creates a GRU context to allow a user access to the GRU

  • gru_get_data_pointer()

    Gets a pointer to a GRU control block

  • gru_wait_abort()

    Waits for an active GRU instruction to complete. Aborts on error

  • gru_set_context_blade_chiplet()

    Selects GRU blade and chiplet for context

  • gru_unload_context()

    Unloads a GRU context

  • gru_check_status()

    Checks the status of a GRU instruction to complete

  • gru_wait()

    Waits for an active GRU instruction to complete

  • gru_get_cb_exception_detail_str()

    Gets string describing a GRU instruction exception

  • gru_print_cb_detail()

    Prints detailed error information for GRU instruction failure

  • gru_flush_tlb()

    Flushes a virtual address range from the GRU

  • gru_create_message_queue()

    Creates a GRU message queue

  • gru_abort()

    Causes abnormal process termination due to GRU instruction error

  • gru_destroy_message_queue()

    Frees resource allocated to a GRU message queue

  • gru_get_cb_substatus()

    Gets the GRU instruction sub-status

  • gru_get_amo_value()

    Gets the AMO value from a GRU control block

  • gru_get_cb_status()

    Gets the GRU instruction status

  • gru_free_message()

    Frees a message from a GRU message queue

  • gru_get_amo_value_head()

    Gets the head value for a GRU message queue AMO

  • gru_get_amo_value_limit()

    Gets the limit value for a GRU message queue AMO

  • gru_get_next_message()

    Gets the next message from a GRU message queue

  • gru_send_message()

    Sends a message to a GRU message queue

  • gru_start_message()

    Sends an asynchronous message to a GRU message queue

  • gru_wait_message()

    Waits for asynchronous message sent to a GRU message queue

  • gru_mesq_head()

    Returns a GRU message queue header value

  • gru_destroy_context()

    Destroys a GRU context and free the GRU resources

  • gru_get_thread_gru_segment()

    Gets GRU context identifier to use to access a GRU context

  • gru_get_cb_pointer()

    Gets a pointer to a GRU control block

  • gru_get_tri()

    Gets a tri0/tri1 index to a GRU data segment element

For detailed information, see the gru(7) man page.

GRU Memory Access Functions

The GRU memory access functions perform GRU operations that include memory read, memory write, memory-to-memory copies, and atomic memory operations. These functions use an ordinary virtual address or a GRU-mapped virtual address to reference the remote memory. Each function has an associated man(3) page.

The following in-line functions are provided by gru_instructions.h and are used to initiate GRU instructions:

  • gru_bcopy()

    Memory to memory copy using the GRU

  • gru_bstore()

    Stores data from the GRU into system memory

  • gru_gamer()

    GRU unregistered atomic memory operation with explicit data

  • gru_gamerr()

    GRU registered atomic memory operation with explicit data

  • gru_ivload()

    Indirectly load data into the GRU from system memory

  • gru_ivset()

    Indirectly store a defined data value into system memory using the GRU

  • gru_ivstore()

    Indirectly store data from the GRU into system memory

  • gru_nop()

    No operation, cancel active GRU instruction

  • gru_vflush()

    Flush cache lines from processor caches using the GRU

  • gru_vload()

    Load data into the GRU from system memory

  • gru_vstore()

    Store data from the GRU into system memory

  • gru_gamxr()

    GRU unregistered atomic memory operation with extended data

  • gru_mesq()

    Atomically send a message to a message queue using the GRU

  • gru_vset()

    Store a defined data value into system memory using the GRU

For detailed information, see the gru(7) man page.

The interfaces to these functions are viewable in the uv/gru/gru_instructions.h header file installed by the gru-devel RPM.

The following code example of a GRU memory access function illustrates the basic call structure.

Example 1-2. GRU Memory Access Function Basic Call Structure

static inline 
void gru_vload(gru_control_block_t *cb, void *mem_addr,
         unsigned int tri0, unsigned char xtype, unsigned long nelem,
         unsigned long stride, unsigned long hints);

Arguments are:
	cb	 - pointer to CB
	mem_addr - address of targeted memory
	tri0     - index to DSEG buffer.  Compute it
		   using gru_get_tri().
	xtype	 - log2 of data type byte size (XTYPE_B ...)
	nelem	 - number of elements to transfer
	stride   - memory stride, scaled in elements
	hints    - IMA_CB_DELAY is commonly used

All memory access operations are asynchronous. The wait functions, such as, gru_wait_abort(), specify the CB handle and are used to wait to completion.

XPMEM Library Functions

The XPMEM interface can map a virtual address range in one process into the GRU-mapped virtual address in another process. The XPMEM interface was designed to meet the needs of MPI and SHMEM implementations and provide ways to map any data region. As a GRU API user, you need to find a way to map the needed memory regions into the processes or threads involved. The Linux operating system offers many options for doing this, as follows:

  • mmap

  • System V shared memory

  • memory sharing among pthreads

  • memory sharing among OpenMP threads

These methods are the likely first choice for most potential GRU users.

The sn/xpmem.h header file installed by the xpmem-devel RPM has interface definitions for all the XPMEM functions.

The following example shows the main XPMEM functions:

Example 1-3. Main XPMEM Functions

extern __s64 xpmem_make_2(void *, size_t, int, void *);
extern int xpmem_remove_2(__s64);
extern __s64 xpmem_get_2(__s64, int, int, void *);
extern int xpmem_release_2(__s64);
extern void *xpmem_attach_2(__s64, off_t, size_t, void *);
extern void *xpmem_attach_high_2(__s64, off_t, size_t, void *);
extern int xpmem_detach_2(void *, size_t size);
extern void *xpmem_reserve_high_2(size_t, size_t);
extern int xpmem_unreserve_high_2(void *, size_t);


For more information on using XPMEM, see SGI UV Systems Configuration and Operations Guide.

MPT Address Mapping Functions

The MPT libmpi library uses XPMEM to cross-map virtual memory between all the processes in an MPI job. Several functions are available to lookup mapped virtual addresses that are pre-attached in the virtual address space of a process by MPI. The addresses returned by the lookups may be passed to the GRU library functions.

Not all GRU API users can require their code to execute in an MPI job, but if you do, you may find the MPT address mapping functions are a convenient way to reference remote data arrays and objects.

The MPT address mapping functions are shown below. They reference ordinary virtual addresses or addresses of symmetric data objects. Symmetric data is static data or array-defined in the intro_shmem(3) man page.

The following example shows an MPI_SGI_gam_type:

Example 1-4. MPI_SGI_gam_type

#include <mpi_ext.h>
   
int
MPI_SGI_gam_type(int rank, MPI_Comm comm)
      
Return value is the XPMEM accessibility of the specified rank.
    
  MPI_GAM_NONE       - not referenceable by load/store or GRU
  MPI_GAM_CPU_NONCOH - Altix 3700 noncoherent
  MPI_GAM_CPU        - if referencable by load/store only
  MPI_GAM_GRU        - if referenceable by GRU only
  MPI_GAM_CPU_PREF   - if referenceable by either load/store
                         or GRU, preferred by load/store
  MPI_GAM_GRU_PREF   - if referenceable by either load/store
                         or GRU, preferred by GRU


The MPT address mapping functions are influenced by the MPI_GSM_NEIGHBORHOOD environment variable. This variable may be used to specify the "neighborhood size" for shared memory accesses. Contiguous groups of ranks within a host can be considered to be in the same neighborhood. The MPI_GSM_NEIGHBORHOOD variable specifies the size of these neighborhoods, as follows:

  • MPI processes within a neighborhood will return gam_type MPI_GAM_CPU_PREF.

  • MPI processes outside a neighborhood with a host will return gam_type MPI_GAM_GRU_PREF.

  • MPI processes from a different host within a SGI UV system will return gam_type MPI_GAM_GRU.

When MPI_GSM_NEIGHBORHOOD is not set, the neighborhood size defaults to all ranks in the current host.

MPI_SGI_gam_ptr Function

The MPI_SGI_gam_ptr function is, as follows:

#include <mpi_ext.h>

void * MPI_SGI_gam_ptr(void *rem_addr, size_t len, int remote_rank,
  MPI_Comm comm, int acc_mode);

Given a virtual address in a specified MPI process rank, returns a general virtual address that may be used to directly reference the memory.

This function is for general users.

acc_mode 

Chooses CPU or GRU addressable

MPI_GAM_CPU 

Requests CPU address that can be referenced

MPI_GAM_GRU 

Requests GRU address that can be referenced

This function prints an error message when error conditions occur and then aborts.

MPI_SGI_symmetric_addr Function

The MPI_SGI_symmetric_addr function is, as follows:

void *MPI_SGI_symmetric_addr(void *local_addr, size_t len,
	    int remote_rank, MPI_Comm comm)

For symmetric objects, returns the virtual address (VA) of the corresponding object in a specified MPI process.

shmem_ptr Function

The shmem_ptr function is, as follows:

#include <mpp/shmem.h>

       void *shmem_ptr(void *target, int pe);

Returns a processor-referencable address that can be used to reference symmetric data object target on a specified MPI process. See shmem_ptr(3) for more details.

GRU Library Program Example

A GRU library programming example follows:

/* This SHMEM program uses the GRU API gru_bcopy function to read the bbb
 * variable on PE N+1.  This accomplishes a global circular shift into aaa.
 */
#include <mpi_ext.h>
#include <mpi.h>
#include <mpp/shmem.h>
#include <uv/gru/gru_alloc.h>
#include <uv/gru/gru_instructions.h>
int aaa, bbb;			/* static data is remotely accessible */
int main ()
{
    int *gptr;
    gru_alloc_thdata_t thd;
    int tri;
    start_pes (0);
    bbb = _my_pe ();
    shmem_barrier_all ();
    gru_temp_reserve (&thd);      /* reserve temp GRU resources */
    gptr = MPI_SGI_gam_ptr (
        &bbb,                           /* address of source */
        1,                              /* number of elements */
        (_my_pe () + 1) % _num_pes (),  /* PE owner of data */
        MPI_COMM_WORLD,                 /* SHMEM uses MPI_COMM_WORLD */
        MPI_GAM_GRU);                   /* get GRU-accessible address */
    tri = gru_get_tri (thd.dsegp);      /* get offset to DSR buffer */
    gru_bcopy (
        thd.cbp,                /* CB 0 will be used */
        gptr,                   /* GRU pointer for source of copy */
        &aaa,                   /* GRU pointer for destination of copy */
        tri,                    /* offset to DSR buffer */
        XTYPE_W,                /* data type is 4 byte word */
        1,                      /* number of elements to copy */
        2,                      /* number of cache lines of DSR buffer */
        0);                     /* hints */
    gru_wait_abort (thd.cbp);   /* wait for completion of gru_bcopy() */
    gru_temp_release ();        /* release GRU resources */
    shmem_barrier_all ();
    printf ("pe %d aaa=%d bbb=%d\n", _my_pe(), aaa, bbb);
    return 0;
}    

The GRU library programming example, shown in “GRU Library Program Example”, may be compiled and run on four processes, as follows:

% module load mpt
% cc prog.c -lmpi -lsma -lgru_alloc
% mpirun -np 4 ./a.out

Simple GRU Program

This section provides a very simple example of a GRU program, as follows:

/*
 * Very simple example of a program that:
 * 	- creates a GRU context
 * 	- uses VSTORE to store data to memory
 * 	- uses VLOAD to load data
 * 	- validates data.
 */

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include "uv/gru/gru.h"
#include "uv/gru/gru_instructions.h"

#define perrorx(s)      do {printf("%s: %s", s, strerror(errno)); exit(EXIT_FAILURE);} while (0)
#define Dprintf(s...)	do {if (verbose) printf(s);} while (0)

#define MAGIC		0xdeadbeef12345678UL

static int cbrs = 1;
static int dsrbytes = 64;

static unsigned long data;

int main(int argc, char **argv)
{
	gru_cookie_t cookie;
	gru_segment_t *gseg;
	gru_control_block_t *cb;
	unsigned long *dsr;

	/*
	 * Create GRU context for accessing the GRU
	 */
	if (gru_create_context(&cookie, NULL, cbrs, dsrbytes, 1, 0) < 0)
		perrorx("cant open gregs");
	if ((gseg = gru_get_thread_gru_segment(cookie, 0)) == NULL)
		perrorx("cant open gregs");

	/*
	 * Get pointers to CBR & DSR space
	 */
	cb = gru_get_cb_pointer(gseg, 0);
	dsr = gru_get_data_pointer(gseg, 0);

	/*
	 * Initialize DSR0. Value equal MAGIC
	 */
	dsr[0] = MAGIC;

	/*
	 * Execute the VSTORE and wait for completion
	 */
	gru_vstore(cb, &data, 0, XTYPE_DW, 1, 1, 0);
	gru_wait_abort(cb);

	/*
	 * Execute the VLOAD and wait for completion
	 */
	gru_vload(cb, &data, 64, XTYPE_DW, 1, 1, 0);
	gru_wait_abort(cb);

	/*
	 * Validate data
	 */
	if (dsr[0] != dsr[8] || dsr[0] != MAGIC)
		printf("miscompare: expected 0x%lx, found 0x%lx\n", dsr[0], dsr[8]);

	if (gru_destroy_context(cookie))
		perrorx("error closing gru segment");
}