Chapter 3. System Overview

This chapter provides an overview of the physical and architectural aspects of your SGI Altix Integrated Compute Environment (ICE) 8400 series system. The major components of the Altix ICE systems are described and illustrated.

Because the system is modular, it combines the advantages of lower entry-level cost with global scalability in processors, memory, InfiniBand connectivity and I/O. You can install and operate the Altix ICE 8400 series system in your lab, or server room. Each 42U SGI rack holds from one to four 10U-high individual rack units (IRUs) that support up to sixteen compute/memory cluster sub modules known as “blades.” These blades are single printed circuit boards (PCBs) with ASICS, processors, memory components and I/O chip sets mounted on a mechanical carrier. The blades slide directly in and out of the IRU enclosures. Every compute blade contains at least two dual-inline memory module (DIMM) memory units.

Each blade supports two processor sockets that can have four or six processor cores. Note that a maximum system size of 64 compute/memory blades (768 cores) per rack is supported at the time this document was published. Optional chilled water cooling may be required for large processor count rack systems. Customers wishing to emphasize memory capacity over processor count may request blades configured with only one processor installed per blade. Contact your SGI sales or service representative for the most current information on these topics.

The SGI Altix ICE 8400 series systems can run parallel programs using a message passing tool like the Message Passing Interface (MPI). The ICE blade system uses a distributed memory scheme as opposed to a shared memory system like that used in the SGI Altix UV series of high-performance compute servers. Instead of passing pointers into a shared virtual address space, parallel processes in an application pass messages and each process has its own dedicated processor and address space. This chapter consists of the following sections:

System Models

Figure 3-1 shows an example configuration of a single-rack Altix ICE 8400 server.

Figure 3-1. SGI Altix ICE 8400 Series System (Single Rack)

SGI Altix ICE 8400 Series 
System (Single Rack)

The 42U rack for this server houses all IRU enclosures, option modules, and other components; up to 128 processor sockets (768 processor cores) in a single rack. The basic enclosure within the Altix ICE system is the 10U high (17.5 inch or 44.45 cm) “individual rack unit” (IRU). The IRU enclosure supports a maximum of 16 single-wide compute/memory blades, up to eight power supplies, one chassis manager interface and two or four InfiniBand architecture I/O fabric switch interface blades. Note that optional water chilled rack cooling is available for systems in environments where ambiant temperatures do not meet adequate air cooling requirements.

The system requires a minimum of one 42U tall rack with two two PDUs per IRU. Each single-phase PDU has 8 outlets; two PDUs must be used with the first IRU and its support servers. Subsequent IRU's installed into the rack are supported by two single-phase PDUs each.

Figure 3-2 shows an IRU and Rack. The optional three-phase PDU has 18 outlets and two PDUs are installed in each ICE 8400 compute rack. You can also add additional RAID and non-RAID disk storage to your rack system and this should be factored into the number of required outlets.

Figure 3-2. IRU and Rack Components Example

IRU and Rack Components Example

System and Blade Architectures

The Altix ICE 8400 series of computer systems are based on an InfiniBand I/O fabric. This concept is supported and enhanced by using the blade-level technologies described in the following subsections.

Depending on the configuration you ordered and your high-performance compute needs, your system may be equipped with different single-wide blade types. These compute blades all use Intel chip sets and different quad-data rate (QDR) InfiniBand on-board host-controller adapters (HCAs) as follows:

  • The IP101 compute/memory blade uses a single-port InfiniBand host controller adapter (HCA).

  • The IP103 compute/memory blade is equipped with one dual-port InfiniBand HCA.

  • The IP105 version of the system blade uses two single-port InfiniBand HCAs.


    Note: The IP105 compute blade is approximately 1/2-inch (12.7 mm) longer than the IP101 or IP103. The faceplate of the IP105 blade will protrude outward from the face of the IRU about 1/2-inch (12.7 mm) while the IP101 and IP103 blades will fit in the IRU with their faceplates flush to the front of the unit.


IP101 Blade Architecture Overview

An enhanced and updated version of the SGI Altix ICE compute blade is used in the ICE 8400 systems. Each blade uses one or two six-core or four-core Intel processors. The IP101 compute blade is not compatible with and cannot be used in “previous generation” Altix ICE 8200 series IRUs. This blade architecture is described in the following sections. The primary difference between the IP101 and and IP103 blades is the single-port vs. dual port imbeded InfiniBand HCA. Note that the IP101 compute blade can only be used in single-plane configurations of the ICE 8400 system.

The compute blade contains the processors, memory, and one QDR InfiniBand single-port imbeded HCA. As previously mentioned, each compute blade is configured with one or two six-core or quad-core Intel processors - a maximum of 12 processor cores per compute blade. A maximum of twelve DDR3 memory DIMMs are supported per compute blade.

The two processors on the IP101 maintain an interactive communication link using Intel's QuickPath Interconnect (QPI) technology. This high-speed interconnect technology provides data transfers between the processors, memory and I/O hub components. Note the IP101 blade does not support a native “on-board” hard disk drive option.

IP103 Blade Architecture Overview

An enhanced and updated six-core or quad-core version of the SGI Altix ICE compute blade is used in the ICE 8400 systems. The IP103 compute blade is not compatible with and cannot be used in “previous generation” Altix ICE 8200 series IRUs. This blade architecture is described in the following sections.

The compute blade contains the processors, memory, and one QDR InfiniBand dual-port imbeded HCA. Each compute blade is configured with one or two six-core or quad-core Intel processors - a maximum of 12 processor cores per compute blade. A maximum of twelve DDR3 memory DIMMs are supported per compute blade.

The two processors on the IP103 maintain an interactive communication link using Intel's QuickPath Interconnect (QPI) technology. This high-speed interconnect technology provides data transfers between the processors, memory and I/O hub components. Note that the IP103 blade does not support a native “on-board” hard disk drive option.

IP105 Blade Architecture Overview

Although the compute blades used in the Altix ICE products are different physically, their basic compute architecture is nearly identical. The primary functional difference is that the larger IP105 blade supports two single-port QDR InfiniBand ASICs and an optional on-board hard disk drive or solid-state disk (SSD). The two compute blades are virtually the same in terms of available numbers of processors, (maximum of 12 cores) on-board memory control, QPI interfaces, DIMM types used and I/O control interfaces.

Figure 3-3 shows a functional block diagram of the Altix ICE 8400 series system IRU using IP105 single-wide compute blades, InfiniBand interface, and component interconnects.

QuickPath Interconnect Features

Each processor on a blade uses two QuickPath Interconnect (QPI) links. The QPI link consists of two point to point 20 bit channels - one send channel and one receive channel. The QPI link has a theoretical maximum aggregate bandwidth of 25.6 GB/s. Each blade's I/O chip set supports two processors. Each processor is connected to one of the I/O chips with a QPI channel. The two processors and the I/O chips are also connected together with a single QPI channel.

The maximum bandwidth of a single QPI link is calculated as follows:

  • The QPI channel uses a 3.2 GHz clock, but the effective clock rate is 6.4 GHz because two bits are transmitted at each clock period -once on the rising edge of the clock and once on the falling edge (DDR).

  • Of the 20 bits in the channel, 16 bits are data and 4 bits are error correction.

  • 6.4 GHz times 16 bits equals 102.4 bits per clock period.

  • Convert to bytes: 102.4 divided by 8 equals 12.8 GB/s (the maximum single direction bandwidth)

  • The total aggregate bandwidth of the QPI channel is 25.6 GB/s: (12.8 GB/s times 2 channels)

Blade Memory Features

The memory control circuitry is integrated into the processors and provides greater memory bandwidth and capacity than previous generations of ICE compute blades.

Note that each processor on a blade uses three DDR3 memory channels with one or more memory DIMMs on each channel (depending on configuration selected). Each blade can support up to 12 DIMMs. The DDR3 memory channel supports a maximum memory bandwidth of up to 10.66 GB per second. The combined maximum bandwidth for all three memory channels on a single processor is 25.6 GB per second. It is highly recommended (though not required) that each processor on a blade be configured with a minimum of three DIMMs (one for each memory channel) to ensure the best DIMM data throughput.

The memory bandwidth is determined by three key factors:

  • The processor speed - different processor SKUs support different DIMM speeds.

  • The number of DIMMs per channel.

  • The DIMM speed - the DIMM itself has a maximum operating frequency or speed. At the time this document was published the ICE 8400 DIMM speed was 1333 MT/s.


    Note: A DIMM must be rated for the maximum speed to be able to run at the maximum speed. For example: a single 1066 MT/s DIMM on a channel will only operate at 1066 MT/s - not 1333 MT/s.


Populating one 1333 MT/s DIMM on each channel delivers a maximum of 10.66 GB/s per channel or 31.99 GB/s total memory bandwidth. The QuickPath Interconnect technology allows memory transfer or retrieval between the blade's two processors at up to 25.6 GB per second.

A minimum of one dual-inline-memory module (DIMM) is required for each processor on a blade. An IRU example using single-wide blades is shown in Figure 3-10. Each of the DIMMs on a blade must be the same capacity and functional speed. When possible, it is generally recommended that all blades within an IRU use the same number and capacity (size) DIMMs.


Note: Regardless of the number of DIMMs installed, a minimum of 4 GB of DIMM memory is recommended for each compute blade. Failure to meet this requirement may have impact on overall application performance.

Each blade in the IRU may have a different total DIMM capacity. For example, one blade may have 12 DIMMs, and another may have only six. Note that while this difference in capacity is acceptable functionally, it may have impact on compute “load balancing” within the system.

Figure 3-3. Functional Block Diagram of an ICE 8400 Individual Rack Unit (IRU)

Functional Block Diagram of an ICE 8400 Individual Rack Unit (IRU)

System InfiniBand Switch Blades

Two or four quad-data-rate (QDR) InfiniBand switch blades can used with each IRU configured in the Altix ICE 8400 system. Figure 3-4 shows an example IRU with four QDR switches installed. IRUs with four switch blades use a dual-plane topology that provides high-bandwidth communication between compute blades inside the IRU as well as blades in other IRUs.

IRU's using two QDR switch blades are available in certain specific configurations. The two-switch blade configuration supports a single-plane QDR InfiniBand topology only; check with your SGI sales or service representative for additional information on availability.

Each Altix ICE 8400 QDR switch blade has 21 external ports (four of these are mini-SAS ports) to support the InfiniBand fabric. Any external switch blade ports not used to support the IB system fabric may be connected to optional service nodes or InfiniBand mass storage. Check with your SGI sales or service representative for information on available options.

Figure 3-4. InfiniBand QDR Switch Numbering in IRUs

InfiniBand QDR Switch Numbering in IRUs

System Features and Major Components

The main features of the Altix ICE 8400 series server systems are introduced in the following sections:

Modularity and Scalability

The Altix ICE 8400 series systems are modular, blade-based, scaleable, high-density cluster systems. The system rack components are primarily housed in building blocks referred to as individual rack units (IRUs). However, other “free-standing” Altix compute servers are used to administer, access and service the ICE 8400 series systems. Additional optional mass storage may be added to the system along with additional IRUs. You can add different types of stand-alone module options to a system rack to achieve the desired system configuration. You can configure and scale IRUs around processing capability, memory size or InfiniBand fabric I/O capability. The air-cooled IRU enclosure has redundant, hot-swap fans and redundant, hot-swap power supplies. A water-chilled rack option expands an ICE 8400 rack's heat dissipation capability for the IRU components without requiring lower ambiant temperatures in the lab or server room. See Figure 4-3 for an example water-chilled rack configuration.

A number of free-standing (non-blade) compute and I/O servers (often referred to as nodes) are used with Altix ICE 8400 series systems in addition to the standard two-socket blade-based compute nodes. These free-standing units are:

  • System administration controller

  • System rack leader controller (RLC) server

  • Service nodes with the following functions:

    • Fabric management service node

    • Login node

    • Batch node

    • I/O gateway node

Each ICE system will have one system administration controller, one rack leader controller (RLC) per system rack, and at least one service node.

The administration controller is a 2U server and the RLCs are integrated stand-alone 1U servers. The service nodes are integrated stand-alone non-blade 1U, 2U, 3U or 4U servers.

The following subsections further define the free-standing unit functions described in the previous list.

System Administration Server

There is one stand-alone administration controller server and I/O unit per system. The system administration controller is a non-blade Altix 2U server system. The server is used to install ICE system software, administer that software and monitor information from all the compute blades in the system. Check with your SGI sales or service representative for information on “cold spare” options that provide a standby administration server on site for use in case of failure.

The administration server on ICE 8400 systems is connected to the external network and may be set up for interactive logins under specific circumstances. However, most ICE systems are configured with dedicated “login” servers for this purpose. In this case, you might configure multiple “service nodes” and have all but one devoted to interactive logins as “login nodes”, see the “Login Server Function” and the “I/O Gateway Node”.

Rack Leader Controller

A rack leader controller (RLC) server is generally used by administrators to provision and manage the system using SGI's cluster management (CM) software. There is generally only one leader controller per rack and it is a non-blade “stand-alone” 1U server. The rack leader controller is guided and monitored by the system administration server. It in turn monitors, pulls and stores data from the compute nodes of all the IRUs within the rack. The rack leader then consolidates and forwards data requests received from the IRU's blade compute nodes to the administration server. The leader controller may also supply boot and root file sharing images to the compute nodes in the IRUs.

For large systems or systems that run many MPI jobs, multiple RLC servers may be used to distribute the load (one RLC server per rack). The first RLC in the ICE system is the “master” controller server. Additional RLCs are slaved to the first RLC (normally installed in rack 1). The second RLC runs the same fabric management image as the primary “master” RLC. Check with your SGI sales or support representative for configurations that use a “cold spare” RLC or administration server. This option can provide rapid replacement for a failed RLC or administrative unit.

In most ICE configurations the fabric management function is handled by the rack leader controller (RLC) node. The RLC is an independent server that is not part of an IRU. See the “Rack Leader Controller” subsection for more detail. The fabric management software runs on one or more RLC nodes and monitors the function of and any changes in the InfiniBand fabrics of the system. It is also possible to host the fabric management function on a dedicated service node, thereby moving the fabric management function from the rack leader node and hosting it on an additional server(s). A separate fabric management server would supply fabric status information to the RLC server periodically or upon request. As with the rack leader controller server, only one per rack is supported.

Service Nodes

The functionality of the service “nodes” listed in this subsection are all services that can technically run on a single hardware server unit. As the system scales, you can add more servers (nodes) and dedicate them to these service functions if the size of the system requires it. However you can also have a smaller system where many of the services are combined on just a single service node. Figure 3-5 shows an example rear view of a 1U service node. Note that dedicated fabric management nodes are required on 8-rack or larger systems.

Figure 3-5. Example Rear View of a 1U Service Node

Example Rear View of a 1U Service Node

Login Server Function

The login server function within the ICE system can be functionally combined with the I/O gateway server node function in some configurations. One or more per system are supported. Very large systems with high levels of user logins may use multiple dedicated login server nodes. The login node functionality is generally used to create and compile programs, and additional login server nodes can be added as the total number of user logins increase. The login server is usually the point of submittal for all message passing interface (MPI) applications run in the system. An MPI job is started from the login node and the sub-processes are distributed to the ICE system's compute nodes. Another operating factor for a login server is the file system structure. If the node is NFS-mounting a network storage system outside the ICE system, input data and output results will need to pass through for each job. Multiple login servers can distribute this load.

Figure 3-6 shows the rear connectors and interface slots on a 2U service node.

Figure 3-6. 2U Service Node Rear Panel

2U Service Node Rear Panel

Batch Server Node

The batch server function may be combined with login or other service nodes for many configurations. Additional batch nodes can be added as the total number of user logins increase. Users login to a batch server in order to run batch scheduler portable-batch system/load-sharing facility (PBS/LSF) programs. Users login or connect to this node to submit these jobs to the system compute nodes.

I/O Gateway Node

The I/O gateway server function may be combined with login or other service nodes for many configurations. If required, the I/O gateway server function can be an optional 1U, 2U or 3U stand-alone server within the ICE system. See Figure 3-7 for a rear view example of the 3U service node. One or more I/O gateway nodes are supported per system, based on system size and functional requirement. The node may be separated from login and/or batch nodes to scale to large configurations. Users login or connect to submit jobs to the compute nodes. The node also acts as a gateway from InfiniBand to various types of storage, such as direct-attach, Fibre Channel, or NFS.

Figure 3-7. 3U Service Node Rear Panel Example

3U Service Node Rear Panel Example

The 4U Service Node

An optional 4U service node is offered with the ICE 8400 systems. This server is a higher-performance system that can contain multiple processors (up to 4) and serve multiple purposes within the ICE system. The 4U server is not used as an administrative node or rack leader controller.

Figure 3-8 shows the rear panel of the 4U service node and Table 3-1 identifies the functional items on the back of the unit. See the SGI Altix UV 10 System User's Guide, (P/N 007-5645-00x) for details on operating the 4U server.

Figure 3-8. 4U Service Node Rear Panel Example

4U Service Node Rear Panel Example

Table 3-1. 4U Service Node Rear Panel Items

Item

Description

A

SAS riser slot - PCIe Gen-2x8 half-height slot

B

I/O riser Gigabit Ethernet ports

C

I/O riser module

D

Serial port connector

E

PCIe Gen-2x8 slots

F

Power supply unit status LEDs

G

AC power input connectors

H

Hot-swap power supply

I

System ID on/off button

J

System status/fault LED

K

System ID LED (blue)

L

USB 2.0 ports

M

VGA video port (up to 1600x1200) 15-pin connector

N

8 power on status test (POST) status LEDs

O

I/O riser management Ethernet port


Multiple Chassis Manager Connections

In certain multiple-IRU configurations the chassis managers in each IRU may be interconnected and wired to the administrative server and the rack leader controller (RLC) server. Figure 3-9 shows an example diagram of the CMC interconnects between two ICE 8400 system racks.


Note: The unconnected chassis manager extension (blue) shown on the lower-right side of the figure illustrates a hypothetical continuation of the CMC network to a third ICE rack.

For more information on these and other topics related to the CMC, see Chapter 1 in the SGI Tempo System Administrator's Guide, (P/N 007-4993-00x).

Note also that the scale of the CMC drawings in Figure 3-9 is adjusted to clarify the interconnect locations.

Figure 3-9. Administration and RLC Cabling to Chassis Managers Example

Administration and RLC Cabling to Chassis Managers Example

Reliability, Availability, and Serviceability (RAS)

The Altix ICE 8400 server series components have the following features to increase the reliability, availability, and serviceability (RAS) of the systems.

  • Power and cooling:

    • IRU power supplies are redundant and can be hot-swapped under most circumstances. Note that this might not be possible in a “fully loaded” IRU.

    • A rack-level water chilled cooling option is available for all configurations.

    • IRUs have overcurrent protection at the blade and power supply level.

    • Fans are redundant and can be hot-swapped.

    • Fans run at multiple speeds in the IRUs. Speed increases automatically when temperature increases or when a single fan fails.

  • System monitoring:

    • Chassis managers monitor the internal voltage, power and temperature of the IRUs.

    • Redundant system management networking is available.

    • Each IRU and each blade/node installed has failure LEDs that indicate the failed part; LEDs are readable at the front of the IRU.

    • Systems support remote console and maintenance activities.

  • Error detection and correction

    • External memory transfers are protected by cyclical redundancy correction (CRC) error detection. If a memory packet does not checksum, it is retransmitted.

    • Nodes within each IRU exceed SECDED standards by detecting and correcting 4-bit and 8-bit DRAM failures.

    • Detection of all double-component 4-bit DRAM failures occur within a pair of DIMMs.

    • 32-bits of error checking code (ECC) are used on each 256 bits of data.

    • Automatic retry of uncorrected errors occurs to eliminate potential soft errors.

  • Power-on and boot:

    • Automatic testing (POST) occurs after you power on the system nodes.

    • Processors and memory are automatically de-allocated when a self-test failure occurs.

    • Boot times are minimized.

System Components

The Altix ICE 8400 series system features the following major components:

  • 42U rack. This is a custom rack used for both the compute and I/O rack in the Altix ICE 8400 series. Up to 4 IRUs can be installed in each rack. Note that the primary (first) rack must have 3U of space reserved for the 2U administrative controller server and 1U rack leader controller (RLC) server.

  • Individual Rack Unit (IRU). This enclosure contains the compute/memory blades, chassis manager, InfiniBand fabric I/O blades and front-access power supplies for the Altix ICE 8400 series computers. The enclosure is 10U high. Figure 3-10 shows the Altix ICE 8400 series IRU system components.

  • Single-wide compute/memory blade. Holds two (quad-core or six-core) processor sockets and up to 12 memory DIMMs.

  • 1U (Rack leader controller). The 1U rack leader server is required in each system rack.

  • 2U Administrative server with PCIe expansion. This server node supports an optional console, administrative software and PCI Express option cards. The administrative server is installed in the primary rack in the system.

  • 1U Service node. Additional 1U server(s) can be added to a system rack and used specifically as an optional login, batch, or fabric functional node. Note that these service functions cannot be incorporated as part of the system RLC server.

  • 2U Service node. An optional 2U service node may be used as a login, batch, or fabric functional node. In smaller systems these functions may be combined on one server. Note that the 2U service node function is never a shared part of the 2U administrative server.

  • 3U Service node. The optional 3U server node is offered with certain configurations needing higher performance I/O access for the ICE system. It offers multiple I/O options and graphics options not available with the 1U or 2U service nodes.

  • 4U Service node. The optional 4U server is offered as the highest overall performance service node available with the ICE 8400 system. It offers the highest processing power, best I/O performance and most flexible configuration options of the available service nodes.


    Note: PCIe options may be limited, check with your SGI sales or support representative.


    Figure 3-10. Altix ICE 8400 Series IRU System Components Example

    Altix ICE 8400 Series 
IRU System Components Example

IRU (Unit) Numbering

IRUs in the racks are not identified using standard units. A standard unit (SU) or unit (U) is equal to 1.75 inches (4.445 cm). IRUs within a rack are identified by the use of module IDs 0, 1, 2, and 3, with IRU 0 residing at the bottom of each rack. These module IDs are incorporated into the host names of the CMC (i0c, i1c, etc.) and the compute blades (r1i0n0, r1i1n0, etc.) in the rack.

Rack Numbering

Each rack in a multi-rack system is numbered with a single-digit number sequentially beginning with (001). A rack contains IRU enclosures, administrative and rack leader server nodes, service specific nodes, optional mass storage enclosures and potentially other options.


Note: In a single compute rack system, the rack number is always (1).

The number of the first IRU will always be zero (0). These numbers are used to identify components starting with the rack, including the individual IRUs and their internal compute-node blades. Note that these single-digit ID numbers are incorporated into the host names of the rack leader controller (RLC) (r1lead) as well as the compute blades (r1i0n0) that reside in that rack.

Optional System Components

Availability of optional components for the SGI ICE 8400 series of systems may vary based on new product introductions or end-of-life components. Some options are listed in this manual, others may be introduced after this document goes to production status. Check with your SGI sales or support representative for the most current information on available product options not discussed in this manual.