An SGI Altix ICE 8000 series system is an integrated blade environment that can scale to thousands of nodes. The Scali Manage management software enables you to provision, install, configure, and manage your system. This chapter provides an overview of the SGI Altix ICE 8000 series system and covers the following topics:
This section provides a brief overview of the SGI Altix ICE 8000 series system hardware and covers the following topics:
For a detailed hardware description, see the SGI Altix ICE 8000 Series System Hardware User's Guide.
The SGI Altix ICE 8000 series system is a blade-based, scalable, high density compute system. The basic building block is the individual rack unit (IRU). The IRU provides power, cooling, system control, and the network fabric for 16 compute blades, as shown in Figure 1-1. Each compute blade supports two either dual-core or quad-core Xeon processor sockets and eight fully-buffered, double-data-rate two (DDR2) memory dual in-line memory module (DIMMs). Four IRUs can reside in a custom designed 42U high rack.
One rack supports a maximum of 512 processor cores and 2TB of memory.
The SGI Altix ICE 8000 series system topology is based on an InfiniBand interconnect. Internal InfiniBand switch ASICs of the IRU eliminate the need for external InfiniBand switches. The dual high-speed, low-latency double data rate (DDR) InfiniBand backplanes built into the IRUs provide for fast communication between nodes and racks.
An InfiniBand switch blade provides the interface between compute blades within the same chassis and also between compute blades in separate IRUs. Fabric management software monitors and controls the InfiniBand fabric. SGI Altix ICE 8000 series systems are configured with two InfiniBand fabrics, designated as ib0 and ib1. In order to maximize performance, SGI advises that the ib0 fabric be used for all MPI traffic, such as Scali MPI or SGI Message Passing Toolkit (MPT). The ib1 fabric is reserved for storage related traffic. The default configuration for MPI is to use only the ib0 fabric. For more information on the InfiniBand fabric, see Chapter 3, “System Fabric Management”.
| Note: The “ ib0 fabric" is a convenient shorthand for "the fabric which is connected to the ib0 interface on most of the nodes”. In the case of the storage service node, there are four interfaces called ib0 through ib3, all of which are connected to the ib1 fabric (see “Storage Service Node ”). |
The SGI Altix ICE system is a distributed memory system as opposed to a shared memory system like that used in the SGI Altix 450 or SGI Altix 4700 high-performance compute servers. Instead of passing pointers into a shared virtual address space, parallel processes in an application pass messages and each process has its own dedicated processor and address space.
Just like a multi-processor shared memory system, an Altix ICE system can be shared among multiple applications. For instance, one application may run on 16 processors in the system while another application runs on a different set of eight processors. Very large systems may run dozens of separate, independent applications at the same time.
Typically, each process of an MPI job runs exclusively on a processor. Multiple processes can share a single processor, through standard Linux context switching, but this can have a significant effect on application performance. A parallel program can only finish when all of its sub-processes have finished. If one process is delayed because it is sharing a processor and memory with another application, then the entire parallel program is delayed. This gets slightly more complicated when systems have multiple processors (and/or multiple cores) that share memory, but the basic rule is that a process is run on a dedicated processor core.
An Gigabit Ethernet connection network built into the backplane of the IRUs provides a control network isolated from application data. Traverse cables provide connection between IRUs and between racks. For more information on how the Gigabit Ethernet connection fabric is used, see “VLANs”.
Each IRU has a one chassis management control (CMC) blade located directly below compute blade slot 0 as shown in Figure 1-1. This is the chassis manager that performs environmental control and monitoring of the IRU. The CMC controls master power to the compute blades under direction of the rack leader controller (leader node). The leader node can also query the CMC for monitored environmental data (temperatures, fan speeds, and so on) for the IRU. Power control for each blade is handled by its Baseboard Management Controller (BMC), also under direction of the rack leader controller. Once the leader node has asked the CMC to enable master power, the leader node can then command each BMC to power up its associated blade. The leader node can also query each BMC to obtain some environmental and error log information about each blade.
The IRU provides data collected from compute nodes within the IRU to the leader node upon request.
The CMC and BMCs are powered by what is called "AUX POWER". This power supply is live any time the rack is plugged in and the main breakers are on. The CMC and BMCs are not able to be powered off under software control.
The compute blades have MAIN POWER which is controlled by the blade BMC. You can send a command to the BMC and have the main power to the associated blade turned on or off by that BMC.
The IRU has a MAIN POWER bus that feeds all of the blades. This main power bus can be turned on and off with a software command to the CMC. This "powering up of the IRU" turns on this main power, the fans in the IRU, and the power to the IB switches. The CMC, itself, is always powered on. This includes the Ethernet switch that is a part of the CMC.
The SGI Altix ICE 8000 series system has a unique four-tier, hierarchical management framework as follows:
Unlike traditional, flat clusters, the SGI Altix ICE 8000 series system does not have a head node. The head node is replaced by a hierarchy of nodes that enables system resources to scale as you add processors. This hierarchy is, as follows:
System admin controller (admin node)
Rack leader controller (leader node)
Service Nodes
Login
Batch
Gateway
Storage
The one system admin node can provision and control multiple leader nodes in the cluster. It receives aggregated cluster management data from the rack leader controllers (leader nodes).
Each system rack has its own leader node. The leader node holds the boot images for the compute blades and aggregates cluster management data for the rack.
Ethernet traffic for managing the nodes in a rack is constrained within the rack by the leader node. Communication and control is distributed across the entire cluster, thereby avoiding a communication bottleneck to the admin node. Administrative tasks, such as booting the cluster, can be done in parallel rack-by-rack in a matter of seconds. For very large configurations, the access infrastructure can also be scaled by adding additional login and batch service nodes. It is the VLAN logical networks that help prevent network traffic bottlenecks.
| Note: Understanding the VLAN logical networks is critical to administering an SGI Altix ICE system. For more detailed information, see “VLANs” and “Network Interface Naming Conventions”. |
The rack leader controller (leader node) and admin node are described in the section that follows (“System Nodes”).
Figure 1-2 shows chassis manager cabling.
| Note: All nodes reside in the Altix ICE custom designed rack. Figure 1-2 and Figure 1-3 show how systems are cabled up prior to shipment. These figures are meant to give you a functional view of the Altix ICE hierarchical design. They are not meant as cabling diagrams. |
The chassis manager in each rack connects to the leader node in its own rack and also the chassis manager in the adjacent rack. The admin node connects to one leader node in the rack. The admin node accesses the BMC on each compute node in the rack via VLAN running over a Gigabit Ethernet (GigE) connection (see Figure 1-7).
Figure 1-3 shows cabling for a service node and storage service node (NAS cube).
This section describes the system nodes that are part of SGI Altix ICE 8000 series system and covers the following topics:
The system admin controller (admin node), is used by a system administrator to provision (install) and manage the SGI Altix ICE 8000 series system using Scali Manage systems management software. There is only one admin node per SGI Altix ICE 8000 series system, as shown in Figure 1-2 and it cannot be combined with any other nodes. A GigE connection provides the network connection between the admin node, leader nodes, and service nodes. Communication to and from the CMC and compute blades from the admin node is controlled by VLANs to reduce network traffic bottlenecks in the system. The admin node is used to provision and manage the leader nodes, compute nodes and service nodes. It receives and holds aggregated management data from the leader nodes. The admin node is an appliance node. It always runs software specified by SGI.
The kernels, initrds and root filesystems (which together make up an "image") reside on the admin node.
When compute nodes are first set up with a new image, the leader nodes will cache this information to reduce the network load for the admin node.
The rack leader controller (leader node) is used to manage the nodes in a single rack. The rack leader controller is provisioned and functioned by the admin node. There is one leader node per rack, as shown in Figure 1-2. A GigE connection provides the network connection to other leader nodes and to first IRU within its rack as shown in Figure 1-3 and Figure 1-4. An InfiniBand fabric connects it to the compute nodes within its rack and compute nodes in other racks. The leader node is an appliance node. It always runs software specified by SGI. The rack leader controller (leader node) does the following:
Runs the fabric management software to monitor and function the InfiniBand fabric on one or more leader nodes in your Altix ICE system
Monitors, functions, and receives data from the IRUs within its rack
Monitors, functions, and receives data from compute nodes within its rack
Consolidates and forwards data from the IRUs and compute nodes within its rack to the admin node upon request
| Note: The following CMC description is the same as the information presented in “Basic System Building Blocks”. |
Figure 1-1 shows an IRU with 16 compute nodes. Users submit MPI jobs to run in parallel on the Altix ICE system compute nodes using a public network connection via the service node. The service node provides login services and a batch scheduling service, such as the Scali MPI scheduling service. The compute nodes are controlled and monitored by the leader node in rack as shown in Figure 1-2. A compute node is diskless and its filesystem is in memory. Scali Manage diskless means a "memory resident" operating system. This means that the operating system resides solely in the system memory. And this means that with each reboot, the compute nodes are re-imaged. For Scali Manage Altix ICE systems, there is only random random-access memory (RAM) available on compute nodes. Power cycle installs a fresh image and any changes to the compute node "filesystem" are volatile. The image that gets loaded onto the Scali Manage Altix ICE compute nodes does get cached on the leader node. The image comes from the admin node.
Actions for the CMC and compute blades are sent to the appropriate rack leader controller, which communicates to the appropriate CMC and compute blades. The compute nodes do not communicate directly to the CMC or admin nodes, or leader nodes outside their rack.
Generally, the CMC controller is not meant to be accessed directly by system administrators, however, in some situations you may need to access it to change a configuration using the LCD control panel. For example, if you added a NAS cube to your system you need to reconfigure the CMC.
| Note: The LCD control panel is not operational for the first release. |
The individual rack unit (IRU) is one of the basic building blocks of the SGI Altix ICE 8000 series system as shown in Figure 1-1. It is described in detail in “Basic System Building Blocks”.
The login service node allows users to login into the system to create, compile, and run applications. The login node is usually combined with batch and gateway service nodes for most configurations. The login service node is connected to the Altix ICE system via the InfiniBand fabric and GigE to the public customer network as shown in Figure 1-4. Additional login service nodes can be added as the total number of user logins grow.
The batch service node provides a batch scheduling service, such as PBS Professional (not supported on Scali Manage on SGI Altix ICE systems for the first release. You need to install it separately. It is supported on the SGI Altix ICE software stack). It is commonly combined with login and gateway service nodes for most configurations. It is connected to the Altix ICE system via the InfiniBand fabric and GigE to the public customer network. This node may be separated from gateway and/or login nodes to scale for large configurations or to run multiple batch schedules.
The gateway service node is the gateway to services on the public network, such as, storage, lightweight directory access protocol (LDAP) services, and file transfer protocol (FTP). Typically, it is combined with the login/batch service node. This node may be separated from login and/or batch nodes to scale for large configurations.
The storage service node is a network-attached storage (NAS) appliance bundle that provides InfiniBand attached storage for the Altix ICE system. There can be multiple storage service nodes for larger Altix ICE system configurations. Figure 1-3 shows a service node and a storage service node (NAS cube).
| Note: All nodes reside in the Altix ICE custom designed rack. Figure 1-2 and Figure 1-3 show how systems are cabled up prior to shipment. These figures are meant to give you a functional view of the Altix ICE hierarchical design. They are not meant as cabling diagrams. |
This section describes the Gigabit Ethernet (GigE) and 10/100 Ethernet connections and the InfiniBand fabric in an SGI Altix ICE 8000 series system and covers the following topics:
This section describes the various network connections in the SGI Altix ICE 8000 series system. Users access the system via a public network through services nodes such as the login node and the batch service node, as shown in Figure 1-4. A single service node can provide both login and batch services.
System administrators provision (install software) and manage the Altix ICE system via the logical VLAN network running over the GigE connection (see Figure 1-6, Figure 1-7, and Figure 1-8). The admin node is on the house network (public network) and you access it directly.
The leader node is connected to blades in its rack via the GigE VLAN. It is connected to all blades and service nodes via InfiniBand fabric. Leader nodes have access to compute nodes in other racks via the leader node in that rack.
The gateway service node is the gateway from the InfiniBand fabric to services on the public network, such as, storage, lightweight directory access protocol (LDAP) services, file transfer protocol (FTP). Typically, it is combined with the login/batch service node.
The admin node and service nodes communicate with the leader node over a GigE fabric that has logically separate, virtual local area networks (VLANs). This GigE fabric is embedded in the backplane of each IRU. This GigE fabric electrically connects much of the Altix ICE system (see Figure 1-4).
Users access compute nodes strictly from the service nodes. Jobs are started on compute nodes using commands on the service node, such as, the OpenSSH client remote login program ssh (1), or the Scali Manage GUI invoked with the following command: /opt/scali/bin/scalimanage-gui.
The SGI Altix ICE 8000 series system has several Ethernet networks that facilitate booting and managing the system. These networks are built onto the backplane of each IRU for connection to the compute blades and transverse cables between IRUs and between racks. Each compute blade has a Gigabit Ethernet (GigE) and 10/100 Ethernet connection to the backplane.
The GigE connection is an interface that is accessible to the operating system and the basic input/output (BIOS) running on the blade. It is the interface over which the BIOS uses the preboot execution environment (PXE) to PXE boot and it is eth0 to the Linux kernel.
The 10/100 Ethernet interface is accessible to the management interface (BMC) built onto each compute blade. The operating system running on the blade cannot directly access this 10/100 interface. It belongs to the processor on the BMC. Likewise, the BMC cannot access the GigE interface.
Figure 1-5 shows a more detailed view of the Chassis manager.
The chassis management control (CMC) blade has two embedded Ethernet switches . One is a 24-port GigE switch and the other a 24-port 10/100 switch. The 10/100 switch is a sub-switch (hanging off one port of) the GigE switch.
The primary GigE interface from each of sixteen blades connects to the GigE switch and the sixteen blade BMCs connect to the 10/100 switch. The GigE connections also connect the service nodes, including service storage nodes.
The GigE switches in each IRU are "stacked" using a special stacking connection between each IRU in a rack. This connection runs a special intra-switch protocol. All switches in a rack are ganged together to form one large 96 port switch. The connections from each CMC to another are labeled UP and DN as shown in Figure 1-5. The switches are stacked in a ring so failure of one link still allows traffic to flow in the opposite direction on the ring.
The processor on the CMC manages these switches effectively forming a large, intelligent Ethernet switch. A VLAN mechanism runs on top of this network to allow management control software to query port statistics and other port metrics including the attached peer's MAC address.
The CMC has five additional RJ45 connections on its front panel as shown in Figure 1-5. The function of these jacks is, as follows:
Local
This is a connection to the leader node at the top of the rack in which this CMC is located. Only one CMC (of the possible four) is connected to the leader node, as shown in Figure 1-2.
LL
Used to connect service nodes and service storage nodes. The RL jack in the far left CMC connects to the LL jack of the right adjacent CMC to create or grow the Ethernet network. Figure 1-2 shows this daisy chaining.
RL
Used to connect service nodes and service storage nodes. The RL jack in the far left CMC connects to the LL jack of the right adjacent CMC to create or grow the Ethernet network. Figure 1-2 shows this daisy chaining.
L58
This is a connection for the IEEE 1588 timing protocol from this CMC to the one immediately to the left. If this is the left-most rack, this jack is unconnected.
R58
This is a connection for the IEEE 1588 timing protocol from this CMC to the one immediately to the right. If this is the right-most rack, this jack is unconnected.
A NAS cube storage service node uses both the LL and RL jacks to connect to the Altix ICE system as shown in Figure 1-3.
For small, one IRU configurations, the L58 and R58 ports (see Figure 1-5) can be used to connect service nodes.
Several virtual local area networks (VLANs) are used to isolate Ethernet traffic domains within the cluster. The physical Ethernet is a shared network that has a connection to every node in the cluster. The admin node, leader nodes, service nodes, compute nodes, CMCs, BMCs, all have a connection to the Ethernet. To isolate the broadcast domains and other traffic within the cluster, VLANs are used to partition it and are, as follows:
VLAN_1588
Includes all 1588_left and 1588_right connections, as well as an internal port to the CMC processor. This VLAN carries all of the IEEE 1588 timing traffic.
VLAN_HEAD
Includes all leader_local, leader_left , and leader_right connections. The VLAN_HEAD VLAN connects the admin node to all of the leader nodes (including the leader nodes' BMCs) and the service nodes.
VLAN_BMC
Includes all 10/100 sub-switches and the leader_local ports. The VLAN_BMC VLAN connects the leader nodes to all of the BMCs on the compute blades and to the CMCs within each IRU. See Figure 1-6.
VLAN_GBE
Includes all GigE blade ports and the leader_local port. See Figure 1-6. The VLAN_GBE VLAN connects the leader nodes to the GigE interfaces of all the compute blades.
VLAN_GBE and VLAN_BMC do not extend outside of any rack. Therefore, traffic on those VLANs stays local to each rack.
Only VLAN_HEAD extends rack to rack. It is the network used by the admin node to communicate to the leader node of each rack and to each service node.
The rack leader controllers (leader nodes) must run 802.1Q VLAN protocol over their downstream GigE connection to the CMC and the CMC LL port must also run 802.1Q. This is done for you when the rack leader controllers are installed from the system admin node (see “Installing Service and Leader Nodes” in Chapter 2). Each VLAN should present itself as a separate, pseudo interface to the operating system kernel running on that leader node. VLAN _HEAD , VLAN_BMC, and VLAN_GBE must all transition the single Ethernet segment which connects the leader to the CMC in the rack below it.
The VLAN_GBE and VLAN_BMC networks connect the leader node in a given rack with the compute nodes (blades). In the case of VLAN_BMC, the network also connects the CMC with the compute blades and rack leader controller (leader node).
In an SGI Altix ICE system with just one IRU, the CMC's R58 and L58 ports are assigned to VLAN_HEAD by a field configurable setting. This provides two additional Ethernet ports that can be use to connect service nodes to your system.
The InfiniBand fabric connects the service nodes, leader nodes, and the compute blades. It does not connect to the admin node or the CMCs. The InfiniBand network has two separate network fabrics, ib0 and ib1. The host channel adapter (HCA) in the leader node has two ports that connect separately to the bottom IRU in the rack.
Each IRU has two 24-port switches (see Switch blade in Figure 1-9). Each switch is on a seperate fabric.
On each switch, 16 ports go to the 16 compute blades. Each compute blade has two, single port HCAs and each HCA connects to a fabric. Therefore, both switches connect to each blade.
Of the remaining eight ports on each switch, currently six of them are used to connect to either IRUs in the same rack or to IRUs in other racks. One port of one IRU in a rack (usually the first or 0th IRU) connects to the leader node in that rack.
To simplify the deployment and management of the Altix ICE system the scaaltixice package includes functionality to automatically configure the system according to a fixed policy tailored for the hierarchical topology used in the Altix ICE system (see “Hardware Overview”).
The network policy implemented by the scaaltixice package is described in this section, as follows:
The Ethernet networks implemented are, as follows:
Corporate network, sometimes called the house network
Your site's existing corporate network to which the Altix ICE system is connected.
Head network
Network for communication between admin node, service nodes, and rack leader controllers (leader nodes). The is the inter-rack communication network.
Head BMC network
Network for communication between admin node and the BMCs on service nodes and leader nodes. This network is on the same VLAN as the head network.
Rack networks
One per rack, and provides the intra-rack network for communication between the leader node and all the blades in a rack.
Rack BMC network
One per rack, and provides intra-rack network for communication between the leader node and the BMCs on all the blades in the rack.
The InfiniBand networks implemented are, as follows:
IB subnet1
Subnet for MPI communication. Default network.
IB subnet2
Subnet for network filesystems.
The system admin controller (admin node) is the Scali Manage server. Networks implemented are, as follows:
BMC
Connected to corporate network. You set the IP address and subnet mask.
eth0
Connected to corporate network. You set the IP address and subnet mask.
eth1
Connected to the head network (IP 172.16.0.1, name admin ) and head BMC network (IP 172.17.0.1, name admin-mgm ).
The service node networks implemented are, as follows:
BMC
Connected to the head BMC network (IP 172.17.0.[2-255])
eth0
Connected to the head network (IP 172.16.0.[2-255], name < hostname>
eth1
Optionally, connected to the corporate network. IP address and subnet mask set by the customer. Name <hostname> -ext.
ib0
Connected to IB subnet1. (IP 10.0.0.[2-255], name < hostname>ib0)
ib1
Connected to IB subnet2. (IP 10.1.0.[2-255], name < hostname>ib1)
These are Scali Manage Gateways. Hostname is rXXlead. The rack leader controller (leader node) networks implemented are, as follows:
BMC
Connected to the head BMC network (IP 172.17.XX.[1-255])
eth0:
Connected to the head network (IP 172.16.XX.[1-255], name rXXlead)
Tagged vlan 1: connected to rack BMC network (IP 192.168.1.1, name rXXlead-mgm)
Tagged vlan 2: connected to rack network (IP 192.168.0.1, name rXXlead-int)
ib0
Connected to IB subnet1. (IP 10.0.XX.1, name rXXlead-ib0 )
ib1
Connected to IB subnet2. (IP 10.1.XX.1, name rXXlead-ib1 )
The chassis management controller (ethernet switch) networks implemented are, as follows:
Hostname is rXXcmc[01-04].
Connected to the rack BMC network (IP 192.168.1.[2-5], name rXXcmc[01-04])
The compute nodes (blades) networks implemented are, as follows:
Hostname is r[01-xx]i[01-04]n[01-16].
BMC
Connected to the rack BMC network (IP 192.168.1.[11-74], name r[01-xx]i[01-04]n[01-16]-bmc)
eth0
Connected to the rack network (IP 192.168.0.[11-74], name r[01-xx]i[01-04]n[01-16]-eth0)
ib0
Connected to IB subnet1. (IP 10.0.XX.[11-74], name r[01-xx]i[01-04]n[01-16] )
ib1
Connected to IB subnet2. (IP 10.1.XX.[11-74], name r[01-xx]i[01-04]n[01-16]-ib1 )