Chapter 4. System Fabric Management

The InfiniBand network on SGI Altix ICE 8200 series systems uses Open Fabrics Enterprise Distribution (OFED) software. This section describes the InfiniBand fabric and how to manage it. For background information on OFED, see http://www.openfabrics.org .

InfiniBand Fabric Management

This section describes the InfiniBand fabric and covers the following topics:

InfiniBand Fabric Overview

Fabric management on SGI Altix ICE 8200 series systems uses the OFED OpenSM software package. The InfiniBand fabric connects the service nodes, rack leader controllers (leader nodes), and the compute nodes. It does not connect to the system admin controller (admin node) or the chassis management control (CMC) blades. The InfiniBand network has two separate network fabrics, ib0 and ib1 (see “InfiniBand Fabric” in Chapter 1) with the following characteristics:

  • Each network fabric has its own subnet manager (SM).

  • For a system with two racks or more, one rack leader controller (leader node) runs an instance of SM to manage the ib0 fabric and a second leader node runs an instance of SM to manage the ib1 fabric. A database on the admin node keeps a record of which rack leader nodes are running the fabric management software for either ib0 or ib1, respectively. The sgifmcli command has the logic to place opensm on the appropriate rack leader controller. If one of the rack leader controllers becomes unavailable, management of fabric can be assigned to another available rack leader node in the system.


    Note: The LX series only has one ib fabric, therefore, the sgifmcli(8) command, should only be run on ib0 (see “The InfiniBand Management Tool Graphical User Interface”).


  • ib0 is mapped to port 1 of the host channel adapater (HCA) on the SM node. ib1 is mapped to port 2 of the HCA on the SM node.

  • On a system with a single rack, both instances of opensm run on the same rack leader node.

  • Each instance of SM on the rack leader controller is controlled by the /etc/ofa/opensm-ib[01].conf configuration file.

  • Rack leader controllers run the opensm daemon for each fabric over separate HCA ports (see Figure 1-9).


    Note: After a system reboot, the opensm daemons start running automatically on the InfiniBand fabric.


  • Each fabric is addressed by a global unique identifier (GUID) and unique HCA port.

    The GUID and HCA port is set in the configuration file.

  • SGI supports the following topologies: hypercube, enhanced hypercube, and fat tree.

  • Each subnet manager (SM) has a failover mechanism. You can define a master / standby per InfiniBand plane for increased resiliency. For more information, see “InfiniBand Fabric Failover Mechanism”.

The InfiniBand Management Tool Graphical User Interface

You can use the InfiniBand management tool graphical user interface (GUI) to configure, administer, or verify the InfiniBand fabric on your SGI Altix ICE system. You can use it to configure, start, stop, restart, cleanup, or get status for the InfiniBand fabric.

From the system admin controller (admin node), enter the following command:

sys-admin:~ # tempo-configure-fabric

The InfiniBand Management Tool GUI appears, as shown in Figure 4-1.

You can also access this command from the configure-cluster GUI main menu F Configure Infiniband Fabric selection (see “configure-cluster Command Cluster Configuration Tool” in Chapter 2). For more information, see Figure 4-1.

Figure 4-1. InfiniBand Management Tool Screen

InfiniBand Management Tool
Screen

Use the Select button to select the action you want to perform. A submenu will appear. Use the Quit button to return to the previous screen. Use the InfiniBand Management GUI to manage your InfiniBand fabric. You can use the Help button to get online help for each of the GUI actions.

From the Configure InfiniBand screen, make sure you select the Configure Topolgy option to set the topology as shown in Figure 4-2. For more information, see “Network Topology”.

Figure 4-2. Configure Topology Screen

Configure Topology Screen

Use the the online help available with this tool to guide you through the InfiniBand configuration. After configuring and bringing up the InfiniBand network, select the Administer InfiniBand ib0 option or the Administer InfiniBand ib1 option, the Administer InfiniBand screen appears as shown in Figure 4-3. You can use this screen to start, stop, restart, or refresh a fabric.

Figure 4-3. Administer InfiniBand Tool Screen

Administer InfiniBand
Tool Screen

You can verify the status via the Status option, as shown in Figure 4-4.

Figure 4-4. Administer InfiniBand Status Option

Administer InfiniBand Status Option

Fabric Component sgifmcli Command


Note: The LX series only has one ib fabric, therefore, thesgifmcli(8) command described in this section, should only be run on the ib0 fabric.


The sgifmcli software manages the cluster fabrics. For more advanced operations, use the sgifmcli(8) command to configure, administer, and verify the fabric or to integrate InfiniBand switches with your InfiniBand network. For more information, see the sgifmcli(8) man page.

Currently, the following switches are supported:

Switch Type 

Description

voltaire-isr-9024 

Voltaire ISR 9024

voltaire-isr-2004 

Voltaire ISR 2004

voltaire-isr-2012 

Voltaire ISR 2012

voltaire-isr-9096 

Voltaire ISR 9096

voltaire-isr-9288 

Voltaire ISR 9288

At the SGI Tempo 1.7 release, the smconfig and smadmin command functionality was integrated into the sgifmcli command. Use the tempo-configure-fabric command to configure the InfiniBand network . The sgifmcli command is used for the following:

  • Initialize and configure external InfiniBand switches

    This is done automatically by the Tempo discover script (see “InfiniBand Configuration” in Chapter 2) but can also be done manually by an administrator. For this operation, no cluster-wide InfiniBand connectivity needs to exist. The only necessity is that the supplied host name is resolvable and provides a working networking connection to the external InfiniBand switch.

  • Configure and administer the cluster fabric

  • Verify the InfiniBand fabric

    This operation requires that the InfiniBand network is configured properly using the tempo-configure-fabric (see “The InfiniBand Management Tool Graphical User Interface”).

sgifmcli SGI Fabric Component Command

The sgifmcli(8) command is, as follows:

sgifmcli [type action [options]] | [options]


Note: You can use shortened versions of the following sgifmcli options as long as you use a significant amount of letters. For example, sgifmcli --vers for sgifmcli --version.


It accepts the following general options:

General Option 

Description

-h, --help 

Displays a help message and the exits

-V, --version 

Shows the version number of the program

-v, --verbose [DEBUG | INFO | ERROR] 

Select verbosity level (default: ERROR). Most the messages from sgmifmcli are written to a log file named /var/log/sgifmcli.log. The default level reports error messages only. INFO provides the user with details about the operation of sgifmcli in addition to error messages. The DEBUG level produces output that is tailored toward the developer to help with bug fixing. In addition, the DEBUG level also produces INFO and ERROR messages.

It accepts the following detailed options:

Detailed Option 

Description

type 

The type option is one of the following:

  • --mastersm - Master subnet manager

  • --standby - Standby subnet manager

  • --ibswitch - InfiniBand switch

  • --ibfabric - InfiniBand fabric

action 

The action option is one of the following:

  • --init - Initializes the switch or fabric

  • --start - Starts a subnet manager

  • --stop - Stops a subnet manager

  • --status - Prints the status of a subnet manager

  • --verify - Verifies the fabric

  • --refresh - Update a InfiniBand fabric (for Enhanced Hypercube)

  • --set - Sets specific SM configuration parameter (see arglist)

  • --add - Adds a subcomponent to its container, for example, add a switch to a fabric

  • --delete - Deletes a subcomponent from its container, for example, delete a switch from a fabric Removes the switch or fabric

  • --remove - Removes an entity

  • --showconfig - Prints fabric configuration

  • --switchlist - Lists switches in a fabric

options 

The options option is one or more of the following with no duplicates, for example, the --fabric option must be either ib0 or ib1, not both:

  • --id - Unique identifier, for example, host name

  • --hostname - Name of the node on which to run OpenSM

  • --switchtype - Type of switch (leaf or spine)

  • --model - Switch model ( voltaire-isr-9024, voltaire-isr-2004, voltaire-isr-2012, voltaire-isr-9096, or voltaire-isr-9288)

  • --fabric - Fabric, either ib0 or ib1

  • --topology - InfiniBand topology, either hypercube, enhanced-hypercube, or ftree

  • --arglist - List of Subnet Manager configuration parameters: param_1=val_1, param_2=val_2, ...

EXIT CODES

To facilitate the use of the sgifmcli(8) command in shell scripts, an exit code is returned to give an indication of what occurred during a given connection.

The exit codes returned by sgifmcli are, as follows:

0 

Successful termination.

255 

Abnormal termination.

For a detailed man page, perform the following command from the admin node:

sys-admin:~ # man sgifmcli

The sgifmcli(8) fabric administration utilities man page appears.

sgifmdb Fabric Management Database Command

The fabric component maintains a database (DB) of the objects it manages (managed objects). The database version is automatically set during cluster install. You do not need to set it. Most likely, this database will change over time. To manage multiple database versions and also to aid in field support, SGI has added another command line tool that currently reports the managed objects database version.

The sgifmdb command is, as follows:

sgifmdb [--get|-g] [--dump|-d] [-v|--version] [-r|--reset] [--help|-h]

It accepts the following general options:

General Option 

Description

-g, --get 

Reads the database version object from the database

-d, --dump 

Dumps the database. This option allows the you to see what fabric objects are currently stored in the fabric database.

-v, --version 

Prints version

-r, --reset 

Resets the database and starts clean

-h, --help 

-h, --help

Example 4-1. Getting sgifmdb(8) Command Help

For a sgifmdb command usage statement, perform the following from the admin node:

sys-admin:~ # sgifmdb -h
SGI Fabric Component DB Version
Usage: sgifmdb [--get|-g] [--dump|-d] [-v|--version] [--help|-h]

        -g, --get       Read DB version object from DB
        -d, --dump      Dump the DB
        -v, --version   Print version
        -h, --help      Show this text


InfiniBand Fabric Management Configuration and Operation Overview

Each subnet manager (SM) performs a light sweep of the fabric it is managing, every 10 seconds by default. The time interval is set by setting the sweep_interval variable in the /opt/sgi/var/sgifmcli/opensm-ib0.conf.templ file and then doing a Commit operation in the tempo-configure-fabric GUI. Alternately, the sgifmcli command has a --arglist option to set various subnet manager configuration parameters including the sweep interval.


Note: If your cluster is larger than 256 nodes, SGI highly recommends increasing this variable to 90 seconds or even larger value.


If an SM detects a change in the fabric during a light sweep, such as, the addition or deletion of a node, it performs a heavy sweep. The heavy sweep actually changes the fabric configuration to reflect the current state of the system.

A sample opensm-ibx.conf configuration file is, as follows:

Example 4-2. opensm-ib0.conf and opensm-ib1.conf Configuration Files

#
# DEVICE ATTRIBUTES OPTIONS
#
# The port GUID on which the OpenSM is running
guid 0x0000000000000000

# M_Key value sent to all ports qualifying all Set(PortInfo)
m_key 0x0000000000000000

# The lease period used for the M_Key on this subnet in [sec]
m_key_lease_period 0

# SM_Key value of the SM used for SM authentication
sm_key 0x0000000000000001

# SM_Key value to qualify rcv SA queries as 'trusted'
sa_key 0x0000000000000001

# Note that for both values above (sm_key and sa_key)
# OpenSM version 3.2.1 and below used the default value '1'
# in a host byte order, it is fixed now but you may need to
# change the values to interoperate with old OpenSM running
# on a little endian machine.

# Subnet prefix used on this subnet
subnet_prefix 0xfec0000000000000

# The LMC value used on this subnet
lmc 0

# lmc_esp0 determines whether LMC value used on subnet is used for
# enhanced switch port 0. If TRUE, LMC value for subnet is used for
# ESP0. Otherwise, LMC value for ESP0s is 0.
lmc_esp0 FALSE

# The code of maximal time a packet can live in a switch
# The actual time is 4.096usec * 2^<packet_life_time>
# The value 0x14 disables this mechanism
packet_life_time 0x12

# The number of sequential packets dropped that cause the port
# to enter the VLStalled state. The result of setting this value to
# zero is undefined.
vl_stall_count 0x07

# The number of sequential packets dropped that cause the port
# to enter the VLStalled state. This value is for switch ports
# driving a CA or router port. The result of setting this value
# to zero is undefined.
leaf_vl_stall_count 0x07

# The code of maximal time a packet can wait at the head of
# transmission queue.
# The actual time is 4.096usec * 2^<head_of_queue_lifetime>
# The value 0x14 disables this mechanism
head_of_queue_lifetime 0x12

# The maximal time a packet can wait at the head of queue on
# switch port connected to a CA or router port
leaf_head_of_queue_lifetime 0x10

# Limit the maximal operational VLs
max_op_vls 5

# Force PortInfo:LinkSpeedEnabled on switch ports
# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port
# Otherwise, use value for PortInfo:LinkSpeedEnabled on switch port
# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 "PortInfo")
#    1: 2.5 Gbps
#    3: 2.5 or 5.0 Gbps
#    5: 2.5 or 10.0 Gbps
#    7: 2.5 or 5.0 or 10.0 Gbps
#    2,4,6,8-14 Reserved
#    Default 15: set to PortInfo:LinkSpeedSupported
force_link_speed 15

# The subnet_timeout code that will be set for all the ports
# The actual timeout is 4.096usec * 2^<subnet_timeout>
subnet_timeout 18

# Threshold of local phy errors for sending Trap 129
local_phy_errors_threshold 0x08

# Threshold of credit overrun errors for sending Trap 130
overrun_errors_threshold 0x08

#
# PARTITIONING OPTIONS
#
# Partition configuration file to be used
partition_config_file /etc/ofa/partitions.conf

# Disable partition enforcement by switches
no_partition_enforcement FALSE

#
# SWEEP OPTIONS
#
# The number of seconds between subnet sweeps (0 disables it)
sweep_interval 10

# If TRUE cause all lids to be reassigned
reassign_lids FALSE

# If TRUE forces every sweep to be a heavy sweep
force_heavy_sweep FALSE

# If TRUE every trap will cause a heavy sweep.
# NOTE: successive identical traps (>10) are suppressed
sweep_on_trap TRUE

#
# ROUTING OPTIONS
#
# If TRUE count switches as link subscriptions
port_profile_switch_nodes FALSE

# Name of file with port guids to be ignored by port profiling
port_prof_ignore_file (null)

# Routing engine
# Multiple routing engines can be specified separated by
# commas so that specific ordering of routing algorithms will
# be tried if earlier routing engines fail.
# Supported engines: minhop, updn, file, ftree, lash, dor
routing_engine (null)

# Connect roots (use FALSE if unsure)
connect_roots FALSE

# Use unicast routing cache (use FALSE if unsure)
use_ucast_cache FALSE

# Lid matrix dump file name
lid_matrix_dump_file (null)

# LFTs file name
lfts_file (null)

# The file holding the root node guids (for fat-tree or Up/Down)
# One guid in each line
root_guid_file /etc/ofa/switchguids-ib0.conf

# The file holding the fat-tree compute node guids
# One guid in each line
cn_guid_file (null)

# The file holding the node ids which will be used by Up/Down algorithm instead
# of GUIDs (one guid and id in each line)
ids_guid_file (null)

# The file holding guid routing order guids (for MinHop and Up/Down)
guid_routing_order_file (null)

# SA database file name
sa_db_file (null)

#
# HANDOVER - MULTIPLE SMs OPTIONS
#
# SM priority used for deciding who is the master
# Range goes from 0 (lowest priority) to 15 (highest).
sm_priority 0

# If TRUE other SMs on the subnet should be ignored
ignore_other_sm FALSE

# Timeout in [msec] between two polls of active master SM
sminfo_polling_timeout 10000

# Number of failing polls of remote SM that declares it dead
polling_retry_number 4

# If TRUE honor the guid2lid file when coming out of standby
# state, if such file exists and is valid
honor_guid2lid_file FALSE

#
# TIMING AND THREADING OPTIONS
#
# Maximum number of SMPs sent in parallel
max_wire_smps 4

# The maximum time in [msec] allowed for a transaction to complete
transaction_timeout 200

# Maximal time in [msec] a message can stay in the incoming message queue.
# If there is more than one message in the queue and the last message
# stayed in the queue more than this value, any SA request will be
# immediately returned with a BUSY status.
max_msg_fifo_timeout 10000

# Use a single thread for handling SA queries
single_thread FALSE

#
# MISC OPTIONS
#
# Daemon mode
daemon FALSE

# SM Inactive
sm_inactive FALSE

# Babbling Port Policy
babbling_port_policy FALSE

#
# Event Plugin Options
#
event_plugin_name (null)

#
# Node name map for mapping node's to more descriptive node descriptions
# (man ibnetdiscover for more information)
#
node_name_map_name (null)

#
# DEBUG FEATURES
#
# The log flags used
log_flags 0x03

# Force flush of the log file after each log message
force_log_flush FALSE

# Log file to be used
log_file /var/log/opensm-ib0.log

# Limit the size of the log file in MB. If overrun, log is restarted
log_max_size 0

# If TRUE will accumulate the log over multiple OpenSM sessions
accum_log_file TRUE

# The directory to hold the file OpenSM dumps
dump_files_dir /var/log/

# If TRUE enables new high risk options and hardware specific quirks
enable_quirks FALSE

# If TRUE disables client reregistration
no_clients_rereg FALSE

# If TRUE OpenSM should disable multicast support and
# no multicast routing is performed if TRUE
disable_multicast FALSE

# If TRUE opensm will exit on fatal initialization issues
exit_on_fatal TRUE

# console [off|local]
console off

# Telnet port for console (default 10000)
console_port 10000

#
# QoS OPTIONS
#
# Enable QoS setup
qos FALSE

# QoS policy file to be used
qos_policy_file /etc/ofa/qos-policy.conf

# QoS default options
qos_max_vls 0
qos_high_limit -1
qos_vlarb_high (null)
qos_vlarb_low (null)
qos_sl2vl (null)

# QoS CA options
qos_ca_max_vls 0
qos_ca_high_limit -1
qos_ca_vlarb_high (null)
qos_ca_vlarb_low (null)
qos_ca_sl2vl (null)

# QoS Switch Port 0 options
qos_sw0_max_vls 0
qos_sw0_high_limit -1
qos_sw0_vlarb_high (null)
qos_sw0_vlarb_low (null)
qos_sw0_sl2vl (null)

# QoS Switch external ports options
qos_swe_max_vls 0
qos_swe_high_limit -1
qos_swe_vlarb_high (null)
qos_swe_vlarb_low (null)
qos_swe_sl2vl (null)

# QoS Router ports options
qos_rtr_max_vls 0
qos_rtr_high_limit -1
qos_rtr_vlarb_high (null)
qos_rtr_vlarb_low (null)
qos_rtr_sl2vl (null)

# Prefix routes file name
prefix_routes_file /etc/ofa/prefix-routes.conf

#
# IPv6 Solicited Node Multicast (SNM) Options
#
consolidate_ipv6_snm_req FALSE



Each fabric is addressed by a global unique identifier (GUID) and unique HCA port (see Figure 4-5). Each fabric has a unique GUID set in its respective configuration file.

Figure 4-5. Two InfiniBand Fabrics in a System with Two IRUs

Two InfiniBand Fabrics in a System with Two IRUs

Network Topology

For SGI Altix ICE systems with a hypercube topology, SGI uses the dimension order routing (DOR) algorithm.

The dimension order routing algorithm is based on the min hop algorithm and so uses shortest paths. Instead of spreading traffic out across different paths with the same shortest distance, it chooses among the available shortest paths based on an ordering of dimensions.

For SGI Altix ICE systems with a fat-tree topology, SGI uses updn as the default routing algorithm. Unicast routing algorithm (UPDN) is also based on the minimum hops to each node, but it is constrained to ranking rules.

For more information on routing variables, see the opensm (8) man page.

Hypercube network topology is well suited for smaller node count MPI jobs or jobs that have communication patterns that are not sensitive to bisection bandwidth. Fat-tree network topology is well suited for large node count MPI jobs that are sensitive to bi-section bandwidth.

As stated above, there are two opensm daemons, one for each fabric, opensmd-ib0 and opensmd-ib1 , respectively. They are controlled by the init.d scripts. Each init.d script has a separate configuration file for each fabric, opensm-ib0 and opensm-ib1 , respectively.

You can use the sminfo command to show the GUID of the SM master.

Configuring the InfiniBand Fabric

This section describes how to configure and administer the InfiniBand fabric using the sgifmcli(8) command.


Note: SGI highly recommends that you use the tempo-configure-fabric GUI to configure and administer the fabric (see “The InfiniBand Management Tool Graphical User Interface”).


Procedure 4-1. Configure the Master Subnet Manager

    When configuring the SM master, the following rules apply:

    • Each InfiniBand fabric needs to have a subnet manager (SM) master.

    • There can be at most one SM master per InfiniBand plane.

    • Fabric configuration and administration can only be done via the SM master.

    • Fabric configuration becomes active after (re)starting the SM master.

    • Deleting an SM master automatically deletes its standby, if it exists.

    The syntax to configure an SM master is, as follows:

    sgifmcli --mastersm --init --id identifier --hostname hostname --fabric fabric --topology topology

    This command creates a master with the name provided by the --id option. The identifier can be any arbitrary string. The hostname determines the host on which the SM master manager is launched. The fabric option associates the SM master manager with either ib0 or ib1. The topology option refers to the InfiniBand topology, which can be either hypercube, enhanced hypercube, or fat tree.

    To configure a master for the fabric ib0 on a hypercube cluster, perform the following steps:

    1. From the admin node to configure an SM master, perform the following:

      # sgifmcli --mastersm --init --id master_ib0 --hostname r1lead --fabric ib0 --topology hypercube

      This creates an SM master for ib0. The underlying topology is a hypercube and thus the routing algorithm dor will be used. This SM master, named master_ib0, is configured to run on the host r1lead.

    2. The syntax to start an SM master is, as follows:

      # sgifmcli --start --id identifier

      To start the master_ib0 SM master, perform the following:

      sgifmcli --start --id master_ib0

      At this point a master for the fabric ib0 is running on the r1lead and thus the fabric ib0 is available for compute jobs. If a standby has been defined, it will be launched automatically, in addition, to the master.

    3. The syntax to stop an SM master is, as follows:

      sgifmcli --stop --id identifier

      To stop the master_ib0 SM master, perform the following:
      # sgifmcli --stop --id master_ib0

      The SM master master_ib0 running on host r1lead is stopped. If a standby has been defined then it will be stopped automatically, in addition to the master.

    4. The syntax to check the status of an SM master is, as follows:

      sgifmcli --status --id identifier

      To check the status of the master_ib0 SM master, perform the following:
      # sgifmcli --status --id master_ib0
      Master SM
      Host = rlead
      Guid = 0x0002c902002838f5
      Fabric = ib0
      Topology = hypercube
      Routing Engine = dor
      OpenSM = running

      The status of the master SM master master_ib0 running on host r1lead is reported. If a standby has been defined, its status will be reported in addition to the master.

    5. The syntax to remove an SM master is, as follows:

      sgifmcli --remove --id identifier

      To remove the master_ib0 SM master, first stop it and then perform the -remove option, as follows:

      # sgifmcli --stop --id master_ib0
      
      # sgifmcli --remove --id master_ib0

      The SM master is removed from the entity list. If a standby has been defined, it is removed, in addition to the master.

    6. To print the fabric configuration, run the following:

      # sgifmcli --showconfig
      
      --------------
      NAME = ib1
      TYPE = ibfabric
      MASTER = 
      STANDBY = 
      SWITCH_LIST = 
      --------------
      NAME = ib0
      TYPE = ibfabric
      MASTER = 
      STANDBY = 
      SWITCH_LIST = 

    InfiniBand Fabric Failover Mechanism

    Each subnet manager (SM) has a failover mechanism. If the master SM fails, the standby SM takes over operation of the fabric. This failover operation is performed automatically by the opensm software.Typically, rack1 is the MASTER for the ib0 fabric and rack2 has the MASTER for the ib1 fabric, as shown in Figure 4-6.

    Figure 4-6. opensm Software Failover

    opensm Software Failover

    The following procedure describes how to setup the failover mechanism.

    Procedure 4-2. Enabling the InfiniBand Failover Mechanism

      When enabling the InfiniBand failover mechanism, the following rules apply:

      • Each InfiniBand fabric can optionally have exactly one standby.

      • A standby SM can only be created for a particular fabric when a master already exists.

      • When adding a standby after a master has already been defined and started, the master needs to be stopped before the standby is defined via the --init option. After defining the standby via --init, restart the master.

      • A SM master and SM standby for a particular fabric can not coexist on the same node.

      SGI highly recommends that you use the tempo-configure-fabric GUI to configure the failover mechanism. If it is necessary to use sgifmcli(8) to enable the InfiniBand failover mechanism, perform the following steps:

      1. If an SM master is defined and running, stop it, as follows:

        # sgifmcli --stop --id master_ib0

        If the SM master has not been defined, define it, as follows:
        # sgifmcli --mastersm --init --id master_ib0 --hostname r1lead --fabric ib0 --topology hypercube

      2. Define the SM standby, as follows:

        # sgifmcli --standbysm --init --id standby_ib0 --hostname r2lead --fabric ib0

      3. Start the SM master, as follows:

        # sgifmcli --start --id master_ib0

        This automatically starts the SM master and the SM standby for ib0.

      4. Now check the status for the subnet manager of ib0, as follows:

        sgifmcli --status --id master_ib0
        
        Master SM
        Host = r1lead
        Guid = 0x0008f10403987da9
        Fabric = ib0
        Toplogy = hypercube
        Routing Engine = dor
        OpenSM = running
        Standby SM
        Host = r2lead
        Guid = 0x0008f10403987d25
        Fabric = ib0
        OpenSM = running

      5. To remove the standby_ib0 SM standby, first stop its master and then perform the remove option, as follows:

        # sgifmcli --stop --id master_ib0
        # sgifmcli --remove --id standby_ib0

        The SM standby is removed from the entity list. If a standby has been defined, it is removed, in addition to the master.

      Configuring the InfiniBand Fat-tree Network Topology

      This section describes how to configure InfiniBand fat-tree network topology. The fat-tree topology involves external InfiniBand switches. For the list of supported external switches, see “Fabric Component sgifmcli Command”. InfiniBand switches come in two types: leaf or spine. Some switches are called director switches; these fall into the spine category. A switch can have one or more spines and has multiple leaf or line switches. It is recommended to discover external IB switches using the Tempo discover command (see “discover Command” in Chapter 2). After discovery is completed, an external switch can also be initialized and added to the InfiniBand system using the sgifmcli command.

      Procedure 4-3. Configuring InfiniBand Fat-tree Network Topology

        To configure the InfiniBand fat-tree network topology on an SGI Altix ICE 8200 series system, perform the following steps:

        1. Make sure that your switch is properly connected to the InfiniBand network. Also, make sure that the admin port of the switch is properly connected to the Ethernet network.

        2. Power on the switch. See the switch manual for operation information.

        3. From the admin node, initialize the switch. The syntax to initialize the switch is, as follows:

          sgifmcli --init --ibswitch --model   --id  --switchtype [leaf | spine]

          An example command is, as follows:

          # sgifmcli --init --ibswitch --model voltaire-isr-2004  --id isr2004 --switchtype spine

          This configures a Voltaire switch ISR2004 with hostname isr2004 as a spine switch. isr2004 refers to the admin port of the switch and needs to be configured previously to allow for switch access. The switch is now initialized and the root GUID from the spine switches have been downloaded.

        4. From the admin node, add the switch to the fabric. The syntax to add the switch is, as follows:

          sgifmcli --add --id <fabric> --switch <hostname>

          An example command is, as follows:

          # sgifmcli --add --id ib0 --switch isr2004

          In this example, ISR2004 is connected to the ib0 fabric.

        5. For the new switch to be activated, the SM master and the optional SM standby need to be (re)started.

          # sgifmcli --start --id master_ib0

          If the SM master was running while the switch was added, you first need to stop and then start the master, as follows:

          # sgifmcli --stop --id master_ib0
          # sgifmcli --start --id master_ib0

          If a standby has been defined, then in case of an SM master failure the SM standby subnet manager will automatically take over and assume control over the switch.

        6. The switches related to a particular fabric can be listed, as follows:

          # sgifmcli --switchlist --id <fabric>

        Verifying the InfiniBand Network

        After your InfiniBand fabric has been configured and started, you can use the sgifmcli(8) command to verify the health of the fabric.

        Procedure 4-4. Verifying the InfiniBand Network

          The fabric can be either ib0 or ib1 . This version of the InfiniBand verifier runs the recommended OFED test suite. In addition, the SGI Tempo cluster view is compared with the InfiniBand cluster view and potential differences are reported.

          To verify the ibo fabric, perform the following command:

          # sgifmcli --verify --id fabric

          Useful Utilities and Diagnostics

          The openib-diags package contains useful tools and diagnostic software for Open Fabrics Enterprise Distribution (OFED). This section describes some of these tools. These tools reside on the rack leader controller (leader node) in the /usr/bin directory, as follows:

          r1lead:~ # cd /usr/bin
          r1lead:/usr/bin # ls ib*
          ibaddr            ibcheckstate     ibdiscover.pl        ibnetdiscover     ib_rdma_bw   ibstatus        ...
          ibcheckerrors     ibcheckwidth     ibdmchk              ibnlparse         ib_rdma_lat  ibswitches      ...
          ibcheckerrs       ibclearcounters  ibdmsh               ibnodes           ib_read_bw   ibsysstat       ...
          ibchecknet        ibclearerrors    ibdmtr               ibping            ib_read_lat  ibtopodiff      ...
          ibchecknode       ib_clock_test    ibfindnodesusing.pl  ibportstate       ibroute      ibtracert       ...
          ibcheckport       ibdiagnet        ibhosts              ibprintca.pl      ib_send_bw   ibv_asyncwatch  ...
          ibcheckportstate  ibdiagpath       ibis                 ibprintswitch.pl  ib_send_lat  ibv_devices     ...
          ibcheckportwidth  ibdiagui         iblinkinfo.pl        ibqueryerrors.pl  ibstat       ibv_devinfo

          This section covers the following topics:

          ibstat and ibstatus Commands

          You can use the ibstat command to see the current status of the host channel adapaters (HCA) in your InfiniBand fabric incluing the HCAs on rack leader controllers. The following view is prior to starting the fabric management:

          r1lead:/usr/bin # ibstat
          CA 'mthca0'
                  CA type: MT25208 (MT23108 compat mode)
                  Number of ports: 2
                  Firmware version: 4.7.600
                  Hardware version: a0
                  Node GUID: 0x0008f104039881a8
                  System image GUID: 0x0008f104039881ab
                  Port 1:
                          State: Initializing
                          Physical state: LinkUp
                          Rate: 20
                          Base lid: 0
                          LMC: 0
                          SM lid: 0
                          Capability mask: 0x02510a68
                          Port GUID: 0x0008f104039881a9
                  Port 2:
                          State: Initializing
                          Physical state: LinkUp
                          Rate: 20
                          Base lid: 0
                          LMC: 0
                          SM lid: 0
                          Capability mask: 0x02510a68
                          Port GUID: 0x0008f104039881aa

          The following shows output from the ibstat command after the fabric management software has been started:

          r1lead:/opt/sgi/sbin # ibstat
          CA 'mthca0'
                  CA type: MT25208 (MT23108 compat mode)
                  Number of ports: 2
                  Firmware version: 4.7.600
                  Hardware version: a0
                  Node GUID: 0x0008f104039881a8
                  System image GUID: 0x0008f104039881ab
                  Port 1:
                          State: Active
                          Physical state: LinkUp
                          Rate: 20
                          Base lid: 1
                          LMC: 0
                          SM lid: 1
                          Capability mask: 0x02510a6a
                          Port GUID: 0x0008f104039881a9
                  Port 2:
                          State: Active
                          Physical state: LinkUp
                          Rate: 20
                          Base lid: 1
                          LMC: 0
                          SM lid: 1
                          Capability mask: 0x02510a6a
                          Port GUID: 0x0008f104039881aa

          You can use the ibstatus (less verbose that ibstat) command to show the link rate, as follows:

          r1lead:/opt/sgi/sbin # ibstatus
          Infiniband device 'mthca0' port 1 status:
                  default gid:     fe80:0000:0000:0000:0008:f104:0398:81a9
                  base lid:        0x1
                  sm lid:          0x1
                  state:           4: ACTIVE
                  phys state:      5: LinkUp
                  rate:            20 Gb/sec (4X DDR)
          
          Infiniband device 'mthca0' port 2 status:
                  default gid:     fe80:0000:0000:0000:0008:f104:0398:81aa
                  base lid:        0x1
                  sm lid:          0x1
                  state:           4: ACTIVE
                  phys state:      5: LinkUp
                  rate:            20 Gb/sec (4X DDR)


          Note: If link rate is not 20 Gb/sec 4xDDR, and you have a DDR capable HCA, there is a physical link problem with your system.


          perfquery Command

          The perfquery command is useful for find errors on a particular or number of HCA's and switch ports. You can also use perfquery to reset HCA and switch port counters.

          To see a usage statement for the perfquery command, perform the following:

          r1lead:/opt/sgi/sbin # perfquery --help
          Usage: perfquery [-d(ebug) -G(uid) -a(ll_ports) -r(eset_after_read) -C ca_name -P ca_port -R(eset_only)
           -t(imeout) timeout_ms -V(ersion) -h(elp)] [<lid|guid> [[port] [reset_mask]]]
                  Examples:
                          perfquery               # read local port's performance counters
                          perfquery 32 1          # read performance counters from lid 32, port 1
                          perfquery -e 32 1       # read extended performance counters from lid 32, port 1
                          perfquery -a 32         # read performance counters from lid 32, all ports
                          perfquery -r 32 1       # read performance counters and reset
                          perfquery -e -r 32 1    # read extended performance counters and reset
                          perfquery -R 0x20 1     # reset performance counters of port 1 only
                          perfquery -e -R 0x20 1  # reset extended performance counters of port 1 only
                          perfquery -R -a 32      # reset performance counters of all ports
                          perfquery -R 32 2 0x0fff        # reset only error counters of port 2
                          perfquery -R 32 2 0xf000        # reset only non-error counters of port 2

          Some sample output from the perfquery command is, as follows:
          r1lead:/opt/sgi/sbin # perfquery
          # Port counters: Lid 1 port 1
          PortSelect:......................1
          CounterSelect:...................0x0000
          SymbolErrors:....................0
          LinkRecovers:....................0
          LinkDowned:......................0
          RcvErrors:.......................0
          RcvRemotePhysErrors:.............0
          RcvSwRelayErrors:................0
          XmtDiscards:.....................0
          XmtConstraintErrors:.............0
          RcvConstraintErrors:.............0
          LinkIntegrityErrors:.............0
          ExcBufOverrunErrors:.............0
          VL15Dropped:.....................0
          XmtData:.........................0
          RcvData:.........................0
          XmtPkts:.........................0
          RcvPkts:.........................0

          ibnetdiscover Command

          The ibnetdiscover command allows you discover the IB fabric.

          To see a usage statement for the ibnetdiscover command, perform the following:

          r1lead:/opt/sgi/sbin # ibnetdiscover --help
          Usage: ibnetdiscover [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) 
          -g(rouping) -H(ca_list) -S(witch_list) 
          -V(ersion) -C ca_name -P ca_port -t(imeout) timeout_ms 
          --switch-map switch-map] [<topology-file>]
          --switch-map <switch-map>  specify a switch-map file


          Note: Only abbreviated output is shown in the this example.


          Some sample output from the ibnetdiscover command is, as follows:
          r1lead:/opt/sgi/sbin # ibnetdiscover
          #
          # Topology file: generated on Tue Jul 17 14:05:20 2007
          #
          # Max of 3 hops discovered
          # Initiated from node 0008f104039881a8 port 0008f104039881a9
          
          vendid=0x2c9
          devid=0xb924
          sysimgguid=0x8006900000000dd
          
          ...
          
          Switch   : 0x08006900000000dc ports 24 devid 0xb924 vendid 0x2c9 
          "MT47396 Infiniscale-III Mellanox Technologies"
          Switch   : 0x08006900000000a4 ports 24 devid 0xb924 vendid 0x2c9 
          "MT47396 Infiniscale-III Mellanox Technologies"
          
          r1lead:/opt/sgi/sbin # ibnetdiscover -H (HCA's)
          Ca       : 0x0030487aa7940000 ports 1 devid 0x6274 vendid 0x2c9 "MT25204 InfiniHostLx Mellanox Technologies"
          Ca       : 0x0030487aa78c0000 ports 1 devid 0x6274 vendid 0x2c9 "r1i0n8-ib0 HCA-1"
          Ca       : 0x0008f10403988198 ports 2 devid 0x6278 vendid 0x8f1 " HCA-1"
          Ca       : 0x0030487aa7840000 ports 1 devid 0x6274 vendid 0x2c9 "r1i0n1-ib0 HCA-1"
          Ca       : 0x0030487aa79c0000 ports 1 devid 0x6274 vendid 0x2c9 "r1i1n0-ib0 HCA-1"
          Ca       : 0x0030487aa7900000 ports 1 devid 0x6274 vendid 0x2c9 "r1i1n8-ib0 HCA-1"
          Ca       : 0x0030487aa7980000 ports 1 devid 0x6274 vendid 0x2c9 "r1i1n1-ib0 HCA-1"
          Ca       : 0x0008f104039881a8 ports 2 devid 0x6278 vendid 0x8f1 " HCA-1"
          
          ======================================================================================================

          ibdiagnet Command

          The ibdiagnet command is a useful diagnostic tool.

          To see a usage statement for the ibdiagnet command, perform the following:

          r1lead:/opt/sgi/sbin # ibdiagnet --help
          Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
          NAME
            ibdiagnet
          SYNOPSYS
            ibdiagnet [-c ] [-v] [-r] [-o ]
               [-t ] [-s ] [-i ] [-p ]
               [-pm] [-pc] [-P <>]
               [-lw <1x|4x|12x>] [-ls <2.5|5|10>]
              
          
          DESCRIPTION
            ibdiagnet scans the fabric using directed route packets and extracts all the 
            available information regarding its connectivity and devices.
            It then produces the following files in the output directory defined by the
            -o option (see below): 
              ibdiagnet.lst    - List of all the nodes, ports and links in the fabric
              ibdiagnet.fdbs   - A dump of the unicast forwarding tables of the fabric
                                 switches
              ibdiagnet.mcfdbs - A dump of the multicast forwarding tables of the fabric
                                 switches
              ibdiagnet.masks  - In case of duplicate port/node Guids, these file include
                                 the map between masked Guid and real Guids 
              ibdiagnet.sm     - A dump of all the SM (state and priority) in the fabric
              ibdiagnet.pm     - In case -pm option was provided, this file contain a dump
                                 of all the nodes PM counters
            In addition to generating the files above, the discovery phase also checks for
            duplicate node/port GUIDs in the IB fabric. If such an error is detected, it 
            is displayed on the standard output.
            After the discovery phase is completed, directed route packets are sent
            multiple times (according to the -c option) to detect possible problematic 
            paths on which packets may be lost. Such paths are explored, and a report of
            the suspected bad links is displayed on the standard output.
            After scanning the fabric, if the -r option is provided, a full report of the
            fabric qualities is displayed.
            This report includes: 
              SM report
              Number of nodes and systems
              Hop-count information: 
                   maximal hop-count, an example path, and a hop-count histogram
              All CA-to-CA paths traced 
              Credit loop report
              mgid-mlid-HCAs matching table
            Note: In case the IB fabric includes only one CA, then CA-to-CA paths are not
            reported.
            Furthermore, if a topology file is provided, ibdiagnet uses the names defined
            in it for the output reports.
                
          OPTIONS
            -c                      : The minimal number of packets to be sent
                                             across each link (default = 10)
            -v                             : Instructs the tool to run in verbose mode
            -r                             : Provides a report of the fabric qualities
            -o                    : Specifies the directory where the output
                                             files will be placed (default = /tmp)
            -t                  : Specifies the topology file name
            -s                   : Specifies the local system name. Meaningful
                                             only if a topology file is specified
            -i                  : Specifies the index of the device of the port
                                             used to connect to the IB fabric (in case of
                                             multiple devices on the local system)
            -p                   : Specifies the local device's port number used
                                             to connect to the IB fabric
            -pm                            : Dumps all pmCounters values into ibdiagnet.pm
            -pc                            : reset all the fabric links pmCounters
            -P <>: If any of the provided pm is greater then its
                                             provided value, print it to screen
            -lw <1x|4x|12x>                : Specifies the expected link width
            -ls <2.5|5|10>                 : Specifies the expected link speed
                                               
            -h|--help                      : Prints this help information
            -V|--version                   : Prints the version of the tool
               --vars                      : Prints the tool's environment variables and
                                             their values
          
          ERROR CODES
            1 - Failed to fully discover the fabric
            2 - Failed to parse command line options
            3 - Failed to interact with IB fabric
            4 - Failed to use local device or local port
            5 - Failed to use Topology File
            6 - Failed to load required Package
          

          Output which shows no errors means the system is operating correctly:

          r1lead:/opt/sgi/sbin # ibdiagnet
          Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
          Loading IBDM from: /usr/lib64/ibdm1.2
          -W- Topology file is not specified.
              Reports regarding cluster links will use direct routes.
          -W- A few ports of local device are up.
              Since port-num was not specified (-p option), port 1 of device 1 will be
              used as the local port.
          -I- Discovering the subnet ... 10 nodes (2 Switches & 8 CA-s) discovered.
          
          
          -I---------------------------------------------------
          -I- Bad Guids Info
          -I---------------------------------------------------
          -I- No bad Guids were found
          
          -I---------------------------------------------------
          -I- Links With Logical State = INIT
          -I---------------------------------------------------
          -I- No bad Links (with logical state = INIT) were found
          
          -I---------------------------------------------------
          -I- PM Counters Info
          -I---------------------------------------------------
          -I- No illegal PM counters values were found
          
          -I---------------------------------------------------
          -I- Bad Links Info
          -I---------------------------------------------------
          -I- No bad link were found
           
          -I- Done. Run time was 0 seconds.
          

          You can use ibdiagnet to load the fabric to test it, as follows:

          r1lead:/opt/sgi/sbin # ibdiagnet -c 5000
          Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
          Loading IBDM from: /usr/lib64/ibdm1.2
          -W- Topology file is not specified.
              Reports regarding cluster links will use direct routes.
          -W- A few ports of local device are up.
              Since port-num was not specified (-p option), port 1 of device 1 will be
              used as the local port.
          -I- Discovering the subnet ... 10 nodes (2 Switches & 8 CA-s) discovered.
          
          
          -I---------------------------------------------------
          -I- Bad Guids Info
          -I---------------------------------------------------
          -I- No bad Guids were found
          
          -I---------------------------------------------------
          -I- Links With Logical State = INIT
          -I---------------------------------------------------
          -I- No bad Links (with logical state = INIT) were found
          
          -I---------------------------------------------------
          -I- PM Counters Info
          -I---------------------------------------------------
          -I- No illegal PM counters values were found
          
          -I---------------------------------------------------
          -I- Bad Links Info
          -I---------------------------------------------------
          -I- No bad link were found
           
          -I- Done. Run time was 8 seconds.