Appendix A. InfiniBand Fabric Details

This appendix provides more a more detailed description of the InfiniBand fabric management.

InfiniBand Fabric Management Configuration and Operation Overview

Each subnet manager (SM) performs a light sweep of the fabric it is managing, every 10 seconds by default. The time interval by setting is in the SWEEP variable in the opensm-ib0.conf and opensm-ib1.conf configuration files located in the /etc directory.


Note: SGI highly recommends that you do NOT change this variable.


If an SM detects a change in the fabric during a light sweep, such as, the addition or deletion of a node, it performs a heavy sweep. The heavy sweep actually changes the fabric configuration to reflect the current state of the system.

A sample opensm-ibx.conf configuration file is, as follows:

Example A-1. opensm-ib0.conf and opensm-ib.conf Configuration Files

# DEBUG mode
#  This option specifies a debug option.
#  These options are not normally needed.
#  The number following -d selects the debug
#  option to enable as follows:
#  OPT   Description
#  ---    -----------------
#  0  - Ignore other SM nodes.
#  1  - Force single threaded dispatching.
#  2  - Force log flushing after each log message.
#  3  - Disable multicast support.
#  4  - Put OpenSM in memory tracking mode.
#  10.. Put OpenSM in testability mode.
#  none, no debug options are enabled.
DEBUG=none

# LMC          
#  This option specifies the subnet's LMC value.
#  The number of LIDs assigned to each port is 2^LMC.
#  The LMC value must be in the range 0-7.
#  LMC values > 0 allow multiple paths between ports.
#  LMC values > 0 should only be used if the subnet
#  topology actually provides multiple paths between
#  ports, i.e. multiple interconnects between switches.
#  OpenSM defaults to LMC = 0, which allows
#  one path between any two ports.
LMC=0

# MAXSMPS
#  This option specifies the number of VL15 SMP MADs
#  allowed on the wire at any one time.
#  Specifying -maxsmps 0 allows unlimited outstanding SMPs.
#  Without -maxsmps, OpenSM defaults to a maximum of
#  one outstanding SMP.
MAXSMPS=0

# REASSIGN_LIDS
#  This option causes OpenSM to reassign LIDs to all
#  end nodes. Specifying "REASSIGN_LIDS=yes" on a running subnet
#  may disrupt subnet traffic.
#  With "REASSIGN_LIDS=no", OpenSM attempts to preserve existing
#  LID assignments resolving multiple use of same LID.
REASSIGN_LIDS="yes"

# SWEEP
#  This option specifies the number of seconds between
#  subnet sweeps.  Specifying SWEEP=0 disables sweeping.
#  OpenSM defaults to a sweep interval of 10 seconds.
SWEEP=10

# TIMEOUT
#  This option specifies the time in milliseconds
#  used for transaction timeouts.
#  Specifying -t 0 disables timeouts.
#  Without -t, OpenSM defaults to a timeout value of
#  200 milliseconds.
TIMEOUT=200

# OSM_LOG
#  This option defines the log to be the given file.
#  By default the log goes to /tmp/osm.log.
#  For the log to go to standard output use OSM_LOG=stdout.
OSM_LOG=/var/log/osm-ib0.log                                                                         

# VERBOSE
#  This option increases the log verbosity level.
#  The "-v" option may be specified multiple times
#  to further increase the verbosity level.
#   "-V" option sets the maximum verbosity level and
#   forces log flushing.
#   The "-V" is equivalent to "-vf 0xFF -d 2".
VERBOSE="none"

# ROUTING_ENGINE
#  This option chooses the routing engine instead of 
#  the Min Hop algorithm which is default.
#  Valid routing engines are :-
#         Min Hop, updn, file, ftree, lash
#  To switch to different routing engine set the engine
#  name in ROUTING_ENGINE (i.e.  ROUTING_ENGINE=lash).
#  For Min Hop use ROUTING_ENGINE="none" or ROUTING_ENGINE=
ROUTING_ENGINE="none"

# GUID_FILE
#  This option only allowed when UPDN algorithm is activated
#  It specifies the guid list file from which to fetch the guid list
#  The file contain in each line only one valid guid
GUID_FILE="none"

#  This option specifies the local port GUID value
#  with which OpenSM should bind.  OpenSM may be
#  bound to 1 port at a time.
#  If GUID given is 0, opensmd use PORT_NUM parameter.
#  Without -g (GUID="none"), OpenSM trys to use the default port.
#  example GUID="0x0005ad00000517c9"
GUID="none"

# OSM_HOSTS
#  The list of all SM's IP addresses in InfiniBand subnet
#  Used to handover mechanism
#  example OSM_HOSTS="128.162.246.221 128.162.246.42"
OSM_HOSTS="none"

# OSM_CACHE_DIR
OSM_CACHE_DIR="/var/cache/osm/ib0"

# CACHE_OPTIONS
#  Cache the given command line options into the file
#  /var/cache/osm/opensm-ib0.opts for use next invocation
#  The cache directory can be changed by the environment
#  variable OSM_CACHE_DIR
#  Set to '--cache-options' or '-c' in order to enable
CACHE_OPTIONS="-c"

# HONORE_GUID2LID 
#  This option forces OpenSM to honor the guid2lid file,
#  when it comes out of Standby state, if such file exists
#  under OSM_CACHE_DIR, and is valid.
#  Set to '--honor_guid2lid' or '-x' to enable.
#  By default this is FALSE. Will be set automatically to '--honor_guid2lid'
#  if OSM_HOSTS includes list of more then one IP addresses.
HONORE_GUID2LID="-x"

# RCP
#  This option osed by SLDD daemon for handover mechanism
#  to copy local cache file to remote computer
RCP=/usr/bin/scp

# RSH
#  This option osed by SLDD daemon for handover mechanism
#  to execute commands on remote computer
RSH=/usr/bin/ssh

# RESCAN_TIME
#  This option osed by SLDD daemon for handover mechanism
#  Time between sweep of sldd daemon in seconds
RESCAN_TIME=60

# PORT_NUM
#  This option defines HCA's port number which OpenSM should bind
PORT_NUM=1

# ONBOOT
#  To start OpenSM automatically set ONBOOT=yes
ONBOOT=yes

# MULTI_FABRIC
# Allow multiple fabrics (and copies of OpenSM) on the same SM host
MULTI_FABRIC=yes

Each fabric is addressed by a global unqiue identifier (GUID) and unique HCA port (see Figure A-1). Each fabric has a unique GUID set in its respective configuration file.

Figure A-1. Two InfiniBand Fabrics in a System with Two IRUs

Two InfiniBand Fabrics in a System with Two IRUs

With Scali Manage, the routing engine is chosen automatically based on the number of racks in the system. For up to two racks, the " Min Hop" algorithm is used. For more than two racks, the “lash” algorithm is used which enables LAyered SHortest Path Routing (LASH).

When the lash routing algorithm is used, the subnet managers need to be restarted after the entire Altix ICE system is up. To restart the subnet managers, perform the following command:

scalimanage-cli restartaltixiceopensm  

As stated above, there are two opensm daemons, one for each fabric, opensmd-ib0 and opensmd-ib1 , respectively. They are controlled by the init.d scripts. Each init.d script has a separate configuration file for each fabric, opensm-ib0 and opensm-ib1 , respectively.

Configuring and Initializing the InfiniBand Fabric Manually

This section describes the changes you need to make to the /etc/opensm-ib0.conf or /etc/opensm-ib1.conf configuration file to configure opensm software, how to start the opensmd-ib0 and opensmd-ib1 daemons, and verify the fabric is operating. For an overview of fabric configuration and management, see “InfiniBand Fabric Management Configuration and Operation Overview ”.

Procedure A-1. Configuring and Initializing the InfiniBand Fabric Manually

    To configure, initialize, and verify the InfiniBand fabric, perform the following steps:

    1. From the admin node, connect to the leader node or rack 1, as follows:

      # ssh r01lead


      Note: Before you attempting to initialize the InfiniBand fabric, make sure all compute nodes are booted and operational.


    2. From the admin node, determine and record the IP addresses of the leader nodes, as follows:

      # ping -c 1 r01lead
      PING r01lead.ice.americas.sgi.com (172.16.0.2) 56(84) bytes of data.
      64 bytes from r01lead.ice.americas.sgi.com (172.16.0.2): icmp_seq=1 ttl=64 time=0.127 ms
      
      --- r01lead.ice.americas.sgi.com ping statistics ---
      1 packets transmitted, 1 received, 0% packet loss, time 0ms
      rtt min/avg/max/mdev = 0.127/0.127/0.127/0.000 ms
      # ping -c 1 r2lead
      PING r2lead.ice.americas.sgi.com (172.16.0.3) 56(84) bytes of data.
      64 bytes from r2lead.ice.americas.sgi.com (172.16.0.3): icmp_seq=1 ttl=64 time=0.089 ms
      
      --- r2lead.ice.americas.sgi.com ping statistics ---
      1 packets transmitted, 1 received, 0% packet loss, time 0ms
      rtt min/avg/max/mdev = 0.089/0.089/0.089/0.000 ms
      # ping -c 1 r3lead
      PING r3lead.ice.americas.sgi.com (172.16.0.4) 56(84) bytes of data.
      64 bytes from r3lead.ice.americas.sgi.com (172.16.0.4): icmp_seq=1 ttl=64 time=0.129 ms
      
      --- r3lead.ice.americas.sgi.com ping statistics ---
      1 packets transmitted, 1 received, 0% packet loss, time 0ms
      rtt min/avg/max/mdev = 0.129/0.129/0.129/0.000 ms
      # ping -c 1 r4lead
      PING r4lead.ice.americas.sgi.com (172.16.0.5) 56(84) bytes of data.
      64 bytes from r4lead.ice.americas.sgi.com (172.16.0.5): icmp_seq=1 ttl=64 time=0.136 ms
      
      --- r4lead.ice.americas.sgi.com ping statistics ---
      1 packets transmitted, 1 received, 0% packet loss, time 0ms
      rtt min/avg/max/mdev = 0.136/0.136/0.136/0.000 ms

    3. From the leader node, issue an ibstat command to determine the Port GUID values, as follows:

      r01lead:/ # ibstat
      CA 'mthca0'
              CA type: MT23108
              Number of ports: 2
              Firmware version: 3.3.3
              Hardware version: a1
              Node GUID: 0x0008f1040397b03c
              System image GUID: 0x0008f1040397b03f
              Port 1:
                      State: Active
                      Physical state: LinkUp
                      Rate: 10
                      Base lid: 1
                      LMC: 0
                      SM lid: 1
                      Capability mask: 0x02510a6a
                      Port GUID: 0x0008f1040397b03d <---<< goes into opensm-ib0.conf
              Port 2:
                      State: Initializing
                      Physical state: LinkUp
                      Rate: 10
                      Base lid: 0
                      LMC: 0
                      SM lid: 0
                      Capability mask: 0x02510a68
                      Port GUID: 0x0008f1040397b03e <---<< goes into opensm-ib1.conf


      Note: Get usage information on the ibstat command, as follows:
      r01lead:/ # ibstat --help
      Usage: ibstat [-d(ebug) -l(ist_of_cas) -s(hort) -p(ort_list) -V(ersion)]  [portnum]
              Examples:
                      ibstat -l         # list all IB devices
                      ibstat mthca0 2 # stat port 2 of 'mthca0'



    4. From the leader node, change directory to the /etc, as follows:

      r01lead:/ # cd /etc

    5. Using your favorite editor, open the opensm-ib0.conf file and enter the Port GUID: value, in this example, 0x0008f1040397b03d, as follows:

      GUID="0x0008f1040397b03d"

    6. Using your favorite editor, open the opensm-ib1.conf file and enter the Port GUID: value, in this example, 0x0008f1040397b03e, as follows:

      GUID="0x0008f1040397b03e"

    7. In both the opensm-ib0.conf file and opensm-ib1.conf file enable the failover (handover) mechanism on the leader nodes by adding the IP addresses recorded in step 2 to the OSM_HOSTS variable, as follows:

      OSM_HOSTS="172.16.0.2 172.16.0.3 172.16.0.4 172.16.0.5"

    8. For systems with five or more racks, SGI recommends you change the ROUTING_ENGINE variable in both configuration files to lash, as follows:

      ROUTING_ENGINE="lash"

    9. To initialize the ib0 fabric, start the opensmd-ib0 daemon, as follows:

      # ./opensmd-ib0 start

    10. To initialize the ib1 fabric, start the opensmd-ib1 daemon, as follows:

      # ./opensmd-ib1 start

    11. Use the the ibnetdiscover command to verify the fabric, as follows:

      r01lead:/ # ibnetdiscover -l
      Switch   : 0x08006900000000dc ports 24 devid 0xb924 vendid 0x2c9 "MT47396 Infiniscale-III Mellanox Technologies"
      Switch   : 0x08006900000000a4 ports 24 devid 0xb924 vendid 0x2c9 "MT47396 Infiniscale-III Mellanox Technologies"
      Ca       : 0x0030487aa7940000 ports 1 devid 0x6274 vendid 0x2c9 " HCA-1"
      Ca       : 0x0030487aa78c0000 ports 1 devid 0x6274 vendid 0x2c9 " HCA-1"
      Ca       : 0x0008f10403988198 ports 2 devid 0x6278 vendid 0x8f1 "service0-ib0 HCA-1"
      Ca       : 0x0030487aa7840000 ports 1 devid 0x6274 vendid 0x2c9 " HCA-1"
      Ca       : 0x0030487aa79c0000 ports 1 devid 0x6274 vendid 0x2c9 " HCA-1"
      Ca       : 0x0030487aa7900000 ports 1 devid 0x6274 vendid 0x2c9 " HCA-1"
      Ca       : 0x0030487aa7980000 ports 1 devid 0x6274 vendid 0x2c9 " HCA-1"
      Ca       : 0x0008f104039881a8 ports 2 devid 0x6278 vendid 0x8f1 " HCA-1"


      Note: Get usage information on the ibnetdiscover command, as follows:
      r01lead:/ # ibnetdiscover --help
      Usage: ibnetdiscover [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -g(rouping) -H(ca_list) 
      -S(witch_list) -V(ersion) -C ca_name -P ca_port -t(imeout) timeout_ms --switch-map switch-map]
       --switch-map  specify a switch-map file



    12. Exit the rack leader controller (leader node) and return to the admin node, you should be good to go now.