The InfiniBand network on SGI Altix ICE 8200 series systems uses Open Fabrics Enterprise Distribution (OFED) software. This section describes the InfiniBand fabric and how to manage it. For background information on OFED, see http://www.openfabrics.org .
This section describes the InfiniBand fabric and covers the following topics:
Fabric management on SGI Altix ICE 8200 series systems uses the OFED OpenSM software package. The InfiniBand fabric connects the service nodes, rack leader controllers (leader nodes), and the compute nodes. It does not connect to the system admin controller (admin node) or the chassis management control (CMC) blades. The InfiniBand network has two separate network fabrics, ib0 and ib1 (see “InfiniBand Fabric” in Chapter 1) with the following characteristics:
Each network fabric has its own subnet manager (SM).
For a system with two racks or more, one rack leader controller (leader node) runs an instance of SM to manage the ib0 fabric and a second leader node runs an instance of SM to manage the ib1 fabric. A database on the admin node keeps a record of which rack leader nodes are running the fabric management software for either ib0 or ib1, respectively. The sgifmcli command has the logic to place opensm on the appropriate rack leader controller. If one of the rack leader controllers becomes unavailable, management of fabric can be assigned to another available rack leader node in the system.
| Note: The LX series only has one ib fabric, therefore, the sgifmcli(8) command, should only be run on ib0 (see “The InfiniBand Management Tool Graphical User Interface”). |
ib0 is mapped to port 1 of the host channel adapater (HCA) on the SM node. ib1 is mapped to port 2 of the HCA on the SM node.
On a system with a single rack, both instances of opensm run on the same rack leader node.
Each instance of SM on the rack leader controller is controlled by the /etc/ofa/opensm-ib[01].conf configuration file.
Rack leader controllers run the opensm daemon for each fabric over separate HCA ports (see Figure 1-9).
| Note: After a system reboot, the opensm daemons start running automatically on the InfiniBand fabric. |
Each fabric is addressed by a global unique identifier (GUID) and unique HCA port.
The GUID and HCA port is set in the configuration file.
SGI supports the following topologies: hypercube, enhanced hypercube, and fat tree.
Each subnet manager (SM) has a failover mechanism. You can define a master / standby per InfiniBand plane for increased resiliency. For more information, see “InfiniBand Fabric Failover Mechanism”.
You can use the InfiniBand management tool graphical user interface (GUI) to configure, administer, or verify the InfiniBand fabric on your SGI Altix ICE system. You can use it to configure, start, stop, restart, cleanup, or get status for the InfiniBand fabric.
From the system admin controller (admin node), enter the following command:
sys-admin:~ # tempo-configure-fabric |
You can also access this command from the configure-cluster GUI main menu F Configure Infiniband Fabric selection (see “configure-cluster Command Cluster Configuration Tool” in Chapter 2). For more information, see Figure 4-1.
From the Configure InfiniBand screen, make sure you select the Configure Topolgy option to set the topology as shown in Figure 4-2. For more information, see “Network Topology”.
Use the the online help available with this tool to guide you through the InfiniBand configuration. After configuring and bringing up the InfiniBand network, select the Administer InfiniBand ib0 option or the Administer InfiniBand ib1 option, the Administer InfiniBand screen appears as shown in Figure 4-3. You can use this screen to start, stop, restart, or refresh a fabric.
| Note: The LX series only has one ib fabric, therefore, thesgifmcli(8) command described in this section, should only be run on the ib0 fabric. |
Currently, the following switches are supported:
| Switch Type | Description | |
| voltaire-isr-9024 | Voltaire ISR 9024 | |
| voltaire-isr-2004 | Voltaire ISR 2004 | |
| voltaire-isr-2012 | Voltaire ISR 2012 | |
| voltaire-isr-9096 | Voltaire ISR 9096 | |
| voltaire-isr-9288 | Voltaire ISR 9288 |
At the SGI Tempo 1.7 release, the smconfig and smadmin command functionality was integrated into the sgifmcli command. Use the tempo-configure-fabric command to configure the InfiniBand network . The sgifmcli command is used for the following:
Initialize and configure external InfiniBand switches
This is done automatically by the Tempo discover script (see “InfiniBand Configuration” in Chapter 2) but can also be done manually by an administrator. For this operation, no cluster-wide InfiniBand connectivity needs to exist. The only necessity is that the supplied host name is resolvable and provides a working networking connection to the external InfiniBand switch.
Configure and administer the cluster fabric
Verify the InfiniBand fabric
This operation requires that the InfiniBand network is configured properly using the tempo-configure-fabric (see “The InfiniBand Management Tool Graphical User Interface”).
The sgifmcli(8) command is, as follows:
sgifmcli [type action [options]] | [options] |
| Note: You can use shortened versions of the following sgifmcli options as long as you use a significant amount of letters. For example, sgifmcli --vers for sgifmcli --version. |
It accepts the following general options:
| General Option | Description | |
| -h, --help | Displays a help message and the exits | |
| -V, --version | Shows the version number of the program | |
| -v, --verbose [DEBUG | INFO | ERROR] | Select verbosity level (default: ERROR). Most the messages from sgmifmcli are written to a log file named /var/log/sgifmcli.log. The default level reports error messages only. INFO provides the user with details about the operation of sgifmcli in addition to error messages. The DEBUG level produces output that is tailored toward the developer to help with bug fixing. In addition, the DEBUG level also produces INFO and ERROR messages. |
It accepts the following detailed options:
| Detailed Option | Description | |
| type | The type option is one of the following:
| |
| action | The action option is one of the following:
| |
| options | The options option is one or more of the following with no duplicates, for example, the --fabric option must be either ib0 or ib1, not both:
|
EXIT CODES
To facilitate the use of the sgifmcli(8) command in shell scripts, an exit code is returned to give an indication of what occurred during a given connection.
The exit codes returned by sgifmcli are, as follows:
| 0 | Successful termination. | |
| 255 | Abnormal termination. |
For a detailed man page, perform the following command from the admin node:
sys-admin:~ # man sgifmcli |
The fabric component maintains a database (DB) of the objects it manages (managed objects). The database version is automatically set during cluster install. You do not need to set it. Most likely, this database will change over time. To manage multiple database versions and also to aid in field support, SGI has added another command line tool that currently reports the managed objects database version.
The sgifmdb command is, as follows:
sgifmdb [--get|-g] [--dump|-d] [-v|--version] [-r|--reset] [--help|-h] |
It accepts the following general options:
| General Option | Description | |
| -g, --get | Reads the database version object from the database | |
| -d, --dump | Dumps the database. This option allows the you to see what fabric objects are currently stored in the fabric database. | |
| -v, --version | Prints version | |
| -r, --reset | Resets the database and starts clean | |
| -h, --help | -h, --help |
Example 4-1. Getting sgifmdb(8) Command Help
For a sgifmdb command usage statement, perform the following from the admin node:
sys-admin:~ # sgifmdb -h
SGI Fabric Component DB Version
Usage: sgifmdb [--get|-g] [--dump|-d] [-v|--version] [--help|-h]
-g, --get Read DB version object from DB
-d, --dump Dump the DB
-v, --version Print version
-h, --help Show this text
|
Each subnet manager (SM) performs a light sweep of the fabric it is managing, every 10 seconds by default. The time interval is set by setting the sweep_interval variable in the /opt/sgi/var/sgifmcli/opensm-ib0.conf.templ file and then doing a Commit operation in the tempo-configure-fabric GUI. Alternately, the sgifmcli command has a --arglist option to set various subnet manager configuration parameters including the sweep interval.
| Note: If your cluster is larger than 256 nodes, SGI highly recommends increasing this variable to 90 seconds or even larger value. |
If an SM detects a change in the fabric during a light sweep, such as, the addition or deletion of a node, it performs a heavy sweep. The heavy sweep actually changes the fabric configuration to reflect the current state of the system.
A sample opensm-ibx.conf configuration file is, as follows:
Example 4-2. opensm-ib0.conf and opensm-ib1.conf Configuration Files
# # DEVICE ATTRIBUTES OPTIONS # # The port GUID on which the OpenSM is running guid 0x0000000000000000 # M_Key value sent to all ports qualifying all Set(PortInfo) m_key 0x0000000000000000 # The lease period used for the M_Key on this subnet in [sec] m_key_lease_period 0 # SM_Key value of the SM used for SM authentication sm_key 0x0000000000000001 # SM_Key value to qualify rcv SA queries as 'trusted' sa_key 0x0000000000000001 # Note that for both values above (sm_key and sa_key) # OpenSM version 3.2.1 and below used the default value '1' # in a host byte order, it is fixed now but you may need to # change the values to interoperate with old OpenSM running # on a little endian machine. # Subnet prefix used on this subnet subnet_prefix 0xfec0000000000000 # The LMC value used on this subnet lmc 0 # lmc_esp0 determines whether LMC value used on subnet is used for # enhanced switch port 0. If TRUE, LMC value for subnet is used for # ESP0. Otherwise, LMC value for ESP0s is 0. lmc_esp0 FALSE # The code of maximal time a packet can live in a switch # The actual time is 4.096usec * 2^<packet_life_time> # The value 0x14 disables this mechanism packet_life_time 0x12 # The number of sequential packets dropped that cause the port # to enter the VLStalled state. The result of setting this value to # zero is undefined. vl_stall_count 0x07 # The number of sequential packets dropped that cause the port # to enter the VLStalled state. This value is for switch ports # driving a CA or router port. The result of setting this value # to zero is undefined. leaf_vl_stall_count 0x07 # The code of maximal time a packet can wait at the head of # transmission queue. # The actual time is 4.096usec * 2^<head_of_queue_lifetime> # The value 0x14 disables this mechanism head_of_queue_lifetime 0x12 # The maximal time a packet can wait at the head of queue on # switch port connected to a CA or router port leaf_head_of_queue_lifetime 0x10 # Limit the maximal operational VLs max_op_vls 5 # Force PortInfo:LinkSpeedEnabled on switch ports # If 0, don't modify PortInfo:LinkSpeedEnabled on switch port # Otherwise, use value for PortInfo:LinkSpeedEnabled on switch port # Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 "PortInfo") # 1: 2.5 Gbps # 3: 2.5 or 5.0 Gbps # 5: 2.5 or 10.0 Gbps # 7: 2.5 or 5.0 or 10.0 Gbps # 2,4,6,8-14 Reserved # Default 15: set to PortInfo:LinkSpeedSupported force_link_speed 15 # The subnet_timeout code that will be set for all the ports # The actual timeout is 4.096usec * 2^<subnet_timeout> subnet_timeout 18 # Threshold of local phy errors for sending Trap 129 local_phy_errors_threshold 0x08 # Threshold of credit overrun errors for sending Trap 130 overrun_errors_threshold 0x08 # # PARTITIONING OPTIONS # # Partition configuration file to be used partition_config_file /etc/ofa/partitions.conf # Disable partition enforcement by switches no_partition_enforcement FALSE # # SWEEP OPTIONS # # The number of seconds between subnet sweeps (0 disables it) sweep_interval 10 # If TRUE cause all lids to be reassigned reassign_lids FALSE # If TRUE forces every sweep to be a heavy sweep force_heavy_sweep FALSE # If TRUE every trap will cause a heavy sweep. # NOTE: successive identical traps (>10) are suppressed sweep_on_trap TRUE # # ROUTING OPTIONS # # If TRUE count switches as link subscriptions port_profile_switch_nodes FALSE # Name of file with port guids to be ignored by port profiling port_prof_ignore_file (null) # Routing engine # Multiple routing engines can be specified separated by # commas so that specific ordering of routing algorithms will # be tried if earlier routing engines fail. # Supported engines: minhop, updn, file, ftree, lash, dor routing_engine (null) # Connect roots (use FALSE if unsure) connect_roots FALSE # Use unicast routing cache (use FALSE if unsure) use_ucast_cache FALSE # Lid matrix dump file name lid_matrix_dump_file (null) # LFTs file name lfts_file (null) # The file holding the root node guids (for fat-tree or Up/Down) # One guid in each line root_guid_file /etc/ofa/switchguids-ib0.conf # The file holding the fat-tree compute node guids # One guid in each line cn_guid_file (null) # The file holding the node ids which will be used by Up/Down algorithm instead # of GUIDs (one guid and id in each line) ids_guid_file (null) # The file holding guid routing order guids (for MinHop and Up/Down) guid_routing_order_file (null) # SA database file name sa_db_file (null) # # HANDOVER - MULTIPLE SMs OPTIONS # # SM priority used for deciding who is the master # Range goes from 0 (lowest priority) to 15 (highest). sm_priority 0 # If TRUE other SMs on the subnet should be ignored ignore_other_sm FALSE # Timeout in [msec] between two polls of active master SM sminfo_polling_timeout 10000 # Number of failing polls of remote SM that declares it dead polling_retry_number 4 # If TRUE honor the guid2lid file when coming out of standby # state, if such file exists and is valid honor_guid2lid_file FALSE # # TIMING AND THREADING OPTIONS # # Maximum number of SMPs sent in parallel max_wire_smps 4 # The maximum time in [msec] allowed for a transaction to complete transaction_timeout 200 # Maximal time in [msec] a message can stay in the incoming message queue. # If there is more than one message in the queue and the last message # stayed in the queue more than this value, any SA request will be # immediately returned with a BUSY status. max_msg_fifo_timeout 10000 # Use a single thread for handling SA queries single_thread FALSE # # MISC OPTIONS # # Daemon mode daemon FALSE # SM Inactive sm_inactive FALSE # Babbling Port Policy babbling_port_policy FALSE # # Event Plugin Options # event_plugin_name (null) # # Node name map for mapping node's to more descriptive node descriptions # (man ibnetdiscover for more information) # node_name_map_name (null) # # DEBUG FEATURES # # The log flags used log_flags 0x03 # Force flush of the log file after each log message force_log_flush FALSE # Log file to be used log_file /var/log/opensm-ib0.log # Limit the size of the log file in MB. If overrun, log is restarted log_max_size 0 # If TRUE will accumulate the log over multiple OpenSM sessions accum_log_file TRUE # The directory to hold the file OpenSM dumps dump_files_dir /var/log/ # If TRUE enables new high risk options and hardware specific quirks enable_quirks FALSE # If TRUE disables client reregistration no_clients_rereg FALSE # If TRUE OpenSM should disable multicast support and # no multicast routing is performed if TRUE disable_multicast FALSE # If TRUE opensm will exit on fatal initialization issues exit_on_fatal TRUE # console [off|local] console off # Telnet port for console (default 10000) console_port 10000 # # QoS OPTIONS # # Enable QoS setup qos FALSE # QoS policy file to be used qos_policy_file /etc/ofa/qos-policy.conf # QoS default options qos_max_vls 0 qos_high_limit -1 qos_vlarb_high (null) qos_vlarb_low (null) qos_sl2vl (null) # QoS CA options qos_ca_max_vls 0 qos_ca_high_limit -1 qos_ca_vlarb_high (null) qos_ca_vlarb_low (null) qos_ca_sl2vl (null) # QoS Switch Port 0 options qos_sw0_max_vls 0 qos_sw0_high_limit -1 qos_sw0_vlarb_high (null) qos_sw0_vlarb_low (null) qos_sw0_sl2vl (null) # QoS Switch external ports options qos_swe_max_vls 0 qos_swe_high_limit -1 qos_swe_vlarb_high (null) qos_swe_vlarb_low (null) qos_swe_sl2vl (null) # QoS Router ports options qos_rtr_max_vls 0 qos_rtr_high_limit -1 qos_rtr_vlarb_high (null) qos_rtr_vlarb_low (null) qos_rtr_sl2vl (null) # Prefix routes file name prefix_routes_file /etc/ofa/prefix-routes.conf # # IPv6 Solicited Node Multicast (SNM) Options # consolidate_ipv6_snm_req FALSE |
Each fabric is addressed by a global unique identifier (GUID) and unique HCA port (see Figure 4-5). Each fabric has a unique GUID set in its respective configuration file.
For SGI Altix ICE systems with a hypercube topology, SGI uses the dimension order routing (DOR) algorithm.
The dimension order routing algorithm is based on the min hop algorithm and so uses shortest paths. Instead of spreading traffic out across different paths with the same shortest distance, it chooses among the available shortest paths based on an ordering of dimensions.
For SGI Altix ICE systems with a fat-tree topology, SGI uses updn as the default routing algorithm. Unicast routing algorithm (UPDN) is also based on the minimum hops to each node, but it is constrained to ranking rules.
For more information on routing variables, see the opensm (8) man page.
Hypercube network topology is well suited for smaller node count MPI jobs or jobs that have communication patterns that are not sensitive to bisection bandwidth. Fat-tree network topology is well suited for large node count MPI jobs that are sensitive to bi-section bandwidth.
As stated above, there are two opensm daemons, one for each fabric, opensmd-ib0 and opensmd-ib1 , respectively. They are controlled by the init.d scripts. Each init.d script has a separate configuration file for each fabric, opensm-ib0 and opensm-ib1 , respectively.
You can use the sminfo command to show the GUID of the SM master.
This section describes how to configure and administer the InfiniBand fabric using the sgifmcli(8) command.
| Note: SGI highly recommends that you use the tempo-configure-fabric GUI to configure and administer the fabric (see “The InfiniBand Management Tool Graphical User Interface”). |
When configuring the SM master, the following rules apply:
Each InfiniBand fabric needs to have a subnet manager (SM) master.
There can be at most one SM master per InfiniBand plane.
Fabric configuration and administration can only be done via the SM master.
Fabric configuration becomes active after (re)starting the SM master.
Deleting an SM master automatically deletes its standby, if it exists.
The syntax to configure an SM master is, as follows:
sgifmcli --mastersm --init --id identifier --hostname hostname --fabric fabric --topology topology |
This command creates a master with the name provided by the --id option. The identifier can be any arbitrary string. The hostname determines the host on which the SM master manager is launched. The fabric option associates the SM master manager with either ib0 or ib1. The topology option refers to the InfiniBand topology, which can be either hypercube, enhanced hypercube, or fat tree.
To configure a master for the fabric ib0 on a hypercube cluster, perform the following steps:
From the admin node to configure an SM master, perform the following:
# sgifmcli --mastersm --init --id master_ib0 --hostname r1lead --fabric ib0 --topology hypercube |
This creates an SM master for ib0. The underlying topology is a hypercube and thus the routing algorithm dor will be used. This SM master, named master_ib0, is configured to run on the host r1lead.
The syntax to start an SM master is, as follows:
# sgifmcli --start --id identifier |
To start the master_ib0 SM master, perform the following:
sgifmcli --start --id master_ib0 |
At this point a master for the fabric ib0 is running on the r1lead and thus the fabric ib0 is available for compute jobs. If a standby has been defined, it will be launched automatically, in addition, to the master.
The syntax to stop an SM master is, as follows:
sgifmcli --stop --id identifier |
# sgifmcli --stop --id master_ib0 |
The SM master master_ib0 running on host r1lead is stopped. If a standby has been defined then it will be stopped automatically, in addition to the master.
The syntax to check the status of an SM master is, as follows:
sgifmcli --status --id identifier |
# sgifmcli --status --id master_ib0 Master SM Host = rlead Guid = 0x0002c902002838f5 Fabric = ib0 Topology = hypercube Routing Engine = dor OpenSM = running |
The syntax to remove an SM master is, as follows:
sgifmcli --remove --id identifier |
To remove the master_ib0 SM master, first stop it and then perform the -remove option, as follows:
# sgifmcli --stop --id master_ib0 # sgifmcli --remove --id master_ib0 |
The SM master is removed from the entity list. If a standby has been defined, it is removed, in addition to the master.
To print the fabric configuration, run the following:
# sgifmcli --showconfig -------------- NAME = ib1 TYPE = ibfabric MASTER = STANDBY = SWITCH_LIST = -------------- NAME = ib0 TYPE = ibfabric MASTER = STANDBY = SWITCH_LIST = |
Each subnet manager (SM) has a failover mechanism. If the master SM fails, the standby SM takes over operation of the fabric. This failover operation is performed automatically by the opensm software.Typically, rack1 is the MASTER for the ib0 fabric and rack2 has the MASTER for the ib1 fabric, as shown in Figure 4-6.
The following procedure describes how to setup the failover mechanism.
When enabling the InfiniBand failover mechanism, the following rules apply:
Each InfiniBand fabric can optionally have exactly one standby.
A standby SM can only be created for a particular fabric when a master already exists.
When adding a standby after a master has already been defined and started, the master needs to be stopped before the standby is defined via the --init option. After defining the standby via --init, restart the master.
A SM master and SM standby for a particular fabric can not coexist on the same node.
SGI highly recommends that you use the tempo-configure-fabric GUI to configure the failover mechanism. If it is necessary to use sgifmcli(8) to enable the InfiniBand failover mechanism, perform the following steps:
If an SM master is defined and running, stop it, as follows:
# sgifmcli --stop --id master_ib0 |
# sgifmcli --mastersm --init --id master_ib0 --hostname r1lead --fabric ib0 --topology hypercube |
Define the SM standby, as follows:
# sgifmcli --standbysm --init --id standby_ib0 --hostname r2lead --fabric ib0 |
Start the SM master, as follows:
# sgifmcli --start --id master_ib0 |
This automatically starts the SM master and the SM standby for ib0.
Now check the status for the subnet manager of ib0, as follows:
sgifmcli --status --id master_ib0 Master SM Host = r1lead Guid = 0x0008f10403987da9 Fabric = ib0 Toplogy = hypercube Routing Engine = dor OpenSM = running Standby SM Host = r2lead Guid = 0x0008f10403987d25 Fabric = ib0 OpenSM = running |
To remove the standby_ib0 SM standby, first stop its master and then perform the remove option, as follows:
# sgifmcli --stop --id master_ib0 # sgifmcli --remove --id standby_ib0 |
The SM standby is removed from the entity list. If a standby has been defined, it is removed, in addition to the master.
This section describes how to configure InfiniBand fat-tree network topology. The fat-tree topology involves external InfiniBand switches. For the list of supported external switches, see “Fabric Component sgifmcli Command”. InfiniBand switches come in two types: leaf or spine. Some switches are called director switches; these fall into the spine category. A switch can have one or more spines and has multiple leaf or line switches. It is recommended to discover external IB switches using the Tempo discover command (see “discover Command” in Chapter 2). After discovery is completed, an external switch can also be initialized and added to the InfiniBand system using the sgifmcli command.
To configure the InfiniBand fat-tree network topology on an SGI Altix ICE 8200 series system, perform the following steps:
Make sure that your switch is properly connected to the InfiniBand network. Also, make sure that the admin port of the switch is properly connected to the Ethernet network.
Power on the switch. See the switch manual for operation information.
From the admin node, initialize the switch. The syntax to initialize the switch is, as follows:
sgifmcli --init --ibswitch --model --id --switchtype [leaf | spine] |
An example command is, as follows:
# sgifmcli --init --ibswitch --model voltaire-isr-2004 --id isr2004 --switchtype spine |
This configures a Voltaire switch ISR2004 with hostname isr2004 as a spine switch. isr2004 refers to the admin port of the switch and needs to be configured previously to allow for switch access. The switch is now initialized and the root GUID from the spine switches have been downloaded.
From the admin node, add the switch to the fabric. The syntax to add the switch is, as follows:
sgifmcli --add --id <fabric> --switch <hostname> |
An example command is, as follows:
# sgifmcli --add --id ib0 --switch isr2004 |
In this example, ISR2004 is connected to the ib0 fabric.
For the new switch to be activated, the SM master and the optional SM standby need to be (re)started.
# sgifmcli --start --id master_ib0 |
If the SM master was running while the switch was added, you first need to stop and then start the master, as follows:
# sgifmcli --stop --id master_ib0 # sgifmcli --start --id master_ib0 |
The switches related to a particular fabric can be listed, as follows:
# sgifmcli --switchlist --id <fabric> |
After your InfiniBand fabric has been configured and started, you can use the sgifmcli(8) command to verify the health of the fabric.
The fabric can be either ib0 or ib1 . This version of the InfiniBand verifier runs the recommended OFED test suite. In addition, the SGI Tempo cluster view is compared with the InfiniBand cluster view and potential differences are reported.
To verify the ibo fabric, perform the following command:
# sgifmcli --verify --id fabric |
The openib-diags package contains useful tools and diagnostic software for Open Fabrics Enterprise Distribution (OFED). This section describes some of these tools. These tools reside on the rack leader controller (leader node) in the /usr/bin directory, as follows:
r1lead:~ # cd /usr/bin r1lead:/usr/bin # ls ib* ibaddr ibcheckstate ibdiscover.pl ibnetdiscover ib_rdma_bw ibstatus ... ibcheckerrors ibcheckwidth ibdmchk ibnlparse ib_rdma_lat ibswitches ... ibcheckerrs ibclearcounters ibdmsh ibnodes ib_read_bw ibsysstat ... ibchecknet ibclearerrors ibdmtr ibping ib_read_lat ibtopodiff ... ibchecknode ib_clock_test ibfindnodesusing.pl ibportstate ibroute ibtracert ... ibcheckport ibdiagnet ibhosts ibprintca.pl ib_send_bw ibv_asyncwatch ... ibcheckportstate ibdiagpath ibis ibprintswitch.pl ib_send_lat ibv_devices ... ibcheckportwidth ibdiagui iblinkinfo.pl ibqueryerrors.pl ibstat ibv_devinfo |
You can use the ibstat command to see the current status of the host channel adapaters (HCA) in your InfiniBand fabric incluing the HCAs on rack leader controllers. The following view is prior to starting the fabric management:
r1lead:/usr/bin # ibstat
CA 'mthca0'
CA type: MT25208 (MT23108 compat mode)
Number of ports: 2
Firmware version: 4.7.600
Hardware version: a0
Node GUID: 0x0008f104039881a8
System image GUID: 0x0008f104039881ab
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 20
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0008f104039881a9
Port 2:
State: Initializing
Physical state: LinkUp
Rate: 20
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0008f104039881aa |
The following shows output from the ibstat command after the fabric management software has been started:
r1lead:/opt/sgi/sbin # ibstat
CA 'mthca0'
CA type: MT25208 (MT23108 compat mode)
Number of ports: 2
Firmware version: 4.7.600
Hardware version: a0
Node GUID: 0x0008f104039881a8
System image GUID: 0x0008f104039881ab
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x02510a6a
Port GUID: 0x0008f104039881a9
Port 2:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x02510a6a
Port GUID: 0x0008f104039881aa |
You can use the ibstatus (less verbose that ibstat) command to show the link rate, as follows:
r1lead:/opt/sgi/sbin # ibstatus
Infiniband device 'mthca0' port 1 status:
default gid: fe80:0000:0000:0000:0008:f104:0398:81a9
base lid: 0x1
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
Infiniband device 'mthca0' port 2 status:
default gid: fe80:0000:0000:0000:0008:f104:0398:81aa
base lid: 0x1
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR) |
| Note: If link rate is not 20 Gb/sec 4xDDR, and you have a DDR capable HCA, there is a physical link problem with your system. |
The perfquery command is useful for find errors on a particular or number of HCA's and switch ports. You can also use perfquery to reset HCA and switch port counters.
To see a usage statement for the perfquery command, perform the following:
r1lead:/opt/sgi/sbin # perfquery --help
Usage: perfquery [-d(ebug) -G(uid) -a(ll_ports) -r(eset_after_read) -C ca_name -P ca_port -R(eset_only)
-t(imeout) timeout_ms -V(ersion) -h(elp)] [<lid|guid> [[port] [reset_mask]]]
Examples:
perfquery # read local port's performance counters
perfquery 32 1 # read performance counters from lid 32, port 1
perfquery -e 32 1 # read extended performance counters from lid 32, port 1
perfquery -a 32 # read performance counters from lid 32, all ports
perfquery -r 32 1 # read performance counters and reset
perfquery -e -r 32 1 # read extended performance counters and reset
perfquery -R 0x20 1 # reset performance counters of port 1 only
perfquery -e -R 0x20 1 # reset extended performance counters of port 1 only
perfquery -R -a 32 # reset performance counters of all ports
perfquery -R 32 2 0x0fff # reset only error counters of port 2
perfquery -R 32 2 0xf000 # reset only non-error counters of port 2 |
r1lead:/opt/sgi/sbin # perfquery # Port counters: Lid 1 port 1 PortSelect:......................1 CounterSelect:...................0x0000 SymbolErrors:....................0 LinkRecovers:....................0 LinkDowned:......................0 RcvErrors:.......................0 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................0 XmtDiscards:.....................0 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................0 RcvData:.........................0 XmtPkts:.........................0 RcvPkts:.........................0 |
The ibnetdiscover command allows you discover the IB fabric.
To see a usage statement for the ibnetdiscover command, perform the following:
r1lead:/opt/sgi/sbin # ibnetdiscover --help Usage: ibnetdiscover [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -V(ersion) -C ca_name -P ca_port -t(imeout) timeout_ms --switch-map switch-map] [<topology-file>] --switch-map <switch-map> specify a switch-map file |
| Note: Only abbreviated output is shown in the this example. |
r1lead:/opt/sgi/sbin # ibnetdiscover # # Topology file: generated on Tue Jul 17 14:05:20 2007 # # Max of 3 hops discovered # Initiated from node 0008f104039881a8 port 0008f104039881a9 vendid=0x2c9 devid=0xb924 sysimgguid=0x8006900000000dd ... Switch : 0x08006900000000dc ports 24 devid 0xb924 vendid 0x2c9 "MT47396 Infiniscale-III Mellanox Technologies" Switch : 0x08006900000000a4 ports 24 devid 0xb924 vendid 0x2c9 "MT47396 Infiniscale-III Mellanox Technologies" r1lead:/opt/sgi/sbin # ibnetdiscover -H (HCA's) Ca : 0x0030487aa7940000 ports 1 devid 0x6274 vendid 0x2c9 "MT25204 InfiniHostLx Mellanox Technologies" Ca : 0x0030487aa78c0000 ports 1 devid 0x6274 vendid 0x2c9 "r1i0n8-ib0 HCA-1" Ca : 0x0008f10403988198 ports 2 devid 0x6278 vendid 0x8f1 " HCA-1" Ca : 0x0030487aa7840000 ports 1 devid 0x6274 vendid 0x2c9 "r1i0n1-ib0 HCA-1" Ca : 0x0030487aa79c0000 ports 1 devid 0x6274 vendid 0x2c9 "r1i1n0-ib0 HCA-1" Ca : 0x0030487aa7900000 ports 1 devid 0x6274 vendid 0x2c9 "r1i1n8-ib0 HCA-1" Ca : 0x0030487aa7980000 ports 1 devid 0x6274 vendid 0x2c9 "r1i1n1-ib0 HCA-1" Ca : 0x0008f104039881a8 ports 2 devid 0x6278 vendid 0x8f1 " HCA-1" ====================================================================================================== |
The ibdiagnet command is a useful diagnostic tool.
To see a usage statement for the ibdiagnet command, perform the following:
r1lead:/opt/sgi/sbin # ibdiagnet --help
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
NAME
ibdiagnet
SYNOPSYS
ibdiagnet [-c ] [-v] [-r] [-o ]
[-t ] [-s ] [-i ] [-p ]
[-pm] [-pc] [-P <>]
[-lw <1x|4x|12x>] [-ls <2.5|5|10>]
DESCRIPTION
ibdiagnet scans the fabric using directed route packets and extracts all the
available information regarding its connectivity and devices.
It then produces the following files in the output directory defined by the
-o option (see below):
ibdiagnet.lst - List of all the nodes, ports and links in the fabric
ibdiagnet.fdbs - A dump of the unicast forwarding tables of the fabric
switches
ibdiagnet.mcfdbs - A dump of the multicast forwarding tables of the fabric
switches
ibdiagnet.masks - In case of duplicate port/node Guids, these file include
the map between masked Guid and real Guids
ibdiagnet.sm - A dump of all the SM (state and priority) in the fabric
ibdiagnet.pm - In case -pm option was provided, this file contain a dump
of all the nodes PM counters
In addition to generating the files above, the discovery phase also checks for
duplicate node/port GUIDs in the IB fabric. If such an error is detected, it
is displayed on the standard output.
After the discovery phase is completed, directed route packets are sent
multiple times (according to the -c option) to detect possible problematic
paths on which packets may be lost. Such paths are explored, and a report of
the suspected bad links is displayed on the standard output.
After scanning the fabric, if the -r option is provided, a full report of the
fabric qualities is displayed.
This report includes:
SM report
Number of nodes and systems
Hop-count information:
maximal hop-count, an example path, and a hop-count histogram
All CA-to-CA paths traced
Credit loop report
mgid-mlid-HCAs matching table
Note: In case the IB fabric includes only one CA, then CA-to-CA paths are not
reported.
Furthermore, if a topology file is provided, ibdiagnet uses the names defined
in it for the output reports.
OPTIONS
-c : The minimal number of packets to be sent
across each link (default = 10)
-v : Instructs the tool to run in verbose mode
-r : Provides a report of the fabric qualities
-o : Specifies the directory where the output
files will be placed (default = /tmp)
-t : Specifies the topology file name
-s : Specifies the local system name. Meaningful
only if a topology file is specified
-i : Specifies the index of the device of the port
used to connect to the IB fabric (in case of
multiple devices on the local system)
-p : Specifies the local device's port number used
to connect to the IB fabric
-pm : Dumps all pmCounters values into ibdiagnet.pm
-pc : reset all the fabric links pmCounters
-P <>: If any of the provided pm is greater then its
provided value, print it to screen
-lw <1x|4x|12x> : Specifies the expected link width
-ls <2.5|5|10> : Specifies the expected link speed
-h|--help : Prints this help information
-V|--version : Prints the version of the tool
--vars : Prints the tool's environment variables and
their values
ERROR CODES
1 - Failed to fully discover the fabric
2 - Failed to parse command line options
3 - Failed to interact with IB fabric
4 - Failed to use local device or local port
5 - Failed to use Topology File
6 - Failed to load required Package
|
Output which shows no errors means the system is operating correctly:
r1lead:/opt/sgi/sbin # ibdiagnet
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
Loading IBDM from: /usr/lib64/ibdm1.2
-W- Topology file is not specified.
Reports regarding cluster links will use direct routes.
-W- A few ports of local device are up.
Since port-num was not specified (-p option), port 1 of device 1 will be
used as the local port.
-I- Discovering the subnet ... 10 nodes (2 Switches & 8 CA-s) discovered.
-I---------------------------------------------------
-I- Bad Guids Info
-I---------------------------------------------------
-I- No bad Guids were found
-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found
-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-I- No illegal PM counters values were found
-I---------------------------------------------------
-I- Bad Links Info
-I---------------------------------------------------
-I- No bad link were found
-I- Done. Run time was 0 seconds.
|
You can use ibdiagnet to load the fabric to test it, as follows:
r1lead:/opt/sgi/sbin # ibdiagnet -c 5000
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
Loading IBDM from: /usr/lib64/ibdm1.2
-W- Topology file is not specified.
Reports regarding cluster links will use direct routes.
-W- A few ports of local device are up.
Since port-num was not specified (-p option), port 1 of device 1 will be
used as the local port.
-I- Discovering the subnet ... 10 nodes (2 Switches & 8 CA-s) discovered.
-I---------------------------------------------------
-I- Bad Guids Info
-I---------------------------------------------------
-I- No bad Guids were found
-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found
-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-I- No illegal PM counters values were found
-I---------------------------------------------------
-I- Bad Links Info
-I---------------------------------------------------
-I- No bad link were found
-I- Done. Run time was 8 seconds. |