Chapter 5. System Maintenance, Monitoring, and Debugging

This chapter describes system monitoring and covers the following topics:

Maintenance Procedures

This section describes some common maintenance procedures, as follows:

Temporarily Take a Node Offline for Maintenance

This section describes how to temporarily take a node offline for maintenance.

Procedure 5-1. Temporarily Take a Node Offline for Maintenance

    To temporarily Take a node offline for maintenance, perform the following steps:

    1. Disable the node in the batch scheduler (depends on your batch scheduler).

    2. Power off the node, as follows:

      # cpower --down r1i0n0

    3. Mark the node offline, as follows:

      # cadmin --set administrative_status=offline r1i0n0

    4. Perform any maintenance to the blade that needs to be done.

    5. Mark the node online, as follows:

      # cadmin --set administrative_status=online r1i0n0

    6. Power up the node, as follows:

      # cpower --boot r1i0n0

    7. Enable the node in the batch scheduler (depends on your batch scheduler).

    Permanently Replace a Failed Blade


    Note: See your SGI field support person for the physical removal and replacement of SGI Altix ICE compute nodes (blades).


    This section describes how to permanently replace a failed blade.

    Procedure 5-2. Permanently Replace a Failed Blade

      To permanently replace a failed blade (compute node), perform the following steps:

      1. Disable the node in the batch scheduler (depends on your batch scheduler).

      2. Power off the node, as follows:

        # cpower --down r1i0n0

      3. Mark the node offline, as follows:

        # cadmin --set administrative_status=offline r1i0n0

      4. Physically remove and replace the failed blade.

      5. Rediscover the rack where the replacement blade lives (this example assumes it is rack 1), as follows:

        # discover-rack --rack 1

      6. Set the node to boot your desired compute image (see cimage --list-images and “cimage Command” in Chapter 3 for your options), as follows:

        # cimage --set mycomputeimage mykernel r1i0n0

      7. Power up the node, as follows:

        # cpower --boot r1i0n0

      8. Enable the node in the batch scheduler (depends on your batch scheduler).

      Permanently Remove a Blade

      This section describes how to permanently remove a blade from your Altix ICE system.

      Procedure 5-3. Permanently Remove a Blade

        To permanently remove a blade from your system, perform the following steps:

        1. Disable the node in the batch scheduler (depends on your batch scheduler).

        2. Power off the node, as follows:

          # cpower --down r1i0n0

        3. Mark the node offline, as follows:

          # cadmin --set administrative_status=offline r1i0n0

        4. Physically remove the failed blade.

        5. Rediscover the rack where the removed blade previously resided.

        Add a New Blade

        This section describes how to add a new blade to an Altix ICE system.

        Procedure 5-4. Add a New Blade

          To add a new blade to your system, perform the following steps:

          1. Physically insert the new blade

          2. Rediscover the rack where the new blade lives (this example assumes it is rack 1), as follows:

            # discover-rack --rack 1

          3. Set the node to boot your desired compute image (see cimage --list-images and “cimage Command” in Chapter 3 for your options), as follows:

            # cimage --set mycomputeimage mykernel r1i0n0

          4. Power up the node, as follows:

            # cpower --boot r1i0n0

          5. Enable the node in the batch scheduler (depends on your batch scheduler).

          Inventory Verification Tool

          You can use the SGI Tempo inventory verification tool to query, take snapshots, analyze and compare the node and network inventory of a cluster. Various hardware, network and operating system configuration properties are available and are presented in user-specified formats.

          To make an inventory snapshot of an Altix ICE system, use the following command from the system admin controller (admin node).

          system-admin:~ # ivt -M
          Making a cluster inventory snapshot.  Takes a couple of minutes...  

          Each snapshot is assigned a unique number and marked with the date and time it was taken. Use the ivt -L command to list active snapshot information, as follows:

          system-admin:~ # ivt -L
              1   2007-07-13.11:42:47

          You can query (-Q option), compare ( -C option) and analyze (-S option) existing snapshots. A variety of system hardware and configuration properties can be displayed. You can compare two snapshots to see what has changed or analyze a system snapshot for failed nodes and or see network fabric links.

          You use the ivt command to show general information about your system (note that only a portion of the output of this command is shown below), as follows:

          system-admin:~ # ivt -S
          
          Your system has 6 compute blades.
          
          All 6 blades have the following characteristics:
              bios_date: 05/29/2007
              cpu_core_count: 8
              cpu_model: Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
              kernel: 2.6.16.46-0.12-smp
              memsize: 2059264
              os_product: SLES
              os_vendor: SUSE
              os_version: 10.1
          
          The following characteristics have different values for some blades.
          
            ib0_phys_state (State of InfiniBand ib0 physical link):
                    4 blades have ib0_phys_state == LinkUp (r1i0n0, r1i1n0, r1i0n8, ...)
                    2 blades have ib0_phys_state == unknown (r1i0n1, r1i1n1)
                Query the  value for all blades with:
                  ivt -Q -w blades -f 'blade $blade has ib0_phys_state $ib0_phys_state'
          
            ib0_rate (Rate of InfiniBand ib0 link - Gb/sec):
                    2 blades have ib0_rate == unknown (r1i0n1, r1i1n1)
                    4 blades have ib0_rate == 20 (r1i0n0, r1i1n0, r1i0n8, ...)
                Query the  value for all blades with:
                  ivt -Q -w blades -f 'blade $blade has ib0_rate $ib0_rate'
          ...
          
            ib_bios_rev (Revision of InfiniBand BIOS on blade):
                    2 blades have ib_bios_rev == unknown (r1i0n1, r1i1n1)
                    4 blades have ib_bios_rev == 1.2.0 (r1i0n0, r1i1n0, r1i0n8, ...)
                Query the  value for all blades with:
                  ivt -Q -w blades -f 'blade $blade has ib_bios_rev $ib_bios_rev'
          
            image (image provisioned on blade):
                    5 blades have image == compute-sles10sp1 (r1i0n1, r1i1n1, r1i1n0, ...)
                    1 blades have image == erikj-blade-mksiimage (r1i0n0)
                Query the  value for all blades with:
                  ivt -Q -w blades -f 'blade $blade has image $image'
          
            rack_blade_count (number of booted blades in this blades rack):
                    2 blades have rack_blade_count == 5 (r1i0n1, r1i1n1)
                    4 blades have rack_blade_count == 4 (r1i0n0, r1i1n0, r1i0n8, ...)
                Query the  value for all blades with:
                  ivt -Q -w blades -f 'blade $blade has rack_blade_count $rack_blade_count'
          
          InfiniBand GUID check:
            Do fabric (ibnetdiscover) and blades (ib stat) have same GUIDs?
              ib0 plane: unmatched GUIDs
              GUIDs seen on blade ports, missing on fabric: unknown 0030487aa7940000
              GUIDs see on fabric, missing on blade ports: 0030487aa7840000 0030487aa7980000
              ib1 plane: unmatched GUIDs
              GUIDs seen on blade ports, missing on fabric: unknown 0030487aa7950000
              GUIDs see on fabric, missing on blade ports: 0030487aa7850000 0030487aa7990000
          
          InfiniBand Link state check:
            Are any IB ports not ACTIVE, not 20 Gb/sec rate or not Up?
          ...
          

          You can use the ivt -c cpu command to show an inventory of the system compute blades and the number of CPUs each blade contains, as follows:

          system-admin:~ # ivt -c cpu
          r1i0n0 has 8 CPUs
          r1i0n1 has 8 CPUs
          r1i0n8 has 8 CPUs
          r1i1n0 has 8 CPUs
          r1i1n1 has 8 CPUs
          r1i1n8 has 8 CPUs

          You can use the ivt tool to determine which compute nodes (blades) are up or down, as follows:

          system-admin:~ #  ivt -Q -w blades -f '$blade $sshstate'
          r1i0n0 up
          r1i0n1 down
          r1i0n8 up
          r1i1n0 up
          r1i1n1 down
          r1i1n8 up

          You can use the ivt tool to determine the GigE Ethernet address for each compute node (blade) , as follows:

          system-admin:~ # ivt -Q -w blades -f '$blade $gige_ip_addr'
          r1i0n0 192.168.159.10
          r1i0n1 192.168.159.11
          r1i0n8 192.168.159.18
          r1i1n0 192.168.159.26
          r1i1n1 192.168.159.27
          r1i1n8 192.168.159.34

          For detailed information on how to use the ivt tool, see the ivt(8) man page or ivt -h, --help usage statement.

          System Monitoring Overview

          Ganglia is a scalable, distributed monitoring system for monitoring system for high-performance computing systems, such as the SGI Altix ICE 8200 system. It displays web browser-based, real-time (on demand) histograms of system metrics, as shown in Figure 5-1.

          Figure 5-1. Ganglia System Monitor

          Ganglia System Monitor

          Detailed information about the Ganglia monitoring system is available at: http://ganglia.info/.

          SGI Tempo has devised a Ganglia model for the Altix ICE system that makes maximum use of Ganglia's highly scalable architecture: each compute node (blade) presents a single monitoring source sending its statistics to the rack leader controller. Therefore, the rack leader controller receives, at most, data from 64 blades. After collecting the data, the rack leader controller forwards aggregated rack statistics to the system admin controller (admin node). The rack leader controller also sends its own statistics to the system admin controller. The system admin controller presents the meta-aggregator for the entire Altix ICE system. It collects data from all rack leaders and presents the cluster-wide metrics. This model enables SGI to scale-out Ganglia to very large cluster deployments.

          The Node View as shown in Figure 5-2 can aid in system troubleshooting. For every blade in the system, the Location field of the Node View shows the exact physical location of the blade. This is an extremely useful when trying to locate a blade that is down.

          Figure 5-2. Ganglia System Monitoring Node View

          Ganglia System Monitoring Node View

          System Monitoring Operation

          This section describes the operation of the Ganglia system monitor and covers the following topics:

          Accessing the Ganglia System Monitor

          To access the Ganglia system monitor, point your browser to the following location: http://admin_pub_name /ganglia

          Monitoring System Metrics

          By default, Ganglia monitors standard operating system metrics like CPU load, memory usage. The Grid Report view shows an overview of your system, such as the number of CPUs, the number of hosts (compute nodes) that are up or down, service node information, memory usage information, and so on.

          The Last pull down menu allows you to view performance data on an hourly, daily, weekly, or yearly basis. The Sorted pull down menu allows provides an ascending, descending, or by host view of performance data. The Grid pull-down menu allows you to see performance data for a particular rack or service node. The Get Fresh Data button allows you to see current data performance.

          SEL/Hardware Event Monitoring

          The system admin controller, rack leader controllers, the service nodes, the chassis management controllers (CMCs) and all the compute nodes (blades) are equipped with a specialized controller, called the Board Management Controller(BMC). This unit provides a broad set of functions as described in the IPMI 2.0 standard. SGI TEMPO software uses the BMCs predominantly for remote power management, remote system configuration, and for gathering critical hardware events.

          Currently, critical hardware events are gathered for the following nodes: rack leader controllers (leader nodes), CMCs and compute nodes (blades). These events are logged in the following locations:

          • /var/log/messages via syslog

          • var/log/sel/sel.log

          • Embedded Support Partner (ESP)

          Whenever critical hardware event occurs, information is forwarded about the event to all three locations. You can observe a critical hardware event via syslog, via sel.log or using ESP. Furthermore, administrator-defined actions can be triggered via ESP, for instance sending an e-mail notification to the system administrator. For more information on ESP, see esp(5) man page and the SGI Embedded Support Partner User Guide.

          All critical hardware events are summarized under the BMC_CMC event type. One particular event holds the following useful information:

          MSG ::=  <syslog-prefix> TEMPO:<node> EVENT:<event> APP:<app> Date:<date> VERSION:<version> TEXT <text> 

          The following fields are all of the type string:

          <node> 

          node name, for example, r1i0n5

          <event> 

          BMC_CMC

          <app> 

          SEL-LOGGER

          <date> 

          date / time of the event

          <version> 

          1.0

          <text> 

          Exact copy of the hardware event description from the BMC

          After reading the events from the BMCs, the BMC event logs are cleared on the controller to avoid duplicate events.

          Node Availability Monitoring

          The availability of each node in the SGI Altix ICE system is monitored via Ganglia. A node is declared as down if it does not send a hearbeat for approximately 80 seconds. In this event, a NODE_DOWN Embedded Support Partner (ESP) event is generated. You can observe this event via syslog or using ESP. Furthermore, administrator-defined actions can be triggered, for instance sending an e-mail notification to the system administrator. For more information on ESP, see esp(5) man page and the SGI Embedded Support Partner User Guide.

          The NODE_DOWN event contains the following useful information:

          MSG ::=  <syslog-prefix> TEMPO:<node> EVENT:<event> APP:<app> Date:<date> VERSION:<version> TEXT <text> 

          The NODE_DOWN event is created only once for a failed node.

          The following fields are all of the type string:

          <node> 

          node name, for example, r1i0n5

          <event> 

          NODE_DOWN

          <app> 

          MIA

          <date> 

          date / time of the event

          <version> 

          1.0

          <text> 

          Ganglia Web link to failed node

          Troubleshooting

          This section describes some troubleshooting tools and covers these topics:

          dbdump Command

          You can run the dbdump script to see an inventory of the Altix ICE database.

          The dbdump command is, as follows:

          /opt/sgi/sbin/dbdump --admin
          /opt/sgi/sbin/dbdump --leader
          /opt/sgi/sbin/dbdump --rack  [--rack ]
          /opt/sgi/sbin/dbdump

          • Use the --admin argument to dump the system admin controller (admin node)

          • Use the --leader argument to dump all rack leader controllers (leader nodes)

          • Use the --rack argument to dump a specific rack

          • Use the dbdump command without any argument to dump the entire Altix ICE system.

          EXAMPLES

          Example 5-1. dbdump Command Examples

          To dump the entire database, perform the following:

          system-admin:~ # dbdump
          0 is { cluster=oscar ifname=service0-bmc dev=bmc0 ip=172.24.0.3 net=head-bmc node=service0
            nodetype=oscar_service mac=00:30:48:8e:
          1 is { cluster=oscar ifname=service0 dev=eth0 ip=172.23.0.3 net=head node=service0
            nodetype=oscar_service mac=00:30:48:33:53:2e }
          2 is { cluster=oscar ifname=service0-ib0 dev=ib0 ip=10.148.0.2 net=ib-0 node=service0
            nodetype=oscar_service }
          3 is { cluster=oscar ifname=service0-ib1 dev=ib1 ip=10.149.0.2 net=ib-1 node=service0
            nodetype=oscar_service }
          4 is { cluster=oscar dev=eth0 ip=128.162.244.86 net=public node=oscar_server
            nodetype=oscar_server mac=00:30:48:34:2B:E0 }
          ...


          Note: Some of the sample output in this section has been modified to fit the format of this manual.


          To dump just the rack leader controller, perform the following:

          system-admin:~ # /opt/sgi/sbin/dbdump --leader
          0 is { cluster=rack1 ifname=r1lead-bmc dev=bmc0 ip=172.24.0.2 net=head-bmc node=r1lead
            nodetype=oscar_leader mac=00:30:48:8a:a4:c2 }
          1 is { cluster=rack1 ifname=lead-bmc dev=eth0 ip=192.168.160.1 net=bmc node=r1lead
            nodetype=oscar_leader mac=00:30:48:33:54:9e }
          2 is { cluster=rack1 ifname=lead-eth dev=eth0 ip=192.168.159.1 net=gbe node=r1lead
            nodetype=oscar_leader mac=00:30:48:33:54:9e }
          3 is { cluster=rack1 ifname=r1lead dev=eth0 ip=172.23.0.2 net=head node=r1lead
            nodetype=oscar_leader mac=00:30:48:33:54:9e }
          4 is { cluster=rack1 ifname=r1lead-ib0 dev=ib0 ip=10.148.0.1 net=ib-0 node=r1lead
            nodetype=oscar_leader }
          5 is { cluster=rack1 ifname=r1lead-ib1 dev=ib1 ip=10.149.0.1 net=ib-1 node=r1lead
            nodetype=oscar_leader }

          To dump just one rack, perform the following:
          system-admin:~ # /opt/sgi/sbin/dbdump --rack 1
          0 is { cluster=rack1 ifname=i0n0-bmc dev=bmc0 ip=192.168.160.10 net=bmc node=r1i0n0
            nodetype=oscar_clients mac=00:30:48:7a:a7:96 }
          1 is { cluster=rack1 ifname=i0n0-eth dev=eth0 ip=192.168.159.10 net=gbe node=r1i0n0
            nodetype=oscar_clients mac=00:30:48:7a:a7:94 }
          2 is { cluster=rack1 ifname=r1i0n0-ib0 dev=ib0 ip=10.148.0.3 net=ib-0 node=r1i0n0
            nodetype=oscar_clients }
          3 is { cluster=rack1 ifname=r1i0n0-ib1 dev=ib1 ip=10.149.0.3 net=ib-1 node=r1i0n0
            nodetype=oscar_clients }
          4 is { cluster=rack1 ifname=i0n1-bmc dev=bmc0 ip=192.168.160.11 net=bmc node=r1i0n1
            nodetype=oscar_clients mac=00:30:48:7a:a7:86 slot=1 }
          5 is { cluster=rack1 ifname=i0n1-eth dev=eth0 ip=192.168.159.11 net=gbe node=r1i0n1
            nodetype=oscar_clients mac=00:30:48:7a:a7:84 slot=1 }
          6 is { cluster=rack1 ifname=r1i0n1-ib0 dev=ib0 ip=10.148.0.4 net=ib-0 node=r1i0n1
            nodetype=oscar_clients slot=1 }
          7 is { cluster=rack1 ifname=r1i0n1-ib1 dev=ib1 ip=10.149.0.4 net=ib-1 node=r1i0n1
            nodetype=oscar_clients slot=1 }
          8 is { cluster=rack1 ifname=i0n10-bmc dev=bmc0 ip=192.168.160.20 net=bmc node=r1i0n10
            nodetype=oscar_clients slot=10 }
          9 is { cluster=rack1 ifname=i0n10-eth dev=eth0 ip=192.168.159.20 net=gbe node=r1i0n10
            nodetype=oscar_clients slot=10 }
          10 is { cluster=rack1 ifname=r1i0n10-ib0 dev=ib0 ip=10.148.0.13 net=ib-0 node=r1i0n10
            nodetype=oscar_clients slot=10 }
          ...


          tempo-info-gather Command

          The tempo-info-gather command enables to collect vital system data especially when troubleshooting problems. The tempo-info-gather command collects the information about the following:

          • Digital media dminfo files, syslogs, Dynamic Host Configuration Protocol (DHCP), network file system (NFS)

          • MySQL cluster database dump

          • Network service configuration files, for example, C3, Ganglia, DHCP, domain name service (DNS) configuration files

          • A list of installed system images

          • Log files in /var/log/messages

          • Chassis management control (CMC) slot table for each rack

          • basic input-output system (BIOS), Baseboard Management Controller (BMC), CMC and Infiniband fabric software versions from all Altix ICE nodes

          To see a usage statement for the tempo-info-gather command, perform the following:

          system-admin:/opt/sgi/sbin # tempo-info-gather  -h
           usage: tempo-info-gather [-h] [-P path] [-o file]
                  tempo-info-gather -h            # Print this usage page
                  tempo-info-gather -o file       # Tar and gzip the directories 
          into file (imply -n)
                  tempo-info-gather -p path       # Directory to write the data 
          (default /var/tmp/tempo)
          

          cminfo Command

          The cminfo command is used internally by many of the SGI Tempo scripts that are used to discover, configure, and manage an SGI Altix ICE system.

          In a troubleshooting situation, you can use it to gather information about your system. To see a usage statement from a rack leader controller, perform the following:

          r1lead:~ # cminfo --help
          Usage: cminfo [--bmc_base_ip|--bmc_ifname|--bmc_iftype|--bmc_ip|--bmc_mac|--bmc_netmask|--bmc_nic|
          --dns_domain|--gbe_base_i
          p|--gbe_ifname|--gbe_iftype|--gbe_ip|--gbe_mac|--gbe_netmask|--gbe_nic|--head_base_ip|
          --head_bmc_base_ip|--head_bmc_ifname|
          --head_bmc_iftype|--head_bmc_ip|--head_bmc_mac|--head_bmc_netmask|--head_bmc_nic|--head_ifname|
          --head_iftype|--head_ip|--he
          ad_mac|--head_netmask|--head_nic|--ib_0_base_ip|--ib_0_ifname|--ib_0_iftype|--ib_0_ip|--ib_0_mac|
          --ib_0_netmask|--ib_0_nic|
          --ib_1_base_ip|--ib_1_ifname|--ib_1_iftype|--ib_1_ip|--ib_1_mac|--ib_1_netmask|
          --ib_1_nic|--name|--rack]
          r1lead:~ # cminfo --bmc_base_ip

          EXAMPLES

          Example 5-2. cminfo Command Examples

          To see the rack leader node BMC IP address, perform the following:

          r1lead:~ # cminfo --bmc_base_ip
          192.168.160.0

          To see the rack leader DNS domain, perform the following:

          r1lead:~ # cminfo --dns_domain
          ice.domain_name.mycompany.com

          To see the BMC nic, perform the following:

          r1lead:~ #  cminfo --bmc_nic
          eth0

          To see the IP address of the ib1 InfiniBand fabric, perform the following:

          r1lead:~ # cminfo --ib_1_base_ip
          10.149.0.0


          System Firmware


          Note: Your SGI Altix ICE system comes preinstalled with the appropriate firmware. See your SGI field support person for any BMC, BIOS, and CMC firmware updates.


          The SGI Altix ICE system firmware software consists of the following components:
          sgi-ice-blade-bmc-1.43.5-1.x86_64.rpm
           

          Blade BMC firmware and update tool

          sgi-ice-blade-bios-2007.08.10-1.x86_64.rpm
           

          Blade BIOS image and update tool

          sgi-ice-cmc-0.0.11-2.x86_64.rpm
           

          CMC firmware and update tool

          BIOS Version Interrogation

          To identify the BIOS you need both the version and the release date. You can get these using the dmidecode command. Log onto the node on which you want to interrogate BIOS level and perform the following:

          # dmidecode -s bios-version; dmidecode -s bios-release-date

          BMC Revision Interrogation

          The BMC firmware revision can be retrieved using the ipmitool . For example, if you are logged onto the r1lead rack leader controller, the following command gets the BMC firmware revision:

          # ipmitool -U ADMIN -P ADMIN -I lanplus -H r1i0n0-bmc bmc info | grep 'Firmware Revision' 

          CMC Version Interrogation

          The CMC firmware version can can be retrieved using the version command to the CMC. For example, if you are logged onto the r1lead rack leader controller, the following command gets the CMC firmware version:

          # ssh root@r1i0-cmc version 

          Infiniband Version Interrogation

          The ibstat command retrieves information for the InfiniBand links including the firmware version. The following command gets the InfiniBand firmware version:

          # ibstat | grep Firmware 

          Getting Firmware Information for All System Nodes

          The firmware_revs script on the system admin controller (admin node) collects the firmware information for all nodes in the SGI Altix ICE system, as follows:

          system-admin:~ # firmware_revs 
          BIOS versions:
          --------------
          admin: 6.00
          r1lead: 6.00
          service0: 6.00
          r1i0n0: 6.00
          r1i0n1: 6.00
          r1i0n8: 6.00
          r1i1n0: 6.00
          r1i1n1: 6.00
          r1i1n8: 6.00
          
          
          BIOS release dates:
          -------------------
          admin: 05/10/2007
          r1lead: 05/10/2007
          service0: 05/10/2007
          r1i0n0: 05/29/2007
          r1i0n1: 05/29/2007
          r1i0n8: 05/29/2007
          r1i1n0: 05/29/2007
          r1i1n1: 05/29/2007
          r1i1n8: 05/29/2007
          
          
          BMC versions:
          -------------
          admin: 1.31
          r1lead: 1.31
          service0: 1.31
          r1i0n0: 1.29
          r1i0n1: 1.29
          r1i0n8: 1.29
          r1i1n0: 1.29
          r1i1n1: 1.29
          r1i1n8: 1.29
          
          
          CMC versions:
          -------------
          r1i0c: 0.0.9pre10
          r1i1c: 0.0.9pre10
          
          
          Infiniband versions:
          --------------------
          r1lead: 4.7.600
          service0: 4.7.600
          r1i0n0: 1.2.0
          r1i0n0: 1.2.0
          r1i0n1: 1.2.0
          r1i0n1: 1.2.0
          r1i0n8: 1.2.0
          r1i0n8: 1.2.0
          r1i1n0: 1.2.0
          r1i1n0: 1.2.0
          r1i1n1: 1.2.0
          r1i1n1: 1.2.0
          r1i1n8: 1.2.0
          r1i1n8: 1.2.0