Chapter 7. Troubleshooting and Diagnostics

This chapter provides the following sections to help you troubleshoot your system:

Troubleshooting Chart

Table 7-1 lists recommended actions for problems that can occur. To solve problems that are not listed in this table, use the SGI Electronic Support system or contact your SGI system support engineer (SSE). For more information about the SGI Electronic Support system, see the “SGI Electronic Support ”.

Table 7-1. Troubleshooting Chart

Problem Description

Recommended Action

The system will not power on.

Ensure that the power cords of the IRU are seated properly in the power receptacles.

Ensure that the PDU circuit breakers are on and properly connected to the wall source.

If the power cord is plugged in and the circuit breaker is on, contact your SSE.

An individual IRU will not power on.

Ensure the power cables of the IRU are plugged in.

View the L1 display; see Table 7-2

 if an error message is present.

If the L1 controller is not running, contact your SSE.

The system will not boot the operating system.

Ensure the IA/IA2 (base I/O) blade that houses the system disk(s) is properly seated in the IRU. Contact your SSE.

The Service Required LED illuminates on an IRU.

View the L1 display of the failing IRU; see Table 7-2

 for a description of the error message.

The Failure LED illuminates on an IRU.

View the L1 display of the failing IRU; see Table 7-2

 for a description of the error message.

The green or yellow LED of a NUMAlink port is not illuminated.

Ensure that the NUMAlink cable is seated properly on both ends.

The PWR LED of a populated PCI slot is not illuminated.

Reseat the PCI card. Check to make sure the blade is seated fully in the IRU.

The Fault LED of a populated PCI slot is illuminated (on).

Reseat the PCI card. Check to make sure the blade is seated properly in the IRU. If the fault LED remains on, replace the PCI card.

The amber LED of a disk drive is on.

Replace the disk drive.


L1 Controller Error Messages

Table 7-2 lists error messages that the L1 controller generates and displays on the L1 display. This display is located on the front of the IRU.


Note: In Table 7-2, a voltage warning occurs when a supplied level of voltage is below or above the nominal (normal) voltage by 10 percent. A voltage fault occurs when a supplied level is below or above the nominal voltage by 20 percent.


Table 7-2. L1 Controller Messages

L1 System Controller Message

Message Meaning and Action Needed

Internal voltage messages:

 

ATTN: <power VRM description> high fault limit reached @ x.xxV

30-second power-off sequence for the IRU.

ATTN: <power VRM description> low fault limit reached @ x.xxV

30-second power-off sequence for the IRU.

ATTN: <power VRM description> high warning limit reached @ x.xxV

A higher than nominal voltage condition is detected.

ATTN: <power VRM description> low warning limit reached @ x.xxV

A lower than nominal voltage condition is detected.

ATTN: <power VRM description> level stabilized @ x.xxV

A monitored voltage level has returned to within acceptable limits.

Fan messages:

 

ATTN: FAN <fan description> fault limit reached @ xx RPM

A fan has reached its maximum RPM level. The ambient temperature may be too high. Check to see if a fan has failed.

ATTN: FAN <fan description> warning limit reached @ xx RPM

A fan has increased its RPM level. Check the ambient temperature. Check to see if the fan stabilizes.

ATTN: FAN <fan description> stabilized @ xx RPM

An increased fan RPM level has returned to normal.

ATTN: <temp sensor description> advisory temperature reached @ xxC xxF

The ambient temperature at the IRU's air inlet has exceeded 30° C.

ATTN: <temp sensor description> critical temperature reached @ xxC xxF

The ambient temperature at the IRU's air inlet has exceeded 35 °C.

ATTN: <temp sensor description> fault temperature reached @ xxC xxF

The ambient temperature at the IRU's air inlet has exceeded 40 °C.

Temperature messages: high alt.

 

ATTN: <temp sensor description> advisory temperature reached @ xxC xxF

The ambient temperature at the IRU's air inlet has exceeded 27 °C.

ATTN: <temp sensor description> critical temperature reached @ xxC xxF

The ambient temperature at the IRU's air inlet has exceeded 31 °C.

ATTN: <temp sensor description> fault temperature reached @ xxC xxF

The ambient temperature at the IRU's air inlet has exceeded 35 °C.

Temperature stable message:

 

ATTN: <temp sensor description> stabilized
@ xxC/xxF

The ambient temperature at the IRU's air inlet has returned to an acceptable level.

Power-off messages:

 

Auto power down in xx seconds

The L1 controller has registered a fault and is shutting down. The message displays every five seconds until shutdown.

IRU appears to have been powered down

The L1 controller has registered a fault and has shut down.


LED Status Indicators

There are a number of LEDs on the front of the IRUs that can help you detect, identify and potentially correct functional interruptions in the system. The following subsections describe these LEDs and ways to use them to understand potential problem areas.

IRU Power Supply LEDs

Each power supply installed in an IRU has a single bi-color (green/amber) status LED. The LED will either light green or amber (yellow), or flash green or yellow to indicate the status of the individual supply. See Table 7-3 for a complete list.

Table 7-3. Power Supply LED States

Power supply status

Green LED

Amber LED

No AC power to the supply

Off

Off

Power supply has failed

Off

On

Power supply problem warning

Off

Blinking

AC available to supply (standby) but IRU is off

Blinking

Off

Power supply on (IRU on)

On

Off


IRU NUMAlink Router Port LEDs

Each IRU supports a total of four external NUMAlink connectors (located on the front of the unit). Each of these connectors has two status LEDs (one green and one amber).

  • The amber LED illuminates to indicate that both the Altix 450 IRU NUMAlink connector and the module to which it is connected are powered on.

  • The green LED illuminates when a link has been established between the Altix 450 NUMAlink connector and the module to which it is connected.

If both LEDs are dark, check the connections at both ends of the NUMAlink cable to ensure they are firmly seated. Check the power-on status of both units the cable is connected with.

Compute/Memory Blade LEDs

Each compute/memory blade installed in an IRU has a total of eight LED indicators arranged in two rows of four and behind the perforated of the blade:

  • One green LED shows power-on complete status for the blade.

  • One red LED shows power failure or bad voltage status within the blade.

  • Two green NUMAlink indicators show NI0 and NI1 connection status between the blade and the router board within the IRU. Constant green is a good connection.

  • Four amber (heartbeat LEDs) indicate compute activity (the LEDs light up according to the number, activity and type of processors installed in the blade). If the IRU is fully powered on and booted and none of the amber LEDs are lit there is most likely a problem with the compute/memory blade. Try reseating the blade in the slot. Confirm the two green NUMAlink status LEDs are on. If there is no LED activity on the blade, it must be replaced.

    Figure 7-1. Compute Blade Status LED Locations

    Compute Blade Status LED Locations

SGI Electronic Support

SGI Electronic Support provides system support and problem-solving services that function automatically, which helps resolve problems before they can affect system availability or develop into actual failures. SGI Electronic Support integrates several services so they work together to monitor your system, notify you if a problem exists, and search for solutions to problems.

Figure 7-2 shows the sequence of events that occurs if you use all of the SGI Electronic Support capabilities.

Figure 7-2. Full Support Sequence

Full Support Sequence

The sequence of events can be described as follows:

  1. Embedded Support Partner (ESP) monitors your system 24 hours a day.

  2. When a specified system event is detected, ESP notifies SGI via e-mail (plain text or encrypted).

  3. Applications that are running at SGI analyze the information, determine whether a support case should be opened, and open a case if necessary. You and SGI support engineers are contacted (via pager or e-mail) with the case ID and problem description.

  4. SGI Knowledgebase searches thousands of tested solutions for possible fixes to the problem. Solutions that are located in SGI Knowledgebase are attached to the service case.

  5. You and the SGI support engineers can view and manage the case by using Supportfolio Online as well as search for additional solutions or schedule maintenance.

  6. Implement the solution.

Most of these actions occur automatically, and you may receive solutions to problems before they affect system availability. You also may be able to return your system to service sooner if it is out of service.

In addition to the event monitoring and problem reporting, SGI Electronic Support monitors both system configuration (to help with asset management) and system availability and performance (to help with capacity planning).

The following three components compose the integrated SGI Electronic Support system:

SGI Embedded Support Partner (ESP) is a set of tools and utilities that are embedded in the SGI Linux ProPack release. ESP can monitor a single system or group of systems for system events, software and hardware failures, availability, performance, and configuration changes, and then perform actions based on those events. ESP can detect system conditions that indicate potential problems, and then alert appropriate personnel by pager, console messages, or e-mail (plain text or encrypted). You also can configure ESP to notify an SGI call center about problems; ESP then sends e-mail to SGI with information about the event.

SGI Knowledgebase  is a database of solutions to problems and answers to questions that can be searched by sophisticated knowledge management tools. You can log on to SGI Knowledgebase at any time to describe a problem or ask a question. Knowledgebase searches thousands of possible causes, problem descriptions, fixes, and how-to instructions for the solutions that best match your description or question.

Supportfolio Online is a customer support resource that includes the latest information about patch sets, bug reports, and software releases.

The complete SGI Electronic Support services are available to customers who have a valid SGI Warranty, FullCare, FullExpress, or Mission-Critical support contract. To purchase a support contract that allows you to use the complete SGI Electronic Support services, contact your SGI sales representative. For more information about the various support contracts, see the following Web page:

http://www.sgi.com/support/customerservice.html

For more information about SGI Electronic Support, see the following Web page:

http://www.sgi.com/support/es