Chapter 12. Troubleshooting

This chapter provides information about the following:

Best Practices

If you run into problems, do the following:

  • Check the messages in /var/log/messages (see “Message Logging” in Chapter 5)

  • Use shutil(8) to see if shared quorum partitions are accessible

  • Use clufence(8) to check the status of the reset cable

  • Verify that the failover domain is defined correctly

Recovery from a clulockd Failure

If the clulockd daemon dies unexpectedly, it freezes all of the locks on the shared quorum partition. clulockd will write a message similar to the following in the logs:

Feb  6 17:25:14 3U:nygaard clulockd[6924]:  Signal 11 received; freezing

The clusvcmgrd daemon will not be able to monitor, start, or stop services. Logs on all members will have a message such as the following:

Feb  6 17:14:48 2U:dahl clusvcmgrd[3255]:  Couldn't connect to member #0: Connection timed out 
Feb  6 17:14:48 3U:dahl clusvcmgrd[3255]:  Unable to obtain cluster lock: No locks available

To recover from this situation, do the following:

  1. Stop cluster daemons on both members.

  2. Reinitialize the shared state from one member in the cluster:

    shutil -i

  3. Make sure that /etc/cluster.xml is same on both members.

  4. Initialize the configuration on the shared quorum partition from one member in the cluster:

    shutil -s /etc/cluster.xml

  5. Verify that the configuration has been initialized correctly from one member in the cluster:

    shutil -p /cluster/config.xml

For more information, see the shutil(8) man page.

Watchdog Errors

Software and hardware watchdog timers are not supported. If a watchdog has been enabled on a member, you may see the following errors when the cluster daemons are starting:

Creating /dev/watchdog: execvp: No such file or directory
^[[FAILED]
Loading Watchdog Timer (softdog): modprobe: Can't locate module softdog
^[[FAILED]

You may also see a message similar to the following in the Cluster Manager log:

clumembd[21355]:  clumembd_sw_watchdog_stop: watchdog is not running.

To disable the software watchdog on a member, enter the following:

sgicm-cluster-manager-cmd --member=member_name --watchdog=no 

For example:

# sgicm-cluster-manager-cmd --member=member1 --watchdog=no 

Shared Quorum Partitions

This section discusses the following:

For more information, see the shutil(8) man page.

Verify Accessibility

To see if shared quorum partitions are accessible, enter the following:

shutil

Read the Configuration File

To read the configuration file from the shared quorum partition, enter the following:

shutil -r -

You should use this command to compare the configuration files in the shared quorum partitions and the local copy.

Verify Metadata Information is Consistent

To verify that the service metadata information is the same on all members, run the following command at the same time on each member:

shutil -m /service/0/status

For example, the following output from member jackhammer and member jackhammer2 indicates a problem:

  • jackhammer output:

    # shutil -m /service/0/status
    Metadata information for /service/0/status
    
    Data Length:   40 bytes
    Data CRC:      0x2dae1205
    Header CRC:    0x7c7185f1
    Last modified: 12:34:58 Mar 31 2004

  • jackhammer2 output:

    # shutil -m /service/0/status
    Metadata information for /service/0/status
    
    Data Length:   40 bytes
    Data CRC:      0x80711487
    Header CRC:    0x9ba9e2cf
    Last modified: 12:34:51 Mar 31 2004

In this case, the service metadata information from both members is inconsistent (the CRC information and the Last modified time stamps are different). The information must be identical from all the members.

Write the Configuration File

To write the configuration file, use the following command:

shutil -s /etc/cluster.xml

You should use this command if one of the following is true:

  • The configuration file in the shared quorum partitions is not consistent with the /etc/cluster.xml file

  • The shared quorum partition was cleared using the shutil -i command

Displaying Metadata Remotely

To display the metadata information from the shared quorum partition, use the following command:

shutil -p /service/0/status

Last Resort: Clear Information


Caution: Do not run this command while the cluster is enabled.

To clear all cluster information, use the following command:

shutil -i

Serial Cable or Reset issues

The clufence command will fail with a nonzero error code for any of the following reasons:

  • The serial cable is not connected

  • The cable is faulty

  • The system controller is not responding

  • The tty device is not available because the serial port driver (ioc4_serial) is not loaded

The messages shown in the following output are also logged to /var/log/messages:

# clufence -s jackhammer2
[12314] info: STONITH: Power controller l2 connected to peer's /dev/ttyIOC1 controls jackhammer
[12314] info: STONITH: Power controller l2 connected to peer's /dev/ttyIOC1 controls jackhammer2
[12314] err: STONITH: Device at /dev/ttyIOC1 controlling jackhammer2 FAILED status check:
Timed out

Failed State for a Service

The following output indicates that the action to disable a service (in this case, nfs_samba ) has failed, and the service is moved to failed state:

# clusvcadm -d nfs_samba
Member machine1 disabling nfs_samba...failed
Service nfs_samba might be running in the cluster. Stop the service manually.

To recover, do the following:

  • Fix the problem.

  • Stop the resources in the service manually using the ifconfig(8), exportfs(8), and umount(8) commands .

  • Disable the service using the clusvcadm(8) command or the SGI Cluster Manager GUI.


    Note: SGI Cluster Manager does not verify that the service has been stopped before disabling.


For more information, see“Service Administration” in Chapter 5 and “Cluster Service States” in Chapter 5.

Error Messages

Following are common error messages.

[12314] err: STONITH: Device at /dev/ttyIOC1 controlling jackhammer2 FAILED status check:
Timed out

There is a problem with the serial cable or system controller. See “Serial Cable or Reset issues”.

clumembd[8431]:  No heartbeat channels available!
clumembd[8431]:  Heartbeat failed to initialize!

These messages (logged by the clumembd process when the local cluster manager daemons are started using GUI or command line) mean that the IP address for the hostname could not be determined or that the IP address assigned to the hostname is invalid. You can verify this by sending ping packets local machine's hostname. Fix the hostname IP address and restart the local cluster daemons.

Shared partition device file names must be defined.

An attempt was made to define the cluster before defining the shared state. Use the --sharedstate command line option or shared state GUI menu to define devices. See “Step 1: Define the Shared Quorum Partitions ” in Chapter 4.

Shared partition device file names primary /dev/shared1 and shadow /dev/shared2 are not valid.

Shared storage initialization failed.
Fix shared storage and write configuration file to shared storage.
Continuing ...

The shared quorum partitions are not accessible or not valid and a configuration change or query was made using the CLI.

Traceback (most recent call last):
   File "/usr/sbin/sgicm-config-cluster", line 47, in ?
     from clusterpkg.cluconfig_module  import cluconfig
   File "/usr/share/sgicm-config-cluster/configure/clusterpkg/cluconfig_module.py", line 2, in ?
     from clusterpkg.cluster_module    import cluster
   File "/usr/share/sgicm-config-cluster/configure/clusterpkg/cluster_module.py", line 2, in ?
     from xml.dom import minidom
 ImportError: No module named xml.dom

These messages will occur if you try to run SGI Cluster Manager without first installing the appropriate packages from the SUSE LINUX Enterprise Server 9 CDs. See the README file for a list of the RPMs.

Reporting Problems to SGI

If you encounter problems, collect the following data from each member:

  • Output from the following commands:

    exportfs   (in NFS configurations)
    chkconfig --list
    clufence -s other_members
    clustat
    cxfsdump    (in CXFS configurations)
    hwinfo
    ls -l each_shared_quorum_partition
    ls -lL each_shared_quorum_partition
    mount
    ps -ef | grep clu
    rpm -qa --last
    shutil -r -
    uname -a

  • Contents of the following files:

    /etc/cluster.xml
    /usr/lib/clumanger/create_device_links
    /var/log/messages
    /etc/samba/smb.conf.SambaShareName  (in Samba configurations)