This chapter provides information about the following:
If you run into problems, do the following:
Check the messages in /var/log/messages (see “Message Logging” in Chapter 5)
Use shutil(8) to see if shared quorum partitions are accessible
If the clulockd daemon dies unexpectedly, it freezes all of the locks on the shared quorum partition. clulockd will write a message similar to the following in the logs:
Feb 6 17:25:14 3U:nygaard clulockd[6924]: Signal 11 received; freezing |
The clusvcmgrd daemon will not be able to monitor, start, or stop services. Logs on all members will have a message such as the following:
Feb 6 17:14:48 2U:dahl clusvcmgrd[3255]: Couldn't connect to member #0: Connection timed out Feb 6 17:14:48 3U:dahl clusvcmgrd[3255]: Unable to obtain cluster lock: No locks available |
To recover from this situation, do the following:
Stop cluster daemons on both members.
Reinitialize the shared state from one member in the cluster:
shutil -i |
Initialize the configuration on the shared quorum partition from one member in the cluster:
shutil -s /etc/cluster.xml |
Verify that the configuration has been initialized correctly from one member in the cluster:
shutil -p /cluster/config.xml |
For more information, see the shutil(8) man page.
Software and hardware watchdog timers are not supported. If a watchdog has been enabled on a member, you may see the following errors when the cluster daemons are starting:
Creating /dev/watchdog: execvp: No such file or directory ^[[FAILED] Loading Watchdog Timer (softdog): modprobe: Can't locate module softdog ^[[FAILED] |
You may also see a message similar to the following in the Cluster Manager log:
clumembd[21355]: clumembd_sw_watchdog_stop: watchdog is not running. |
To disable the software watchdog on a member, enter the following:
sgicm-cluster-manager-cmd --member=member_name --watchdog=no |
For example:
# sgicm-cluster-manager-cmd --member=member1 --watchdog=no |
This section discusses the following:
For more information, see the shutil(8) man page.
To read the configuration file from the shared quorum partition, enter the following:
shutil -r - |
You should use this command to compare the configuration files in the shared quorum partitions and the local copy.
To verify that the service metadata information is the same on all members, run the following command at the same time on each member:
shutil -m /service/0/status |
For example, the following output from member jackhammer and member jackhammer2 indicates a problem:
jackhammer output:
# shutil -m /service/0/status Metadata information for /service/0/status Data Length: 40 bytes Data CRC: 0x2dae1205 Header CRC: 0x7c7185f1 Last modified: 12:34:58 Mar 31 2004 |
jackhammer2 output:
# shutil -m /service/0/status Metadata information for /service/0/status Data Length: 40 bytes Data CRC: 0x80711487 Header CRC: 0x9ba9e2cf Last modified: 12:34:51 Mar 31 2004 |
In this case, the service metadata information from both members is inconsistent (the CRC information and the Last modified time stamps are different). The information must be identical from all the members.
To write the configuration file, use the following command:
shutil -s /etc/cluster.xml |
You should use this command if one of the following is true:
The configuration file in the shared quorum partitions is not consistent with the /etc/cluster.xml file
The shared quorum partition was cleared using the shutil -i command
The clufence command will fail with a nonzero error code for any of the following reasons:
The serial cable is not connected
The cable is faulty
The system controller is not responding
The tty device is not available because the serial port driver (ioc4_serial) is not loaded
The messages shown in the following output are also logged to /var/log/messages:
# clufence -s jackhammer2 [12314] info: STONITH: Power controller l2 connected to peer's /dev/ttyIOC1 controls jackhammer [12314] info: STONITH: Power controller l2 connected to peer's /dev/ttyIOC1 controls jackhammer2 [12314] err: STONITH: Device at /dev/ttyIOC1 controlling jackhammer2 FAILED status check: Timed out |
The following output indicates that the action to disable a service (in this case, nfs_samba ) has failed, and the service is moved to failed state:
# clusvcadm -d nfs_samba Member machine1 disabling nfs_samba...failed Service nfs_samba might be running in the cluster. Stop the service manually. |
To recover, do the following:
Fix the problem.
Stop the resources in the service manually using the ifconfig(8), exportfs(8), and umount(8) commands .
Disable the service using the clusvcadm(8) command or the SGI Cluster Manager GUI.
| Note: SGI Cluster Manager does not verify that the service has been stopped before disabling. |
For more information, see“Service Administration” in Chapter 5 and “Cluster Service States” in Chapter 5.
Following are common error messages.
[12314] err: STONITH: Device at /dev/ttyIOC1 controlling jackhammer2 FAILED status check: Timed out |
There is a problem with the serial cable or system controller. See “Serial Cable or Reset issues”.
clumembd[8431]: No heartbeat channels available! clumembd[8431]: Heartbeat failed to initialize! |
These messages (logged by the clumembd process when the local cluster manager daemons are started using GUI or command line) mean that the IP address for the hostname could not be determined or that the IP address assigned to the hostname is invalid. You can verify this by sending ping packets local machine's hostname. Fix the hostname IP address and restart the local cluster daemons.
Shared partition device file names must be defined. |
An attempt was made to define the cluster before defining the shared state. Use the --sharedstate command line option or shared state GUI menu to define devices. See “Step 1: Define the Shared Quorum Partitions ” in Chapter 4.
Shared partition device file names primary /dev/shared1 and shadow /dev/shared2 are not valid. Shared storage initialization failed. Fix shared storage and write configuration file to shared storage. Continuing ... |
The shared quorum partitions are not accessible or not valid and a configuration change or query was made using the CLI.
Traceback (most recent call last):
File "/usr/sbin/sgicm-config-cluster", line 47, in ?
from clusterpkg.cluconfig_module import cluconfig
File "/usr/share/sgicm-config-cluster/configure/clusterpkg/cluconfig_module.py", line 2, in ?
from clusterpkg.cluster_module import cluster
File "/usr/share/sgicm-config-cluster/configure/clusterpkg/cluster_module.py", line 2, in ?
from xml.dom import minidom
ImportError: No module named xml.dom |
These messages will occur if you try to run SGI Cluster Manager without first installing the appropriate packages from the SUSE LINUX Enterprise Server 9 CDs. See the README file for a list of the RPMs.
If you encounter problems, collect the following data from each member:
Output from the following commands:
exportfs (in NFS configurations) chkconfig --list clufence -s other_members clustat cxfsdump (in CXFS configurations) hwinfo ls -l each_shared_quorum_partition ls -lL each_shared_quorum_partition mount ps -ef | grep clu rpm -qa --last shutil -r - uname -a |
Contents of the following files:
/etc/cluster.xml /usr/lib/clumanger/create_device_links /var/log/messages /etc/samba/smb.conf.SambaShareName (in Samba configurations) |