This chapter explains how to test the IRIS FailSafe system configuration. The tests in each section in this chapter, except the last section, are performed when IRIS FailSafe software is not running. The last section describes how to test the running IRIS FailSafe software.
The sections in this chapter are as follows:
To test the serial connections between the IRIS FailSafe nodes, follow these steps:
If a remote power control unit is used, confirm that it is powered on by checking that the display light on the front of the box is lit green. (The section “Replacing Batteries in the Remote Power Control Unit” in Chapter 8 explains how to change the batteries.)
Enter this command on one node:
# /usr/etc/ha_spng -i 10 -f reset-tty |
reset-tty is the value of the reset-tty parameter in the configuration file /var/ha/ha.conf.
Check the return value of the command by entering the first command if you are using csh and the second command if you are using sh:
# echo $status # echo $? |
If the return value is 0, the connection is good.
If the return value is 1, verify the cable connections of the serial cable from each node's serial port to the remote power control unit or the other node's system controller port.
To test the private (heartbeat) network, follow these steps:
Enter this command on one node:
# /usr/etc/ping -r -c 3 priv-xfs-ha1 PING priv-xfs-ha1.eng.sgi.com (190.0.3.1): 56 data bytes 64 bytes from 190.0.3.1: icmp_seq=0 ttl=254 time=3 ms 64 bytes from 190.0.3.1: icmp_seq=1 ttl=254 time=2 ms 64 bytes from 190.0.3.1: icmp_seq=2 ttl=254 time=2 ms |
priv-xfs-ha1 is the private IP address of the other node. Typical ping output, such as that shown, should appear.
If the ping command fails, verify that the private network interface has been configured up using the ifconfig command, for example:
# /usr/etc/ifconfig ec3
ec3: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,FILTMULTI,MULTICAST>
inet 190.0.3.1 netmask 0xffffff00 broadcast 190.0.3.255
|
The UP in the first line of output indicates that the interface is configured up.
If the ping command fails and the private network interface has been configured up, verify tha the private network cables are connected properly.
The procedure below describes how to test the public interfaces on each node. It uses this interface as an example:
node xfs-ha1
interface xfs-ha1-ec0
{
name = ec0
ip-address = xfs-ha1
netmask = 0xffffff00
broadcast-addr = 190.0.2.255
}
...
}
node xfs-ha2
...
interface-pair one {
primary-interface = xfs-ha1-ec0
secondary-interface = xfs-ha2-ec0
re-mac = false
netmask = 0xffffff00
broadcast-addr = 190.0.2.255
ip-aliases ( stocks )
}
|
Follow these steps:
To test the public network interfaces on the first node (xfs-ha1), enter the following command from a client:
# /usr/etc/ping -c 3 xfs-ha1 PING xfs-ha1.engr.sgi.com (190.0.2.1): 56 data bytes 64 bytes from 190.0.2.1: icmp_seq=0 ttl=254 time=3 ms 64 bytes from 190.0.2.1: icmp_seq=1 ttl=254 time=2 ms 64 bytes from 190.0.2.1: icmp_seq=2 ttl=254 time=2 ms |
xfs-ha1 is an IP address for an interface on the node xfs-ha1.
Repeat step 1 for the remaining public network interfaces on xfs-ha1.
Repeat step 1 for all public interfaces of the other node in the cluster.
Follow the procedure below to verify that XLV logical volumes have been configured properly. It uses this portion of a configuration file as an example:
volume sharedsybase_vol
{
server-node = xfs-ha1
backup-node = xfs-ha2
devname = /dev/dsk/xlv/shared_sybase
devname-owner = sybase
devname-group = sybase
devname-mode = 0664
}
|
On a node that is a primary node for volumes (xfs-ha1 in this example), enter this command to stop the RAID agent if the cluster uses a CHALLENGE RAID storage system:
# /etc/init.d/raid5 stop |
On the same node, enter the following commands to assemble the XLV logical volume sharedsybase_vol:
# xlv_mgr -c "change nodename xfs-ha1 shared_sybase"
set node name xfs-ha1 for object shared_sybase done
# xlv_assemble -l -s shared_sybase
VOL shared_sybase flags=0x1, [complete] (node=xfs-ha1)
DATA flags=0x0() open_flag=0x0() device=(192, 4)
PLEX 0 flags=0x0
VE 0 [active]
start=0, end=3583999, (cat)grp_size=1
/dev/dsk/dks5d1s0 (3584000 blks)
|
Repeat step 2 for each of the other volumes with the same primary node.
If you stopped the RAID agent in step 1, restart the RAID agent by entering this command:
# /etc/init.d/raid5 start |
On the same node, list all of the XLV logical volumes on the node:
# ls -l /dev/dsk/xlv total 0 brw-rw-r-- 1 sybase sybase 192, 4 May 22 11:18 shared_sybase ... |
You should see all volumes that have this node listed as their server-node in the configuration file.
Enter this command to read ten blocks from one of the XLV logical volumes (for example /dev/dsk/xlv/shared_sybase) and discard them:
# dd if=/dev/dsk/xlv/shared_sybase of=/dev/null count=10 10+0 records in 10+0 records out |
The output should match the output shown.
Repeat step 6 for every volume in the configuration file for which this node is the primary node.
If the other node serves as the primary node for any XLV logical volumes, repeat steps 1 through 7.
The procedure below tests filesystems configured for IRIS FailSafe by executing the mount commands that the IRIS FailSafe software would execute. These filesystem and volume sections of a configuration file are used as an example:
filesystem shared1
{
mount-point = /shared1
mount-info
{
fs-type = xfs
volume-name = shared1_vol
mode = rw,noauto
}
}
volume shared1_vol
{
server-node = xfs-ha1
backup-node = xfs-ha2
devname = /dev/dsk/xlv/shared1_vol
devname-owner = root
devname-group = sys
devname-mode = 0600
}
|
For each filesystem listed in the configuration file, follow this procedure:
Identify the primary node for the filesystem by looking up the primary node (the server-node) of the XLV logical volume used by this filesystem. In the example above, volume-name is shared1_vol; look for the volume block with the label shared1_vol. Its server-node (primary node) is xfs-ha1.
On the primary node, check to see if the XLV logical volume device name exists:
# ls /dev/dsk/xlv/shared1_vol |
If the device name doesn't exist, enter the following commands to assemble the XLV logical volume shared1_vol:
# xlv_mgr -c "change nodename xfs-ha1 shared1_vol"
set node name xfs-ha1 for object shared1_vol done
# xlv_assemble -l -s shared1_vol
VOL shared1_vol flags=0x1, [complete] (node=xfs-ha1)
DATA flags=0x0() open_flag=0x0() device=(192, 4)
PLEX 0 flags=0x0
VE 0 [active]
start=0, end=3583999, (cat)grp_size=1
/dev/dsk/dks5d1s0 (3584000 blks)
|
On the primary node, mount the filesystem using a mount command that mimics the mount command given by IRIS FailSafe:
# mount -txfs -o rw,noauto /dev/dsk/xlv/shared1_vol /shared1 |
The mount should be successful.
Unmount the filesystem:
# umount /shared1 |
On the secondary node, check to see if the XLV logical volume device name exists:
# ls /dev/dsk/xlv/shared1_vol |
If the device name doesn't exist, enter the following commands on the secondary node to assemble the XLV logical volume shared1_vol:
# xlv_mgr -c "change nodename xfs-ha2 shared1_vol"
set node name xfs-ha2 for object shared1_vol done
# xlv_assemble -l -s shared1_vol
VOL shared1_vol flags=0x1, [complete] (node=xfs-ha2)
DATA flags=0x0() open_flag=0x0() device=(192, 4)
PLEX 0 flags=0x0
VE 0 [active]
start=0, end=3583999, (cat)grp_size=1
/dev/dsk/dks5d1s0 (3584000 blks)
|
Mount the filesystem on the secondary node by entering the command from step 4 on the secondary node:
# mount -txfs -o rw,noauto /dev/dsk/xlv/shared1_vol /shared1 |
Unmount the filesystem:
# umount /shared1 |
The procedure below tests NFS configuration by exporting filesystems manually and determining if a client can access them.
It uses this NFS entry in ha.conf as an example:
nfs shared1
{
filesystem = shared1
export-point = /shared1
export-info = rw
ip-address = 190.0.2.3
}
|
For each NFS block in ha.conf, follow these steps:
Mount /shared1 on either node as described in the section “Testing Filesystems.”
Make sure the IP address is configured by entering this command on the node where /shared1 was mounted:
# /usr/etc/ping -c 3 190.0.2.3 PING 190.0.2.3 (190.0.2.3): 56 data bytes 64 bytes from 190.0.2.3: icmp_seq=0 ttl=254 time=3 ms 64 bytes from 190.0.2.3: icmp_seq=1 ttl=254 time=2 ms 64 bytes from 190.0.2.3: icmp_seq=2 ttl=254 time=2 ms |
From the node on which it is mounted, export the filesystem:
# exportfs -i -o rw /shared1 |
Make sure the filesystem was exported:
# exportfs /shared1 -rw |
Verify that you can mount the exported filesystem on a client by entering these commands from a client:
# mkdir /tempmount # mount 190.0.2.3:/shared1 /tempmount |
On the client, unmount the filesystem and remove the temporary directory:
# umount /shared1 # rmdir /tempmount |
From the node on which the filesystem is mounted, unexport it and unmount it in preparation for running this test from the other node:
# exportfs -u /shared1 # umount /shared1 |
Repeat steps 1 through 7 on the other node. Make sure you do not mount the filesystem simultaneously from both nodes.
To test whether the Netscape servers are correctly configured, follow these steps:
Start the Netscape Communications and Commerce servers:
# /etc/init.d/ns_httpd start # /etc/init.d/ns_commerce start |
Run a Web browser, such as Netscape, on a client and try to access some Web pages served by the server.
Stop the Netscape Communications and Commerce servers:
# /etc/init.d/ns_httpd stop # /etc/init.d/ns_commerce stop |
Testing system behavior with IRIS FailSafe running is broken into four phases in the following subsections. The phases are: preparing for testing, checking normal operation, checking failover, and cleaning up after testing.
Edit the file /etc/init.d/failsafe on each node and change the value of MIN_UPTIME from the default (300 seconds) to 0. This enables you to allow multiple failovers without the FailSafe software disabling itself due to frequent failovers.
Bring up IRIS FailSafe software by entering these two commands on each node:
# chkconfig failsafe on # /etc/init.d/failsafe start |
Follow this procedure to verify that the IRIS FailSafe cluster is operating normally:
Verify that the nodes are in normal state by entering this command on each node:
# /usr/etc/ha_admin -i ha_admin: Node controller state normal |
If either node has not reached normal state, wait a few minutes and try the command again. If normal state isn't reached, check the /var/adm/SYSLOG file on both nodes for errors. See Appendix B, “System Troubleshooting,” for troubleshooting information.
Verify that NFS filesystems exported by the cluster by mounting them from a client.
Verify that Netscape servers on the cluster are working by running a browser on a client and viewing Web pages served by the Netscape servers.
Check any other highly available applications running on the cluster.
After you have confirmed that the cluster operates correctly when both nodes are active, confirm that the cluster functions correctly in the face of failures by performing the tests below. Each test is an independent test and should be performed on an IRIS FailSafe cluster that is operating normally.
Power off one node in the cluster. The other node in the cluster should detect the failure and take over the services. If you have an active/backup configuration, power off the active node.
Disconnect the private network. If you have enabled heartbeat messages to be sent over the public network, the cluster should continue to function as before. Otherwise, one node takes over the services of the other node. The node whose services got taken over gets rebooted.
If heartbeat messages are sent over the public network (hb-public-ipname is set to a fixed IP address), enter this command after reconnecting the private network to switch heartbeat messages back to the private network:
# /usr/etc/ha_admin -x |
Disconnect the public network from one of the active nodes in the cluster. The other node should take over the services.
Forcibly unmount a filesystem on an active node to make it unavailable. The other node should take over the services. The failed node enters standby state.
Kill the application daemons (for example, ns_httpd) on a node to make the service unavailable. The other node should take over the service.
Disconnect the serial line to the system controller port (if you are using CHALLENGE XL/L/DM) or the remote power control unit (if you are using CHALLENGE S). If you have configured the IRIS FailSafe software to send mail, it notifies the administrator of the failure and otherwise continues to function.
When the cluster is in this state, neither node can take over if another failure occurs. After you have reconnected the serial line, you can resume monitoring of the serial line by executing /usr/etc/ha_admin -m start <node_name>.
Follow this procedure to return the nodes to normal state after testing:
Edit the file /etc/init.d/failsafe on each node and return the value of MIN_UPTIME to its initial suggested value, 300.
Restart IRIS FailSafe on both nodes by entering these commands on each node:
# /etc/init.d/failsafe stop # /etc/init.d/failsafe start |