This chapter describes administrative commands and procedures for IRIS FailSafe. The major sections in this chapter are as follows:
Follow this procedure to start the IRIS FailSafe system:
To stop IRIS FailSafe software on one node or both nodes in the cluster, follow the steps:
On one node, shut down IRIS FailSafe by entering this command:
# /etc/init.d/failsafe stop |
This command causes all highly available services to fail over from the node where you gave the command to the other node. All FailSafe processes (node controller, application monitor, and so on) on the node where you gave the command are terminated. The other node moves to degraded state.
Wait for the command to finish.
If you want to shut down IRIS FailSafe on the second node, enter this command on that node:
# /etc/init.d/failsafe stop |
All highly available services are shut down and the FailSafe processes are terminated.
The subsections below provide information that system administrators need to know to successfully administer IRIS FailSafe nodes.
All messages from scripts and the IRIS FailSafe daemons go into the /var/adm/SYSLOG file. Check this file to get information about node state changes and errors. See the section “Errors Logged to /var/adm/SYSLOG” in Appendix B for descriptions of some of the SYSLOG messages.
When a node is in the process of changing from one state to another and you enter an ha_admin command, the ha_admin command can time out. Wait for a minute or two for the state change to complete and retry the command.
When the IRIS FailSafe software is running, some administrative procedures should not be used because they prevent proper operation of the IRIS FailSafe system. Follow these guidelines to avoid problems:
Do not reset the network (resetting the network is done by this sequence of commands: /etc/init.d/network stop; /etc/init.d/network start.
Do not set the XLV_ASSEMBLE_ARGS environment variable to change the default behavior of the xlv_assemble command (see the xlv_assemble(1M) reference page for information about XLV_ASSEMBLE_ARGS).
Do not stop highly available services without first stopping IRIS FailSafe. Because IRIS FailSafe monitors the highly available services, a stopped service can appear to be a failed service and cause IRIS FailSafe to perform an undesired failover.
When deciding which filesystems on the non-shared disks of a node in an IRIS FailSafe cluster to export, you need to follow this rule: don't export a filesystem on a non-shared disk if it is the parent of a highly available NFS filesystem. A parent filesystem contains the mount point of another filesystem. A highly available NFS filesystem is a filesystem on a shared disk that is configured as an NFS filesystem in the IRIS FailSafe configuration file. This rule prevents problems with automounting of the highly available NFS filesystem by clients. A simple example of this rule is that /, the root directory on a node, should not be exported.
If you are using a CHALLENGE RAID storage system, you must stop the RAID agent before running the xlv_assemble command and restart it after running xlv_assemble. (The /dev/scsi device nodes can be opened by only one process at a time. As a consequence, xlv_assemble does not correctly assemble the failover paths to a RAID device if the RAID agent is active.)
To stop the RAID agent, enter this command:
# /etc/init.d/raid5 stop |
To restart the RAID agent enter this command:
# /etc/init.d/raid5 start |
If your cluster contains replicated files, for example, the same Netscape document root on each node, care must be taken when updating the files, for example adding new Web documents or installing a new software release. The important points are:
Identical changes to the files must be made on both nodes.
Changes should be made on both nodes as close to simultaneously as possible. This reduces the possibility that a failover occurs when the files are not the same and as a result clients see different pages after failover.
When IRIS FailSafe is running on the nodes in a cluster, users need to know these things:
Which applications on the nodes are highly available applications
What to do if a highly available application needs to be stopped (for example, stop IRIS FailSafe first or contact a system administrator)
Which XLV logical volumes and XFS filesystems are on shared disks and which are on non-shared disks
Which IP address to use for various types of access to the nodes. Because there are two types of IP addresses for nodes in an IRIS FailSafe cluster, fixed IP addresses and high availability IP addresses, users must learn which IP address to use when they access the nodes for different purposes.
Users should use a fixed IP address, such as the hostname, in these situations:
When using the rcp command to copy files from a filesystem on a non-shared disk
When putting NFS filesystems on non-shared disks in /etc/fstab
When using automount to access a filesystem on a non-shared disk
When browsing a Web page whose document root is on a non-shared disk
Client users should use a high availability IP address for a node instead of its hostname in these situations that require hostnames:
When using the rcp command to copy files from a filesystem on a shared disk
When putting NFS filesystems on shared disks in /etc/fstab
When using automount to access a filesystem on a shared disk
When browsing a Web page whose document root is on a shared disk
The netstat -i command tells you which interfaces are ifconfig'ed up and the IP addresses they own. For each interface, the first line of output from netstat -i for that interface lists the fixed IP address for the interface. The IP addresses on the second and later lines of output for an interface are high availability IP addresses.
Because IP aliases are ifconfig'ed up only when a node is in normal or degraded state, the output varies of netstat -i varies. The examples below from the configuration shown in Figure 2-4 show netstat -i output for both nodes for the three possible situations:
IRIS FailSafe is not running.
On xfs-ha1 each interface is listed once; the IP alias stocks isn't listed:
# /usr/etc/netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ec0 1500 190.0.2.1 xfs-ha1 71174 0 645 0 210 ec3 1500 190.0.3.1 priv-xfs-ha1 1030 0 1031 0 0 lo0 8304 loopback localh 857 0 857 0 0 |
On xfs-ha2 each interface is listed once; the IP alias bonds isn't listed:
# /usr/etc/netstat -i Name Mtu Network Addres Ipkts Ierrs Opkts Oerrs Coll ec0 1500 190.0.2.2 xfs-ha2 71174 0 645 0 200 ec3 1500 190.0.3.2 priv-xfs-ha2 1030 0 1031 0 0 lo0 8304 loopback localhost 857 0 857 0 0 |
IRIS FailSafe is running, and both nodes are in normal state.
On xfs-ha1 the IP alias stocks is listed for ec0, as well as its IP address xfs-ha1:
# /usr/etc/netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ec0 1500 190.0.2.1 xfs-ha1 87234 0 1445 0 218 ec0 1500 190.0.2.3 stocks 87234 0 1445 0 218 ec3 1500 190.0.3.1 priv-xfs-ha1 1090 0 1091 0 0 lo0 8304 loopback localhost 857 0 857 0 0 |
On xfs-ha2 the IP alias bonds is listed for ec0, as well as its IP address xfs-ha2:
# /usr/etc/netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ec0 1500 190.0.2.2 xfs-ha2 78972 0 1645 0 200 ec0 1500 190.0.2.4 bonds 78972 0 1645 0 200 ec3 1500 190.0.3.2 priv-xfs-ha2 1090 0 1091 0 0 lo0 8304 loopback localhost 857 0 857 0 0 |
One node (xfs-ha1) is in degraded state, and the other node (xfs-ha2) is in standby state.
On xfs-ha1 the IP aliases stocks and bonds are both listed because the IP alias bonds has failed over from xfs-ha2:
# /usr/etc/netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ec0 1500 190.0.2.1 xfs-ha1 87234 0 1445 0 218 ec0 1500 190.0.2.3 stocks 87234 0 1445 0 218 ec0 1500 190.0.2.4 bonds 87234 0 1445 0 218 ec3 1500 190.0.3.1 priv-xfs-ha1 1090 0 1091 0 0 lo0 8304 loopback localhost 857 0 857 0 0 |
On xfs-ha2 no IP aliases are listed:
# /usr/etc/netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ec0 1500 190.0.2.2 xfs-ha2 78972 0 1645 0 200 ec3 1500 190.0.3.2 priv-xfs-ha2 1030 0 1031 0 0 lo0 8304 loopback localhost 857 0 857 0 0 |
To display the current state of a node (referred to as the node controller state in the output), enter this command on that node:
# /usr/etc/ha_admin -i |
A possible return might be
ha_admin: Node controller state normal |
Table 1-1 explains the possible states returned from this command.
To get the state of the other node in the cluster, enter this command:
# /usr/etc/ha_admin -i hostname |
hostname is the hostname of the other node.
The ha_admin -a command provides information about the cluster including each node's state and IP aliases. Each volume, filesystem, NFS filesystem, and Web server is listed along with its primary node. For example:
# /usr/etc/ha_admin -a
Node controller states
Node: xfs-ha1 State: normal
Node: xfs-ha2 State: normal
Interface-pairs
Interface-pair: one Owner: xfs-ha1
IP aliases in interface-pair one: stocks
Interface-pair: two Owner: xfs-ha2
IP aliases in interface-pair two: bonds
XLV Volumes
XLV volume: shared1_vol Owner: xfs-ha1
XLV volume: shared2_vol Owner: xfs-ha2
Filesystems
Filesystem: shared1_fs Owner: xfs-ha1
Filesystem: shared2_fs Owner: xfs-ha2
NFS
NFS: export1_fs Owner: xfs-ha1
NFS: export2_fs Owner: xfs-ha2
Webservers
Webserver: webha1 Owner: xfs-ha1
|
This output shows an NFS filesystem being exported from each node (the NFS portion shows two NFS filesystems with different owners) and one Netscape server (Webserver). The names in the middle column of the output (xfs-ha1, one, shared1_vol, and so on) are all labels for blocks (node, interface-pair, volume, and so on) in the configuration file.
When a node is in standby state, it is not providing any highly available services. The -rf option and the -G option of ha_admin are used to tell the node to begin providing highly available services.
If both nodes are running, the ha_admin -rf command is used to move a node in standby state to a state that enables it to provide highly available services. The state that it moves to depends upon the state of the other node:
If the other node is in degraded state (it is providing all highly available services for which it is the primary or backup node), both nodes move to normal state (each node provides the highly available services for which it is the primary node).
If the other node is in standby state (it is not providing any highly available services), the node on which you enter the command moves to degraded state (it provides all highly available services for which it is the primary or backup node) and the other node remains in standby state.
On the node in standby state that you want to move to normal or degraded state, enter this command:
# /usr/etc/ha_admin -rf |
If the other node is physically disconnected (no Ethernet connection, no FDDI connection, no serial line connection, and no connection to shared disk storage), you must use the ha_admin -G command to move the node in standby state to degraded state. In degraded state it provides all highly available services for which it is the primary or backup node. Follow this procedure:
Make sure that all the shared resources are not connected to the disconnected node: the disconnected node is not attached to the Ethernet or FDDI, the serial line is disconnected, and shared disk storage is disconnected.
![]() | Note: If you enter the command ha_admin -G while the other node is connected, even by just a serial line, the other node is rebooted. |
Enter the ha_admin command:
# /usr/etc/ha_admin -G
CAUTION: The -G option must only be used when the other node has
been physically removed from the cluster. If this
condition is not met, data integrity could be compromised.
Read ha_admin(1m) or IRIS FailSafe release notes for more
information.
Continue (y/n)?
|
Respond to the prompt with y if you are ready to move the node to degraded state.
The -s option of ha_admin is used to move a node from normal state to standby state (a state where it provides no highly available services). When a node is in standby state, it is said to be removed from the cluster. All highly available services provided by the node are failed over to the other node.
You can remove a node that is in normal state by entering a command from either node:
Enter this command to remove this node:
# /usr/etc/ha_admin -s |
Enter this command to remove the other node:
# /usr/etc/ha_admin -s hostname |
hostname is the hostname of the other node.
The -sf option of ha_admin is used to move a node from degraded state (providing all highly available services for which it is the primary or backup node) to standby state (not providing any highly available services). Moving the node to standby state removes it from the cluster. The other node remains in standby state.
You can remove a node that is in degraded state by entering a command from either node:
Enter this command to remove this node:
# /usr/etc/ha_admin -sf |
Enter this command to remove the other node:
# /usr/etc/ha_admin -sf hostname |
hostname is the hostname of the other node.
The -r option of the ha_admin command is used to move a node from controlled failback state (the node does not provide highly available services and is actively monitoring the other node) to normal state (providing the highly available services for which it is the primary node). The other node (in degraded state) also moves to normal state.
You can move a node in controlled failback state to normal state by entering a command from either node:
Enter this command to move this node from controlled failback state to normal state and the other node from degraded state to normal state:
# /usr/etc/ha_admin -r |
Enter this command to move this node from degraded state to normal state and the other node from controlled failback state to normal state:
# /usr/etc/ha_admin -r hostname |
hostname is the hostname of the node in controlled failback state.
The -s option of the ha_admin command is used to move a node from controlled failback state to standby state (the node, which has not been providing highly available services, doesn't begin to provide them, but it does stop monitoring the other node). All the highly available services are provided by the other node (which remains in degraded state).
You can move a node from controlled failback state to standby state by entering a command from either node:
Enter this command to move this node from controlled failback state to standby state:
# /usr/etc/ha_admin -s |
Enter this command to move the other node from controlled failback state to standby state:
# /usr/etc/ha_admin -s hostname |
hostname is the hostname of the node in controlled failback state.
If there is a failure of the private network between two nodes in a cluster, the heartbeat messages that are normally transmitted on the private network are switched automatically to the public network if the parameter hb-public-ipname is set to a fixed IP address. When the problem with the private network has been resolved, you must manually switch the heartbeat messages from the public network back to the private network.
To switch the heartbeat messages from the public network back to the private network, follow this procedure on either node:
Bring up the private interface by entering this command:
# /usr/etc/ifconfig interface inet IPaddress1 up |
interface is the value of the parameter hb-private-ipname and IPaddress1 is the value of the parameter hb-private-ipname for this node.
Wait 30 seconds and start a ping to the private IP address of the other node:
# ping IPaddress2 PING priv-xfs-ha1 (190.0.3.1): 56 data bytes ping: sendto: No route to host ping: wrote priv-xfs-ha1 64 chars, ret=-1 ping: sendto: No route to host ping: wrote priv-xfs-ha1 64 chars, ret=-1 64 bytes from 190.0.3.1: icmp_seq=0 ttl=255 time=1 ms 64 bytes from 190.0.3.1: icmp_seq=1 ttl=255 time=1 ms |
(In the example output, the command was entered on xfs-ha2 and IPaddress2 was priv-xfs-ha1.)
When the pings start succeeding, kill the command with EOF (<Ctrl-d> by default).
Switch the heartbeat messages back to the private network with this command:
# /usr/etc/ha_admin -x |