This chapter describes administrative tasks you perform to operate and monitor an IRIS FailSafe system. It describes how to perform tasks using the FailSafe Manager GUI and the cmgr(1M) command. The major sections in this chapter are as follows:
| Note: It is recommended that all FailSafe administration be done from one node in the pool so that the latest copy of the database will be available even when there are network partitions. |
On Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C systems, there is only one serial/USB port that provides both L1 and console support for the machine. In a FailSafe configuration, this port (the DB9 connector) is used for system reset. It is connected to a serial port in another node or to the Ethernet multiplexer.
To get access to console input and output, the console must be redirected to another serial port in the machine. Use the following procedure:
Edit the /etc/inittab file to use an alternate serial port.
Either issue an init q command or reboot.
For example, suppose you had the following in the /etc/inittab file (line breaks added for readability):
# on-board ports or on Challenge/Onyx MP machines, first IO4 board ports t1:23:respawn:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip -c "exec /sbin/getty ttyd1 console" # alt console t2:23:off:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip -c "exec /sbin/getty -N ttyd2 co_9600" # port 2 |
You could change it to the following:
# on-board ports or on Challenge/Onyx MP machines, first IO4 board ports t1:23:off:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip -c "exec /sbin/getty ttyd1 co_9600" # port 1 t2:23:respawn:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip -c "exec /sbin/getty -N ttyd2 console" # alt console |
| Caution: Redirecting the console by using the above method works only when IRIX is running. To access console when IRIX is not running (miniroot), you must physically reconnect the machine: unplug the reset cable from the console/L1 port and then connect the console cable. |
Once a FailSafe command is started, it may partially complete even if you interrupt the command by typing Ctrl-c. If you halt the execution of a command this way, you may leave the cluster in an indeterminate state and you may need to use the various status commands to determine the actual state of the cluster and its components.
If you have a two-node cluster, you should create an emergency failover policy (step 1 below) for each node in preparation for a time when it may need to run by itself. This situation can occur if the other must stay down for maintenance or if it fails and cannot be brought up.
The following procedure describes the steps required to use just one node in the cluster.
Create an emergency failover policy for each node. Each policy should look like the following example when the cmgr command is issued, where Active_node is the name of the node using the policy (in the examples, nodeA) and Down_node is the name of the nonfunctioning node (in the examples, nodeB):
cmgr> show failover_policy emergency-Active_node Failover Policy: emergency-Active_node Version: 1 Script: ordered Attributes: Controlled_Failback InPlace_Recovery Initial AFD: Active_node |
For example, suppose you have two nodes, nodeA and nodeB. You would have two emergency failover policies:
cmgr> show failover_policy emergency-nodeA Failover Policy: emergency-nodeA Version: 1 Script: ordered Attributes: Controlled_Failback InPlace_Recovery Initial AFD: nodeA cmgr> show failover_policy emergency-nodeB Failover Policy: emergency-nodeB Version: 1 Script: ordered Attributes: Controlled_Failback InPlace_Recovery Initial AFD: nodeB |
For more information, see “Define a Failover Policy” in Chapter 5.
At this point, the procedure assumes that the cluster has one node that has tried to come up but is now in the lonely state. The other node is down. The procedure is for recovering from this point.
Modify each resource group to use the appropriate single-node emergency failover policy, using the following cmgr commands or the GUI:
modify resource_group RG_name in cluster clustername set failover_policy to emergency-Active_node |
For example, on nodeA:
cmgr> set cluster test-cluster cmgr> modify resource_group group1 Enter commands, when finished enter either "done" or "cancel" resource_group group1 ? set failover_policy to emergency-nodeA resource_group group1 ? done Successfully modified resource group group1 |
Change the state of all of resource groups to offline. (The last known state of these groups was online before the machines went down; however, the resource groups are not actually online at this point because the cluster was booted with the active node having entered the lonely state, since the other node is not functional. This step just tells the database to label the state of the resource groups appropriately in preparation for later steps.)
Use the following command:
admin offline resource_group RG_name in cluster clustername |
For example:
cmgr> set cluster test-cluster
cmgr> show resource_groups in test-cluster
Resource Groups:
group1
group2
cmgr> admin offline resource_group group1
cmgr> admin offline resource_group group2 |
Force the stop of HA services for the down node:
stop ha_services on Down_node for cluster clustername force |
| Note: This is a long running task that and might take a few minutes to complete. cmgr will provide intermediate task status. |
For example:
cmgr> stop ha_services on nodeB for cluster test-cluster force |
Mark the resource groups as online in the database. When HA services are started in future steps, the services will come online using the emergency failover policies.
admin online resource_group RG_name in cluster clustername |
For example:
cmgr> set cluster test-cluster
cmgr> admin online resource_group group1
FailSafe daemon (ha_fsd) is not running on this local node or it is not ready to accept admin commands.
Resource Group (group1) is online-ready.
Failed to admin:
online
admin command failed
cmgr> show status of resource_group group1 in cluster test-cluster
State: Online Ready
Error: No error
Check resource group group1 status in an active node if HA services are active in cluster |
Start HA services on the active node:
start ha_services on Active_node for clustername |
HA services are now active in this single node cluster. From this step through the rest of the recovery, services are active and there should be no downtime experienced.
For example:
cmgr> start ha_services on nodeA for cluster test-cluster |
Remove and reinitialize the database on the down node, which is now booted in multiuser mode:
# cd /var/cluster/cdb # rm -rf /var/cluster/cdb/cdb* # /usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db |
Wait for a few minutes and use the tail(1) command to watch the SYSLOG file for messages indicating that this node knows its identity and has joined the cluster. For example (line breaks added here for readability):
# tail -f /var/adm/SYSLOG ... Dec 14 18:23:16 6D:Down_node cmond[1074]: Notification can not be processed, local machine and cluster name is not known. Dec 14 18:23:16 6D:Down_node cmond[1074]: Local machine belongs to cluster clustername. Dec 14 18:23:16 6D:Down_node cmond[1074]: Local machine name is Down_node" |
To resume using the down node, do the following:
Boot the down node into single user mode.
Within single-user mode, use the chkconfig(1M) command to set all cluster services to off:
# chkconfig | grep cluster |
Boot the Down_node to multiuser mode.
Start HA services for the down node:
cmgr> start ha_services on node Down_node for cluster clustername |
Use the tail(1) command to watch the SYSLOG file for messages that indicate the formerly down node is now in membership and that services are in the UP state. Do not continue until you see messages such as the following:
# tail -f /var/adm/SYSLOG ... miredb node Active_node [81] : UP incarnation 7 age 2:0 miredb node Down_node [82] : UP incarnation 1 age 1:0 |
Modify the resource groups to restore the original failover policies they were using before the failure:
modify resource_group RG_name in cluster cluster_name set failover_policy to Original-Failover-Policy |
| Note: This only restores the configuration for the static environment. The runtime environment will still be using the single-node policy at this time. |
For example, if the normal failover policy was normal-fp:
cmgr> set cluster test-cluster cmgr> modify resource_group group1 Enter commands, when finished enter either "done" or "cancel" resource_group group1 ? set failover_policy to normal-fp resource_group group1 ? done Successfully modified resource group group1 cmgr modify resource_group group2 Enter commands, when finished enter either "done" or "cancel" resource_group group2 ? set failover_policy to normal-fp resource_group group2 ? done Successfully modified resource group group2 |
Perform an offline_detach command on the resource groups in the cluster. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. The resources will remain in service.
| Note: Because FailSafe is no longer monitoring the group after the offline_detach (or offline_detach_force) is executed, you should not allow FailSafe to recover on any node other than where it was running at the time the offline_detach was performed.
This also means that no other nodes should be allowed to rejoin the FailSafe membership, especially if Auto_Recovery is set in the resource group's failover policy. This restriction is required because the failover policy scripts are run whenever there is a change in membership; rerunning the scripts could cause your previously offline detached resource group to start on a node other than the node where the offline_detach was performed. FailSafe policy scripts are run only on nodes where FailSafe is running (nodes where HA services have been started). For example, suppose you have a four-node FailSafe cluster (with nodes A, B, C and D), where nodes A, B, C are in UP state and node D is DOWN state. If resource group RG is made offline using offline_detach or offline_detach_force command on Node B and HA services are shutdown on Node B, Node D should not rejoin the cluster before resources in RG are stopped manually on Node B. If Node D rejoins the cluster, the resource group RG will be made online on Nodes A, C or D. |
Use the following command:
admin offline_detach resource_group resource_group [in cluster clustername] |
cmgr> admin offline_detach resource_group group1 in cluster test-cluster |
Show the status of the resource groups to be sure that they now show as offline.
| Note: The resources are still in service even though this command output shows them as offline. |
show status of resource_group resource_group in cluster clustername |
For example:
cmgr> show status of resource_group group1 in cluster test-cluster |
Make the resource groups online in the cluster:
admin online resource_group RG_name in cluster cluster_name |
For example:
cmgr> admin online resource_group group1 in cluster test-cluster cmgr> admin online resource_group group2 in cluster test-cluster |
Move the resources back to their original nodes. Because our original policies included the InPlace_Recovery attribute, all of the resources have remained on the node that has been active throughout this process.
admin move resource_group RG_name in cluster cluster_name to node Primary_owner |
For example:
cmgr> admin move resource_group group1 in cluster test-cluster to node nodeB |
While the FailSafe system is running, you can monitor the status of the FailSafe components to determine the state of the component. FailSafe allows you to view the system status in the following ways:
You can keep continuous watch on the state of a cluster using the cluster_status command, or the GUI.
You can query the status of an individual resource group, node, or cluster using either the GUI or cmgr.
You can use the haStatus script provided with the cmgr command to see the status of all clusters, nodes, resources, and resource groups in the configuration.
The following sections describe the procedures for performing these tasks.
You can use the cluster_status command to monitor the cluster using a curses(3X) interface. For example, the following shows a two-node cluster configured for FailSafe only and cluster_status help text displayed:
# /var/cluster/cmgr-scripts/cluster_status
* Cluster=nfs-cluster FailSafe=ACTIVE CXFS=Not Configured 08:45:12
Nodes = hans2 hans1
FailSafe = UP UP
CXFS =
ResourceGroup Owner State Error
bartest-group Offline No error
footest-group Offline No error
bar_rg2 hans2 Online No error
nfs-group1 hans2 Online No error
foo_rg hans2 Online No error
+-------+ cluster_status Help +--------+
| on s - Toggle Sound on event |
| on r - Toggle Resource Group View |
| on c - Toggle CXFS View |
| j - Scroll up the selection |
| k - Scroll down the selection |
| TAB - Toggle RG or CXFS selection |
| ENTER - View detail on selection |
| h - Toggle help screen |
| q - Quit cluster_status |
+--- Press 'h' to remove help window --+
|
The above shows that a sound will be activated when a node or the cluster changes status. You can override the s setting by invoking cluster_status with the -m (mute) option. You can also use the arrow keys to scroll the selection.
| Note: The cluster_status command can display no more than 128 CXFS filesystems. |
The easiest way to keep a continuous watch on the state of a cluster is to use the GUI tree view.
System components that are experiencing problems appear as blinking red icons. Components in transitional states also appear as blinking icons. If there is a problem in a resource group or node, the icon for the cluster turns red and blinks, as well as the resource group or node icon.
The cluster status can be one of the following:
ACTIVE, which means the cluster is up and running and there is a valid FailSafe membership.
INACTIVE, which means the start FailSafe HA services task has not been run and there is no FailSafe membership.
ERROR, which means that some nodes are in a DOWN state; that is, the cluster should be running, but it is not.
UNKNOWN,which means that the state cannot be determined because FailSafe HA services are not running on the node performing the query.
If you minimize the GUI window, the minimized-icon shows the current state of the cluster. When the cluster goes into error state, the icon shows a red cluster.
The following tables show keys to the icons and states used in the FailSafe Manager GUI.
The full legend for component states in the FailSafe Cluster View is as follows:
Icon | Entity |
|---|---|
Node | |
Cluster | |
Resource | |
Resource group | |
Resource type | |
![]() | Failover policy |
Expanded tree | |
Collapsed tree |
Icon | State |
|---|---|
Inactive or unknown (HA services may not be active) | |
Online-ready state for a resource group | |
Healthy and active or online | |
(blinking) Transitioning to healthy and active/online or transitioning to offline | |
![]() | Maintenance mode, in which the resource is not monitored by FailSafe |
(blinking red) Problems with the component |
To query node and cluster status, use the following command:
show status of cluster clustername |
You can use cmgr to query the status of a resource or to contact the system controller at a node, as described in the following subsections.
To query a resource status, use the following command:
show status of resource resource_name of resource_type RT_name [in cluster clustername] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource in the default cluster.
To query the status of a resource group, you provide the name of the resource group and the cluster which includes the resource group. Resource group status includes the following components:
Resource group state
Resource group error state
Resource owner
These components are described in the following subsections.
If a node that contains a resource group online has a status of UNKNOWN, the status of the resource group will not be available or ONLINE-READY.
A resource group state can be one of the following:
When a resource group is ONLINE, its error status is continually being monitored. A resource group error status can be one of the following:
You can use the tree view to monitor the status of the resources in a FailSafe configuration:
Select View: Resources in Groups to see the resources organized by the groups to which they belong.
Select View: Groups owned by Nodes to see where the online groups are running. This view lets you observe failovers as they occur.
To query a resource group status, use the following cmgr command:
show status of resource_group RG_name [in cluster clustername] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource group in the default cluster.
To query the status of a node, you provide the logical node name of the node. The node status can be one of the following:
When you start HA services, node states transition from INACTIVE to UP. It may happen that a node state may transition from INACTIVE to UNKNOWN to UP.
You can use the cluster_status command to monitor the status of the nodes in the cluster.
You can use the GUI tree view to monitor the status of the clusters in a FailSafe configuration. Select View: Groups owned by Nodes to monitor the health of the default cluster, its resource groups, and the group's resources.
To query node status, use the following command:
show status of node nodename |
When FailSafe is running, you can determine whether the system controller on a node is responding with the following command:
admin ping node nodename |
This command uses the FailSafe daemons to test whether the system controller is responding.
You can verify reset connectivity on a node in a cluster even when the FailSafe daemons are not running by using the standalone option of the admin ping command:
admin ping standalone node nodename |
This command does not go through the FailSafe daemons, but calls the ping command directly to test whether the system controller on the indicated node is responding.
The haStatus script provides status and configuration information about clusters, nodes, resources, and resource groups in the configuration. This script is installed in the /var/cluster/cmgr-scripts directory. You can modify this script to suit your needs. See the haStatus(1M) man page for further information about this script.
The following examples show the output of the different options of the haStatus script.
# haStatus -help Usage: haStatus [-a|-i] [-c clustername] where, -a prints detailed cluster configuration information and cluster status. -i prints detailed cluster configuration information only. -c can be used to specify a cluster for which status is to be printed. “clustername” is the name of the cluster for which status is to be printed. # haStatus Tue Nov 30 14:12:09 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Node hans1: State of machine is UP. Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) # haStatus -i Tue Nov 30 14:13:52 PST 1999 Cluster test-cluster: Node hans2: Logical Machine Name: hans2 Hostname: hans2.engr.sgi.com Is FailSafe: true Is CXFS: false Nodeid: 32418 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans1 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.15 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.61 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Node hans1: Logical Machine Name: hans1 Hostname: hans1.engr.sgi.com Is FailSafe: true Is CXFS: false Nodeid: 32645 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans2 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.14 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.60 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Resource_group nfs-group1: Failover Policy: fp_h1_h2_ord_auto_auto Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) Resource /hafs1 (type NFS): export-info: rw,wsync filesystem: /hafs1 Resource dependencies statd /hafs1/nfs/statmon filesystem /hafs1 Resource /hafs1/nfs/statmon (type statd): InterfaceAddress: 150.166.41.95 Resource dependencies IP_address 150.166.41.95 filesystem /hafs1 Resource 150.166.41.95 (type IP_address): NetworkMask: 0xffffff00 interfaces: ef1 BroadcastAddress: 150.166.41.255 No resource dependencies Resource /hafs1 (type filesystem): volume-name: havol1 mount-options: rw,noauto monitor-level: 2 Resource dependencies volume havol1 Resource havol1 (type volume): devname-group: sys devname-owner: root devname-mode: 666 No resource dependencies Failover_policy fp_h1_h2_ord_auto_auto: Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 # haStatus -a Tue Nov 30 14:45:30 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Logical Machine Name: hans2 Hostname: hans2.engr.sgi.com Is FailSafe: true Is CXFS: false Nodeid: 32418 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans1 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.15 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.61 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Node hans1: State of machine is UP. Logical Machine Name: hans1 Hostname: hans1.engr.sgi.com Is FailSafe: true Is CXFS: false Nodeid: 32645 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans2 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.14 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.60 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) Resource /hafs1 (type NFS): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally export-info: rw,wsync filesystem: /hafs1 Resource dependencies statd /hafs1/nfs/statmon filesystem /hafs1 Resource /hafs1/nfs/statmon (type statd): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally InterfaceAddress: 150.166.41.95 Resource dependencies IP_address 150.166.41.95 filesystem /hafs1 Resource 150.166.41.95 (type IP_address): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally NetworkMask: 0xffffff00 interfaces: ef1 BroadcastAddress: 150.166.41.255 No resource dependencies Resource /hafs1 (type filesystem): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally volume-name: havol1 mount-options: rw,noauto monitor-level: 2 Resource dependencies volume havol1 Resource havol1 (type volume): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally devname-group: sys devname-owner: root devname-mode: 666 No resource dependencies # haStatus -c test-cluster Tue Nov 30 14:42:04 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Node hans1: State of machine is UP. Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) |
The Embedded Support Partner (ESP) consists of a set of daemons that perform various monitoring activities. You can choose to configure ESP so that it will log FailSafe events (the FailSafe ESP event profile is not configured in ESP by default).
FailSafe uses an event class ID of 77 and a description of IRIS FailSafe2.
If you want to use ESP for FailSafe, enter the following command to add the failsafe2 event profile to ESP:
# espconfig -add eventprofile failsafe2 |
FailSafe will then log ESP events for the following:
Daemon configuration error
Failover policy configuration error
Resource group allocation (start) failure
Resource group failures:
Allocation (start) failure
Release (stop) failure
Monitoring failure
Exclusivity failure
Failover policy failure
Resource group status:
online
offline
maintenance_on
maintenance_off
FailSafe shutdown (HA services stopped)
FailSafe started (HA services started)
You can use the espreport(1M) or launchESPartner(1) commands to see the logged ESP events. See the esp(5) man page and the Embedded Support Partner User Guide for more information about ESP.
While a FailSafe system is running, you can move a resource group online to a particular node, or you can take a resource group offline. In addition, you can move a resource group from one node in a cluster to another node in a cluster. The following subsections describe these tasks.
This section describes how to bring a resource group online.
Before you bring a resource group online for the first time, you should run the diagnostic tests on that resource group. Diagnostics check system configurations and perform some validations that are not performed when you bring a resource group online.
You cannot bring a resource group online in the following circumstances:
If the resource group has no members
If the resource group is currently running in the cluster
To bring a resource group fully online, HA services must be active. When HA services are active, an attempt is made to allocate the resource group in the cluster. However, you can also execute a command to bring the resource group online when HA services are not active. When HA services are not active, the resource group is marked to be brought online when HA services become active; the resource group is then in an ONLINE-READY state. Failsafe tries to bring a resource group in an ONLINE-READY state online when HA services are started.
You can disable resource groups from coming online when HA services are started by using the GUI or cmgr to take the resource group offline, as described in “Take a Resource Group Offline”.
| Caution: Before bringing a resource group online in the cluster, you must be sure that the resource group is not running on a disabled node (where HA services are not running). Bringing a resource group online while it is running on a disabled node could cause data corruption. For information on detached resource groups, see “Take a Resource Group Offline”. |
Do the following:
Group to Bring Online: use the pull-down list to select the name of the resource group you want to bring online. The menu displays only resource groups that are not currently online.
Click on OK to complete the task.
To bring a resource group online, use the following command:
admin online resource_group RG_name [in cluster clustername] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command.
cmgr> set cluster test-cluster
cmgr> admin online resource_group group1
FailSafe daemon (ha_fsd) is not running on this local node or it is not ready to accept admin commands.
Resource Group (group1) is online-ready.
Failed to admin:
online
admin command failed
cmgr> show status of resource_group group1 in cluster test-cluster
State: Online Ready
Error: No error
Check resource group group1 status in an active node if HA services are active in cluster |
This section tells you how to take a resource group offline.
When you take a resource group offline, FailSafe takes each resource in the resource group offline in a predefined order. If any single resource gives an error during this process, the process stops, leaving all remaining resources allocated.
You can take a FailSafe resource group offline in any of the following ways:
Take the resource group offline. This physically stops the processes for that resource group and does not reset any error conditions. If this operation fails, the resource group will be left online in an error state.
Force the resource group offline. This physically stops the processes for that resource group but resets any error conditions. This operation cannot fail.
Detach the resource group. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. This operation should rarely fail.
Detach the resource group and force the error state to be cleared. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. In addition, all error conditions of the resource group will be reset. This operation should rarely fail.
If you do not need to stop the resource group and do not want FailSafe to monitor the resource group while you make changes, but you would still like to have administrative control over the resource group (for instance, to move that resource group to another node), you can put the resource group in maintenance mode using the Suspend Monitoring a Resource Group task on the GUI or the admin maintenance_on command of cmgr, as described in “Suspend and Resume Monitoring of a Resource Group”.
If the fsd daemon is not running or is not ready to accept client requests, executing this task disables the resource group in the cluster database only. The resource group remains online and the command fails.
Enter the following:
Detach Only: check this box to stop monitoring the resource group. The resource group will not be stopped, but FailSafe will not have any control over the group.
Detach Force: check this box to stop monitoring the resource group. The resource group will not be stopped, but FailSafe will not have any control over the group. In addition, Failsafe will clear all errors.
| Caution: The Detach Only and Detach Force settings leave the resource group's resources running on the node where the group was online. After stopping HA services on that node, do not bring the resource group online on another node in the cluster; doing so can cause data integrity problems. Instead, make sure that no resources are running on a node before stopping HA services on that node. |
Force Offline: check this box to stop all resources in the group and clear all errors.
Force Offline: check this box to stop all resources in the group and clear all errors.
Group to Take Offline: select the name of the resource group you want to take offline. The menu displays only resource groups that are currently online.
Click on OK to complete the task.
To take a resource group offline, use the following command:
admin offline resource_group RG_name [in cluster clustername] |
To take a resource group offline with the force option in effect, forcing FailSafe to complete the action even if there are errors, use the following command:
admin offline_force resource_group RG_name [in cluster clustername] |
To detach a resource group, use the following command:
admin offline_detach resource_group RG_name [in cluster clustername] |
To detach the resource group and force the error state to be cleared:
admin offline_detach_force resource_group RG_name [in cluster clustername] |
This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. In addition, all error conditions of the resource group will be reset. This operation should rarely fail.
This section tells you how to move a resource group.
While FailSafe is active, you can move a resource group to another node in the same cluster.
| Note: When you move a resource group in an active system, you may find the unexpected behavior that the command appears to have succeeded, but the resource group remains online on the same node in the cluster. This can occur if the resource group fails to start on the node to which you are moving it. In this case, FailSafe will fail over the resource group to the next node in the application failover domain, which may be the node on which the resource group was originally running. Since FailSafe kept the resource group online, the command succeeds. |
Do the following:
Group to Move: select the name of the resource group to be moved. Only resource groups that are currently online are displayed in the menu.
Failover Domain Node: (optional) select the name of the node to which you want to move the resource group. If you do not specify a node, FailSafe will move the resource group to the next available node in the failover domain.
Click on OK to complete the task.
This section describes how to stop monitoring of a resource group in order to put it into maintenance mode.
You can temporarily stop FailSafe from monitoring a specific resource group, which puts the resource group in maintenance mode. The resource group remains on the same node in the cluster but is no longer monitored by FailSafe for resource failures.
You can put a resource group into maintenance mode if you do not want FailSafe to monitor the group for a period of time. You may want to do this for upgrade or testing purposes, or if there is any reason that FailSafe should not act on that resource group. When a resource group is in maintenance mode, it is not being monitored and it is not highly available. If the resource group's owner node fails, FailSafe will move the resource group to another node and resume monitoring.
When you put a resource group into maintenance mode, resources in the resource group are in ONLINE-MAINTENANCE state. The ONLINE-MAINTENANCE state for the resource is seen only on the node that has the resource online. All other nodes will show the resource as ONLINE. The resource group, however, should appear as being in ONLINE-MAINTENANCE state in all nodes.
Do the following:
Group to Stop Monitoring: select the name of the group you want to stop monitoring. Only those resource groups that are currently online and monitored are displayed in the menu.
Click OK to complete the task.
This task lets you resume monitoring for a resource group that FailSafe is not monitoring currently. (All resource groups that are in online state without error are monitored by default.)
Once monitoring is resumed, if the resource group or one of its resources fails, FailSafe will restart each failed component based on the failover policy (assuming that the restart action is enabled).
Perform the following steps:
Group to Start Monitoring: select the name of the group you want to stop monitoring. Only those resource groups that are currently online and not monitored are displayed in the menu.
Click OK to complete the task.
To put a resource group into maintenance mode, use the following command:
admin maintenance_on resource_group RG_name [in cluster clustername] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command.
You can stop the execution of FailSafe on all the nodes in a cluster or on a specified node only. See “Stop FailSafe HA Services” in Chapter 5.
You can use FailSafe to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect this and remove the node from the active cluster, reallocating any resource groups that were allocated on that node onto a backup node. The backup node that is used depends on how you have configured your system.
After the node reboots, it will rejoin the cluster. Some resource groups might move back to the node, depending on how you have configured your system.
You can use the GUI to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect the change and remove the node from the active cluster. When the node reboots, it will rejoin the FailSafe membership.
To reset a node, do the following:
Node to Reset: use the pull-down menu to select the node to be reset.
Click on OK to complete the task.
When FailSafe is running, you can reboot a node with the following command:
admin reset node nodename |
This command uses the FailSafe daemons to reset the specified node.
You can reset a node in a cluster even when the FailSafe daemons are not running by using the standalone option of the admin reset command:
admin reset standalone node nodename |
This command does not go through the FailSafe daemons.
The cmgr command provides scripts that you can use to backup and restore your configuration: cdbBackup and cdbRestore. These scripts are installed in the /usr/cluster/bin directory. You can modify these scripts to suit your needs.
The cdbBackup script, as provided, creates compressed tar files of the /var/cluster/cdb/cdb.db# directory and the /var/cluster/cdb.db file.
The cdbRestore script, as provided, restores the compressed tar files of the /var/cluster/cdb/cdb.db# directory and the /var/cluster/cdb.db file.
When you use the cdbBackup and cdbRestore scripts, you should follow the following procedures:
Run the cdbBackup and cdbRestore scripts only when no administrative commands are running. This could result in an inconsistent backup.
You must back up the configuration of each node in the cluster separately. The configuration information is different for each node, and all node-specific information is stored locally only.
Run the backup procedure whenever you change your configuration.
The backups of all nodes in the pool taken at the same time should be restored together.
Cluster and FailSafe process should not be running when you restore your configuration.
| Note: In addition to the above restrictions, you should not perform a cdbDump while information is changing in the cluster database. Check the SYSLOG file for information to help determine when cluster database activity is occurring. As a rule of thumb, you should be able to perform a cdbDump if at least 15 minutes have passed since the last node joined the cluster or the last administration command was run. |
You should rotate the log files at least weekly so that your disk will not become full.
The following sections provide example scripts. You may want to consider placing an entry in the root crontab(1) to run such scripts periodically.
For information about log levels, see “Set Log Configuration” in Chapter 5.
You can use a script such as the following to copy all files to a new location.
#!/bin/sh
DATE=`/sbin/date +'%U-%a'`
LOG_DIR="/var/cluster/ha/log"
HOST=`/usr/bsd/hostname -s`
LOG_FILES="cad_log cmond_log fs2d_log"
LOG_HFILES="cli cmsd crsd failsafe gcd ifd script srmd clconfd"
LOG_ARCH=$LOG_DIR"/Old-Log"
if [ ! -d $LOG_ARCH ] ; then
mkdir $LOG_ARCH
fi
for file in $LOG_FILES
do
rm -f ${LOG_ARCH}/${file}-${DATE}
cp ${LOG_DIR}/${file} ${LOG_ARCH}/${file}-${DATE}
echo "Log Rotation at `date`" > ${LOG_DIR}/${file}
done
for file in $LOG_HFILES
do
rm -f ${LOG_ARCH}/${file}_${HOST}-${DATE}
cp ${LOG_DIR}/${file}_${HOST} ${LOG_ARCH}/${file}_${HOST}-${DATE}
echo "Log Rotation at `date`" > ${LOG_DIR}/${file}_${HOST}
done |
The script can be executed as a cron(1) job to regularly clean up log files. This script rotates log files when HA services are active in the FailSafe cluster. Default log levels do not create large log files.