Chapter 7. IRIS FailSafe System Operation

This chapter describes administrative tasks you perform to operate and monitor an IRIS FailSafe system. It describes how to perform tasks using the FailSafe Manager GUI and the cmgr(1M) command. The major sections in this chapter are as follows:


Note: It is recommended that all FailSafe administration be done from one node in the pool so that the latest copy of the database will be available even when there are network partitions.


Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C Console Support

On Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C systems, there is only one serial/USB port that provides both L1 and console support for the machine. In a FailSafe configuration, this port (the DB9 connector) is used for system reset. It is connected to a serial port in another node or to the Ethernet multiplexer.

To get access to console input and output, the console must be redirected to another serial port in the machine. Use the following procedure:

  1. Edit the /etc/inittab file to use an alternate serial port.

  2. Either issue an init q command or reboot.

For example, suppose you had the following in the /etc/inittab file (line breaks added for readability):

# on-board ports or on Challenge/Onyx MP machines, first IO4 board ports
t1:23:respawn:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip
-c "exec /sbin/getty ttyd1 console"    # alt console
t2:23:off:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip 
-c "exec /sbin/getty -N ttyd2 co_9600"     # port 2   

You could change it to the following:

# on-board ports or on Challenge/Onyx MP machines, first IO4 board ports
t1:23:off:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip
-c "exec /sbin/getty ttyd1 co_9600"        # port 1
t2:23:respawn:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip
-c "exec /sbin/getty -N ttyd2 console" # alt console


Caution: Redirecting the console by using the above method works only when IRIX is running. To access the console when IRIX is not running (miniroot), you must physically reconnect the machine: unplug the serial hardware reset cable from the console/L1 port and then connect the console cable.


System Operation Considerations

After a FailSafe command is started, it may partially complete even if you interrupt the command by typing Ctrl-c. If you halt the execution of a command this way, you may leave the cluster in an indeterminate state and you may need to use the various status commands to determine the actual state of the cluster and its components.

Two-Node Clusters: Single-Node Use

If you have a two-node cluster, you should create an emergency failover policy (step 1 in “System Status”) for each node in preparation for a time when it may need to run by itself. This situation can occur if the other must stay down for maintenance or if it fails and cannot be brought up.


Caution: Without these emergency failover policies and the appropriate set of procedures, the surviving node will be in what is called the lonely state, meaning that it will never form a cluster by itself.


Using a Single Node

The following procedure describes the steps required to use just one node in the cluster.

  1. Create an emergency failover policy for each node. Each policy should look like the following example when the cmgr command is issued, where ActiveNode is the name of the node using the policy (in the examples, nodeA ) and DownNode is the name of the nonfunctioning node (in the examples, nodeB):

    cmgr> show failover_policy emergency-ActiveNode
    
    Failover Policy: emergency-ActiveNode
    Version: 1
    Script: ordered
    Attributes: Controlled_Failback InPlace_Recovery 
    Initial AFD: ActiveNode

    For example, suppose you have two nodes, nodeA and nodeB. You would have two emergency failover policies:

    cmgr> show failover_policy emergency-nodeA
    
    Failover Policy: emergency-nodeA
    Version: 1
    Script: ordered
    Attributes: Controlled_Failback InPlace_Recovery 
    Initial AFD: nodeA
    
    cmgr> show failover_policy emergency-nodeB
    
    Failover Policy: emergency-nodeB
    Version: 1
    Script: ordered
    Attributes: Controlled_Failback InPlace_Recovery 
    Initial AFD: nodeB

    For more information, see “Define a Failover Policy” in Chapter 5.

    At this point, the procedure assumes that the cluster has one node that has tried to come up but is now in the lonely state. The other node is down. The procedure is for recovering from this point.

  2. Modify each resource group to use the appropriate single-node emergency failover policy, using the following cmgr commands or the GUI:

    modify resource_group RGname in cluster Clustername
    set failover_policy to emergency-ActiveNode

    For example, on nodeA:

    cmgr> set cluster test-cluster
    cmgr> modify resource_group group1
    Enter commands, when finished enter either "done" or "cancel"
    
    resource_group group1 ? set failover_policy to emergency-nodeA
    resource_group group1 ? done
    Successfully modified resource group group1

  3. Change the state of all of resource groups to offline. (The last known state of these groups was online before the machines went down; however, the resource groups are not actually online at this point because the cluster was booted with the active node having entered the lonely state because the other node is not functional. This step tells the database to label the state of the resource groups appropriately in preparation for later steps.)

    Use the following command:

    admin offline_force resource_group RGname in cluster Clustername

    For example:

    cmgr> set cluster test-cluster
    cmgr> show resource_groups in test-cluster
    
    Resource Groups: 
            group1
            group2
    
    cmgr> admin offline_force resource_group group1
    cmgr> admin offline_force resource_group group2

  4. Force the stop of HA services for the down node:

    stop ha_services on DownNode for cluster Clustername force


    Note: This is a long running task that and might take a few minutes to complete. cmgr will provide intermediate task status.

    For example:

    cmgr> stop ha_services on nodeB for cluster test-cluster force

  5. Mark the resource groups as online in the database. When HA services are started in future steps, the services will come online using the emergency failover policies.

    admin online resource_group RGname in cluster Clustername

    For example:

    cmgr> set cluster test-cluster
    cmgr> admin online resource_group group1
    FailSafe daemon (ha_fsd) is not running on this local node or it is not ready to accept admin commands.
    Resource Group (group1) is online-ready.
    
    Failed to admin:
            online
    
    admin command failed
    
    cmgr> show status of resource_group group1 in cluster test-cluster
    
    State: Online Ready
    Error: No error
    Check resource group group1 status in an active node if HA services are active in cluster

  6. Start HA services on the active node:

    start ha_services on ActiveNode for Clustername 

    HA services are now active in this single node cluster. From this step through the rest of the recovery, services are active and there should be no downtime experienced.

    For example:

    cmgr> start ha_services on nodeA for cluster test-cluster

  7. Remove and reinitialize the database on the down node, which is now booted in multiuser mode:

    # cd /var/cluster/cdb
    # rm -rf /var/cluster/cdb/cdb*
    # /usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db

    Wait for a few minutes and use the tail (1) command to watch the SYSLOG file for messages indicating that this node knows its identity and has joined the cluster. For example (line breaks added here for readability):

    # tail -f /var/adm/SYSLOG
    ...
    Dec 14 18:23:16 6D:DownNode cmond[1074]:  Notification can not be processed, local machine and 
      cluster name is not known.
    Dec 14 18:23:16 6D:DownNode cmond[1074]:  Local machine belongs to cluster Clustername.
    Dec 14 18:23:16 6D:DownNode cmond[1074]:  Local machine name is DownNode"

Resuming Two-Node Use

To resume using the down node, do the following:

  1. Boot the down node into single user mode.

  2. Within single-user mode, use the chkconfig(1M) command to set all cluster services to off :

    # chkconfig | grep cluster

  3. Boot the DownNode to multiuser mode.

  4. Start HA services for the down node:

    cmgr> start ha_services on node DownNode for cluster Clustername

  5. Use the tail(1) command to watch the SYSLOG file for messages that indicate the formerly down node is now in FailSafe membership and that services are in the UP state. Do not continue until you see messages such as the following:

    # tail -f /var/adm/SYSLOG
    ...
    blue node ActiveNode [81] : UP  incarnation 7   age 2:0   
    blue node DownNode [82] : UP  incarnation 1   age 1:0

  6. Modify the resource groups to restore the original failover policies they were using before the failure:

    modify resource_group RGname in cluster Clustername
    set failover_policy to OriginalFailoverPolicy


    Note: This only restores the configuration for the static environment. The runtime environment will still be using the single-node policy at this time.

    For example, if the normal failover policy was normal-fp :

    cmgr> set cluster test-cluster
    cmgr> modify resource_group group1
    Enter commands, when finished enter either "done" or "cancel"
    
    resource_group group1 ? set failover_policy to normal-fp
    resource_group group1 ? done
    Successfully modified resource group group1
    
    cmgr modify resource_group group2
    Enter commands, when finished enter either "done" or "cancel"
    
    resource_group group2 ? set failover_policy to normal-fp
    resource_group group2 ? done
    Successfully modified resource group group2

  7. Perform an offline_detach command on the resource groups in the cluster. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. The resources will remain in service.


    Note: Doing an offline_detach operation on a resource group leaves the resources in the resource group running on the node. FailSafe does not track these resources any longer. You should take care should to prevent resources from running on multiple nodes in the cluster.

    Because FailSafe is no longer monitoring the group after the offline_detach (or offline_detach_force) is executed, you should not allow FailSafe to recover on any node other than where it was running at the time the offline_detach was performed.

    This also means that no other nodes should be allowed to rejoin the FailSafe membership, especially if Auto_Recovery is set in the resource group's failover policy. This restriction is required because the failover policy scripts are run whenever there is a change in membership; rerunning the scripts could cause your previously offline detached resource group to start on a node other than the node where the offline_detach was performed.

    FailSafe policy scripts are run only on nodes where FailSafe is running (that is, nodes where HA services have been started). For example, suppose you have a four-node FailSafe cluster (with nodes A, B, C and D), where nodes A, B, C are in UP state and node D is DOWN state. If resource group RG is made offline using offline_detach or offline_detach_force command on node B and HA services are shutdown on node B, node D should not rejoin the cluster before resources in RG are stopped manually on node B. If node D rejoins the cluster, the resource group RG will be made online on nodes A, C or D.

    Use the following command:

    admin offline_detach resource_group RGname [in cluster Clustername]

    For example:

    cmgr> admin offline_detach resource_group group1 in cluster test-cluster

    Show the status of the resource groups to be sure that they now show as offline.


    Note: The resources are still in service even though this command output shows them as offline .


    show status of resource_group RGname in cluster Clustername

    For example:

    cmgr> show status of resource_group group1 in cluster test-cluster

  8. Make the resource groups online in the cluster:

    admin online resource_group RGname in cluster Clustername

    For example:

    cmgr> admin online resource_group group1 in cluster test-cluster
    cmgr> admin online resource_group group2 in cluster test-cluster

  9. Move the resources back to their original nodes. Because our original policies included the InPlace_Recovery attribute, all of the resources have remained on the node that has been active throughout this process.

    admin move resource_group RGname in cluster Clustername to node PrimaryOwner

    For example:

    cmgr> admin move resource_group group1 in cluster test-cluster to node nodeB

If you run into errors after entering the admin move command, see “Ensuring that Resource Groups are Deallocated” in Chapter 9.

System Status

While the FailSafe system is running, you can monitor the status of the FailSafe components to determine the state of the component. FailSafe allows you to view the system status in the following ways:

  • You can keep continuous watch on the state of a cluster using the cluster_status command or the GUI.

  • You can query the status of an individual resource group, node, or cluster using either the GUI or the cmgr command.

  • You can use the haStatus script provided with the cmgr command to see the status of all clusters, nodes, resources, and resource groups in the configuration.

The following sections describe the procedures for performing these tasks.

Monitoring System Status with cluster_status

You can use the cluster_status command to monitor the cluster using a curses(3X) interface. For example, the following shows a two-node cluster configured for FailSafe only and cluster_status help text displayed:

# /var/cluster/cmgr-scripts/cluster_status
* Cluster=nfs-cluster  FailSafe=ACTIVE  CXFS=Not Configured         08:45:12 
   Nodes =   hans2    hans1
FailSafe =      UP       UP
    CXFS =
       ResourceGroup           Owner           State           Error         

       bartest-group                         Offline        No error
       footest-group                         Offline        No error
             bar_rg2           hans2          Online        No error
          nfs-group1           hans2          Online        No error
              foo_rg           hans2          Online        No error


                 +-------+ cluster_status Help +--------+
                 |  on s  - Toggle Sound on event       |
                 |  on r  - Toggle Resource Group View  |
                 |  on c  - Toggle CXFS View            |
                 |     j  - Scroll up the selection     |
                 |     k  - Scroll down the selection   |
                 |   TAB  - Toggle RG or CXFS selection |
                 | ENTER  - View detail on selection    |
                 |     h  - Toggle help screen          |
                 |     q  - Quit cluster_status         |
                 +--- Press 'h' to remove help window --+


The above shows that a sound will be activated when a node or the cluster changes status. You can override the s setting by invoking cluster_status with the -m (mute) option. You can also use the arrow keys to scroll the selection.


Note: The cluster_status command can display no more than 128 CXFS filesystems.


Monitoring System Status with the GUI

The easiest way to keep a continuous watch on the state of a cluster is to use the GUI view area.

System components that are experiencing problems appear as blinking red icons. Components in transitional states also appear as blinking icons. If there is a problem in a resource group or node, the icon for the cluster turns red and blinks, as well as the resource group or node icon.

The cluster status can be one of the following:

  • ACTIVE, which means the cluster is up and running and there is a valid FailSafe membership.

  • INACTIVE, which means the start FailSafe HA services task has not been run and there is no FailSafe membership.

  • ERROR, which means that some nodes are in a DOWN state; that is, the cluster should be running, but it is not.

  • UNKNOWN,which means that the state cannot be determined because FailSafe HA services are not running on the node performing the query.

If you minimize the GUI window, the minimized-icon shows the current state of the cluster. Green indicates FailSafe HA services active without an error, gray indicates FailSafe HA services are inactive, and red indicates an error state.

Key to Icons and States

The following tables show keys to the icons and states used in the FailSafe Manager GUI.

The full legend for component states is as follows:

Table 7-1. Key to Icons

Icon

Entity

Node

Cluster

Resource

Resource group

Resource type

Failover policy

Expanded tree

Collapsed tree


Table 7-2. Key to States

Icon

State

Inactive or unknown (HA services may not be active)

Online-ready state for a resource group

Healthy and active or online

(blinking) Transitioning to healthy and active/online or transitioning to offline

Maintenance mode, in which the resource is not monitored by FailSafe

(blinking red) Problems with the component


Querying Cluster Status with cmgr

To query node and cluster status, use the following command:

show status of cluster Clustername

Monitoring Resource and Reset Serial Line with cmgr

You can use cmgr to query the status of a resource or to contact the system controller at a node, as described in the following subsections.

Querying Resource Status with cmgr

To query a resource status, use the following command:

show status of resource Resourcename of resource_type RTname [in cluster Clustername]

If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource in the default cluster.

Performing a ping of a System Controller with cmgr

To perform a ping(1M) operation on a system controller by providing the device name, use the following command:

admin ping dev_name devicename of dev_type Devicetype with sysctrl_type SystemControllerType

Resource Group Status

To query the status of a resource group, you provide the name of the resource group and the cluster which includes the resource group. Resource group status includes the following components:

  • Resource group state

  • Resource group error state

  • Resource owner

These components are described in the following subsections.

If a node that contains a resource group online has a status of UNKNOWN, the status of the resource group will not be available or ONLINE-READY.

Resource Group State

A resource group state can be one of the following:

ONLINE 

FailSafe is running on the local nodes. The resource group is allocated on a node in the cluster and is being monitored by FailSafe. It is fully allocated if there is no error; otherwise, some resources may not be allocated or some resources may be in an error state.

ONLINE-PENDING 

FailSafe is running on the local nodes and the resource group is in the process of being allocated. This is a transient state.

OFFLINE 

The resource group is not running or the resource group has been detached, regardless of whether FailSafe is running. When FailSafe starts up, it will not allocate this resource group.

OFFLINE-PENDING 

FailSafe is running on the local nodes and the resource group is in the process of being released (becoming offline). This is a transient state.

ONLINE-READY 

FailSafe is not running on the local node. When FailSafe starts up, it will attempt to bring this resource group online. No FailSafe process is running on the current node if this state is returned.

ONLINE-MAINTENANCE 

The resource group is allocated in a node in the cluster but it is not being monitored by FailSafe. If a node failure occurs while a resource group in ONLINE-MAINTENANCE state resides on that node, the resource group will be moved to another node and monitoring will resume. An administrator may move a resource group to an ONLINE-MAINTENANCE state for upgrade or testing purposes, or if there is any reason that FailSafe should not act on that resource for a period of time.

INTERNAL ERROR 

An internal FailSafe error has occurred and FailSafe does not know the state of the resource group. Error recovery is required. This could result from a memory error, bugs in a program, or communication problems.

DISCOVERY (EXCLUSIVITY)  

The resource group is in the process of going online if FailSafe can correctly determine whether any resource in the resource group is already allocated on all nodes in the resource group's failover domain. This is a transient state.

INITIALIZING 

FailSafe on the local node has yet to get any information about this resource group. This is a transient state.

Resource Group Error State

When a resource group is ONLINE, its error status is continually being monitored. A resource group error status can be one of the following:

NO ERROR 

Resource group has no error.

INTERNAL ERROR - NOT RECOVERABLE 

An internal error occurred; notify SGI if this condition arises.

NODE UNKNOWN 

Node that had the resource group online is in unknown state. This occurs when the node is not part of the cluster. The last known state of the resource group is ONLINE, but the system cannot talk to the node.

SRMD EXECUTABLE ERROR 

The start or stop action has failed for a resource in the resource group.

SPLIT RESOURCE GROUP (EXCLUSIVITY) 

FailSafe has determined that part of the resource group was running on at least two different nodes in the cluster.

NODE NOT AVAILABLE (EXCLUSIVITY) 

FailSafe has determined that one of the nodes in the resource group's failover domain was not in the FailSafe membership. FailSafe cannot bring the resource group online until that node is removed from the failover domain or HA services are started on that node.

MONITOR ACTIVITY UNKNOWN 

In the process of turning maintenance mode on or off, an error occurred. FailSafe can no longer determine if monitoring is enabled or disabled. Retry the operation. If the error continues, report the error to SGI.

NO AVAILABLE NODES 

A monitoring error has occurred on the last valid node in the FailSafe membership.

Resource Owner

The resource owner is the logical node name of the node that currently owns the resource.

Monitoring Resource Group Status with GUI

You can use the view area to monitor the status of the resources in a FailSafe configuration:

  • Select View: Resources in Groups to see the resources organized by the groups to which they belong.

  • Select View: Groups owned by Nodes to see where the online groups are running. This view lets you observe failovers as they occur.

Querying Resource Group Status with cmgr

To query a resource group status, use the following cmgr command:

show status of resource_group RGname [in cluster Clustername]

If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource group in the default cluster.

Node Status

To query the status of a node, you provide the logical node name of the node. The node status can be one of the following:

UP 

This node is part of the FailSafe membership.

DOWN 

This node is not part of the FailSafe membership (no heartbeats) and this node has been reset. This is a transient state.

UNKNOWN 

This node is not part of the FailSafe membership (no heartbeats) and this nod has not been reset (reset attempt has failed).

INACTIVE 

HA services have not been started on this node.

When you start HA services, node states transition from INACTIVE to UP. It may happen that a node state may transition from INACTIVE to UNKNOWN to UP.

Monitoring Node Status with cluster_status

You can use the cluster_status command to monitor the status of the nodes in the cluster.

Monitoring Cluster Status with the GUI

You can use the GUI view area to monitor the status of the clusters in a FailSafe configuration. Select View: Groups owned by Nodes to monitor the health of the default cluster, its resource groups, and the group's resources.

Querying Node Status with cmgr

To query node status, use the following command:

show status of node nodename

Performing a ping of the System Controller with cmgr

When FailSafe is running, you can determine whether the system controller on a node is responding with the following command:

admin ping node nodename

This command uses the FailSafe daemons to test whether the system controller is responding.

You can verify reset connectivity on a node in a cluster even when the FailSafe daemons are not running by using the standalone option of the admin ping command:

admin ping standalone node nodename

This command does not go through the FailSafe daemons, but calls the ping(1) command directly to test whether the system controller on the indicated node is responding.

Viewing System Status with the haStatus Script

The haStatus script provides status and configuration information about clusters, nodes, resources, and resource groups in the configuration. This script is installed in the /var/cluster/cmgr-scripts directory. You can modify this script to suit your needs. See the haStatus(1M) man page for further information about this script.

The following examples show the output of the different options of the haStatus script.

# haStatus -help
Usage: haStatus [-a|-i] [-c clustername]
where,
 -a prints detailed cluster configuration information and cluster
status.
 -i prints detailed cluster configuration information only.
 -c can be used to specify a cluster for which status is to be printed.
 “clustername” is the name of the cluster for which status is to be
printed.
# haStatus
Tue Nov 30 14:12:09 PST 1999
Cluster test-cluster:
        Cluster state is ACTIVE.
Node hans2:
        State of machine is UP.
Node hans1:
        State of machine is UP.
Resource_group nfs-group1:
        State: Online
        Error: No error
        Owner: hans1
        Failover Policy: fp_h1_h2_ord_auto_auto
        Resources:
                /hafs1  (type: NFS)
                /hafs1/nfs/statmon      (type: statd)
                150.166.41.95   (type: IP_address)
                /hafs1  (type: filesystem)
                havol1  (type: volume)
# haStatus -i
Tue Nov 30 14:13:52 PST 1999
Cluster test-cluster:
Node hans2:
        Logical Machine Name: hans2
        Hostname: hans2.dept.company.com
        Is FailSafe: true
        Is CXFS: false
        Nodeid: 32418
        Reset type: powerCycle
        System Controller: msc
        System Controller status: enabled
        System Controller owner: hans1
        System Controller owner device: /dev/ttyd2
        System Controller owner type: tty
        ControlNet Ipaddr: 192.26.50.15
        ControlNet HB: true
        ControlNet Control: true
        ControlNet Priority: 1
        ControlNet Ipaddr: 150.166.41.61
        ControlNet HB: true
        ControlNet Control: false
        ControlNet Priority: 2
Node hans1:
        Logical Machine Name: hans1
        Hostname: hans1.dept.company.com
        Is FailSafe: true
        Is CXFS: false
        Nodeid: 32645
        Reset type: powerCycle
        System Controller: msc
        System Controller status: enabled
        System Controller owner: hans2
        System Controller owner device: /dev/ttyd2
        System Controller owner type: tty
        ControlNet Ipaddr: 192.26.50.14
        ControlNet HB: true
        ControlNet Control: true
        ControlNet Priority: 1
        ControlNet Ipaddr: 150.166.41.60
        ControlNet HB: true
        ControlNet Control: false
        ControlNet Priority: 2
Resource_group nfs-group1:
        Failover Policy: fp_h1_h2_ord_auto_auto
                Version: 1
                Script: ordered
                Attributes: Auto_Failback Auto_Recovery
                Initial AFD: hans1 hans2
        Resources:
                /hafs1  (type: NFS)
                /hafs1/nfs/statmon      (type: statd)
                150.166.41.95   (type: IP_address)
                /hafs1  (type: filesystem)
                havol1  (type: volume)
Resource /hafs1 (type NFS):
        export-info: rw,wsync
        filesystem: /hafs1
        Resource dependencies
        statd /hafs1/nfs/statmon
        filesystem /hafs1
Resource /hafs1/nfs/statmon (type statd):
        InterfaceAddress: 150.166.41.95
        Resource dependencies
        IP_address 150.166.41.95
        filesystem /hafs1
Resource 150.166.41.95 (type IP_address):
        NetworkMask: 0xffffff00
        interfaces: ef1
        BroadcastAddress: 150.166.41.255
        No resource dependencies
Resource /hafs1 (type filesystem):
        volume-name: havol1
        mount-options: rw,noauto
        monitor-level: 2
        Resource dependencies
        volume havol1
Resource havol1 (type volume):
        devname-group: sys
        devname-owner: root
        devname-mode: 666
        No resource dependencies
Failover_policy fp_h1_h2_ord_auto_auto:
        Version: 1
        Script: ordered
        Attributes: Auto_Failback Auto_Recovery
        Initial AFD: hans1 hans2
# haStatus -a
Tue Nov 30 14:45:30 PST 1999
Cluster test-cluster:
        Cluster state is ACTIVE.
Node hans2:
        State of machine is UP.
        Logical Machine Name: hans2
        Hostname: hans2.dept.company.com
        Is FailSafe: true
        Is CXFS: false
        Nodeid: 32418
        Reset type: powerCycle
        System Controller: msc
        System Controller status: enabled
        System Controller owner: hans1
        System Controller owner device: /dev/ttyd2
        System Controller owner type: tty
        ControlNet Ipaddr: 192.26.50.15
        ControlNet HB: true
        ControlNet Control: true
        ControlNet Priority: 1
        ControlNet Ipaddr: 150.166.41.61
        ControlNet HB: true
        ControlNet Control: false
        ControlNet Priority: 2
Node hans1:
        State of machine is UP.
        Logical Machine Name: hans1
        Hostname: hans1.dept.company.com
        Is FailSafe: true
        Is CXFS: false
        Nodeid: 32645
        Reset type: powerCycle
        System Controller: msc
        System Controller status: enabled
        System Controller owner: hans2
        System Controller owner device: /dev/ttyd2
        System Controller owner type: tty
        ControlNet Ipaddr: 192.26.50.14
        ControlNet HB: true
        ControlNet Control: true
        ControlNet Priority: 1
        ControlNet Ipaddr: 150.166.41.60
        ControlNet HB: true
        ControlNet Control: false
        ControlNet Priority: 2
Resource_group nfs-group1:
        State: Online
        Error: No error
        Owner: hans1
        Failover Policy: fp_h1_h2_ord_auto_auto
                Version: 1
                Script: ordered
                Attributes: Auto_Failback Auto_Recovery
                Initial AFD: hans1 hans2
        Resources:
                /hafs1  (type: NFS)
                /hafs1/nfs/statmon      (type: statd)
                150.166.41.95   (type: IP_address)
                /hafs1  (type: filesystem)
                havol1  (type: volume)
Resource /hafs1 (type NFS):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        export-info: rw,wsync
        filesystem: /hafs1
        Resource dependencies
        statd /hafs1/nfs/statmon
        filesystem /hafs1
Resource /hafs1/nfs/statmon (type statd):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        InterfaceAddress: 150.166.41.95
        Resource dependencies
        IP_address 150.166.41.95
        filesystem /hafs1
Resource 150.166.41.95 (type IP_address):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        NetworkMask: 0xffffff00
        interfaces: ef1
        BroadcastAddress: 150.166.41.255
        No resource dependencies
Resource /hafs1 (type filesystem):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        volume-name: havol1
        mount-options: rw,noauto
        monitor-level: 2
        Resource dependencies
        volume havol1
Resource havol1 (type volume):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        devname-group: sys
        devname-owner: root
        devname-mode: 666
        No resource dependencies
# haStatus -c test-cluster
Tue Nov 30 14:42:04 PST 1999
Cluster test-cluster:
        Cluster state is ACTIVE.
Node hans2:
        State of machine is UP.
Node hans1:
        State of machine is UP.
Resource_group nfs-group1:
        State: Online
        Error: No error
        Owner: hans1
        Failover Policy: fp_h1_h2_ord_auto_auto
        Resources:
                /hafs1  (type: NFS)
                /hafs1/nfs/statmon      (type: statd)
                150.166.41.95   (type: IP_address)
                /hafs1  (type: filesystem)
                havol1  (type: volume)

Embedded Support Partner (ESP) Logging of FailSafe Events

The Embedded Support Partner (ESP) consists of a set of daemons that perform various monitoring activities. You can choose to configure ESP so that it will log FailSafe events (the FailSafe ESP event profile is not configured in ESP by default).

FailSafe uses an event class ID of 77 and a description of IRIS FailSafe2.

If you want to use ESP for FailSafe, enter the following command to add the failsafe2 event profile to ESP:

# espconfig -add eventprofile failsafe2

FailSafe will then log ESP events for the following:

  • Daemon configuration error

  • Failover policy configuration error

  • Resource group allocation (start) failure

  • Resource group failures:

    • Allocation (start) failure

    • Release (stop) failure

    • Monitoring failure

    • Exclusivity failure

    • Failover policy failure

  • Resource group status:

    • online

    • offline

    • maintenance_on

    • maintenance_off

  • FailSafe shutdown (HA services stopped)

  • FailSafe started (HA services started)

You can use the espreport(1M) or launchESPartner(1) commands to see the logged ESP events. See the esp(5) man page and the Embedded Support Partner User Guide for more information about ESP.

Resource Group Failover

While a FailSafe system is running, you can move a resource group online to a particular node, or you can take a resource group offline. In addition, you can move a resource group from one node in a cluster to another node in a cluster. The following subsections describe these tasks.

Bring a Resource Group Online

This section describes how to bring a resource group online.

Bring a Resource Group Online with the GUI

Before you bring a resource group online for the first time, you should run the diagnostic tests on that resource group. Diagnostics check system configurations and perform some validations that are not performed when you bring a resource group online.

You cannot bring a resource group online in the following circumstances:

  • If the resource group has no members

  • If the resource group is currently running in the cluster

To bring a resource group fully online, HA services must be active. When HA services are active, an attempt is made to allocate the resource group in the cluster. However, you can also execute a command to bring the resource group online when HA services are not active. When HA services are not active, the resource group is marked to be brought online when HA services become active; the resource group is then in an ONLINE-READY state. Failsafe tries to bring a resource group in an ONLINE-READY state online when HA services are started.

You can disable resource groups from coming online when HA services are started by using the GUI or cmgr to take the resource group offline, as described in “Take a Resource Group Offline”.


Caution: Before bringing a resource group online in the cluster, you must be sure that the resource group is not running on a disabled node (where HA services are not running). Bringing a resource group online while it is running on a disabled node could cause data corruption. For information on detached resource groups, see “Take a Resource Group Offline”.

Do the following:

  1. Group to Bring Online: select the name of the resource group you want to bring online. The menu displays only resource groups that are not currently online.

  2. Click on OK to complete the task.

Bring a Resource Group Online with cmgr

To bring a resource group online, use the following command:

admin online resource_group RGname [in cluster Clustername]

If you have specified a default cluster, you do not need to specify a cluster when you use this command.

For example:

cmgr> set cluster test-cluster
cmgr> admin online resource_group group1
FailSafe daemon (ha_fsd) is not running on this local node or it is not ready to accept admin commands.
Resource Group (group1) is online-ready.

Failed to admin:
        online

admin command failed

cmgr> show status of resource_group group1 in cluster test-cluster

State: Online Ready
Error: No error
Check resource group group1 status in an active node if HA services are active in cluster

Take a Resource Group Offline

This section tells you how to take a resource group offline.

Take a Resource Group Offline with the GUI

When you take a resource group offline, FailSafe takes each resource in the resource group offline in a predefined order. If any single resource gives an error during this process, the process stops, leaving all remaining resources allocated.

You can take a FailSafe resource group offline in any of the following ways:

  • Take the resource group offline. This physically stops the processes for that resource group and does not reset any error conditions. If this operation fails, the resource group will be left online in an error state.

  • Force the resource group offline. This physically stops the processes for that resource group but resets any error conditions. This operation cannot fail.

  • Detach the resource group. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. This operation should rarely fail.

  • Detach the resource group and force the error state to be cleared. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. In addition, all error conditions of the resource group will be reset. This operation should rarely fail.

If you do not need to stop the resource group and do not want FailSafe to monitor the resource group while you make changes, but you would still like to have administrative control over the resource group (for instance, to move that resource group to another node), you can put the resource group in maintenance mode using the Suspend Monitoring a Resource Group task on the GUI or the admin maintenance_on command of cmgr, as described in “Suspend and Resume Monitoring of a Resource Group”.

If the fsd daemon is not running or is not ready to accept client requests, executing this task disables the resource group in the cluster database only. The resource group remains online and the command fails.

Enter the following:

  1. Detach Only: check this box to stop monitoring the resource group. The resource group will not be stopped, but FailSafe will not have any control over the group.

  2. Detach Force: check this box to stop monitoring the resource group. The resource group will not be stopped, but FailSafe will not have any control over the group. In addition, Failsafe will clear all errors.


    Caution: The Detach Only and Detach Force settings leave the resource group's resources running on the node where the group was online. After stopping HA services on that node, do not bring the resource group online on another node in the cluster; doing so can cause data integrity problems. Instead, make sure that no resources are running on a node before stopping HA services on that node.


  3. Force Offline: check this box to stop all resources in the group and clear all errors.

  4. Group to Take Offline: select the name of the resource group you want to take offline. The menu displays only resource groups that are currently online.

  5. Click on OK to complete the task.

Take a Resource Group Offline with cmgr

To take a resource group offline, use the following command:

admin offline resource_group RGname [in cluster Clustername]

To take a resource group offline with the force option in effect, forcing FailSafe to complete the action even if there are errors, use the following command:

admin offline_force resource_group RGname [in cluster Clustername]


Note: Doing an offline_force operation on a resource group can leave resources in the resource group running on the cluster. The offline_force operation will succeed even though all resources in the resource group have not been stopped. FailSafe does not track these resources any longer. You should take care to prevent resources from running on multiple nodes in the cluster.

To detach a resource group, use the following command:

admin offline_detach resource_group RGname [in cluster Clustername]

To detach the resource group and force the error state to be cleared:

admin offline_detach_force resource_group RGname [in cluster Clustername]

This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. In addition, all error conditions of the resource group will be reset. This operation should rarely fail.

Move a Resource Group

This section tells you how to move a resource group.

Move a Resource Group with the GUI

While FailSafe is active, you can move a resource group to another node in the same cluster.


Note: When you move a resource group in an active system, you may find the unexpected behavior that the command appears to have succeeded, but the resource group remains online on the same node in the cluster. This can occur if the resource group fails to start on the node to which you are moving it. In this case, FailSafe will fail over the resource group to the next node in the application failover domain, which may be the node on which the resource group was originally running. Since FailSafe kept the resource group online, the command succeeds.

Enter the following:

  1. Group to Move: select the name of the resource group to be moved. Only resource groups that are currently online are displayed in the menu.

    Click Next to move to the next page.

  2. Failover Domain Node: (optional) select the name of the node to which you want to move the resource group. If you do not specify a node, FailSafe will move the resource group to the next available node in the failover domain.

  3. Click on OK to complete the task.

Move a Resource Group with cmgr

To move a resource group to another node, use the following command:

admin move resource_group RGname [in cluster Clustername] [to node Nodename]

For example, to move resource group nfs-group1 running on node primary to node backup in the cluster nfs-cluster, do the following:

cmgr> admin move resource_group nfs-group1 in cluster nfs-cluster to node backup

If the user does not specify the node, the resource group's failover policy is used to determine the destination node for the resource group.

If you run into errors after entering the admin move command, see “Ensuring that Resource Groups are Deallocated” in Chapter 9.

If you run into errors after entering the admin move command, see “Ensuring that Resource Groups are Deallocated” in Chapter 9.

Suspend and Resume Monitoring of a Resource Group

This section describes how to stop monitoring of a resource group in order to put it into maintenance mode.

Suspend Monitoring a Resource Group with the GUI

You can temporarily stop FailSafe from monitoring a specific resource group, which puts the resource group in maintenance mode. The resource group remains on the same node in the cluster but is no longer monitored by FailSafe for resource failures.

You can put a resource group into maintenance mode if you do not want FailSafe to monitor the group for a period of time. You may want to do this for upgrade or testing purposes, or if there is any reason that FailSafe should not act on that resource group. When a resource group is in maintenance mode, it is not being monitored and it is not highly available. If the resource group's owner node fails, FailSafe will move the resource group to another node and resume monitoring.

When you put a resource group into maintenance mode, resources in the resource group are in ONLINE-MAINTENANCE state. The ONLINE-MAINTENANCE state for the resource is seen only on the node that has the resource online. All other nodes will show the resource as ONLINE. The resource group, however, should appear as being in ONLINE-MAINTENANCE state in all nodes.

Do the following:

  1. Group to Stop Monitoring: select the name of the group you want to stop monitoring. Only those resource groups that are currently online and monitored are displayed in the menu.

  2. Click OK to complete the task.

Resume Monitoring of a Resource Group with the GUI

This task lets you resume monitoring a resource group.

Once monitoring is resumed and assuming that the restart action is enabled, if the resource group or one of its resources fails, FailSafe will restart each failed component based on the failover policy.

Perform the following steps:

  1. Group to Start Monitoring: select the name of the group you want to start monitoring. Only those resource groups that are currently online and not monitored are displayed in the menu.

  2. Click OK to complete the task.

Putting a Resource Group into Maintenance Mode with cmgr

To put a resource group into maintenance mode, use the following command:

admin maintenance_on resource_group RGname [in cluster Clustername]

If you have specified a default cluster, you do not need to specify a cluster when you use this command.

Resume Monitoring of a Resource Group with cmgr

To move a resource group back online from maintenance mode, use the following command:

admin maintenance_off resource_group RGname [in cluster Clustername]

Stopping FailSafe

You can stop the execution of FailSafe on all the nodes in a cluster or on a specified node only. See “Stop FailSafe HA Services” in Chapter 5.

Resetting Nodes

You can use FailSafe to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect this and remove the node from the active cluster, reallocating any resource groups that were allocated on that node onto a backup node. The backup node that is used depends on how you have configured your system.

After the node reboots, it will rejoin the cluster. Some resource groups might move back to the node, depending on how you have configured your system.

Reset a Node with the GUI

You can use the GUI to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect the change and remove the node from the active cluster. When the node reboots, it will rejoin the FailSafe membership.

To reset a node, do the following:

  1. Node to Reset: select the node to be reset.

  2. Click on OK to complete the task.

Reset a Node with cmgr

When FailSafe is running, you can reboot a node with the following command:

admin reset node nodename

This command uses the FailSafe daemons to reset the specified node.

You can reset a node in a cluster even when the FailSafe daemons are not running by using the standalone option of the admin reset command:

admin reset standalone node nodename

This command does not go through the FailSafe daemons.

Backing Up and Restoring Configuration with cmgr

The cmgr command provides scripts that you can use to backup and restore your configuration: cdbBackup and cdbRestore. These scripts are installed in the /usr/cluster/bin directory. You can modify these scripts to suit your needs.

The cdbBackup script, as provided, creates compressed tar(1) files of the /var/cluster/cdb/cdb.db# directory and the /var/cluster/cdb.db file.

The cdbRestore script, as provided, restores the compressed tar files of the /var/cluster/cdb/cdb.db# directory and the /var/cluster/cdb.db file.

When you use the cdbBackup and cdbRestore scripts, you should follow the following procedures:

  • Run the cdbBackup and cdbRestore scripts only when no administrative commands are running. If you run the scripts at the same time as the administrative commands, the result could be an inconsistent backup.

  • Back up the configuration of each node in the cluster separately. The configuration information is different for each node, and all node-specific information is stored locally only.

  • Run the backup procedure whenever you change your configuration.

  • Restore at the same time the backups of all nodes in the pool taken at the same time.

  • Do not run cluster and FailSafe process when you restore the configuration.

  • Do not perform a cdbDump while information is changing in the cluster database. Check the SYSLOG file for information to help determine when cluster database activity is occurring. As a rule of thumb, you should be able to perform a cdbDump if at least 15 minutes have passed since the last node joined the cluster or the last administration command was run.

Log File Management

You should rotate the log files at least weekly so that your disk will not become full.

The following sections provide example scripts. You may want to consider placing an entry in the root crontab(1) to run such scripts periodically.

For information about log levels, see “Set Log Configuration” in Chapter 5.

Rotating All Log Files

You can use a script such as the following to copy all files to a new location.

#!/bin/sh

DATE=`/sbin/date +'%U-%a'`
LOG_DIR="/var/cluster/ha/log"
HOST=`/usr/bsd/hostname -s`
LOG_FILES="cad_log cmond_log fs2d_log"
LOG_HFILES="cli cmsd crsd failsafe gcd ifd script srmd clconfd"

LOG_ARCH=$LOG_DIR"/Old-Log"

if [ ! -d $LOG_ARCH ] ; then
   mkdir $LOG_ARCH
fi

for file in $LOG_FILES
do

  rm -f ${LOG_ARCH}/${file}-${DATE}
  cp ${LOG_DIR}/${file} ${LOG_ARCH}/${file}-${DATE}
  echo "Log Rotation at `date`" > ${LOG_DIR}/${file}
done

for file in $LOG_HFILES
do

  rm -f ${LOG_ARCH}/${file}_${HOST}-${DATE}
  cp ${LOG_DIR}/${file}_${HOST} ${LOG_ARCH}/${file}_${HOST}-${DATE}
  echo "Log Rotation at `date`" > ${LOG_DIR}/${file}_${HOST}
done

The script can be executed as a cron(1M) job to regularly clean up log files. This script rotates log files when HA services are active in the FailSafe cluster. Default log levels do not create large log files.