Chapter 5. Installing and Testing Scripts

This chapter describes how to name and install new scripts and how to test them. It also provides tips on how to debug problems that you may encounter.

The major sections in this chapter are as follows:

Naming and Installing Monitoring Scripts

Install monitoring scripts in the directory /var/ha/actions with owner root, group sys, and mode 700. Local monitoring scripts have a name of this form:

ha_service_lmon 

Remote monitoring scripts have a name of this form:

ha_service_rmon 

Choosing the Execution Order of Failover Scripts for Each Operation

The section “IRIS FailSafe Scripts” in Chapter 1 described the organization of the failover scripts:

  • The scripts are stored in /var/ha/resources.

  • Each of the directories /var/ha/actions.d/giveaway, /var/ha/actions.d/giveback, /var/ha/actions.d/takeback, and /var/ha/actions.d/takeover contains links to each of the scripts.

  • The link names begin with S and a three-digit number.

  • Because the links (scripts) in each directory are executed in lexicographic order, the ordering of the three-digit numbers is the order in which the scripts are executed.

In the giveaway and giveback directories, the order of execution of the standard, NFS, and Web server scripts is webserver, interfaces, statd, nfs, filesystems, and volumes. In the takeback and takeover directories, the order of execution is volumes, filesystems, nfs, interfaces, statd, and webserver.

Based on the tasks performed by each of these scripts (see the section “Tasks Performed by the Standard Failover Scripts” in Chapter 1) and the resources used by the application, you must choose where in the sequence of execution to insert your new script for each operation. For example, filesystems on which an application depends must be mounted before the application is started up. Thus, for takeover and takeback operations, the sequence number of the filesystem script (/var/ha/actions.d/takeover/S100filesystem) must be smaller than that of your new highly available service, so that filesystems are mounted before instances of the new highly available service are started up. Similarly, the application sequence number must be smaller than the filesystem sequence number for giveback and giveaway operations because the application must be stopped before filesystems are unmounted.

For a failover script for named, good choices are as follows:

/var/ha/actions.d/takeback/S850named
/var/ha/actions.d/takeover/S850named
/var/ha/actions.d/giveback/S700named
/var/ha/actions.d/giveaway/S700named

This ordering was chosen because the named process has to be started after the interfaces have been brought up and before NFS filesystems are mounted. It has to be stopped before interfaces are stopped and after NFS filesystems are unmounted.

For most applications it is best not to insert them into the middle of this application order; they should be executed before the scripts provided by Silicon Graphics in the giveaway and giveback directories and after the scripts provided by Silicon Graphics in the takeback and takeover directories. Thus, for giveaway and giveback, applications are stopped before stopping interfaces, filesystems, and volumes. For takeback and takeover, applications are started after the volumes, filesystems, and interfaces are started.

Installing Failover Scripts

After deciding the execution order of your failover script in each of the actions.d directories as described in the section “Choosing the Execution Order of Failover Scripts for Each Operation” in this chapter, you can complete the installation of your script:

  1. Copy the script to /var/ha/resources.

  2. Change the owner and group of the script to root sys and the mode to 700.

  3. Choose a three-digit number that will ensure that the script is executed in the correct order in the giveaway and giveback directories.

  4. Create links in the giveaway and giveback directories. In each of these directories, enter this command:

    # ln -s ../resources/script Snnnscript
    

  5. Choose a three-digit number that will ensure that the script is executed in the correct order in the takeback and takeover directories.

  6. Create links in the takeback and takeover directories. In each of these directories, enter this command:

    # ln -s ../resources/script Smmmscript
    

Modifying Application Startup Procedures

Because highly available services are started up by IRIS FailSafe, rather than as a result of executing scripts in /etc/init.d or other automatic, non IRIS FailSafe mechanism, you must disable the normal startup procedure for the application you are making highly available.

For example, to turn off the automatic (non IRIS FailSafe) startup of named, use the chkconfig command to turn named off:

# chkconfig named off

Testing New Scripts

The subsections below describe strategies for testing new monitoring and failover scripts. To prepare for testing, take these steps:

  • Ensure that you have exclusive use of both nodes—users logged in during testing could experience unavailability of highly available services.

  • Generate additional debugging information in /var/adm/SYSLOG by setting the variable TESTING in /var/ha/actions/common.vars:

    TESTING=ok 
    

General Testing and Debugging Techniques

Some general testing and debugging techniques you can use during testing are as follows:

  • While testing your scripts, you can get debugging information from these sources:

    • IRIS FailSafe writes messages in /var/adm/SYSLOG, which can be useful in debugging script problems. Running this command in a window dedicated to this command can help you keep track of the messages as they occur:

      # tail -f /var/adm/SYLOG 
      

    • The ha_admin -i command reports the state of a node. Note that this command hangs if a node is in transition from one state to another.

    • The ha_admin -a command provides information about the cluster that includes node states for each node, IP addresses and the node that owns them, XLV volumes and the node that owns them, and filesystems and the node that owns them.

  • If your testing causes repeated failovers, IRIS FailSafe is disabled (chkconfig failsafe off), so that it is not started automatically at boot time. This is because IRIS FailSafe software is designed so that repeated failures don't result in repeated failovers. The criterion for disabling IRIS FailSafe is two failures within a set period of time. This period of time is specified by the variable MIN_UPTIME in the file /etc/init.d/failsafe. During testing, you can set MIN_UPTIME = 0, with the result that IRIS FailSafe is never disabled.

  • The procedures in the following subsections assume that you are using csh. If you are using sh, substitute echo $? for the echo $status commands that report the return value of the previous command. The return value should always be zero, which indicates success.

  • To check that an application is running on a node, you may be able to use a command provided by the application. For example, the IRIS FailSafe INFORMIX option uses the INFORMIX command onstat.

  • Another way to check that an application is running on a node, is to enter this command on that node:

    # ps -ef | grep application 
    

    application is the name (or a portion of the name) of the executable for the application.

Testing Monitoring Scripts

Monitoring scripts test the liveliness of applications and resources. The best way to test them is to induce failures, one at a time, run the script, and check if this failure is detected by the script. Test monitoring scripts without IRIS FailSafe running on either node.

Use this checklist for testing a monitoring script:

  • Verify that the script detects failure of the application successfully.

  • Verify that the script always exits with a return value. See the section “Understanding the Monitoring Script Template” in Chapter 3 for a list of return values.

  • Verify that the script does not contain commands that can hang, such as using DNS for name resolution, or those that continue forever, such as ping.

  • Verify that the script completes before the timeout value specified in the configuration file.

  • Verify that the script's return codes are correct.

During testing, measure the time it takes for a script to complete and adjust the monitoring times in the configuration file, /var/ha/ha.conf, accordingly. To get a good estimate of the time required for the script to execute, run it under different system load conditions.

Testing Failover Scripts Without Starting IRIS FailSafe

You can test the operations giveaway, giveback, takeback, and takeover manually using the general procedure below. It refers to one node (either one) as Node A and the other as Node B.

  1. Before beginning this testing, ensure that the following are true:

    • The failover script you are testing is installed.

    • The configuration file (/var/ha/ha.conf) includes blocks for the application whose script you are testing.

    • IRIS FailSafe is not running on the cluster.

    • The application you are testing starts and stops correctly on each node.

    • The application you are testing is not running on either node in the cluster.

    • The logical volumes used by the application are assembled.

    • The filesystems used by the application are mounted.

    • The network interfaces used the application are configured up.

  2. On each node, enter this command and check the return value:

    # /var/ha/actions/takeback `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

    See the section “General Testing and Debugging Techniques” for information about the echo command.

  3. On each node, verify that all instances of the application for which this node is the primary node (server-node) have been started. See the section “General Testing and Debugging Techniques” for information.

  4. On Node A, enter this command and check the return value:

    # /var/ha/actions/giveaway `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

  5. Verify that no instances of the application are running on node A.

  6. On Node B, enter this command and check the return value:

    # /var/ha/actions/takeover `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

  7. Verify that all instances of the application for which node B is the backup node are now running on Node B.

  8. On Node B, enter this command and check the return value:

    # /var/ha/actions/giveback `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

  9. Verify that Node B is running just the application instances for which it is the primary node.

  10. On Node A, enter this command and check the return value:

    # /var/ha/actions/takeback `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

  11. Verify that Node A is running the application instances for which it is the primary node.

  12. On Node B, enter this command and check the return value:

    # /var/ha/actions/giveaway `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

  13. Verify that no instances of the application are running on node B.

  14. On Node A, enter this command and check the return value:

    # /var/ha/actions/takeover `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

  15. Verify that all instances of the application for which node A is the primary or backup node are now running on Node A.

  16. On Node A, enter this command and check the return value:

    # /var/ha/actions/giveback `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

  17. Verify that Node A is running just the application instances for which it is the primary node.

  18. On Node B, enter this command and check the return value:

    # /var/ha/actions/takeback `/usr/etc/ha_cfgchksum` 
    # echo $status 
    

  19. Verify that Node B is running the application instances for which it is the primary node.

Testing Failover Scripts While Running IRIS FailSafe

You can test the operations giveaway, giveback, takeback, and takeover while IRIS FailSafe is running using the general procedure below. It refers to one node (either node) as Node A and the other as Node B.

  1. Before beginning this testing, ensure the following:

    • The failover script you are testing is installed.

    • The configuration file (/var/ha/ha.conf) includes blocks for the application whose script you are testing.

    • IRIS FailSafe is not running on the cluster.

    • The application you are testing is not running on either node in the cluster.

  2. Start up IRIS FailSafe and the applications whose script you are testing by entering these commands on both nodes:

    # chkconfig failsafe on 
    # /etc/init.d/failsafe start 
    

  3. Wait until both nodes reach normal state. You can verify this using this command on each node:

    # /usr/etc/ha_admin -i 
    ha_admin: Node controller state normal 
    

  4. Verify that Node A and Node B are running the instances of all applications for which they are the primary node.

  5. On Node A, enter this command:

    # /usr/etc/ha_admin -s 
    

  6. Verify that no highly available applications are running on Node A, and that all instances of the highly available applications are running on Node B. Node A must be in standby state and Node B must be in degraded state.

  7. On Node A, enter this command:

    # /usr/etc/ha_admin -fr 
    

  8. Verify that Node A and Node B are running the instances of all applications for which they are the primary node and both nodes are in normal state.

  9. On Node B, enter this command:

    # /usr/etc/ha_admin -s 
    

  10. Verify that no highly available applications are running on Node B, and that all instances of the highly available applications are running on Node A. Node B must be in standby state and Node A must be in degraded state.

  11. On Node B, enter this command:

    # /usr/etc/ha_admin -fr 
    

  12. Verify that Node A and Node B are running the instances of all applications for which they are the primary node and both nodes are in normal state.