Chapter 3. Writing a Monitoring Script

This chapter provides information about writing local and remote monitoring scripts. It begins with a section that describes how to write a monitoring script. The remaining sections provide details about various aspects of monitoring scripts that will help you develop your script.

The sections in this chapter are as follows:

Writing a Monitoring Script

Two types of monitoring scripts can be written. A local monitoring script monitors a particular resource, for example network interfaces, or the instances of an application, for example Netscape servers, on the local node. A remote monitoring script monitors a resource or the instances of an application on the other node in the cluster. The procedure below describes the steps to write a local or remote monitoring script, including choosing which type of script to write.

  1. Use the questions and information in the section “Preparing to Write a Monitoring Script” in this chapter to help you get information you may need about the application you are monitoring and make decisions about how to implement monitoring.

  2. Begin with the monitoring script template /var/ha/actions/ha_app_Xmon or with a copy of one of the monitoring scripts provided with the product if it is similar to the script you need.

  3. Review the monitoring script template /var/ha/actions/ha_app_Xmon. It is described in the section “Understanding the Monitoring Script Template” in this chapter.

  4. Become familiar with using the command ha_cfginfo to extract information from the configuration file. The use of this command is described in the section “Using ha_cfginfo to Get Configuration File Information” in this chapter.

  5. Review other monitoring scripts in /var/ha/actions to get an idea of how they perform their checking tests and to see if portions of these scripts can be reused in your script.

  6. Define variables in the monitoring script for any new block, section, and parameter names you added to the configuration file that you will use in the monitoring script. See the section “Defining Variables for New Block, Section, and Parameter Names” in this chapter for details.

  7. Write the function that performs the failure test, check(). It should contain code that searches for all instances of the application running on $HOST and, for each instance of the application, perform the failure test. The requirements for this function are described in the section “Understanding the Function of the Monitoring Script check() Function” in this chapter.

  8. Modify the remainder of the script as necessary.

The remaining sections in this chapter provide information that will help you perform these steps.

Preparing to Write a Monitoring Script

Here are some questions to think about before writing a monitoring script:

  • Is a monitoring script required?

    Monitoring scripts may not be needed at all in these situations:

    • Heartbeat monitoring is sufficient; simply verifying that the node is alive (provided automatically by IRIS FailSafe software) determines the health of the highly available service.

    • There is no process or resource that can be monitored. For example, the Silicon Graphics Gauntlet software performs IP filtering on firewall nodes. Because the filtering is done in the kernel, there is no process or resource to monitor.

    • The resource on which the application depends is already monitored. For example, monitoring some client-server applications might best be done by monitoring the filesystems, volumes, and network interfaces they use. Because this is already done by the IRIS FailSafe base software (the /var/ha/actions/ha_filesys_lmon and /var/ha/actions/ha_vol_lmon scripts and the ha_ifa interface agent), additional monitoring is not required.

  • Can a local monitoring script be written?

    Local monitoring may be so expensive that it affects system performance. In this case it shouldn't be done. Also, security issues may make monitoring very difficult.

    In some unusual situations, applications may not allow local monitoring. For example, the application may prevent local clients from connecting. In this case, only remote monitoring can be done.

  • Is a remote monitoring script necessary?

    There are generally two components to remote monitoring: testing the network between the two nodes and verifying that the application is running in the remote node. Because the network interfaces specified in the node blocks of the configuration file are monitored by the interface agent and the application can be monitored by a local monitoring script, a remote monitoring script may not be necessary.

  • What are the symptoms of failure for this application?

    Some possibilities include:

    • The application returns an error code.

    • The application returns the wrong result.

    • The application does not return quickly enough.

  • What is the test for failure?

    The test should be simple and complete quickly, whether it succeeds or fails. Some examples of tests are as follows:

    • For a client-server application that follows a protocol, the monitoring script can make a simple request and verify that the proper response is received.

    • For a web server, the monitoring script can request a home page, verify that the connection was made, and ignore the resulting home page.

    • For a database, a simple request such as querying a table can be made.

    • For NFS, more complicated end-to-end monitoring is required. The test might consist of mounting an exported filesystem, checking access to the filesystem with a stat() system call to the root of the filesystem, and undoing the mount.

    • For an application that writes to a log file, check that the size of the log file is increasing or use the grep command to check for a particular message.

    • The command

      # killall -0 processname 
      

      can be used to determine quickly whether a process exists. Using the ps command to check on a particular process is not a good test; its execution can be too slow.

  • What should the probe time be set to (the frequency of monitoring)?

    For local monitoring, the probe time should be a balance between the frequency of checking and the cost of checking. Monitoring reduces the performance of a node.

    For remote monitoring, the probe time should be longer than the probe time for local monitoring and longer than the heartbeat probe time. A good initial value for the probe time for remote monitoring is the value of long-timeout. Remote monitoring is much more likely to suffer from timeouts than local monitoring.

  • What should the timeout be (the period in which a test should complete)?

    This value must be determined by testing the monitoring script. It must be long enough to guarantee that occasional anomalies do not cause false failovers.

  • Should the failure test be executed multiple times so that a node is not declared dead after a single failure?

    Testing more than once before declaring failure is a good idea. One way to do this if the test is a single command is to use the ha_exec command. It is described in the section “Understanding the Function of the Monitoring Script check() Function.”

  • What values need to be customized or tuned and should therefore go into the configuration file as parameters?

    See Chapter 2, “Modifying the Configuration File for a New Highly Available Service,” for information on adding parameters to /var/ha/ha.conf.

Understanding the Monitoring Script Template

The monitoring script template /var/ha/actions/ha_app_Xmon is shown in Example 3-1. A description of the template is provided at the end of the template.

Example 3-1. Monitoring Script Template With Line Numbers


  1 #!/sbin/sh 
  2 #
  3 ## Instructions for modifying this file are on lines that begin with ##.
  4 #
  5 ## Provide a description of this script including its name, installation
  6 ## location, purpose, and the monitoring tests performed.
  7 #
  8 # Usage:
  9 ## Replace <scriptname> in the next line with the name of this script.
 10 #        <scriptname> “<checksum> <nodename>”
 11 #
 12 # Exit codes:
 13 #        0: The local/remote monitor succeeded
 14 #        1: This script called illegally
 15 #        2: Configuration file is incorrect
 16 #        3: The local/remote monitoring failed
 17 #
 18 
 19 SUCCESS=0
 20 ILLEGAL_CALL=1
 21 INCORRECT_CONF_FILE=2
 22 FAILED=3
 23 
 24 HA_DIR=/var/ha/actions
 25 HAEXEC=/usr/etc/ha_exec
 26 CONF=$HA_DIR/common.vars
 27 
 28 ## Define other variables that are local to this script here.
 29 ## Use ${LOGGER} to print error and TESTING messages to /var/adm/SYSLOG
 30 ## file.
 31 
 32 # Source in common variables
 33 . $CONF
 34 
 35 if [ X$TESTING = Xok ]; then
 36 ## Replace <application> and <local/remote> in the next line.
 37     ${LOGGER} “Executing <application> <local/remote> monitor script”
 38 fi
 39 
 40 if [ $# -ne 1 ]; then
 41     ${LOGGER} “Illegal syntax: argument required”
 42     ${LOGGER} “Usage: $0 \”checksum nodename\””
 43     exit $ILLEGAL_CALL;
 44 fi
 45 
 46 # Get the checksum and nodename from the argument string.
 47 set $1
 48 
 49 if [ $# -ne 2 ]; then
 50     ${LOGGER} “Illegal syntax: argument required”
 51     ${LOGGER} “Usage: $0 \”checksum nodename\””
 52     exit $ILLEGAL_CALL;
 53 fi
 54 
 55 HOST=$2
 56 
 57 #
 58 # Compare the checksum argument (the checksum known by the node 
 59 # controller and application monitor) with the checksum of ha.conf
 60 # on this system.
 61 
 62 CNF_CHKSUM=$1
 63 CHKSUM=`$CFG_SUM`
 64 if [ $CNF_CHKSUM != $CHKSUM ]; then
 65     ${LOGGER} “Checksum mismatch [argument: $CNF_CHKSUM] [file: $CHKSUM]”
 66     exit $INCORRECT_CONF_FILE;
 67 fi
 68 
 69 ##
 70 ## Substitute ha_app_Xmon by the application name
 71 ##
 72 LOGFILE=/var/ha/logs/ha_app_Xmon.$HOST.log
 73 echo Started logging at `date`> $LOGFILE
 74 
 75 #
 76 # Executes the command $EXEC and prints the command, output and 
 77 # error to log file $LOGFILE. If the return value from the command 
 78 # is non-zero, the function exits with value 3. 
 79 # It takes one parameter, log message about the command.
 80 #
 81 execute_cmd()
 82 {
 83 
 84     echo $1 >> $LOGFILE;
 85     if [ X${TESTING} = Xok ]; then
 86         ${LOGGER} $1
 87     fi
 88 
 89     eval $EXEC >> $LOGFILE 2>&1;
 90 
 91     exit_code=$?;
 92 
 93     if [ $exit_code -ne 0 ]; then
 94         echo “ERROR: $EXEC,  exit_code: $exit_code” >> $LOGFILE;
 95         ${LOGGER} “ERROR: $EXEC”
 96         exit 3;
 97     fi
 98     
 99     echo “*** $EXEC completed with exit_code 0 ***” >> $LOGFILE;
100 
101 }
102 
103 ## Put the checking procedure(s) here.
104 
105 ## Comment about check() procedure. 
106 ## Use $HAEXEC for commands which have to be retried before declaring 
107 ## application monitor failure. 
108 ## Check to see if the application instances whose server-node is $HOST 
109 ## has failed.
110 ## The check() procedure should return $FAILED if the application 
111 ## instance has failed.
112 ## If the configuration file ha.conf is incorrect, check() procedure
113 ## should return $INCORRECT_CONF_FILE.
114 ## To read the configuration file ha.conf, use $CFG_INFO command. For
115 ## more information about the command, see ha_cfginfo(1M) manpage.
116 ## Use execute_cmd() to execute the commands in the script.
117 ## 
118 
119 ## check()
120 ## {
121 ##     ...
122 ## }
123 
124 ## Make call(s) to checking procedure(s) here.
125 
126 ## check;
127 
128 # Exit with SUCCESS 
129 
130 exit $SUCCESS;

The monitoring script template can be broken into these sections:

  • Lines 19 to 22 set variables for the script return values. Failover scripts have these return values:

    • 0 ($SUCCESS)—Success; the operation succeeded, so the application is running.

    • 1 ($ILLEGAL_CALL)—An invalid argument was passed to the script.

    • 2 ($INCORRECT_CONF_FILE)—The configuration file is invalid; either information in the configuration file is incorrect or some information is missing from the configuration file.

    • 3 ($FAILED)—The operation failed.

    If a monitoring script returns a non-zero value, the application is assumed to have failed.

  • Line 33 sources the file /var/ha/actions/common.vars, which assigns strings in /var/ha/ha.conf to variables and defines the ${LOGGER} command, which is used to write messages to /var/adm/SYLOG, and the ${TESTING} variable, which is used to control debugging information written to /var/adm/SYSLOG. It also sets the variable ${CFG_SEP} to the character #.

  • Lines 40 to 54 contain code for checking the monitoring script's command-line argument. The monitoring script must have one command line argument, a double-quoted argument that contains two strings separated by a blank:

    • The first string is the checksum of the /var/ha/ha.conf file, as generated by the ha_cfgchksum command.

    • The second string is a node name. This is the hostname of the node to be monitored. Line 55 sets $HOST to the node name.

  • Lines 62 to 67 compare the checksum argument with the checksum of /var/ha/ha.conf.

  • Lines 72 and 73 set $LOGFILE to the name of the log file and write a message to it. The directory for the log file is /var/ha/logs. The convention for the filename is the name of the application, $HOST, and the word log, separated by periods.

  • Lines 76 to 101 describe and define the execute_cmd() function. It writes information to the log file and executes the command specified by the variable $EXEC. It is described fully in the section “Executing a Command in a Monitoring Script.”

  • Lines 103 to 122 describe and define the check() function. The check() function is described fully in the next section, “Understanding the Function of the Monitoring Script check() Function.”

Defining Variables for New Block, Section, and Parameter Names

Each new block, section, or parameter name that you added to the configuration file (see the section “Choosing Parameters for a New Highly Available Service” in Chapter 2) must be assigned to a shell variable at the beginning of each script in which they are used. The variables are used in scripts, not the parameter, section, and block names.

When assigning a parameter, section, or block name to a variable, choose a variable name that starts with T_. You can see examples of these assignments in the file /var/ha/actions/common.vars. Your variables can be defined in the scripts in which they are used. Do not modify the /var/ha/actions/common.vars file to add new variables. /var/ha/actions/common.vars gets updated by new releases and your modifications will be lost when a new release of IRIS FailSafe software is installed.

For example, say that you added this parameter to the configuration file:

process-name = named

To use this parameter in a script, add this line to the script you write (about line 31 in Example 3-1):

T_PROCNAME=process-name 

Using ha_cfginfo to Get Configuration File Information

The command ha_cfginfo is used in monitoring and failover scripts to obtain information from the configuration file /var/ha/ha.conf. The command is

# /usr/etc/ha_cfginfo [ -f filename ] [ string ]

filename is the name of a configuration file; by default it is /var/ha/ha.conf. If string isn't specified, the names of the blocks in the configuration file are listed. By specifying string, you can get any value in the file. For example, say that this is a portion of a configuration file:

volume shared1_vol
{
    server-node = xfs-ha1
    backup-node = xfs-ha2
    devname = /dev/dsk/xlv/shared1_vol
}

volume shared2_vol
{
    server-node = xfs-ha2
    backup-node = xfs-ha1
    devname = /dev/dsk/xlv/shared2_vol
    disks = (/dev/dsk/dks0d1s2 /dev/dsk/dks0d5s3 /dev/dsk/dks0d2s6)
}

Some example ha_cfginfo commands and their output are shown below. The string argument specifies the hierarchical path you are interested in, with the # character separating elements in the hierarchy.

# /usr/etc/ha_cfginfo volume
shared1_vol shared2_vol
# /usr/etc/ha_cfginfo volume#shared1_vol
server-node backup-node devname
# /usr/etc/ha_cfginfo volume#shared1_vol#server-node
xfs-ha1
# /usr/etc/ha_cfginfo volume#shared2_vol#server-node
xfs-ha2
# /usr/etc/ha_cfginfo volume#shared2_vol#disks
/dev/dsk/dks0d1s2 /dev/dsk/dks0d5s3 /dev/dsk/dks0d2s6 

A simple example of using ha_cfginfo in a script is this fragment that monitors each of the volumes defined in /var/ha/ha.conf:

EXEC = `/usr/etc/ha_cfginfo volume`

for VOL in $EXEC
do
    monitor the volume $VOL here
done

Scripts access the labels and parameter values in /var/ha/ha.conf by specifying the hierarchical path to the label or parameter they want—for example, the block, its label, a section, its label, and finally the parameter—as an argument to the ha_cfginfo command. However, there is a level of indirection in the naming of the blocks, sections, and parameters. In the shell script /var/ha/actions/common.vars, each block, section, and string name in /var/ha/ha.conf is assigned to a similarly named variable. These variables are used as arguments to ha_cfginfo in monitoring and failover scripts.

As an example of the use of ha_cfginfo, say that the configuration file contains this fragment:

nfs nfs1
{
    export-point = /shared1/export
    ...
}
nfs nfs2
{
    export-point = /shared2/export
    ...
}

The file /var/ha/actions/common.vars includes these lines:

CFG_FILE=/var/ha/ha.conf
CFG_INFO="/usr/etc/ha_cfginfo -f ${CFG_FILE}"
CFG_SEP=#
T_NFS=nfs
T_EXPORTPT=export-point

To perform an operation on each export point for NFS filesystems, use a shell script fragment such as this to get the value of each export-point parameter:

for FS in `$CFG_INFO ${T_NFS}`   # loop through each nfs block
do
    # set up the ha_cfginfo command line to get the export-point value of an nfs block
    SEARCH=”$CFG_INFO ${T_NFS}${CFG_SEP}${FS}${CFG_SEP}${T_EXPORTPT}”

    # perform the ha_cfginfo command, assign the result to $EXPORT_PT
    EXPORT_PT=`$SEARCH`

    # perform operation on $EXPORT_PT 
    ...
done

Understanding the Function of the Monitoring Script check() Function

The checking function check() must perform these functions:

  • Check to see if the application instances whose server-node is $HOST have failed.

  • Exit the script with the return value $FAILED if the application instance has failed.

  • Exit the script with the return value $INCORRECT_CONF_FILE if the configuration file /var/ha/ha.conf is incorrect.

To extract information from /var/ha/ha.conf, use the ha_cfginfo command. (The common.vars file sets the variable $CFG_INFO to the ha_cfginfo command.) ha_cfginfo is described in the section “Using ha_cfginfo to Get Configuration File Information” in this chapter.

When executing each command used to check if an application instance has failed, you can use the ha_exec command, which provides automatic retry and timeout, and the execute_cmd() function, which provides automatic logging. See the subsection “Executing a Command in a Monitoring Script” for more information.

Shown below is the check() function for a named local monitoring script (to be installed as /var/ha/actions/ha_named_lmon).

check()
{
    NAMED=named
    # for each named block ... 
    for i in `$CFG_INFO ${NAMED}`
    do
        # get the server-node name 
        SEARCH="$CFG_INFO ${NAMED}${CFG_SEP}${i}${CFG_SEP}${T_SERVER}"
        SERVER_NODE=`$SEARCH`
        # if that failed, log a message and exit 
        if [ $? -eq 1 ]; then
    ${LOGGER} "$0: Trouble finding server node for named $i ($SEARCH)"
            exit $INCORRECT_CONF_FILE;
        fi

        # if this node is the server-node ... 
        if [ X${SERVER_NODE} = X${HOST} ]; then
            # get the value of process-name 
        SEARCH="$CFG_INFO ${NAMED}${CFG_SEP}${i}${CFG_SEP}${PROC_NAME}"
            PROC_NAME=`$SEARCH`
            # if that failed, log a message and exit 
            if [ $? -eq 1 ]; then
    ${LOGGER} "$0: Trouble finding process name for named $i ($SEARCH)"
                exit $INCORRECT_CONF_FILE;
            fi

            # set up and execute the command "killall -0 named", which checks to 
            # see if named is running 
            EXEC="${KILLALL} -0 ${PROC_NAME}"
            execute_cmd "check if ${PROC_NAME} is running"
        fi
    done
}

Executing a Command in a Monitoring Script

To execute each command that you add to a script, you have these choices:

  • Execute the command.

  • Use the ha_exec command to execute the command.

    ha_exec is used when the command has to be retried before declaring that the application has failed or when the command might not return quickly enough and you want to set a time limit.

    The syntax of the ha_exec command is

    ha_exec [ -p waitperiod ] timeout retry command 
    

    command is the command for the failure test, timeout is the length of time to wait without response before declaring that a single test failure, retry is the number of times to retry the test, and waitperiod is the length of time to wait after a failure or command timeout before retrying command. waitperiod defaults to 0. (See the ha_exec(1M) reference page for more information.)

  • Use the execute_cmd() function (with or without ha_exec) to execute the command.

    The execute_cmd() function writes information to the log file and executes the command specified by the variable $EXEC. It takes one parameter, a string that is a message of your choice. It executes a command you specify and writes a message passed as a parameter, the command executed, the output of the command executed, and a message about the return value of the command to the log file /var/ha/logs/ha_<app>_lmon.<node_name>.log. This log file makes debugging a monitoring script failure easier. The command is the value of $EXEC, which you set in the check() function.

For example, say that you decide to use this command to determine if the sendmail process is running: killall -0 sendmail. This command returns 0 if sendmail is running and non-zero if it is not. Your choices are these:

  • Execute the command and check the return value with code such as this:

    RESULT = `killall -0 sendmail`
    

  • Use ha_exec to execute the command, giving it three seconds to return and trying twice if necessary:

    RESULT = `$HAEXEC 3 2 "killall -0 sendmail"`
    

  • Use execute_cmd() without ha_exec to execute the command:

    EXEC = "killall -0 sendmail"
    RESULT = execute_cmd "checking for sendmail"
    

  • Use execute_cmd() with ha_exec to execute the command:

    EXEC = `$HAEXEC 3 2 "killall -0 sendmail"`
    RESULT = execute_cmd "checking for sendmail"
    

    Using execute_cmd() with ha_exec is recommended.

When choosing between these different methods of executing a command, keep these things in mind:

  • Use ha_exec when the command might fail and has to be retried or when the command might not return quickly and you want to set a time limit.

  • When you use ha_exec and execute_cmd(), the command must return 0 on success and non-zero on failure.

  • If you need to examine the output of the command, don't use execute_cmd() because the output goes to the log file, where it would be difficult to parse.