This chapter provides information about writing a failover script for a resource or application that you want the IRIS FailSafe system to treat as a highly available service. It begins with a section that describes how to write a failover script. The remaining sections provide details about various aspects of failover scripts that will help you develop your script.
The sections in this chapter are as follows:
Follow these steps to write a failover script:
Use the questions and information in the section “Preparing to Write a Failover Script” in this chapter to help you get information about the application you are failing over and make decisions about how to implement the failover script.
Begin with the failover script template in /var/ha/resources/appclass or with a copy of one of the failover scripts in /var/ha/resources if it is similar to the script you need.
Review the failover script template /var/ha/resources/appclass. It is described in the section “Understanding the Failover Script Template” in this section.
If necessary, review how to extract information from /var/ha/ha.conf using the ha_cfginfo command. ha_cfginfo is described in the section “Using ha_cfginfo to Get Configuration File Information” in Chapter 3.
Review your choices for executing commands in the script, which are described in the section “Executing a Command in a Failover Script” in this chapter.
Define variables in the monitoring script for any new block, section, and parameter names you added to the configuration file that you will use in the monitoring script. See the section “Defining Variables for New Block, Section, and Parameter Names” in Chapter 3 for details.
Write the takeback(), takeover(), giveaway(), and giveback() functions. See the section “Writing the Failover Functions” in this chapter for more information.
Make each of the functions takeback(), takeover(), giveaway(), and giveback() in the script idempotent—if it is executed twice in a row and the first execution succeeds, the second time must also succeed.
For example, running the script with the giveaway argument should stop all instances of the highly available service. If it is run again immediately, it should return without error. If the giveaway argument is specified when no instances of highly available service are running, the giveaway() function must succeed. To achieve this, you may have to add a check that tests whether the application is running prior to each command that halts an application. The command to halt an application is executed only if the application is running.
Review each function and test if necessary to verify that it executes in less than the value of the long-timeout parameter in the internal block, which is 60 seconds by default.
The remaining sections in this chapter provide information that will help you perform these steps.
Each highly available service has a failover script in the /var/ha/resources directory. This script contains at least these four functions: takeover(), takeback(), giveaway(), and giveback().
Here are some questions to think about before writing a failover script:
How do I move this application from one machine to another?
Can this application be moved at any time?
Do any highly available services, such as filesystems on shared disks, need to exist before the application can be started on another node?
Do any actions need to be performed to recover lost transactions, data, or state before starting the application on another node?
For example, databases are able to recover lost transactions. Commands can be executed by the script to recover lost transactions. For NFS filesystems or a Netscape server, there is no automatic recovery; the client simply requests the data again.
How do you start and stop the application on a node? How do you start and stop a specific instance of the application?
Can the application be started and stopped as root or must it be another user?
If a user other than root must start and stop the application, should that user be specified in /var/ha/ha.conf?
Where is the configuration information for the application stored? Will it be on shared or local disks?
You may not have any flexibility about where the configuration information is stored. To store it on a shared disk, you may need to link or copy files. (Remember that shared disks don't allow concurrent access; they can be used by only one node at a time.)
Where is the data for the application stored? Will it be on shared or local disks?
Does the application both read and write data or just read it?
If the application doesn't write data, for example a front end Web server that has ready-only data, duplicating the data on local disks might be the best choice.
Where is the log information for the application stored? Will it be on shared or local disks?
Where is the application itself stored? Will it be on shared or local disks?
Does information about the application, data, log, or configuration information need to be specified in /var/ha/ha.conf?
What tasks will be performed by other failover scripts executed prior to this script?
If log, data, or configuration information is stored in a raw volume or in a filesystem on a shared disk, the filesystems and volumes failover scripts must be run before the application is started by the takeover or takeback operations.
See “Tasks Performed by the Standard Failover Scripts” in Chapter 1 for information about the actions of each failover script.
What tasks will be performed by other failover scripts after this script is executed?
If log, data, or configuration information is stored in a raw volume or in a filesystem on a shared disk, the filesystems and volumes failover scripts must be run after the application is stopped by the giveback or giveaway operations.
See “Tasks Performed by the Standard Failover Scripts” in Chapter 1 for information about the actions of each failover script. See “Choosing the Execution Order of Failover Scripts for Each Operation” in Chapter 5 for information about specifying the ordering of execution of the script relative to other scripts.
What additional information about the application should be stored in /var/ha/ha.conf?
All shared filesystems and volumes must be specified in /var/ha/ha.conf. Command-line arguments for starting and stopping applications should be put in /var/ha/ha.conf if they will vary; otherwise they can be hardcoded in the failover script.
The failover script template /var/ha/resources/appclass is shown in Example 4-1. A description of the template is provided at the end of the template.
1 #!/sbin/sh
2 #
3 ## Instructions for modifying this file are on lines that begin with ##.
4 #
5 ## Provide a description of this script including its name, installation
6 ## location, purpose and the resource(s)/application(s) that it fails over.
7 #
8 # Usage:
9 ## Replace <scriptname> in the next line with the name of this script.
10 # <scriptname> <checksum> <argument>
11 #
12 # The <argument> can be one of the operations - giveback,
13 # giveaway, takeover or takeback.
14 #
15 # Exit codes:
16 # 0: The operation succeeded
17 # 1: This script called illegally
18 # 2: Configuration file is incorrect
19 # 3: Command exited with non-zero return code - the action
20 # failed.
21
22 SUCCESS=0
23 ILLEGAL_CALL=1
24 INCORRECT_CONF_FILE=2
25 FAILED=3
26
27 HA_DIR=/var/ha/actions
28 CONF=$HA_DIR/common.vars
29
30 ## Define other variables that are local to this script here.
31 ## Use ${LOGGER} to print error and TESTING messages to /var/adm/SYSLOG
32 ## file.
33
34 # Source in common variables
35 . $CONF
36
37 if [ X$TESTING = X"ok" ]; then
38 ## Replace <application> in the next line.
39 ${LOGGER} "Executing <application> script"
40 fi
41
42 if [ $# -ne 2 ]; then
43 ${LOGGER} "Illegal syntax: checksum and argument required"
44 ${LOGGER} "Usage: $0 <checksum> <argument>"
45 exit $ILLEGAL_CALL;
46 fi
47
48 if [ $2 != "giveback" -a $2 != "takeback" -a $2 != "takeover" -a $2 != "giveaway" ]; then
49 ${LOGGER} "Illegal argument: must be giveback, giveaway, takeback, or takeover"
50 ${LOGGER} "Usage: $0 <checksum> <argument>"
51 exit $ILLEGAL_CALL;
52 fi
53
54 #
55 # Compare the checksum argument (the checksum known by the node
56 # controller and application monitor) with the checksum of ha.conf
57 # on this system.
58 CNF_CHKSUM=$1
59 CHKSUM=`$CFG_SUM`
60 if [ $CNF_CHKSUM != $CHKSUM ]; then
61 ${LOGGER} "Checksum mismatch [argument: $CNF_CHKSUM] [file: $CHKSUM]"
62 exit $INCORRECT_CONF_FILE;
63 fi
64
65 HOST=`hostname`
66
67 ##
68 ## Substitute applclass by the application name
69 ##
70 LOGFILE=/var/ha/logs/appclass.log
71 echo Started logging at `date`> $LOGFILE
72
73 #
74 # Executes the command $EXEC and prints the command, output and
75 # error to log file $LOGFILE. If the return value from the command
76 # is non-zero, the function exits with value 3.
77 # It takes one parameter, log message about the command.
78 #
79 execute_cmd()
80 {
81
82 echo $1 >> $LOGFILE;
83 if [ X${TESTING} = Xok ]; then
84 ${LOGGER} $1
85 fi
86
87 eval $EXEC >> $LOGFILE 2>&1;
88
89 exit_code=$?;
90
91 if [ $exit_code -ne 0 ]; then
92 echo "ERROR: $EXEC, exit_code: $exit_code" >> $LOGFILE;
93 ${LOGGER} "ERROR: $EXEC"
94 exit 3;
95 fi
96
97 echo "*** $EXEC completed with exit_code 0 ***" >> $LOGFILE;
98
99 }
100
101 ## Put the procedures here.
102
103 ## Comment about giveback() procedure.
104 ## Stop all the application instance(s) or resource(s) for which $HOST
105 ## is the backup-node.
106 ## Use $CFG_INFO to read information from the configuration file, ha.conf.
107 ## To get more information about the command, see ha_cfginfo(1M) manpage.
108 ## The procedure should return $FAILED on failure and $SUCCESS on
109 ## success of the operation.
110 ## giveback()
111 ## {
112 ## ...
113 ## }
114
115 ## Comment about giveaway() procedure.
116 ## Stop all the application instance(s) or resource(s) for which $HOST
117 ## is the server-node.
118 ## Use $CFG_INFO to read information from the configuration file, ha.conf.
119 ## The procedure should return $FAILED on failure and $SUCCESS on
120 ## success of the operation.
121 ## giveaway()
122 ## {
123 ## ...
124 ## }
125
126 ## Comment about takeover() procedure.
127 ## Start all the application instance(s) or resource(s) for which $HOST
128 ## is the backup-node.
129 ## Use $CFG_INFO to read information from the configuration file, ha.conf.
130 ## The procedure should return $FAILED on failure and $SUCCESS on
131 ## success of the operation.
132 ## takeover()
133 ## {
134 ## ...
135 ## }
136
137 ## Comment about takeback() procedure.
138 ## Start all the application instance(s) or resource(s) for which $HOST
139 ## is the server-node.
140 ## Use $CFG_INFO to read information from the configuration file, ha.conf.
141 ## The procedure should return $FAILED on failure and $SUCCESS on
142 ## success of the operation.
143 ## takeback()
144 ## {
145 ## ...
146 ## }
147
148 ## Make calls to operation procedures here.
149
150 if [ $2 = "giveback" ]; then
151 giveback;
152 elif [ $2 = "takeover" ]; then
153 takeover;
154 elif [ $2 = "takeback" ]; then
155 takeback;
156 elif [ $2 = "giveaway" ]; then
157 giveaway;
158 fi
159
160 # Exit with SUCCESS.
161
162 exit $SUCCESS;
|
The failover script template can be broken into these sections:
Lines 22 to 25 set variables for the script return values. Failover scripts have these return values:
0 ($SUCCESS)—Success; the operation succeeded.
1 ($ILLEGAL_CALL)—An invalid argument was passed to the script.
2 ($INCORRECT_CONF_FILE)—The configuration file is invalid; either information in the configuration file is incorrect, some information is missing from the configuration file, or the configuration file changed between starting up IRIS FailSafe and the execution of the script.
3 ($FAILED)—The operation failed.
If the failover script returns a non-zero value, the script is assumed to have failed.
Line 35 sources the file /var/ha/actions/common.vars, which assigns strings in /var/ha/ha.conf to variables and defines the ${LOGGER} command, which is used to write messages to /var/adm/SYLOG, and the ${TESTING} variable, which is used to control debugging information written to /var/adm/SYSLOG. It also sets the variable ${CFG_SEP} to the character #.
Lines 37 to 40 write a message to /var/adm/SYLOG.
Lines 42 to 63 check the command-line arguments. The script takes two arguments:
The first argument is the checksum for the configuration file. Lines 58 to 63 compare the checksum argument with the checksum of /var/ha/ha.conf.
The second argument is an operation: takeback, takeover, giveaway, or giveback. Lines 48 to 52 check this argument.
Line 65 sets $HOST to the result of the hostname command.
Lines 70 and 71 set $LOGFILE to the name of the log file and write a message to it. The directory for the log file is /var/ha/logs. The convention for the filename is the name of the application and the word log, separated by a period.
Lines 76 to 101 describe and define the execute_cmd() procedure. It takes one parameter, a string that is a message of your choice. It executes a command you specify and writes a message passed as a parameter, the command executed, the output of the command executed, and a message about the return value of the command to a log file. The command is the value of $EXEC, which you set in the check() procedure.
Lines 103 to 113 describe and define the giveback() procedure. The giveback() procedure is described fully in the next subsection, “Writing the Failover Functions.”
Lines 115 to 124 describe and define the giveaway() procedure. The giveaway() procedure is described fully in the next subsection, “Writing the Failover Functions.”
Lines 126 to 135 describe and define the takeover() procedure. The takeover() procedure is described fully in the next subsection, “Writing the Failover Functions.”
Lines 137 to 146 describe and define the takeback() procedure. The takeback() procedure is described fully in the next subsection, “Writing the Failover Functions.”
Lines 150 to 158 call one of the failover functions—the one passed as an argument to the failover script.
Line 162 exits with the success return value.
This section describes how to write the takeback(), takeover(), giveaway(), and giveback() functions. The purposes of these functions are described below:
| takeback() | Starts all instances of the applications class for which this node is the primary node. | |
| takeover() | Starts all instances of the applications class for which this node is the backup node. | |
| giveaway() | Stops all instances of the applications class for which this node is the primary node. | |
| giveback() | Stops all instances of the applications class for which this node is the backup node. |
As an example, this section uses the named daemon as the application to be failed over. It will be run in an active/backup configuration—only one instance of named runs on the cluster. Follow these general steps to write the failover functions:
Determine the commands required to start and stop instances of the application.
Looking in /etc/init.d/network, which normally starts named in a standalone system, the command to start named is
/usr/sbin/named `cat /etc/config/named.options 2> /dev/null`< \ /dev/null |
The command to stop named is
/sbin/killall -k 1 -TERM named |
Review the application's use of configuration files and their locations (on shared disks or non-shared disks?).
For example, named uses the configuration files /etc/named.boot and /etc/config/named.options. These files reside on non-shared disks and are identical on each node. Thus, named is not dependent upon filesystems that must be failed over.
Develop the takeback() function. Shown below is the body of this function for named, along with line numbers and comments.
1 NAMED=named
2 # for each named block ...
3 for i in `$CFG_INFO ${NAMED}`
4 do
5 # get the server-node name
6 SEARCH="$CFG_INFO ${NAMED}${CFG_SEP}${i}${CFG_SEP}${T_SERVER}"
7 SERVER_NODE=`$SEARCH`
8 # if that failed, log a message and exit
9 if [ $? -eq 1 ]; then
10 ${LOGGER} "$0: Trouble finding server node for named ($SEARCH)"
11 exit $INCORRECT_CONF_FILE;
12 fi
13
14 # if server-node matches $HOST ...
15 if [ X${SERVER_NODE} = X$HOST ]; then
16
17 # execute the command that starts the application
18 EXEC="/usr/sbin/named `cat /etc/config/named.options 2> /dev/null`< /dev/null"
19 execute_cmd "${EXEC}"
20 fi
21 done
22 # exit with success
23 exit $SUCCESS;
24 }
|
Develop the takeover() function. It is the same as the takeback(), with this exception:
5 # get the backup-node name
6 SEARCH="$CFG_INFO ${NAMED}${CFG_SEP}${i}${CFG_SEP}${T_BACKUP}"
|
Develop the giveaway() function. It is the same as the takeback(), with this exception:
17 # execute the command that stops the application 18 EXEC="/sbin/killall -k1 -TERM named"; |
Develop the giveback() function. It is the same as the takeback(), with these exceptions:
5 # get the backup-node name
6 SEARCH="$CFG_INFO ${NAMED}${CFG_SEP}${i}${CFG_SEP}${T_BACKUP}"
17 # execute the command that stops the application
18 EXEC="/sbin/killall -k1 -TERM named";
|
To execute each command you add to a script, you have these choices:
Execute the command.
Use the execute_cmd() function to execute the command.
The execute_cmd() function writes information to the log file and executes the command specified by the variable $EXEC. It takes one parameter, a string that is a message of your choice. It executes a command you specify and writes a message passed as a parameter, the command executed, the output of the command executed, and a message about the return value of the command to a log file in /var/ha/logs/<application_class>.log. The command is the value of $EXEC, which you set in the takeback(), takeover(), giveaway(), or giveback() function.