This section contains the following:
A FailSafe system has the following software layers:
Plug-ins, which create highly available services. The following table shows the provided and optional FailSafe plug-ins and their associated resource types.
Table A-1. Provided and Optional Plug-Ins
Provided Plug-In | Resource Type | Optional Plug-In | Resource Type |
|---|---|---|---|
CXFS file system | FailSafe/DMF | ||
IP addresses | FailSafe/NFS | ||
MAC addresses | FailSafe/Informix | ||
XFS file systems | FailSafe/Oracle | ||
XLV logical volumes | FailSafe/Samba | ||
|
| FailSafe/TMF | |
|
| FailSafe/Web (Netscape) |
See the release notes for information about the specific releases of these products that are supported.
Note: The Samba interfaces parameter allows Samba
to support multiple IP interfaces. It takes the following format, where
IP must be a dotted decimal IP address
and netmask must be a
dotted decimal netmask such as 255.255.255.0:
|
If the application you want is not available, you can hire the SGI Professional Services group to develop the required software, or you can use the IRIS FailSafe Version 2 Programmer's Guide to write the software yourself.
FailSafe base, which includes the ability to define resource groups and failover policies.
Cluster services, which lets you define clusters, resources, and resource types (this consists of the cluster_services installation package)
Cluster software infrastructure, which lets you do the following:
Perform node logging
Administer the cluster
Define nodes
The cluster software infrastructure consists of the cluster_admin and cluster_control subsystems.
Figure A-1 shows a graphic representation of these layers. The cluster services and cluster software infrastructure layers are shared with CXFS. Table A-2, describes the contents of the /usr/cluster/bin directory. For more information about CXFS, see the CXFS Version 2 Software Installation and Administration Guide.
Table A-2. Contents of /usr/cluster/bin
Layer | Subsystem | Process | Description |
|---|---|---|---|
Plug-ins | failsafe_informix failsafe2_oracle | ha_ifmx2 | IRIS FailSafe database agents. Each database agent monitors all instances of one type of database. |
IRIS FailSafe Base | failsafe2 | ha_fsd | IRIS FailSafe daemon. Provides basic component of the IRIS FailSafe software. |
Cluster services (high-availability processes) | cluster_services | ha_cmsd | The FailSafe membership daemon. Provides the list of nodes, called FailSafe membership, available to the cluster. |
|
| ha_gcd | Group membership daemon. Provides group membership and reliable communication services in the presence of failures to IRIS FailSafe processes. |
|
| ha_srmd | System resource manager daemon. Manages resources, resource groups, and resource types. Executes action scripts for resources. |
|
| ha_ifd | Interface agent daemon. Monitors the local node's network interfaces. This daemon is described in detail in “Interface Agent Daemon (IFD)”. |
Cluster software infrastructure (cluster administrative processes) | cluster_admin | cad | Cluster administration daemon. Provides administration services. |
| cluster_control | crsd | Node control daemon. Monitors the serial connection to other nodes. Has the ability to reset other nodes. |
|
| cmond | Daemon that manages all other daemons. This process starts other processes in all nodes in the cluster and restarts them on failures. |
|
| fs2d | Manages the cluster database and keeps each copy in sync on all nodes in the pool. |
The IFD is an agent that monitors network interfaces and IP addresses. The IFD monitors all network interfaces and IP addresses configured in the node even when there are no highly available IP addresses in the node.
The IFD checks the number of input packets for each interace. If the number of input packets does not increase for a 10-second period, the IFD contacts the broadcast address of the interface by using the ping(1M) command. If the input packet count does not increase in the next 10-second period, the network interface and all IP addresses on the interface are marked as bad.
The IFD reads the configuration of IP addresses from the cluster database.
IP_address resource type action scripts use the ha_ifdadmin command to communicate with the IFD. Action scripts obtain status and configuration IP address from the IFD.
IFD logging can be controlled with the GUI and the cmgr command.
The following figures show communication paths in FailSafe.
| Note: The following figures do not represent the cmond cluster manager daemon. The purpose of this daemon is to keep the other daemons running. |
The order of execution is as follows:
FailSafe starts up by using the start ha_services command in cmgr or as part of the node bootup procedure. It then reads the resource group information from the cluster database.
FailSafe tells the system resource manager (SRM) to run exclusive scripts for all resource groups that are in the Online ready state.
SRM returns one of the following states for each resource group:
running
partially running
not running
If a resource group has a state of not running in a node where HA services have been started, the following occurs:
FailSafe runs the failover policy script associated with the resource group. The failover policy scripts takes the list of nodes that are capable of running the resource group (the failover domain) as a parameter.
The failover policy script returns an ordered list of nodes in descending order of priority (the run-time failover domain) where the resource group can be placed.
FailSafe sends a request to SRM to move the resource group to the first node in the run-time failover domain.
SRM executes the start action script for all resources in the resource group:
If the start script fails, the resource group is marked online on that node with following error:
srmd executable error |
If the start script is successful, SRM automatically starts monitoring those resources. After the specified start monitoring time passes, SRM executes the monitor action script for the resource in the resource group.
If the state of the resource group has a status of running or partially running on only one node in the cluster, FailSafe runs the associated failover policy script:
If the highest priority node is the same node where the resource group is partially running or running , the resource group is made online on the same node. In the partially running case, FailSafe tells SRM to execute start scripts for all resources in the resource group.
If the highest priority node is another node in the cluster, FailSafe tells SRM to execute stop action scripts for resources in the resource group on other nodes. FailSafe then makes the resource group online in the highest priority node in the cluster.
If the state of the resource group is running or partially running in multiple nodes in the cluster, the resource group is marked with an error exclusivity error. These resource groups will require operator intervention to become online in the cluster.
Figure A-8 shows the message paths for action scripts and failover policy scripts.
When the start action script fails, the order of execution is as follows:
SRM notifies FailSafe of the start action script failure as a resource group failure.
FailSafe runs the failover policy script to determine the next node for the resource group.
FailSafe sends a request to SRM to release the resource group and allocate the resource group in the next node in the cluster.
The cluster database is a key component of FailSafe software. It contains all information about the following:
Resources
Resource types
Resource groups
Failover policies
Nodes
Clusters
The cluster database daemon (fs2d) maintains identical databases on each node in the cluster.
The following table shows the contents of the /var/cluster/ha directory.
Table A-3. Contents of the /var/cluster/ha directory
Directory or File | Purpose |
|---|---|
comm/ | Directory that contains files that communicate between various daemons. FailSafe processes create temporary files in this directory. FailSafe interprocess communication will fail if there is not sufficient disk space for this directory (approximately 2-3 MB) in the root filesystem on every node in a FailSafe cluster. |
common_scripts/ | Directory that contains the script library (the common functions that may be used in action scripts). |
log/ | Directory that contains the logs of all scripts and daemons executed by IRIS FailSafe. The outputs and errors from the commands within the scripts are logged in the script_Nodename file. |
policies/ | Directory that contains the failover scripts used for resource groups. |
resource_types/template | Directory that contains the template action scripts. |
resource_types/ RTname | Directory that contains the action scripts for the RTname resource type. For example, /var/cluster/ha/resource_types/filesystem. |
resource_types/ RTname/exclusive | Script that verifies that a resource of this resource type is not already running. |
resource_types/ RTname/monitor | Script that monitors a resource of this resource type. |
resource_types/ RTname/restart | Script that restarts a resource of this resource type on the same node after a monitoring failure. |
resource_types/ RTname/start | Script that starts a resource of this resource type. |
resource_types/ RTname/stop | Script that stops a resource of this resource type. |