This chapter explains how to plan the configuration of highly available (HA) services on your FailSafe cluster. The major sections of this chapter are as follows:
You must first decide how you want to use the cluster. You can then configure the disks and interfaces to meet the needs of the HA services you want the cluster to provide.
Questions you must answer during the planning process are as follows:
How do you plan to use the nodes?
Your answers might include uses such as offering home directories for users, running particular applications, supporting an Oracle database, providing Netscape Web service, and providing file service.
Which of these uses will be provided as an HA service?
SGI has developed FailSafe software options for some HA applications; see “Software Layers” in Appendix A. To offer other applications as HA services, you must develop a set of application monitoring shell scripts as described in the IRIS FailSafe Version 2 Programmer's Guide. If you need assistance, contact SGI Professional Services, which offers custom FailSafe agent development and integration services.
Which node will be the primary node for each HA service?
The primary node is the node that provides the service (exports the filesystem, is a Netscape server, provides the database, and so on).
For each HA service, how will the software and data be distributed on shared and non-shared disks?
Each application has requirements and choices for placing its software on disks that are failed over (shared) or not failed over (non-shared).
Are the shared disks going to be part of a RAID storage system or are they going to be disks in SCSI/Fibre channel disk storage that have plexed XLV logical volumes on them?
Shared disks must be part of a RAID storage system or in SCSI/Fibre channel disk storage with plexed XLV logical volumes on them.
How will shared disks be configured?
As raw XLV logical volumes?
As XLV logical volumes with XFS filesystems on them?
As CXFS filesystems, which use XVM logical volumes? For information on using FailSafe and CXFS, see “Coexecution of CXFS and FailSafe”.
The choice of volumes or filesystems depends on the application that is going to use the disk space.
Which IP addresses will be used by clients of HA services?
Multiple interfaces may be required on each node because a node could be connected to more than one network or because there could be more than one interface to a single network.
Which resources will be part of a resource group?
All resources that are dependent on each other must be in the resource group.
What will be the failover domain of the resource group? (For more information about failover domains, see “Failover Domain” in Chapter 1.)
The failover domain determines the list of nodes in the cluster where the resource group can reside. For example, a volume resource that is part of a resource group can reside only in nodes from which the disks composing the volume can be accessed.
How many HA IP addresses on each network interface will be available to clients of the HA services?
At least one HA IP address must be available for each interface on each node that is used by clients of HA services.
Which HA IP addresses on nodes in the failover domain are going to be available to clients of the HA services?
For each HA IP address that is available on a node in the failover domain to clients of HA services, which interface on the other nodes will be assigned that IP address after a failover?
Every HA IP address used by an HA service must be mapped to at least one interface in each node that can take over the resource group service. The HA IP addresses are failed over from the interface in the primary node of the resource group to the interface in the replacement node.
As an example of the configuration planning process, suppose that you have a two-node FailSafe cluster that is a departmental server. You want to make four XFS filesystems available for NFS mounting and have two Netscape FastTrack servers, each serving a different set of documents. These applications will be HA services.
You decide to distribute the services across two nodes, so each node will be the primary node for two filesystems and one Netscape server. The filesystems and the document roots for the Netscape servers (on XFS filesystems) are each on their own plexed XLV logical volume. The logical volumes are created from disks in a Fibre Channel storage system connected to both nodes.
There are four resource groups:
NFSgroup1
NFSgroup2
Webgroup1
Webgroup2
NFSgroup1 and NFSgroup2 are the NFS resource groups; Webgroup1 and Webgroup2 are the Web resource groups. NFSgroup1 and Webgroup1 will have one node as the primary node. NFSgroup2 and Webgroup2 will have the other node as the primary node.
Two networks are available on each node, ef0 and ef1. The ef1 interfaces in each node are connected to each other to form a private network.
Figure 2-1 depicts this configuration.
This section contains the following:
For each disk in a FailSafe cluster, you must choose whether to make it a shared disk, which enables it to be failed over, or a non-shared disk. Non-shared disks are not failed over.
The nodes in a FailSafe cluster must follow these requirements:
The system disk must be a non-shared disk
The FailSafe software must be on a non-shared disk
All system directories (/tmp, /var, /usr, /bin, /dev, ... ) should be in a non-shared disk
Only HA application data and configuration data can be placed on a shared disk. Choosing to make a disk shared or non-shared depends on the needs of the HA services that use the disk. Each HA service has requirements about the location of data associated with the service:
Some data must be placed on non-shared disks
Some data must not be placed on shared disks
Some data can be on shared or non-shared disks
The figures in the remainder of this section show the basic disk configurations on FailSafe clusters before and after failover. A cluster can contain a combination of the following basic disk configurations:
A non-shared disk on each node
Multiple shared disks containing Web server and NFS file server documents
| Note: In each of the before and after failover diagrams, each disk shown can represent a set of disks. |
Figure 2-2 shows two nodes in a cluster, each of which has a non-shared disk with two resource groups. When non-shared disks are used by HA applications, the data required by those applications must be duplicated on non-shared disks on both nodes. The clients should access the data in the shared disk using HA IP alias. When a failover occurs, HA IP aliases fail over.
| Note: The hostname is bound to a different IP address that never moves. |
The data that was originally available on the failed node is still available from the replacement node by using the HA IP alias to access it.
The configuration in Figure 2-2 contains two resource groups:
Resource Group | Resource Type | Resource |
|---|---|---|
Group1 | IP_address | 192.26.50.1 |
Group2 | IP_address | 192.26.50.2 |
Figure 2-3 shows a two-node configuration with one resource group, Group1:
Resource Group | Resource Type | Resource | Failover Domain |
|---|---|---|---|
Group1 | IP_address | 192.26.50.1 | (xfs-ha1, xfs-ha2) |
| filesystem | /shared |
|
| volume | shared_vol |
|
In this configuration, the resource group Group1 has a primary node, which is the node that accesses the disk prior to a failover. It is shown by a solid line connection. The backup node, which accesses the disk after a failover, is shown by a dotted line. Thus, the disk is shared between the nodes. In an active/backup configuration, all resource groups have the same primary node. The backup node does not run any HA resource groups until a failover occurs.
Figure 2-4 shows two shared disks in a two-node cluster with two resource groups, Group1 and Group2:
Resource Group | Resource Type | Resource | Failover Domain |
|---|---|---|---|
Group1 | IP_address | 192.26.50.1 | (xfs-ha1, xfs-ha2) |
| filesystem | /shared1 |
|
| volume | shared1_vol |
|
Group2 | IP_address | 192.26.50.2 | (xfs-ha2, xfs-ha1) |
| filesystem | /shared2 |
|
| volume | shared2_vol |
|
In this configuration, each node serves as a primary node for one resource group. The solid line connections show the connection to the primary node prior to fail over. The dotted lines show the connections to the backup nodes. After a failover, the surviving node has all the resource groups.
Other sections in this chapter and similar sections in the IRIS FailSafe 2.0 Oracle Administrator's Guide and IRIS FailSafe 2.0 INFORMIX Administrator's Guide provide more specific information about choosing between shared and non-shared disks for various types of data associated with each HA service.
There are no configuration parameters associated with non-shared disks. They are not specified when you configure a FailSafe system. Only the XLV logical volumes on shared disks are specified at configuration. For more information, see “Resource Attributes for Logical Volumes”.
For information on using CXFS filesystems (which use XVM logical volumes) in a FailSafe configuration, see “Coexecution of CXFS and FailSafe”.
| Note: This section describes logical volume configuration using XLV logical volumes. See also “Coexecution of CXFS and FailSafe”, and “Local XVM Volumes”. |
This section contains the following:
All shared disks must contain XLV logical volumes. You can work with XLV logical volumes on shared disks as you would work with other disks. However, you must follow these rules:
All data that is used by HA applications on shared disks must be stored in XLV logical volumes.
If you create more than one XLV volume on a single physical disk, all of those volumes must be owned by the same node. For example, if a disk has two partitions that are part of two XLV volumes, both XLV volumes must be part of the same resource group. (See “Create XLV Logical Volumes and XFS Filesystems” in Chapter 3, for more information about XLV volume ownership.)
Each disk in a Fibre Channel or SCSI Vault or RAID logical unit number (LUN) must be part of one resource group. Therefore, you must divide the Fibre Channel or SCSI Vault disks and RAID LUNs into one set for each resource group. If you create multiple volumes on a Fibre Channel or SCSI Vault disk or RAID LUN, all of those volumes must be part of one resource group.
Do not simultaneously access a shared XLV volume from more than one node. Doing so causes data corruption.
The FailSafe software relies on the XLV naming scheme to operate correctly. A fully qualified XLV volume name uses one of the following formats:
pathname/volname pathname/nodename.volname |
The components are these:
pathname is /dev/xlv
nodename by default is the same as the hostname of the node on which the volume was created
volname is a name specified when the volume was created; this component is commonly used when a volume is to be operated on by any of the XLV tools
For example, if volume vol1 is created on node ha1 using disk partitions located on a shared disk, the raw character device name for the assembled volume is /dev/rxlv/vol1. On the peer ha2, however, the same raw character volume appears as /dev/rxlv/ha1.vol1, where ha1 is the nodename component and vol1 is the volname component. As can be seen from this example, when the nodename component is the same as the local hostname, it does not appear as part of the device node name.
One nodename is stored in each disk or LUN volume header. This is why all volumes with volume elements on any single disk must have the same nodename component.
| Caution: If this rule is not followed, FailSafe does not operate correctly. |
FailSafe modifies the nodename component of the volume header as volumes are transferred between nodes during failover and recovery operations. This is important because xlv_assemble assembles only those volumes whose nodename matches the local hostname. Some of the other XLV utilities allow you to see (and modify) all volumes, regardless of which node owns them.
The resource name for a resource of resource type volume is the XLV volume name.
If you use XLV logical volumes as raw volumes (that is, with no filesystem) for storing database data, the database system may require that the device names in /dev/xlv have specific owners, groups, and modes. See the documentation provided by the database vendor to determine if the XLV logical volume device names must have owners, groups, and modes that are different from the default values (the defaults are root, sys, and 0600, respectively).
As an example of logical volume configuration, say that you have the following logical volumes on disks that we will call Disk1 through Disk5:
/dev/xlv/VolA (volume A) contains Disk1 and a portion of Disk2
/dev/xlv/VolB (volume B) contains the remainder of Disk2 and Disk3
/dev/xlv/VolC (volume C) that contains Disk4 and Disk5
VolA and VolB must be part of the same resource group because they share a disk. VolC could be part of any resource group. Figure 2-5 describes this.
Table 2-1 lists the resource attributes for XLV logical volumes.
Table 2-1. XLV Logical Volume Resource Attributes
Resource Attribute | Default | Description |
|---|---|---|
devname-owner | root | |
devname-group | sys | |
devname-mode | 0600 |
See the section “Create XLV Logical Volumes and XFS Filesystems” in Chapter 3, for information about creating XLV logical volumes.
This section describes filesystem configuration for FailSafe using XFS filesystems. For information on coexecution of FailSafe and CXFS filesystems, see “Coexecution of CXFS and FailSafe”.
FailSafe supports the failover of XFS filesystems on shared disks. Shared disks must be either Fibre Channel or SCSI JBOD or in RAID storage systems that are shared between nodes in the FailSafe cluster. Fibre Channel and SCSI JBOD storage systems must use XLV mirroring.
The following are special issues that you must be aware of when you are working with filesystems on shared disks in a cluster:
All filesystems to be failed over must be XFS filesystems.
All filesystems to be failed over must be created on XLV logical volumes on shared disks.
For availability, filesystems to be failed over in a cluster must be created on either mirrored disks (using the XLV plexing software) or on the Fibre Channel RAID storage system.
Create the mount points for the filesystems on all nodes in the failover domain.
When you set up the various FailSafe filesystems on each node, ensure that each filesystem uses a different mount point.
Do not simultaneously mount filesystems on shared disks on more than one node. Doing so causes data corruption. Normally, FailSafe performs all mounts of filesystems on shared disks. If you manually mount a filesystem on a shared disk, verify that it is not being used by another node.
Do not place filesystems on shared disks in the /etc/fstab file. FailSafe mounts these filesystems only after verifying that another node does not have these filesystems mounted.
The name of a resource of the filesystem resource type is the mount point of the filesystem.
When clients are actively writing to a FailSafe NFS filesystem during failover of filesystems, data corruption can occur unless filesystems are exported with the mode wsync. This mode requires that local mounts of the XFS filesystems use the wsync mount mode as well. Using wsync affects performance considerably.
Continuing with the scenario from in “Example Logical Volume Configuration”, suppose you have the following XFS filesystems:
xfsA on VolA is mounted at /sharedA with modes rw and noauto
xfsB on VolB is mounted at /sharedB with modes rw, noauto, and wsync
xfsC on VolC is mounted at /sharedC with modes rw and noauto
Table 2-2 lists a label and configuration parameters for each filesystem.
Table 2-2. Filesystem Configuration Parameters
Figure 2-6 shows the following:
Resource group 1 has two XFS filesystems (xfsA and xfsB) and two XLV volumes (VolA and VolB)
Resource group 2 has one XFS filesystem (xfsC) and one XLV volume (VolC)
See “Create XLV Logical Volumes and XFS Filesystems” in Chapter 3, for information about creating XFS filesystems.
This section contains the following:
Use the following guidelines when planning interface configuration for the private control network between nodes:
Each interface has one IP address.
The HA IP addresses used on each node for the interfaces to the private network are on a different subnet from the IP addresses used for public networks.
An IP name can be specified for each HA IP address in /etc/hosts.
A naming convention that identifies these HA IP addresses with the private network can be helpful. For example, precede the hostname with priv- (for private), as in priv-xfs-ha1 and priv-xfs-ha2.
Use the following guidelines when planning the interface configuration for one or more public networks:
If re-MACing is required, each interface to be failed over requires a dedicated backup interface on the other node (an interface that does not have an HA IP address). Thus, for each HA IP address on an interface that requires re-MACing, there should be one interface in each node in the failover domain dedicated for the interface.
Each interface has a primary IP address also known as the fixed address. The primary IP address does not fail over.
The hostname of a node cannot be an HA IP address.
All HA IP addresses used by clients to access HA services must be part of the resource group to which the HA service belongs.
If re-MACing is required, all of the HA IP addresses must have the same backup interface.
Making good choices for HA IP addresses is important; these are the “hostnames” that will be used by users of the HA services, not the true hostnames of the nodes.
Make a plan for publicizing the HA IP addresses to the user community, because users of HA services must use HA IP addresses instead of the output of the hostname command.
HA IP addresses should not be configured in the /etc/config/netif.options file. HA IP addresses also should not be defined in the /etc/config/ipaliases.options file.
Use the following procedure to determine whether re-MACing is required. It requires the use of three nodes: node1, node2, and node3. node1 and node2 can be nodes of a FailSafe cluster, but they need not be. They must be on the same subnet. node3 is a third node. If you must verify that a router accepts gratuitous ARP packets (which means that re-MACing is not required), node3 must be on the other side of the router from node1 and node2. For more information about re-MACing, see “Network Interfaces and IP Addresses” in Chapter 1.
Configure an HA IP address on one of the interfaces of node1:
# /usr/etc/ifconfig interface inet ip_address netmask netmask up |
interface is the interface to be used access the node. ip_address is an IP address for node1; this IP address is used throughout this procedure. netmask is the netmask of the IP address.
From node3, contact the HA IP address used in step 1 using the ping(1M) command:
# ping -c 2 ip_address |
For example, if the value for ip_address is 190.0.2.1:
# ping -c 2 190.0.2.1 PING 190.0.2.1 (190.0.2.1): 56 data bytes 64 bytes from 190.0.2.1: icmp_seq=0 ttl=255 time=29 ms 64 bytes from 190.0.2.1: icmp_seq=1 ttl=255 time=1 ms ----190.0.2.1 PING Statistics---- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 1/1/1 ms |
Enter the following command on node1 to shut down the interface you configured in step 1:
# /usr/etc/ifconfig interface down |
On node2, enter the following command to move the HA IP address to node2:
# /usr/etc/ifconfig interface inet ip_address netmask netmask up |
On node3, contact the HA IP address:
# ping -c 2 ip_address |
If the ping(1) command fails, gratuitous ARP packets are not being accepted and re-MACing is needed to fail over the HA IP address.
Table 2-3 shows the FailSafe configuration parameters you could specify for these example HA IP addresses.
Table 2-3. HA IP Address Configuration Parameters
Resource Attribute | Resource Name: | Resource Name: |
|---|---|---|
Network mask | 0xffffff00 | 0xffffff00 |
Broadcast address | 192.26.50.255 | 192.26.50.255 |
Interface | ef0 | ef0 |
You can configure your system so that an HA IP address will fail over to a second interface within the same node, for example from ef0 to ef1. A configuration example that shows the steps you must follow for this configuration is provided in “Example: Local Failover of HA IP Address” in Chapter 6.
CXFS, the clustered XFS filesystem, allows groups of computers to coherently share large amounts of data while maintaining high performance. You can use FailSafe in a coexecution cluster to provide HA services (such as NFS or Web) running on a CXFS filesystem. This combination provides high-performance shared data access for HA applications.
CXFS 6.5.10 or later and IRIS FailSafe 2.1 or later (plus relevant patches) may be installed and run on the same system, which is known as coexecution. This allows you to have application-level high availability with a clustered filesystem.
A subset of nodes in a coexecution cluster can be configured to be used as FailSafe nodes; a coexecution cluster can have up to eight nodes that run FailSafe.
| Note: In IRIX 6.5.18f, CXFS provides a system tunable parameter (cxfs_relocation_ok) to allow users to disable CXFS metadata server relocation. CXFS filesystem relocation is disabled by default. In FailSafe/CXFS coexecution clusters, it is recommended that the CXFS filesystem relocation is disabled using the tunable parameter.
The system tunable parameter is different from the relocate-mds attribute of a CXFS resource. FailSafe uses a different procedure to initiate metadata server relocation for a filesystem that is not impacted by the value of the cxfs_relocation_ok system tunable parameter. |
This section contains the following:
See also “Communication Paths in a Coexecution Cluster” in Appendix A.
A coexecution cluster is supported with as many as 32 nodes. All of these nodes must run CXFS and up to eight can also run FailSafe. As many as 16 of the nodes can be CXFS administration nodes and all other nodes can be client-only nodes. FailSafe must be run on a CXFS administration node; FailSafe cannot run on a client-only node.
It is recommended that a production cluster can be configured with a minimum of three CXFS metadata server-capable administration nodes. (A cluster with serial hardware reset cables and only two server-capable nodes is supported, but there are inherent issues with this configuration; see the CXFS Version 2 Software Installation and Administration Guide.)
Even when you are running CXFS and FailSafe, there is still only one pool, one cluster, and one cluster configuration.
The cluster can be one of three types:
FailSafe. In this case, all nodes will also be of type FailSafe.
CXFS. In this case, all nodes will be of type CXFS.
CXFS and FailSafe (coexecution). In this case, the set of nodes will be a mix of type CXFS and type CXFS and FailSafe, using FailSafe for application-level high availability and CXFS.
| Note: Although it is possible to configure a coexecution cluster with type FailSafe only nodes, SGI does not support this configuration. |
The metadata server list must exactly match the failover domain list (the names and the order of names).
FailSafe provides the CXFS resource type, which can be used to fail over applications that use CXFS filesystems. CXFS resources must be added to the resource group that contain the resources that depend on a CXFS filesystem. The name of the CXFS resource is the CXFS filesystem mount point.
The CXFS resource type has the following characteristics:
It does not start all resources that depend on the CXFS filesystem until the CXFS filesystem is mounted on the local node.
The start and stop action scripts for the CXFS resource type do not mount and unmount CXFS filesystems, respectively. (The start script waits for the CXFS filesystem to become available; the stop script does nothing but its existence is required by FailSafe.) Users should use the CXFS graphical user interface (GUI) or cmgr(1M) command to mount and unmount CXFS filesystems.
It monitors CXFS filesystem for failures.
Optionally, for applications that must run on a CXFS metadata server, the CXFS resource type relocates the CXFS metadata server when there is an application failover. In this case, the application failover domain (AFD) for the resource group should consist of the CXFS metadata server and the metadata server backup nodes.
The CXFS filesystems that an NFS server exports should be mounted on all nodes in the failover domain using the CXFS GUI or the cmgr(1M) command.
For example, following are the commands used to create resources named NFS, CXFS and statd_unlimited based on a CXFS filesystem mounted on /FC/lun0_s6. (This example assumes that you have defined a cluster named test-cluster and that you have already created a failover policy named cxfs-fp and a resource group named cxfs-group based on this policy.)
cmgr> define resource /FC/lun0_s6 of resource_type CXFS in cluster test-cluster Enter commands, when finished enter either "done" or "cancel" Type specific attributes to create with set command: Type Specific Attributes - 1: relocate-mds No resource type dependencies to add resource /FC/lun0_s6 ? set relocate-mds to false resource /FC/lun0_s6 ? done ============================================ cmgr> define resource /FC/lun0_s6 of resource_type NFS in cluster test-cluster Enter commands, when finished enter either "done" or "cancel" Type specific attributes to create with set command: Type Specific Attributes - 1: export-info Type Specific Attributes - 2: filesystem No resource type dependencies to add resource /FC/lun0_s6 ? set export-info to rw resource /FC/lun0_s6 ? set filesystem to /FC/lun0_s6 resource /FC/lun0_s6 ? done ============================================ cmgr> define resource /FC/lun0_s6/statmon of resource_type statd_unlimited in cluster test-cluster Enter commands, when finished enter either "done" or "cancel" Type specific attributes to create with set command: Type Specific Attributes - 1: ExportPoint Resource type dependencies to add: Resource Dependency Type - 1: NFS resource /FC/lun0_s6/statmon ? set ExportPoint to /FC/lun0_s6 resource /FC/lun0_s6/statmon ? add dependency /FC/lun0_s6 of type NFS resource /FC/lun0_s6/statmon ? done ============================================== cmgr> define resource_group cxfs-group in cluster test-cluster Enter commands, when finished enter either "done" or "cancel" resource_group cxfs-group ? set failover_policy to cxfs-fp resource_group cxfs-group ? add resource /FC/lun0_s6 of resource_type NFS resource_group cxfs-group ? add resource /FC/lun0_s6 of resource_type CXFS resource_group cxfs-group ? add resource /FC/lun0_s6/statmon of resource_type statd_unlimited resource_group cxfs-group ? done |
There is one cmgr(1M) command but separate GUIs for CXFS and for FailSafe. You must manage CXFS configuration with the CXFS GUI and FailSafe configuration with the FailSafe GUI; you can manage both with cmgr.
Using the CXFS GUI or cmgr(1M), you can convert an existing FailSafe cluster and nodes to type CXFS or to type CXFS and FailSafe. You can perform a parallel action using the FailSafe GUI. A converted node can be used by FailSafe to provide application-level high-availability and by CXFS to provide clustered filesystems.
However:
You cannot change the type of a node if the respective high availability (HA) or CXFS services are active. You must first stop the services for the node.
The cluster must support all of the functionalities (FailSafe and/or CXFS) that are turned on for its nodes; that is, if your cluster is of type CXFS, then you cannot modify a node that is already part of the cluster so that it is of type FailSafe. However, the nodes do not have to support all the functionalities of the cluster; that is, you can have a CXFS node in a CXFS and FailSafe cluster.
For FailSafe, you must have at least two network interfaces. However, CXFS uses only one interface for both heartbeat and control messages.
When using FailSafe and CXFS on the same node, only the priority 1 network will be used for CXFS and it must be set to allow both heartbeat and control messages.
| Note: CXFS will not fail over to the second network. If the priority 1 network fails, CXFS will fail but FailSafe services may move to the second network if the node is CXFS and FailSafe.
If CXFS resets the node due to the loss of the priority 1 network, it will cause FailSafe to remove the node from the FailSafe membership; this in turn will cause resource groups to fail over to other FailSafe nodes in the cluster. |
FailSafe also supports local XVM; you cannot use local XVM with CXFS XVM.
FailSafe provides the XVM resource type, which can be used to fail over applications that use local XVM volumes without CXFS. (Do not use the XVM resource type with the CXFS resource type.)
For each local XVM resource, the name of the resource is the name of the XVM volume without the preceding vol/ characters. The resource name must be unique for all XVM domains in the FailSafe cluster.
Table 2-4 provides the XVM resource attributes. There are no resource type dependencies for the XVM resource type.
Table 2-4. Local XVM Volume Resource Attributes