The SGI Cluster Manager for Linux provides highly available services that survive a single point of failure. It uses redundant components and special software to provide services for a cluster that contains two machines or system partitions, known as members .
All highly available services are owned by one member at a time. Highly available services are monitored by the SGI Cluster Manager software. If one member fails, the other member restarts the highly available applications of the failed member, known as the failover process.
To clients, the services on the backup member are indistinguishable from the original services before failure occurred. It appears as if the original member has crashed and rebooted quickly. Clients that use User Datagram Protocol (UDP) for communication with the server will notice a brief interruption in the highly available service. Clients that use Transmission Control Protocol (TCP) for communication may have to reconnect to the server in case of failure.
Although SGI Cluster Manager for Linux provides similar functionality to IRIX FailSafe, there are differences; see Appendix A, “FailSafe and SGI Cluster Manager”.
This chapter discusses the following:
The SGI Cluster Manager base product provides failover support for the following:
Filesystems (including XFS)
NFS
Samba
IP addresses
User-defined applications (that is, applications that are not provided by the SGI Cluster Manager product)
A plug-in is the set of software that allows a service to be highly available without modifying the application itself. An optional value-add product supplies plug-ins for the following:
CXFS clustered filesystems
Data Migration Facility (DMF)
XVM volume manager in local mode
This optional product also provides a failover script for the Tape Management Facility (TMF). You can modify your application to use this script to provide highly available services for TMF.
A highly available service consists of the following:
Disks (such as XVM volumes)
IP address
Filesystem (such as XFS or CXFS)
NFS (if used)
Samba (if used)
User applications (if used)
SGI Cluster manager requires a cluster of exactly two members. The following SGI Altix servers are supported:
An Altix 330 server with a USB-to-Ethernet adapter connected to the L1 system controller so that the brick emulates an L2 controller and becomes an L1/L2 controller. (Separate physical L2 controllers are not used with the Altix 330 systems.) Access to the L2 functionality is made by way of an Ethernet connection to a PC or laptop. An Altix 330 server must use the L2 Ethernet reset configuration (l2network ) for remote resets.
An Altix 350 server with an IO10 PCI card, which may use either of the following for remote resets:
An Altix 350 server with an IO9 PCI card, which must use the L2 Ethernet reset configuration (l2network) for remote resets. This requires a hardware L2 system controller that must be separately purchased.
An Altix 3700 server, which can use either the L2 Ethernet reset configuration (l2network) or the L2 serial reset configuration ( l2). These servers may be partitioned; each system partition is an individual member.
An Altix 3700 Bx2 server with a USB-to-Ethernet adapter connected to the L1 system controller so that the brick emulates an L2 controller and becomes an L1/L2 controller. (Separate physical L2 controllers are not used with the Altix 3700 Bx2 systems.) Access to the L2 functionality is made by way of an Ethernet connection to a PC or laptop. An Altix 3700 Bx2 server must use the L2 Ethernet reset configuration (l2network). See “l2network Ethernet Connection” in Chapter 2.
An Altix 4700 server with a USB-to-Ethernet adapter connected to the L1 system controller so that the brick emulates an L2 controller and becomes an L1/L2 controller. (Separate physical L2 controllers are not used with the Altix 4700 systems.) Access to the L2 functionality is made by way of an Ethernet connection to a PC or laptop. An Altix 4700 server must use the L2 Ethernet reset configuration (l2network ) for remote resets.
SGI Cluster Manager also requires the following:
Shared quorum partitions without filesystems where configuration, cluster, and service status information is kept by SGI Cluster Manager. For more information, see “Shared Quorum Partitions” in Chapter 2.
Network cabling: you can connect private network or cross-over cables between members. You have a choice between an Ethernet cable from server to hub or a 20-ft cross-over Ethernet cable.
| Note: To use a private network, you must have a second NIC whether you use a cross-over cable or a switch/hub. |
Figure 1-1 shows an example configuration using CXFS. A private network is recommended for SGI Cluster Manager. The SGI Cluster Manager members should be able to communicate with the SGI Cluster Manager tiebreaker via the network. The tiebreaker can be a machine or a router or any device that can be connected via the network. (For more information about tiebreakers, see “Step 6: Set the Tiebreakers” in Chapter 4.)
The failover domain is the list of members in the cluster where a service can be online.
Each failover domain has two failover options that are considered when a new membership is formed or a failure occurs and a new target member for the service must be determined:
Restricted failover permits failover only to the members listed. If all of the members in the domain are unavailable, the service will stop.
If a domain is not restricted, a service can run on the member that is not in domain if there is a failure and the member that is in the domain is unavailable. (However, administrative commands cannot relocate the service to a member that is not in the domain, whether or not this option is used.)
Ordered failover causes the service to start on the first member defined (the lowest-ordered) if it is available; if that member is unavailable, the other member will be used. If controlled failback is not set, the service will automatically failback from the second member to the original member when the original member is rebooted after a failure or maintenance period.
Each failover domain also has a failback option, which is considered when a member rejoins the cluster. The controlled failback option says that a service will not be moved back to the original member when it rejoins the cluster even if it is the preferred member in the list (when ordered failover is used). The system administrator must manually relocate the service in order for it to run on the original member without an intervening failure. Only a new failure will cause a service to be automatically moved.
Suppose you have a cluster members A and B. Table 1-1 describes some of the possible results from using various options under different circumstances for the nfs service.
Table 1-1. Failover Domain and Option Results
Failover Domain | Options | Circumstance | Results |
|---|---|---|---|
(none) | (none) | Newly formed membership | The service will be started on either A or B, randomly chosen |
B | (none) | Newly formed membership | The service will be started on B if it is available. If B is not available, the service will be started on A. |
B, A | (none) | Newly formed membership | The service will be started on either A or B, randomly chosen. If that member is unavailable, the other will be used. This situation is similar to having no failover domain. |
B | (none) | The service is running on B and then B fails | The service will be started on A. The service will remain on A even after B restarts. |
B, A | Ordered | Newly formed membership | The service will be started on B if it is available. If B is not available, the service will be started on A. |
B, A | Restricted failover and controlled failback | Newly formed membership | The service will be started on either A or B, randomly chosen. If that member fails, the service will be restarted on the other member and will remain there until the system administrator manually intervenes. |
B | Restricted | The service is running on B and then B fails | The service will stop. |
B, A | Ordered | The service is running on B and then B fails | The service will be started on A. The service will be moved back to B as soon as it restarts. |
B, A | Ordered failover and controlled failback | The service is running on B and then B fails | The service will be started on A. The service will remain on A even after B restarts. To go back to B, the system administrator must manually move the service. |
Following is an overview of the cluster daemons:
clumembd(8) is the cluster membership daemon. It performs network heartbeats and checks the liveliness of other members in the cluster.
cluquorumd(8) is the cluster quorum daemon. It computes new membership and implements quorum. It also implements I/O fencing by resetting members that are in failed state and reads/writes membership information to the shared quorum partitions.
clurmtabd(8) is the cluster remote NFS mount table daemon. It synchronizes NFS mount point entries by polling the /var/lib/nfs/rmtab file.
clusvcmgrd(8) is the cluster service manager daemon. It starts/stops and checks the status of services running in the cluster.
clulockd(8) is the cluster global lock manager daemon. The locks are stored on the shared quorum partitions.
For more information, see the man pages.