Chapter 1. Introduction

The SGI Cluster Manager for Linux provides highly available services that survive a single point of failure. It uses redundant components and special software to provide services for a cluster that contains two machines or system partitions, known as members .

All highly available services are owned by one member at a time. Highly available services are monitored by the SGI Cluster Manager software. If one member fails, the other member restarts the highly available applications of the failed member, known as the failover process.

To clients, the services on the backup member are indistinguishable from the original services before failure occurred. It appears as if the original member has crashed and rebooted quickly. Clients that use User Datagram Protocol (UDP) for communication with the server will notice a brief interruption in the highly available service. Clients that use Transmission Control Protocol (TCP) for communication may have to reconnect to the server in case of failure.

Although SGI Cluster Manager for Linux provides similar functionality to IRIX FailSafe, there are differences; see Appendix A, “FailSafe and SGI Cluster Manager”.

This chapter discusses the following:

Base Product

The SGI Cluster Manager base product provides failover support for the following:

  • Filesystems (including XFS)

  • NFS

  • Samba

  • IP addresses

  • User-defined applications (that is, applications that are not provided by the SGI Cluster Manager product)

Optional SGI Software Storage Plug-In Product

A plug-in is the set of software that allows a service to be highly available without modifying the application itself. An optional value-add product supplies plug-ins for the following:

  • CXFS clustered filesystems

  • Data Migration Facility (DMF)

  • XVM volume manager in local mode

This optional product also provides a failover script for the Tape Management Facility (TMF). You can modify your application to use this script to provide highly available services for TMF.

Highly Available Services

A highly available service consists of the following:

  • Disks (such as XVM volumes)

  • IP address

  • Filesystem (such as XFS or CXFS)

  • NFS (if used)

  • Samba (if used)

  • User applications (if used)

Hardware Requirements

SGI Cluster manager requires a cluster of exactly two members. The following SGI Altix servers are supported:

  • An Altix 330 server with a USB-to-Ethernet adapter connected to the L1 system controller so that the brick emulates an L2 controller and becomes an L1/L2 controller. (Separate physical L2 controllers are not used with the Altix 330 systems.) Access to the L2 functionality is made by way of an Ethernet connection to a PC or laptop. An Altix 330 server must use the L2 Ethernet reset configuration (l2network ) for remote resets.

  • An Altix 350 server with an IO10 PCI card, which may use either of the following for remote resets:

    • Network connection (l2network), which requires an additional PCI network interface card must be purchased. (Preferred method.)

    • Serial connection (l2), which requires the following:

      • Multiport serial adapter cable (a device that provides four DB9 serial ports from a 36-pin connector), must be purchased (part number CBL-SATA-SERIAL)

      • Hardware L2 system controller (which must be purchased)

  • An Altix 350 server with an IO9 PCI card, which must use the L2 Ethernet reset configuration (l2network) for remote resets. This requires a hardware L2 system controller that must be separately purchased.


    Note: Customers cannot replace the IO9 PCI card in the Altix 350 with the IO10 PCI card. This procedure requires a new interface board and cables as well as a drive swap from SCSI to SATA. This procedure can only be done by SGI service personnel.


  • An Altix 3700 server, which can use either the L2 Ethernet reset configuration (l2network) or the L2 serial reset configuration ( l2). These servers may be partitioned; each system partition is an individual member.

  • An Altix 3700 Bx2 server with a USB-to-Ethernet adapter connected to the L1 system controller so that the brick emulates an L2 controller and becomes an L1/L2 controller. (Separate physical L2 controllers are not used with the Altix 3700 Bx2 systems.) Access to the L2 functionality is made by way of an Ethernet connection to a PC or laptop. An Altix 3700 Bx2 server must use the L2 Ethernet reset configuration (l2network). See “l2network Ethernet Connection” in Chapter 2.

  • An Altix 4700 server with a USB-to-Ethernet adapter connected to the L1 system controller so that the brick emulates an L2 controller and becomes an L1/L2 controller. (Separate physical L2 controllers are not used with the Altix 4700 systems.) Access to the L2 functionality is made by way of an Ethernet connection to a PC or laptop. An Altix 4700 server must use the L2 Ethernet reset configuration (l2network ) for remote resets.

SGI Cluster Manager also requires the following:

  • Shared quorum partitions without filesystems where configuration, cluster, and service status information is kept by SGI Cluster Manager. For more information, see “Shared Quorum Partitions” in Chapter 2.

  • Network cabling: you can connect private network or cross-over cables between members. You have a choice between an Ethernet cable from server to hub or a 20-ft cross-over Ethernet cable.


    Note: To use a private network, you must have a second NIC whether you use a cross-over cable or a switch/hub.


Figure 1-1 shows an example configuration using CXFS. A private network is recommended for SGI Cluster Manager. The SGI Cluster Manager members should be able to communicate with the SGI Cluster Manager tiebreaker via the network. The tiebreaker can be a machine or a router or any device that can be connected via the network. (For more information about tiebreakers, see “Step 6: Set the Tiebreakers” in Chapter 4.)

Figure 1-1. An Example CXFS and SGI Cluster Manager Configuration

An Example CXFS and SGI Cluster Manager Configuration

Software Requirements

SGI Cluster Manager requires the following:

  • SGI ProPack 4 Service Pack 3

  • SUSE LINUX Enterprise Server 9 (SLES9) Service Pack 2

This release also supports the following releases:

  • Samba 3.0.20b or later

  • The CXFS 4.0 or later


    Note: Use of clustered XVM volumes with SGI Cluster Manager requires the CXFS plug-in. The SGI Cluster Manager base product supports local XVM volumes.


  • DMF 3.3 or later

  • TMF 1.4.5 or later

See the README file for a list of the RPMs included on CDs.

Failover Domains

The failover domain is the list of members in the cluster where a service can be online.

Each failover domain has two failover options that are considered when a new membership is formed or a failure occurs and a new target member for the service must be determined:

  • Restricted failover permits failover only to the members listed. If all of the members in the domain are unavailable, the service will stop.

    If a domain is not restricted, a service can run on the member that is not in domain if there is a failure and the member that is in the domain is unavailable. (However, administrative commands cannot relocate the service to a member that is not in the domain, whether or not this option is used.)

  • Ordered failover causes the service to start on the first member defined (the lowest-ordered) if it is available; if that member is unavailable, the other member will be used. If controlled failback is not set, the service will automatically failback from the second member to the original member when the original member is rebooted after a failure or maintenance period.


    Note: Lowest-ordered means a higher preference for a service to be started on that member.


Each failover domain also has a failback option, which is considered when a member rejoins the cluster. The controlled failback option says that a service will not be moved back to the original member when it rejoins the cluster even if it is the preferred member in the list (when ordered failover is used). The system administrator must manually relocate the service in order for it to run on the original member without an intervening failure. Only a new failure will cause a service to be automatically moved.

Suppose you have a cluster members A and B. Table 1-1 describes some of the possible results from using various options under different circumstances for the nfs service.

Table 1-1. Failover Domain and Option Results

Failover Domain

Options

Circumstance

Results

(none)

(none)

Newly formed membership

The service will be started on either A or B, randomly chosen

B

(none)

Newly formed membership

The service will be started on B if it is available. If B is not available, the service will be started on A.

B, A

(none)

Newly formed membership

The service will be started on either A or B, randomly chosen. If that member is unavailable, the other will be used. This situation is similar to having no failover domain.

B

(none)

The service is running on B and then B fails

The service will be started on A. The service will remain on A even after B restarts.

B, A

Ordered

Newly formed membership

The service will be started on B if it is available. If B is not available, the service will be started on A.

B, A

Restricted failover and controlled failback

Newly formed membership

The service will be started on either A or B, randomly chosen. If that member fails, the service will be restarted on the other member and will remain there until the system administrator manually intervenes.

B

Restricted

The service is running on B and then B fails

The service will stop.

B, A

Ordered

The service is running on B and then B fails

The service will be started on A. The service will be moved back to B as soon as it restarts.

B, A

Ordered failover and controlled failback

The service is running on B and then B fails

The service will be started on A. The service will remain on A even after B restarts. To go back to B, the system administrator must manually move the service.


Cluster Daemons

Following is an overview of the cluster daemons:

  • clumembd(8) is the cluster membership daemon. It performs network heartbeats and checks the liveliness of other members in the cluster.

  • cluquorumd(8) is the cluster quorum daemon. It computes new membership and implements quorum. It also implements I/O fencing by resetting members that are in failed state and reads/writes membership information to the shared quorum partitions.

  • clurmtabd(8) is the cluster remote NFS mount table daemon. It synchronizes NFS mount point entries by polling the /var/lib/nfs/rmtab file.

  • clusvcmgrd(8) is the cluster service manager daemon. It starts/stops and checks the status of services running in the cluster.

  • clulockd(8) is the cluster global lock manager daemon. The locks are stored on the shared quorum partitions.

For more information, see the man pages.