Chapter 3. Installation and System Preparation


Note: The procedures in this chapter assume that you have done the work described in Chapter 2, “Configuration Planning”.

The following steps are required for IRIS FailSafe installation and system preparation:

Install Software

Installing the IRIS FailSafe base CD requires about 10 MB of free space.

To install the required software, do the following:

  1. On each node in the pool, upgrade to a supported release of IRIX according to the IRIX 6.5 Installation Instructions and the FailSafe product release notes:

    # relnotes failsafe2 [chapter_number]

    To verify that a given node has been upgraded, use the following command to display the currently installed system:

    # uname -aR

  2. Depending on the servers and storage in the configuration and the IRIX revision level, install the latest recommended patches. For information on recommended patches for each platform, see: http://bits.csd.sgi.com/digest/patches/recommended/

  3. On each node, install the version of the serial port server driver that is appropriate to the operating system. Use the CD that accompanies the serial port server. Reboot the system after installation.

    For more information, see the following documentation provided with the serial port server:

    • EL Serial Port Server Installation Guide (provided by Digi Corporation)

    • EL Serial Port Server Installation Guide Errata

  4. On each node, install the following software, in the order shown:

    sysadm_base.sw.dso 
    sysadm_base.sw.server 
    sysadm_cluster.sw.server
    cluster_admin.sw.base 
    cluster_control.sw.base
    cluster_services.sw.base
    cluster_services.sw.cli 
    failsafe2.sw 
    sysadm_failsafe2.sw.server

    When sysadm_base is installed, tcpmux service added to the /etc/inetd.conf file.


    Note: For systems that do not have sysadmdesktop installed, inst reports missing prerequisites. Resolve this conflict by installing sysadm_base.sw.priv, which provides a subset of the functionality of sysadmdesktop.sw.base and is included in this distribution, or by installing sysadmdesktop.sw.base from the IRIX distribution.

    If you try to install sysadm_base.sw.priv on a system that already has sysadmdesktop.sw.base, inst reports incompatible subsystems. Resolve this conflict by not installing sysadm_base.sw.priv. Similar conflicts occur if you try to install sysadmdesktop.sw.base on a system that already has sysadm_base.sw.priv.

    If the nodes are to be administered by a Web-based version of the GUI, install these subsystems, in the order shown:

    java_eoe.sw  version 3.1.1
    sysadm_base.sw.client 
    sysadm_cluster.sw.client
    sysadm_failsafe2.sw.client
    sysadm_failsafe2.sw.web


    Caution: The GUI only operates with Java 1.1.8. This is the version of Java that is provided with the IRIX 6.5.x release.

    The SGI Web site also contains Java 2. However, you cannot use this version of Java with the GUI. Using a Java version other than 1.1.8 will cause the GUI to fail.


  5. On each node, install the following additional software, in the order shown:

    nfs.ws.nfs (If necessary; from IRIX, might already be present)
    failsafe2_nfs.sw
    ns_admin.sw.server (If necessary; from Netscape, might already be present)
    ns_fasttrack.sw.server OR ns_enterprise.sw.server ((If necessary;
         from Netscape, might already be present)
    failsafe2_web.sw

    To install nfs.ws.nfs, you should have purchased the optional FailSafe/NFS software to make NFS server highly available. To install ns_fasttrack.sw.server or ns_enterprise.sw.server, you should have purchased the optional FailSafe/Web software to make netscape servers highly available.

  6. If you want to run the administrative workstation (GUI client) from an IRIX desktop, install the following subsystems on the desktop:

    sysadm_failsafe2.sw.desktop
    sysadm_failsafe2.sw.client
    sysadm_base.sw.client
    sysadm_cluster.sw.client
    java_eoe.sw version 3.1.1

    • If the administrative workstation is an IRIX machine that launches the GUI client from a Web browser that supports Java, install the java_plugin from the CXFS CD. (However, launching the GUI from a Web browser is not the recommended method on IRIX. Running the GUI client from an IRIX desktop is preferred.)

      If you try to install all subsystems in java_plugin, inst reports incompatible subsystems (java_plugin.sw.swing101, java_plugin.sw.swing102, and java_plugin.sw.swing103). Do not install these three subsystems because the GUI does not use them.

      After installing the Java plug-in, you must close all browser windows and restart the browser.

  7. On the appropriate nodes, install other optional software, such as storage management or network board software.

  8. If the cluster is using plexed XLV logical volumes, do the following:

    1. Install a disk plexing license on each node in the /var/flex1m/license.dat file. For more information on XLV logical volumes and on XFS plexing and filesystems, see Chapter 2, “Configuration Planning”.

    2. Verify that the license has been successfully installed on each node in the cluster by using the xlv_mgr(1M) command:

      # xlv_mgr
      xlv_mgr> show config

      If the license is successfully installed, the following line appears:

      Plexing license: present

    3. Quit xlv_mgr.

  9. Install recommended patches for FailSafe.

    For instructions on installing a FailSafe patch, see “Install Patches”.

Set the AutoLoad variable to Yes; this can be done when you set host SCSI IDs, as explained in “Set NVRAM Variables”.


Note: For reference, Appendix C, “IRIS FailSafe 2.1.x Software”, summarizes systems to install on each component of a cluster or node.


Configure System Files

This section discusses the following:

Hostname Resolution: /etc/sys_id, /etc/hosts, /etc/nsswitch.conf


Caution: It is critical that you understand these rules before attempting to configure a FailSafe cluster.

The following hostname resolution rules and recommendations apply to FailSafe clusters:

  • Hostnames cannot begin with an underscore (_) or include any whitespace characters.

  • The value of the /etc/sys_id file must match the node's primary hostname in the /etc/hosts file (that is, the first field after the node's IP address in /etc/hosts) for all nodes in the cluster. This field can be either the hostname or the fully qualified domain name.

    The /etc/hosts file has the following format, where primary_hostname can be the simple hostname or the fully qualified domain name:

    IP_address primary_hostname aliases

    For example, suppose your /etc/hosts file contains the following:

    # The public interface:
    128.2.3.4  color-green.sgi.com color-green green
    
    # The private interface:
    192.0.1.1  color-green-private.sgi.com  color-green-private green-private

    The /etc/sys_id file could contain either the hostname color-green or the fully qualified domain name color-green.sgi.com. It cannot contain the alias green.

    In this case, you would enter the hostname color-green or the fully qualified domain name color-green.sgi.com for the Server field in the login screen and for the Hostname field in the Define a new node window.

  • If you use the nsd(1M) name service daemon, you must configure your system so that local files are accessed before either the network information service (NIS) or the domain name service (DNS). That is, the hosts line in /etc/nsswitch.conf must list files first. For example:

    hosts:      files nis dns 

    (The order of nis and dns is not significant to FailSafe; files must be first.)

    The /etc/config/netif.options file must have one of the interfaces be equal to the value of /etc/sys_id ($HOSTNAME).

    For more information about the Unified Name Service (UNS) and the name service daemon, see the nsd(1M) man page.

  • If you change the /etc/nsswitch.conf or /etc/hosts files, you must restart nsd by using the nsadmin restart command, which also flushes its cache.

    The reason you must restart nsd(1M) after making a change to these files is that the nsd name service daemon actually takes the contents of /etc/hosts and places the contents in its memory cache in a format that is faster to search. Thus, you must restart nsd in order for it to see that change and place the new /etc/hosts information into RAM cache. If /etc/nsswitch.conf is changed, nsd must re-read this file so that it knows what type of files (for example, hosts or passwd) to manage, what services it should call to get information, and in what order those services should be called.

    The IP addresses on a running node in the cluster and the IP address of the first node in the cluster cannot be changed while cluster services are active.

  • You should be consistent when using fully qualified domain names in the /etc/hosts file. If you use fully qualified domain names in /etc/sys_id on a particular node, then all of the nodes in the cluster should use the fully qualified name of that node when defining the IP/hostname information for that host in their /etc/hosts file.

    The decision to use fully qualified domain names is usually a matter of how the clients (such as NFS) are going to resolve names for their client server programs, how their default resolution is done, and so on.

  • If you change hostname resolution settings in the /etc/nsswitch.conf file after you have defined the first node (which creates the cluster database), you must recreate the database.

  • When using coexecution with CXFS, never add an /etc/hosts entry that associates the value of /etc/sys_id with an IP address alias. You must use the primary address.

/etc/services

Edit the /etc/services file so that it contains entries for sgi-cad and sgi-crsd before you install the cluster_admin product on each node in the pool. The port numbers assigned for these processes must be the same in all nodes in the pool.


Note: sgi-cad requires a TCP port for communication between FailSafe nodes.

The following shows an example of /etc/services entries for sgi-cad and sgi-crsd:

sgi-crsd        7500/udp           # Cluster Reset Services Daemon
sgi-cad         9000/tcp           # Cluster Admin daemon

Edit the /etc/services file so that it contains entries for sgi-cmsd and sgi-gcd on each node before starting highly available (HA) services on the node. The port numbers assigned for these processes must be the same in all nodes in the cluster.

The following shows an example of /etc/services entries for sgi-cmsd and sgi-gcd:

sgi-cmsd        7000/udp         # SGI FailSafe Membership Daemon
sgi-gcd         8000/udp         # SGI Group Communication Daemon

/etc/config/cad.options

The /etc/config/cad.options file contains the list of parameters that the cad(1M) cluster administration daemon reads when the process is started. cad provides cluster information to the GUI.

The following options can be set in the cad.options file:

--append_log 

Append cad logging information to the cad log file instead of overwriting it.

--log_file filename 

cad log file name. Alternately, this can be specified as -lf filename.

-vvvv 

Verbosity level. The number of v characters indicates the level of logging. Setting -v logs the fewest messages. Setting -vvvv logs the highest number of messages.

The following example shows an /etc/config/cad.options file:

-vv -lf /var/cluster/ha/log/cad_nodename --append_log

The contents of the /etc/config/cad.options file cannot be modified using the cmgr(1M) command or the GUI.


Note: If you make a change to the cad.options file at any time other than initial configuration, you must restart the cad processes in order for these changes to take effect. You can do this by rebooting the nodes or by entering the following command:
# /etc/init.d/cluster restart



If you execute this command on a running cluster, it will remain up and running. However, the GUI will lose connection with the cad(1M) daemon; the GUI will prompt you to reconnect.


/etc/config/fs2d.options

The /etc/config/fs2d.options file contains the list of parameters that the fs2d daemon reads when the process is started. The fs2d daemon is the cluster database daemon that manages the distribution of cluster database across the nodes in the pool.

Table 3-1 shows the options can that can be set in the fs2d.options file.

Table 3-1. fs2d.options File Options

Option

Description

-logevents event name

Log selected events. The following event names may be used: all, internal, args, attach, chandle, node, tree, lock, datacon, trap, notify, access, storage. The default is all.

-logdest log destination

Set log destination. The following log destinations may be used: all, stdout, stderr, syslog, logfile. If multiple destinations are specified, the log messages are written to all of them. If logfile is specified, it has no effect unless the -logfile option is also specified. The default is logfile.

-logfile filename

Set log filename. The default is /var/cluster/ha/log/fs2d_log.

-logfilemax maximum size

Set log file maximum size (in bytes). If the file exceeds the maximum size, any preexisting filename.old will be deleted, the current file will be renamed to filename.old, and a new file will be created. A single message will not be split across files. If -logfile is set, the default is 10000000.

-loglevel loglevel

Set log level. The following log levels may be used: always, critical, error, warning, info, moreinfo, freq, morefreq, trace, busy. The default is info.

-trace trace_class

Trace selected events. The following trace classes may be used: all, rpcs, updates, transactions, monitor. If you specify this option, you must also specify -tracefile and/or -tracelog. No tracing is done, even if it is requested for one or more classes of events, unless either or both of -tracefile or -tracelog is specified. The default is transactions.

-tracefile filename

Set trace filename. There is no default.

-tracefilemax maximum_size

Set trace file maximum size (in bytes). If the file exceeds the maximum size, any preexisting filename.old will be deleted, the current file will be renamed to filename.old, and a new file will be created.

-[no]tracelog

[Do not] trace to log destination. When this option is set, tracing messages are directed to the log destination or destinations. If there is also a trace file, the tracing messages are written there as well. The default is -tracelog.

-[no]parent_timer

[Do not] exit when parent exits. The default is -noparent_timer.

-[no]daemonize

[Do not] run as a daemon. The default is -daemonize.

-l

Do not run as a daemon.

-h

Print usage message.

-o help

Print usage message.

If you use the default values for these options, the system will be configured so that all log messages of level info or less, and all trace messages for transaction events, are sent to the /var/cluster/ha/log/fs2d_log file. When the file size reaches 10MB, this file will be moved to its namesake with the .old extension and logging will roll over to a new file of the same name. A single message will not be split across files.


Note: If you make a change to the fs2d.options file at any time other than initial configuration, you must restart the fs2d processes in order for those changes to take effect. You can do this by rebooting the nodes or by entering the following command:
# /etc/init.d/cluster restart



If you execute this command on a running cluster, it should remain up and running. However, the GUI will lose connection with the cad(1M) daemon; the GUI will prompt you to reconnect.


Example 1

The following example shows an /etc/config/fs2d.options file that directs logging and tracing information as follows:

  • All log events are sent to /var/adm/SYSLOG.

  • Tracing information for RPCs, updates, and transactions are sent to /var/cluster/ha/log/fs2d_ops1.

    When the size this file exceeds 100,000,000 bytes, this file is renamed to /var/cluster/ha/log/fs2d_ops1.old and a new file /var/cluster/ha/log/fs2d_ops1 is created. A single message is not split across files.

(Line breaks added here only for readability.)

-logevents all -loglevel trace -logdest syslog -trace rpcs 
-trace updates -trace transactions -tracefile /var/cluster/ha/log/fs2d_ops1 
-tracefilemax 100000000

Example 2

The following example shows an /etc/config/fs2d.options file that directs all log and trace messages into one file, /var/cluster/ha/log/fs2d_chaos6, for which a maximum size of 100,000,000 bytes is specified. -tracelog directs the tracing to the log file.

(Line breaks added here only for readability.)

-logevents all -loglevel trace -trace rpcs -trace updates 
-trace transactions -tracelog -logfile /var/cluster/ha/log/fs2d_chaos6 
-logfilemax 100000000 -logdest logfile.

/etc/config/cmond.options

The/etc/config/cmond.options file contains the list of parameters that the cmond(1M) cluster monitor daemon reads when the process is started. It also specifies the name of the file that logs cmond events. cmond provides a framework for starting, stopping, and monitoring process groups. See the cmond(1M) man page for more information.

The following options can be set in the cmond.options file:

-L log_level 

Set log level to log_level. The legal values for log_level are normal, critical, error, warning, info, frequent, and all.

-d 

Run in debug mode

-l 

Lazy mode, where cmond does not validate its connection to the cluster database

-t nap_interval 

The time interval in milliseconds after which cmond checks for liveliness of process groups it is monitoring

-s 

Log messages to standard error.

A default cmond.options file is shipped with the following options. This default options file logs cmond events to the /var/cluster/ha/log/cmond_log file.

-L info -f /var/cluster/ha/log/cmond_log

Set the corepluspid System Parameter

Use the systune(1M) command to set the corepluspid flag to 1 on every node. If this flag is set, IRIX will suffix all core files with a process ID (PID). This prevents a core dump from being overwritten by another process core dump.

Set NVRAM Variables

During the hardware installation of FailSafe nodes, two non-volatile random-access memory (NVRAM) variables must be set:

  • The boot parameter AutoLoad must be set to yes. FailSafe requires the nodes to be automatically booted when they are reset or when the node is powered on.

  • The SCSI IDs of the nodes, specified by the scsihostid variable, must be different. This variable is important only when a cluster is configured with shared SCSI storage. If a cluster has no shared storage or is using shared Fibre Channel storage, setting scsihostid is not important.

You can check the setting of these variables with the following commands:

# nvram AutoLoad
Y
# nvram scsihostid 
0

To set these variables, use the following commands:

# nvram AutoLoad yes
# nvram scsihostid number 

number is the SCSI ID you choose. A node uses its SCSI ID on all buses attached to it. Therefore, you must ensure that no device attached to a node has number as its SCSI unit number. If you change the value of the scsihostid variable, you must reboot the system for the change to take effect.

Create XLV Logical Volumes and XFS Filesystems

You can create XLV logical volumes by following the instructions in the guide IRIX Admin: Disks and Filesystems.


Note: This section describes logical volume configuration using XLV logical volumes. For information on coexecution of FailSafe and CXFS filesystems (which use XVM logical volumes), see “Coexecution of CXFS and FailSafe” in Chapter 2. For information on creating CXFS filesystems, see the CXFS Version 2 Software Installation and Administration Guide. For information on creating XVM logical volumes, see the XVM Volume Manager Administrator's Guide.

When you create XLV logical volumes and XFS filesystems, remember the following important points:

  • If the shared disks are not in a RAID storage system, you should create plexed XLV logical volumes.

  • Each XLV logical volume must be owned by the same node that is the primary node for the resources that use the logical volume (see “Planning XLV Logical Volumes” in Chapter 2). To simplify the management of the owners of volumes on shared disks, use the following recommendations:

    • Work with the volumes on a shared disk from only one node in the cluster.

    • After you create all the volumes on one node, you can selectively change the nodename to the other node using xlv_mgr.

  • If the XLV logical volumes you create are used as raw volumes (that is, with no filesystem) for storing database data, the database system may require that the device names (in /dev/rxlv and /dev/xlv) have specific owners, groups, and modes. If this is the case (see the documentation provided by the database vendor), use the chown(1) and chmod(1) commands to set the owner, group, and mode as required.

  • No filesystem entries are made in /etc/fstab for XFS filesystems on shared disks; FailSafe software mounts the filesystems on shared disks. However, to simplify system administration, consider adding comments to /etc/fstab that list the XFS filesystems configured for FailSafe. Thus, a system administrator who sees mounted FailSafe filesystems in the output of the df command and looks for the filesystems in the /etc/fstab file will learn that they are filesystems managed by FailSafe.

  • Be sure to create the mount point directory for each filesystem on all nodes.

Configure Network Interfaces

This section describes how to configure the network interfaces. The example shown in Figure 3-1 is used in the procedure.

Figure 3-1. Example Interface Configuration


  1. If possible, add every IP address, IP name, and IP alias for the nodes to /etc/hosts on one node.

    For example:

    190.0.2.1 xfs-ha1.company.com xfs-ha1
    190.0.2.3 stocks
    190.0.3.1 priv-xfs-ha1
    190.0.2.2 xfs-ha2.company.com xfs-ha2
    190.0.2.4 bonds
    190.0.3.2 priv-xfs-ha2


    Note: IP aliases that are used exclusively by HA services are not added to the file /etc/config/ipaliases.options. Similarly, if all IP aliases are used only by HA services, the ipaliases chkconfig flag should be off.


  2. Add all of the IP addresses from step 1 to /etc/hosts on the other nodes in the cluster.

  3. If there are IP addresses, IP names, or IP aliases that you did not add to /etc/hosts in steps 1 and 2, verify that NIS is configured on all nodes by entering the following command on each node:

    # chkconfig | grep yp
    ...
            yp           on

    If the output shows that yp is off, you must start NIS. See the NIS Administrator's Guide for details.

  4. For IP addresses, IP names, and IP aliases that you did not add to /etc/hosts on the nodes in steps 1 and 2, verify that they are in the NIS database by entering the following command for each address:

    # ypmatch address hosts
    190.0.2.1 xfs-ha1.company.com xfs-ha1

    address is an IP address, IP name, or IP alias. If ypmatch(1M) reports that address does not match, it must be added to the NIS database. See the NIS Administrator's Guide for details.

  5. On one node, add that node's interfaces and their IP addresses to the file /etc/config/netif.options . However, highly available (HA) IP addresses are not added to the netif.options file.

    For the example in Figure 3-1, the public interface name and IP address lines are as follows:

    if1name=ec0
    if1addr=$HOSTNAME

    $HOSTNAME is an alias for an IP address that appears in /etc/hosts.

    If there are additional public interfaces, their interface names and IP addresses appear on lines such as the following:

    if2name=
    if2addr=

    In the example, the control network name and IP address are as follows:

    if3name=ec3
    if3addr=priv-$HOSTNAME

    The control network IP address in this example, priv-$HOSTNAME, is an alias for an IP address that appears in /etc/hosts.

  6. If there are more than eight interfaces on the node, change the value of if_num to the number of interfaces. For fewer than eight interfaces (as in the example in Figure 3-1), the line is as follows:

    if_num=8

  7. Repeat Steps 5 and 6 on the other nodes.

  8. Edit the /etc/config/routed.options file on each node so that the routes are not advertised over the control network. See the routed(1M) man page for a list of options.

    For example:

    -q -h -Prdisc_interval=45


    Note: The -q option is required for FailSafe to function correctly. This ensures that the heartbeat network does not get loaded with packets that are not related to the cluster.

    The options do the following:

    • Turn off advertising of routes

    • Cause host or point-to-point routes to not be advertised (provided there is a network route going the same direction)

    • Set the normal interval with which router discovery advertisements are transmitted to 45 seconds (and their lifetime to 135 seconds)

  9. Verify that IRIS FailSafe 2.x is turned off on each node, using the chkconfig(1M) command:

    # chkconfig | grep failsafe2
    ...
            failsafe2          off
    ...

    If failsafe2 is set to on on a node, enter this command on that node:

    # chkconfig failsafe2 off

    If Failsafe 1.x is present, you must also ensure that it is not configured on for any node:

    # chkconfig | grep failsafe
    ...
            failsafe             off
    ...

    If failsafe is on on any node, enter this command on that node:

    # chkconfig failsafe off

  10. Configure an e-mail alias on each node that sends the FailSafe e-mail notifications of cluster transitions to a user outside of the cluster and to a user on the other nodes in the cluster.

    For example, if there are two nodes called xfs-ha1 and xfs-ha2, add the following to /usr/lib/aliases on xfs-ha1:

    fsafe_admin:operations@console.xyz.com,admin_user@xfs-ha2.xyz.com 

    On xfs-ha2, add the following line to /usr/lib/aliases:

    fsafe_admin:operations@console.xyz.com,admin_user@xfs-ha1.xyz.com 

    The alias you choose, fsafe_admin in this case, is the value you will use for the mail destination address when you configure your system. In this example, operations is the user outside the cluster and admin_user is a user on each node.

  11. If the nodes use NIS -- that is, yp has been set to on using chkconfig(1M) -- or the BIND domain name server (DNS), switching to local name resolution is recommended. Modify the /etc/nsswitch.conf file so that it reads as follows:

    hosts:                  files nis dns 


    Note: Exclusive use of NIS or DNS for IP address lookup for the nodes has been shown to reduce availability in situations where the NIS service becomes unreliable.


  12. If you are using FDDI, finish configuring and verifying the new FDDI station, as explained in the FDDIXpress release notes and the FDDIXPress Administration Guide.

  13. Reboot all nodes to put the new network configuration into effect.

Configure the Serial Ports for a Ring Reset

When using a ring reset configuration, you must turn off the getty process for the tty ports to which the reset serial cables. . Perform the following steps on each node:

  1. Determine which port is used for the reset serial line.

  2. Open the file /etc/inittab for editing.

  3. Find the line for the port by looking at the comments on the right for the port number from step 1.

  4. Change the third field of this line to off. For example:

    t2:23:off:/sbin/getty -N ttyd2 co_9600          # port 2

  5. Save the file.

  6. Enter these commands to make the change take effect:

    # killall getty
    # init q

Install Patches

The procedures in this section describe how to install a FailSafe patch. The patch should be installed on all nodes.

Installing FailSafe 2.x and a FailSafe Patch at the Same Time

When you install FailSafe 2.x images and an upgrade patch together, the cluster processes must be stopped and started on each node after patch installation. This is because the FailSafe 2.x installation automatically starts the cluster processes and the patch installation does not automatically stop them, so the cluster processes will continue to run the unpatched shared libraries unless you restart them.

Do the following on each node:

  1. Install FailSafe 2.x images on the node. This includes the following products:

    cluster_admin
    cluster_control
    cluster_services
    failsafe2
    sysadm_base
    sysadm_failsafe2

  2. Install the FailSafe 2.x patch.

  3. In a UNIX shell, stop all cluster processes on the node:

    # /etc/init.d/cluster stop

  4. Verify that the cluster processes (cad, cmond, crsd, and fs2d) have stopped:

    # ps -ef | egrep '(cad|cmond|crsd|fs2d)'

  5. Start cluster processes on the node:

    # /etc/init.d/cluster start

You are now ready to run the FailSafe Manager GUI or the cmgr(1M) command to set up a FailSafe cluster.

Installing a FailSafe Patch on an Existing FailSafe 2.x Cluster

Using these instructions, you can install a FailSafe patch on each FailSafe 2.x node in turn, without shutting down the entire cluster and without interrupting the HA services provided by the cluster.


Note: Before installing a FailSafe patch, you should read the patch's release notes. These release notes may contain special instructions that are not provided in this procedure.

To install a FailSafe patch on each node in your FailSafe cluster, follow these steps:

  1. If you have the FailSafe GUI client software installed on a machine that is not a node, first install the patch client subsystems on that machine. The GUI client software subsystems are as follows, where xxxxxxx is the patch number:

    patchSGxxxxxxx.sysadm_base_sw.client
    patchSGxxxxxxx.sysadm_failsafe2_sw.client
    patchSGxxxxxxx.sysadm_failsafe2_sw.desktop

  2. Choose a node on which to install the patch. Start up the FailSafe GUI or cmgr(1M) command on that node.

    For convenience, connect the GUI to a node that you are not upgrading.


    Note: If you connect to the node that you are upgrading, then in a later step (when you stop HA services), FailSafe will no longer report accurate status to the GUI; in another later step (when you stop cluster services), the GUI will lose its connection.


    Use the following cmgr command to specify a default node (later commands in this procedure assume the cluster name has already been set):

    cmgr> set cluster clustername

  3. (Optional) If you wish to keep all resource groups running on the node during installation, take the resource groups offline using the detach option (that is, detach the resource groups). If you do this, FailSafe will stop monitoring the resources, which will continue to run on the node, and will not have any control over the resource groups. Otherwise, in the next step, the resources should migrate to another node automatically, assuming the failover policy is defined that way.

    If you are using the GUI, run the Take Resource Group Offline task and check the Detach Only checkbox.

    If you are using cmgr, execute the following command:

    cmgr> admin offline_detach resource_group groupname

  4. Stop HA services on the node. (When HA services stop, FailSafe will no longer be able to report current cluster and node state if the FailSafe GUI is connected to that node. To monitor the cluster state during installation, connect the FailSafe GUI to the node that you are not upgrading.)

    If you are using the GUI, run the Stop FailSafe HA Services task, specifying the node you are patching in the One Node Only field.

    If you are using cmgr, execute the following command:

    cmgr> stop ha_services on node nodename

    If you skipped optional step 3, FailSafe will attempt to migrate all resource groups off that node, but this will fail if there are no other available nodes in the resource group's failover domain. If an error occurs, either complete step 3 or move the resource group to the other node:

    If you are using the GUI, run the Move Resource Group task, specifying the node you are not patching in the Failover Domain Node field.

    If you are using cmgr, execute the following command:

    cmgr> admin move resource_group groupname to node nodename

  5. In a UNIX shell on the node you are upgrading, stop all cluster processes:

    # /etc/init.d/cluster stop

    When you are using the GUI, if the connection lost dialogue appears, click No. If you wish to continue using the GUI, restart the GUI, connecting to a node you are not patching.

  6. Verify that the cluster processes (cad, cmond, crsd, and fs2d) have stopped:

    # ps -ef | egrep '(cad|cmond|crsd|fs2d)'

  7. Use chkconfig(1M) to turn off the cluster flag:

    # chkconfig cluster off


    Note: You cannot use the failsafe2 flag to turn off the HA services on a node. You must use the GUI or cmgr commands to stop HA services; these commands can be run from any node in the pool. If necessary, you can use the force option. For more information, see “Stop FailSafe HA Services” in Chapter 5.


  8. Install the patch on the node.

  9. Use chkconfig to turn on the cluster flag:

    # chkconfig cluster on

  10. Start cluster processes on the node:

    # /etc/init.d/cluster start

  11. Start HA services on the node.

    If you are using the GUI and you are running the GUI in a Web browser, do the following:

    1. Exit your browser.

    2. Restart the Web server on the node you have just patched.

    3. Restart the GUI, connecting to the patched node.

    4. Run the Start FailSafe HA Services task, specifying the node that you just patched in the One Node Only field.

      If the GUI claims that FailSafe HA services are active on the cluster, then you are using an unpatched client; in this case, run the cmgr command instead, run the GUI on a patched client, or run the GUI in a Web browser from the patched node.

    If you are using cmgr, execute the following command:

    cmgr> start ha_services on node nodename

  12. Monitor the resource groups and verify that they come back online on the upgraded node. This may take several minutes, depending on the types and numbers of resources in the groups.

    If you are using the GUI, select View: Groups Owned by Nodes in the view area. Confirm that the resource group icons indicates online status.


    Note: When you restart HA services on the upgraded node, it can take several minutes for the node and cluster to return to normal active state.

    If you are using cmgr, execute the following command:

    cmgr> show status of resource_group groupname

Repeat the above process for the other nodes. If you are using the GUI, remember to reconnect to the node that you have just upgraded. After completing the process for all nodes, you can continue to monitor and administer your upgraded cluster, defining additional new nodes if desired.

Install Performance Co-Pilot (PCP) Software

You can deploy Performance Co-Pilot (PCP) for FailSafe as a collector agent or as a monitor client:

  • Collector agents are installed on collector hosts, which are the nodes in the FailSafe cluster itself from which you want to gather statistics. Typically, each node in a FailSafe cluster is designated as a collector host.

  • A monitor client is installed on the monitor host, which is typically a workstation that has a display and is running the IRIS Desktop.

Installing the Collector Host

To install PCP for FailSafe on the designated collector hosts, the following software components must already be installed:

  • The pcp_eoe.sw subsystem from IRIX 6.5.11 or later

  • IRIS FailSafe 2.1 or later

  • PCP 2.1 or later

A collector license (PCPCOL) must also be installed on each of these nodes.

After this software is installed, you must install the following subsystems of PCP for FailSafe on each collector host. Table 3-2 lists the subsystems required for a collector host and their approximate sizes.

Table 3-2. PCP for FailSafe Collector Subsystems

Subsystem

Size in KB

pcp_fsafe.man.pages

40

pcp_fsafe.man.relnotes

32

pcp_fsafe.sw.collector

128


To install the required subsystems on a monitor host, do the following:

  1. Mount the FailSafe CD-ROM by inserting it into an available drive. You can access a local CD-ROM drive or a remote CD-ROM drive of another host over the network.

  2. Log in as root.

  3. Start the inst(1) command:

    # inst

  4. Specify the installation location:

    • If you are installing from the local CD-ROM drive, enter the following:

      Inst> from /CDROM/dist

    • If you are installing from a remote drive, enter the following, where host is the name of the host with the CD-ROM drive that contains a mounted FailSafe CD-ROM:

      Inst> from host:/CDROM/dist

  5. Select the default subsystems in the pcp_fsafe package. The default subsystems are provided for easy installation onto multiple collector hosts:

    Inst> install default

  6. Ensure that there are no conflicts:

    Inst> conflicts

  7. Install the software:

    Inst> go

  8. Change to the /var/pcp/pmdas/fsafe directory:

    # cd /var/pcp/pmdas/fsafe

  9. Run the Install utility, which installs the FailSafe performance metrics into the PCP performance metrics namespace:

    # ./Install

  10. Choose an appropriate configuration for installation of the fsafe Performance Metrics Domain Agent (PMDA):

    • collector, which collects performance statistics on this system

    • monitor, which allows this system to monitor local and/or remote systems

    • both, which allows collector and monitor configuration for this system

    For example, to choose just the collector, enter the following:

    Please enter c(ollector) or m(onitor) or b(oth) [b] c

Removing Performance Metrics from a Collector Host

If you wish to remove PCP for FailSafe from a collector host, you must remove the PCP for FailSafe metrics from the performance metrics namespace of that host. You can do this before removing the pcp_fsafe subsystem by performing the following commands:

  1. Change to the /var/pcp/pmdas/fsafe directory:

    # cd /var/pcp/pmdas/fsafe

  2. Run the Remove utility:

    # ./Remove

Installing the Monitor Host

To install PCP for FailSafe on a designated monitor host, the following software components must already be installed on the node:

  • The pcp_eoe.sw subsystem of IRIX 6.5.11 or later, including the subsystem pcp_eoe.sw.monitor

  • PCP 2.1 or later, including the subsystem pcp.sw.monitor

The monitor license (PCPMON) must also be installed on the monitor host.

After this software is installed, install the subsystems of PCP for FailSafe listed in Table 3-3 on each collector host.

Table 3-3. PCP for FailSafe Monitor Subsystems

Subsystem

Size in KB

pcp_fsafe.man.pages

40

pcp_fsafe.man.relnotes

32

pcp_fsafe.sw.monitor

516


To install the required subsystems for PCP for FailSafe on a monitor host, do the following:

  1. Mount the PCP for FailSafe CD-ROM by inserting it into an available drive. You can access a local CD-ROM drive or a remote CD-ROM drive of another host over the network.

  2. Log in as root.

  3. Start inst(1) :

    # inst

  4. Specify the installation location:

    • If you are installing from the local CD-ROM drive, enter the following:

      Inst> from /CDROM/dist

    • If you are installing from a remote drive, enter the following, where host is the name of the host with the CD-ROM drive that contains a mounted PCP for FailSafe CD-ROM:

      Inst> from host:/CDROM/dist

  5. Select the required subsystems in the pcp_fsafe package for a monitor configuration:

    Inst> keep pcp_fsafe.sw.collector
    Inst> install pcp_fsafe.sw.monitor

  6. Ensure that there are no conflicts before you install PCP for FailSafe:

    Inst> conflicts

  7. Install the software:

    Inst> go

Test the System

This section discusses the following ways of testing the system:

Private Network Interface

For each private network on each node in the pool, enter the following, where nodeIPaddress is the IP address of the node:

# /usr/etc/ping -c 3 nodeIPaddress

Typical ping(1M) output should appear, such as the following:

PING IPaddress (190.x.x.x: 56 data bytes
64 bytes from 190.x.x.x: icmp_seq=0 tt1=254 time=3 ms
64 bytes from 190.x.x.x: icmp_seq=1 tt1=254 time=2 ms
64 bytes from 190.x.x.x: icmp_seq=2 tt1=254 time=2 ms

If ping fails, follow these steps:

  1. Verify that the network interface was configured up using ifconfig; for example:

    # /usr/etc/ifconfig ec3
    ec3: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,FILTMULTI,MULTICAST>
    inet 190.x.x.x netmask 0xffffff00 broadcast 190.x.x.x

    The UP in the first line of output indicates that the interface was configured up.

  2. Verify that the cables are correctly seated.

Repeat this procedure on each node.

Serial Reset Connection

To test the serial hardware reset connections, do the following:

  1. Ensure that the nodes and the serial multiplexer are powered on.

  2. Start the cmgr(1M) command on one of the nodes in the pool:

    # cmgr

  3. Stop HA services on each node:

    stop ha_services for cluster clustername

    For example:

    cmgr> stop ha_services for cluster fs6-8

    Wait until the node has successfully transitioned to inactive state and the FailSafe processes have exited. This process can take a few minutes.

  4. Test the serial connections by entering one of the following:

    • To test the whole cluster, enter the following:

      test serial in cluster clustername

      For example:

      cmgr> test serial in cluster fs6-8
      Status: Testing serial lines ...
      Status: Checking serial lines using crsd (cluster reset services) from node fs8
      Success: Serial ping command OK.
      
      Status: Checking serial lines using crsd (cluster reset services) from node fs6
      Success: Serial ping command OK.
      
      Status: Checking serial lines using crsd (cluster reset services) from node fs7
      Success: Serial ping command OK.
      
      Notice: overall exit status:success, tests failed:0, total tests executed:1

    • To test an individual node, entering the following:

      test serial in cluster clustername node machinename

      For example:

      cmgr> test serial in cluster fs6-8 node fs7
      Status: Testing serial lines ...
      Status: Checking serial lines using crsd (cluster reset services) from node fs6
      Success: Serial ping command OK.
      
      Notice: overall exit status:success, tests failed:0, total tests executed:1

    • To test an individual node using just a ping, enter the following:

      admin ping node nodename

      For example:

      cmgr> admin ping node fs7
      
      ping operation successful

  5. If a command fails, make sure all the cables are seated properly and rerun the command.

  6. Repeat the process on other nodes in the cluster.