This chapter tells you how to use Performance Co-Pilot (PCP) for FailSafe to monitor the availability of an IRIX FailSafe cluster. For information about installing PCP for FailSafe, see “Install Performance Co-Pilot (PCP) Software” in Chapter 3.
PCP provides the following:
An agent for exporting FailSafe heartbeat and resource monitoring statistics to the PCP framework
3-D visualization tools for displaying these statistics in an intuitive presentation
The visualization of statistics provides valuable information about the availability of nodes and resources monitored by FailSafe. For example, it can highlight a reduction in monitoring response times that may indicate problems in availability of services provided by the cluster.
Because PCP for FailSafe is an extension to the PCP framework, you can use other PCP tools to analyze or present FailSafe monitoring statistics, and record PCP for FailSafe metrics as archives for deferred analysis. You can also use PCP to gather statistics about CPU and memory utilization, network and disk activity, and other performance metrics for each node in the cluster.
To view statistics about the FailSafe cluster, use the hbvis(1) and rmvis(1) commands.
The hbvis(1) command constructs a display showing the distribution of heartbeat response times for every node in the cluster. Figure 11-1 shows an example display.
Key features of the display include the frequency of heartbeat responses that arrive at particular intervals within the timeout period and the frequency of heartbeat responses that have been missed (determined not to have arrived). The bar representing the frequency of missed heartbeat responses changes color to indicate the urgency of problems with availability of a node.
The rmvis(1) command constructs a display of the resource monitoring response times for resources monitored on every node of the cluster. Figure 11-2 shows an example display.
The display is similar in concept to that of hbvis(1), showing the frequency of resource monitoring responses that arrive within the timeout period, and the frequency of responses that have timed out. The bar representing the frequency of resource responses that have timed out also changes color to indicate the urgency of problems with the availability of particular resources.
If a node has failed or a resource has failed over, its statistics will disappear from the display.
To run a visualization tool on the monitor host, use the -h option to specify an available collector host in the cluster (host):
% hbvis -h host |
or
% rmvis -h host |
The collector host specified can be any collector host that is a member of the cluster for which you wish to view statistics.
There are various options available to alter the display provided by hbvis(1) and rmvis(1):
| -H hostfile | Provides a file that lists the nodes that are to appear in the visualization. This is useful in limiting the number of nodes in the display, because it takes more time to construct the display for clusters with more nodes. |
| -t interval | Assigns the sampling time of the visualization. There may be circumstances where extending the period of the sampling time may provide better application responsiveness, particularly for clusters with many nodes. Because FailSafe maintains the statistics, hbvis(1) and rmvis(1) will always show the latest statistics available for the sampling time selected. For details about the interval option, see the pmview(1) and PCPIntro(1) man pages. |
| -r | Selects the FailSafe metrics that present a sampling of statistics taken from the time of the last statistical reset. This enables hbvis(1) and rmvis(1) to improve the sensitivity of the visualization when abrupt changes appear in the FailSafe monitoring statistics. Without the -r option, the statistics presented are from a sampling of FailSafe metrics collected from the time ha_cmsd(1m) and/or ha_srmd(1m) was last restarted. |
| -R | Starts a new statistical sampling. |
| -v | (hbvis(1) only) Provides a visualization of heartbeat statistics for each node in the cluster, from the point of view of the selected collector host only. (The collector host is selected using the -h option). There is a graphical representation of heartbeat statistics for each node in the cluster as observed by the selected collector host. |
| -w | (hbvis(1) only) Provides a visualization of the aggregate of heartbeat statistics for all nodes in the cluster, from the point of view of the selected collector host only. (The collector host is selected using the -h option). There is a only one graphical representation of heartbeat statistics for the entire cluster as observed by the selected collector host. |
For a complete description of options, see the hbvis(1) and rmvis(1) man pages.
hbvis(1) and rmvis(1) use the command pmview(1) to display the 3-D visualization of FailSafe performance metrics. For a description of the various menu commands and controls in the visualization window, consult the man pages for pmview(1).
PCP tools such as pmlogger(1), pmchart(1), and pminfo(1) can use the metrics exported by PCP for FailSafe.
Appendix C, “Metrics Exported by PCP for FailSafe”, provides a description of PCP for FailSafe metrics. You can also display a description of metrics by using the following command:
% pminfo -tT -h host |
(If you are logged in to a collector host, you can leave out the -h option).
A gray display (that is, no colored rectangle bars appear on the node's gray baseplane) when using hbvis(1) or rmvis(1) may indicate one of the following:
The node is down.
If you wish to see only the nodes that are up, create a file containing a list of nodes that are to be displayed and pass it as an option to hbvis(1)/rmvis(1) using the -H option (or the environment variable PCP_FSAFE_NODES) so that a new picture of the cluster can be generated. Please refer to the hbvis(1)/rmvis(1) man pages for more details on the -H option.
The collector daemons have been killed on that node.
To solve this problem, restart pmdafsafe(1) in one of the following ways:
If pmcd(1) is still running, send pmcd(1) the SIGHUP signal by entering the following:
# killall -HUP pmcd |
If pmcd(1) is not running, restart PCP by entering the following:
# /etc/init.d/pcp start |
The timeout and sampling settings are too short.
To change the sampling time, use the time controls available in the pmview(1) window. By default, this is two seconds; you may need to lengthen the sampling period if you are getting an unsatisfactory display.
Alternatively, there may be timeout issues between pmdafsafe(1) and pmcd(1), or between pmcd(1) and pmview(1). Refer to the man pages for pmcd(1) and PCPIntro(1) for information on how to change the timeout settings for the various PCP tools.
The resource has failed over (for rmvis(1)).
In this case, restart rmvis(1) so that a new picture of the cluster can be generated.