|Document revision date: 15 July 2002|
If a computer fails to join the cluster, follow the procedures in this
section to determine the cause.
C.4.1 Verifying OpenVMS Cluster Software Load
To verify that OpenVMS Cluster software has been loaded, follow these instructions:
|1||Look for connection manager (%CNXMAN) messages like those shown in Section C.1.2.|
|2||If no such messages are displayed, OpenVMS Cluster software probably was not loaded at boot time. Reboot the computer in conversational mode. At the SYSBOOT> prompt, set the VAXCLUSTER parameter to 2.|
|3||For OpenVMS Cluster systems communicating over the LAN or mixed interconnects, set NISCS_LOAD_PEA0 to 1 and VAXCLUSTER to 2. These parameters should also be set in the computer's MODPARAMS.DAT file. (For more information about booting a computer in conversational mode, consult your installation and operations guide).|
|4||For OpenVMS Cluster systems on the LAN, verify that the cluster security database file (SYS$COMMON:CLUSTER_AUTHORIZE.DAT) exists and that you have specified the correct group number for this cluster (see Section 10.9.1).|
To verify that the computer has booted from the correct disk and system root, follow these instructions:
|1||If %CNXMAN messages are displayed, and if, after the conversational reboot, the computer still does not join the cluster, check the console output on all active computers and look for messages indicating that one or more computers found a remote computer that conflicted with a known or local computer. Such messages suggest that two computers have booted from the same system root.|
|2||Review the boot command files for all CI computers and ensure that all are booting from the correct disks and from unique system roots.|
If you find it necessary to modify the computer's bootstrap command
procedure (console media), you may be able to do so on another
processor that is already running in the cluster.
Replace the running processor's console media with the media to be modified, and use the Exchange utility and a text editor to make the required changes. Consult the appropriate processor-specific installation and operations guide for information about examining and editing boot command files.
To be eligible to join a cluster, a computer must have unique SCSNODE and SCSSYSTEMID parameter values.
|1||Check that the current values do not duplicate any values set for existing OpenVMS Cluster computers. To check values, you can perform a conversational bootstrap operation.|
If the values of SCSNODE or SCSSYSTEMID are not unique, do either of
Note: To modify values, you can perform a conversational bootstrap operation. However, for reliable future bootstrap operations, specify appropriate values for these parameters in the computer's MODPARAMS.DAT file.
To verify the cluster group code and password, follow these instructions:
|1||Verify that the database file SYS$COMMON:CLUSTER_AUTHORIZE.DAT exists.|
For clusters with multiple system disks, ensure that the correct (same)
group number and password were specified for each.
Reference: See Section 10.9 to view the group number and to reset the password in the CLUSTER_AUTHORIZE.DAT file using the SYSMAN utility.
If a computer boots and joins the cluster but appears to hang before startup procedures complete---that is, before you are able to log in to the system---be sure that you have allowed sufficient time for the startup procedures to execute.
|The startup procedures fail to complete after a period that is normal for your site.||Try to access the procedures from another OpenVMS Cluster computer and make appropriate adjustments. For example, verify that all required devices are configured and available. One cause of such a failure could be the lack of some system resource, such as NPAGEDYN or page file space.|
|You suspect that the value for the NPAGEDYN parameter is set too low.||Perform a conversational bootstrap operation to increase it. Use SYSBOOT to check the current value, and then double the value.|
|You suspect a shortage of page file space, and another OpenVMS Cluster computer is available.||
Log in on that computer and use the System Generation utility (SYSGEN)
to provide adequate page file space for the problem computer.
Note: Insufficent page-file space on the booting computer might cause other computers to hang.
|The computer still cannot complete the startup procedures.||Contact your Compaq support representative.|
Section D.5 provides troubleshooting techniques for LAN component failures (for example, broken LAN bridges). That appendix also describes techniques for using the Local Area OpenVMS Cluster Network Failure Analysis Program.
Intermittent LAN component failures (for example, packet loss) can
cause problems in the NISCA transport protocol that delivers System
Communications Services (SCS) messages to other nodes in the OpenVMS
Cluster. Appendix F describes troubleshooting techniques and
requirements for LAN analyzer tools.
C.7 Diagnosing Cluster Hangs
Conditions like the following can cause a OpenVMS Cluster computer to suspend process or system activity (that is, to hang):
|Cluster quorum is lost.||Section C.7.1|
|A shared cluster resource is inaccessible.||Section C.7.2|
The OpenVMS Cluster quorum algorithm coordinates activity among OpenVMS Cluster computers and ensures the integrity of shared cluster resources. (The quorum algorithm is described fully in Chapter 2.) Quorum is checked after any change to the cluster configuration---for example, when a voting computer leaves or joins the cluster. If quorum is lost, process and I/O activity on all computers in the cluster are blocked.
Information about the loss of quorum and about clusterwide events that cause loss of quorum are sent to the OPCOM process, which broadcasts messages to designated operator terminals. The information is also broadcast to each computer's operator console (OPA0), unless broadcast activity is explicitly disabled on that terminal. However, because quorum may be lost before OPCOM has been able to inform the operator terminals, the messages sent to OPA0 are the most reliable source of information about events that cause loss of quorum.
If quorum is lost, you might add or reboot a node with additional votes.
Reference: See also the information about cluster
quorum in Section 10.12.
C.7.2 Inaccessible Cluster Resource
Access to shared cluster resources is coordinated by the distributed lock manager. If a particular process is granted a lock on a resource (for example, a shared data file), other processes in the cluster that request incompatible locks on that resource must wait until the original lock is released. If the original process retains its lock for an extended period, other processes waiting for the lock to be released may appear to hang.
Occasionally, a system activity must acquire a restrictive lock on a resource for an extended period. For example, to perform a volume rebuild, system software takes out an exclusive lock on the volume being rebuilt. While this lock is held, no processes can allocate space on the disk volume. If they attempt to do so, they may appear to hang.
Access to files that contain data necessary for the operation of the system itself is coordinated by the distributed lock manager. For this reason, a process that acquires a lock on one of these resources and is then unable to proceed may cause the cluster to appear to hang.
For example, this condition may occur if a process locks a portion of the system authorization file (SYS$SYSTEM:SYSUAF.DAT) for write access. Any activity that requires access to that portion of the file, such as logging in to an account with the same or similar user name or sending mail to that user name, is blocked until the original lock is released. Normally, this lock is released quickly, and users do not notice the locking operation.
However, if the process holding the lock is unable to proceed, other
processes could enter a wait state. Because the authorization file is
used during login and for most process creation operations (for
example, batch and network jobs), blocked processes could rapidly
accumulate in the cluster. Because the distributed lock manager is
functioning normally under these conditions, users are not notified by
broadcast messages or other means that a problem has occurred.
C.8 Diagnosing CLUEXIT Bugchecks
The operating system performs bugcheck operations only
when it detects conditions that could compromise normal system activity
or endanger data integrity. A CLUEXIT bugcheck is a
type of bugcheck initiated by the connection manager, the OpenVMS
Cluster software component that manages the interaction of cooperating
OpenVMS Cluster computers. Most such bugchecks are triggered by
conditions resulting from hardware failures (particularly failures in
communications paths), configuration errors, or system management
C.8.1 Conditions Causing Bugchecks
The most common conditions that result in CLUEXIT bugchecks are as follows:
|Possible Bugcheck Causes||Recommendations|
The cluster connection between two computers is broken for longer than
RECNXINTERVAL seconds. Thereafter, the connection is declared
irrevocably broken. If the connection is later reestablished, one of
the computers shut down with a CLUEXIT bugcheck.
This condition can occur:
|Determine the cause of the interrupted connection and correct the problem. For example, if recovery from a power failure is longer than RECNXINTERVAL seconds, you may want to increase the value of the RECNXINTERVAL parameter on all computers.|
|Cluster partitioning occurs. A member of a cluster discovers or establishes connection to a member of another cluster, or a foreign cluster is detected in the quorum file.||Review the setting of EXPECTED_VOTES on all computers.|
|The value specified for the SCSMAXMSG system parameter on a computer is too small.||Verify that the value of SCSMAXMSG on all OpenVMS Cluster computers is set to a value that is at the least the default value.|
These sections provide detailed information about port communications
to assist in diagnosing port communication problems.
C.9.1 Port Polling
Shortly after a CI computer boots, the CI port driver (PADRIVER) begins configuration polling to discover other active ports on the CI. Normally, the poller runs every 5 seconds (the default value of the PAPOLLINTERVAL system parameter). In the first polling pass, all addresses are probed over cable path A; on the second pass, all addresses are probed over path B; on the third pass, path A is probed again; and so on.
The poller probes by sending Request ID (REQID) packets to all possible port numbers, including itself. Active ports receiving the REQIDs return ID Received packet (IDREC) to the port issuing the REQID. A port might respond to a REQID even if the computer attached to the port is not running.
For OpenVMS Cluster systems communicating over the CI, DSSI, or a
combination of these interconnects, the port drivers perform a start
handshake when a pair of ports and port drivers has successfully
exchanged ID packets. The port drivers exchange datagrams containing
information about the computers, such as the type of computer and the
operating system version. If this exchange is successful, each computer
declares a virtual circuit open. An open virtual circuit is
prerequisite to all other activity.
C.9.2 LAN Communications
For clusters that include Ethernet or FDDI interconnects, a multicast scheme is used to locate computers on the LAN. Approximately every 3 seconds, the port emulator driver (PEDRIVER) sends a HELLO datagram message through each LAN adapter to a cluster-specific multicast address that is derived from the cluster group number. The driver also enables the reception of these messages from other computers. When the driver receives a HELLO datagram message from a computer with which it does not currently share an open virtual circuit, it attempts to create a circuit. HELLO datagram messages received from a computer with a currently open virtual circuit indicate that the remote computer is operational.
A standard, three-message exchange handshake is used to create a
virtual circuit. The handshake messages contain information about the
transmitting computer and its record of the cluster password. These
parameters are verified at the receiving computer, which continues the
handshake only if its verification is successful. Thus, each computer
authenticates the other. After the final message, the virtual circuit
is opened for use by both computers.
C.9.3 System Communications Services (SCS) Connections
System services such as the disk class driver, connection manager, and the MSCP and TMSCP servers communicate between computers with a protocol called System Communications Services (SCS). SCS is responsible primarily for forming and breaking intersystem process connections and for controlling flow of message traffic over those connections. SCS is implemented in the port driver (for example, PADRIVER, PBDRIVER, PEDRIVER, PIDRIVER), and in a loadable piece of the operating system called SCSLOA.EXE (loaded automatically during system initialization).
When a virtual circuit has been opened, a computer periodically probes
a remote computer for system services that the remote computer may be
offering. The SCS directory service, which makes known services that a
computer is offering, is always present both on computers and HSC
subsystems. As system services discover their counterparts on other
computers and HSC subsystems, they establish SCS connections to each
other. These connections are full duplex and are associated with a
particular virtual circuit. Multiple connections are typically
associated with a virtual circuit.
C.10 Diagnosing Port Failures
This section describes the hierarchy of communication paths and
describes where failures can occur.
C.10.1 Hierarchy of Communication Paths
Taken together, SCS, the port drivers, and the port itself support a hierarchy of communication paths. Starting with the most fundamental level, these are as follows:
Failures can occur at each communication level and in each component. Failures at one level translate into failures elsewhere, as described in Table C-3.
|Wires||If the LAN fails or is disconnected, LAN traffic stops or is interrupted, depending on the nature of the failure. For the CI, either path A or B can fail while the virtual circuit remains intact. All traffic is directed over the remaining good path. When the wire is repaired, the repair is detected automatically by port polling, and normal operations resume on all ports.|
If no path works between a pair of ports, the virtual circuit fails and
is closed. A path failure is discovered as follows:
When a virtual circuit fails, every SCS connection on it is closed. The software automatically reestablishes connections when the virtual circuit is reestablished. Normally, reestablishing a virtual circuit takes several seconds after the problem is corrected.
|CI port||If a port fails, all virtual circuits to that port fail, and all SCS connections on those virtual circuits are closed. If the port is successfully reinitialized, virtual circuits and connections are reestablished automatically. Normally, port reinitialization and reestablishment of connections take several seconds.|
|LAN adapter||If a LAN adapter device fails, attempts are made to restart it. If repeated attempts fail, all channels using that adapter are broken. A channel is a pair of LAN addresses, one local and one remote. If the last open channel for a virtual circuit fails, the virtual circuit is closed and the connections are broken.|
|SCS connection||When the software protocols fail or, in some instances, when the software detects a hardware malfunction, a connection is terminated. Other connections are usually unaffected, as is the virtual circuit. Breaking of connections is also used under certain conditions as an error recovery mechanism---most commonly when there is insufficient nonpaged pool available on the computer.|
|Computer||If a computer fails because of operator shutdown, bugcheck, or halt, all other computers in the cluster record the shutdown as failures of their virtual circuits to the port on the shut down computer.|
Before you boot in a cluster a CI connected computer that is new, just repaired, or suspected of having a problem, you should have Compaq services verify that the computer runs correctly on its own.
|privacy and legal statement|