|Document revision date: 15 July 2002|
Port drivers detect certain error conditions and attempt to log them. The port driver attempts both OPA0 error broadcasting and standard error logging under any of the following circumstances:
Note the implicit assumption that the system and error-logging devices are one and the same.
The following table describes error-logging methods and their reliability.
|Standard error logging to an error-logging device.||Under some circumstances, attempts to log errors to the error-logging device can fail. Such failures can occur because the error-logging device is not accessible when attempts are made to log the error condition.||Because of the central role that the port device plays in clusters, the loss of error-logged information in such cases makes it difficult to diagnose and fix problems.|
|Broadcasting selected information about the error condition to OPA0. (This is in addition to the port driver's attempt to log the error condition to the error-logging device.)||This method of reporting errors is not entirely reliable, because some error conditions may not be reported due to the way OPA0 error broadcasting is performed. This situation occurs whenever a second error condition is detected before the port driver has been able to broadcast the first error condition to OPA0. In such a case, only the first error condition is reported to OPA0, because that condition is deemed to be the more important one.||This second, redundant method of error logging captures at least some of the information about port-device error conditions that would otherwise be lost.|
Note: Certain error conditions are always broadcast to
OPA0, regardless of whether the error-logging device is accessible. In
general, these are errors that cause the port to shut down either
permanently or temporarily.
C.12.1 OPA0 Error Messages
One OPA0 error message for each error condition is always logged. The text of each error message is similar to the text in the summary displayed by formatting the corresponding standard error-log entry using the Error Log utility. (See Section C.11.7 for a list of Error Log utility summary messages and their explanations.)
Table C-8 lists the OPA0 error messages. The table is divided into units by error type. Many of the OPA0 error messages contain some optional information, such as the remote port number, CI packet information (flags, port operation code, response status, and port number fields), or specific CI port registers. The codes specify whether the message is always logged on OPA0 or is logged only when the system device is inaccessible.
|Error Message||Logged or Inaccessible|
|Software Errors During Initialization|
|%Pxxn, Insufficient Non-Paged Pool for Initialization||Logged|
|%Pxxn, Failed to Locate Port Micro-code Image||Logged|
|%Pxxn, SCSSYSTEMID has NOT been set to a Non-Zero Value||Logged|
|%Pxxn, BIIC failure---BICSR/BER/CNF xxxxxx/xxxxxx/xxxxxx||Logged|
|%Pxxn, Micro-code Verification Error||Logged|
|%Pxxn, Port Transition Failure---CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx||Logged|
|%Pxxn, Port Error Bit(s) Set---CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx||Logged|
|%Pxxn, Port Power Down||Logged|
|%Pxxn, Port Power Up||Logged|
|%Pxxn, Unexpected Interrupt---CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx||Logged|
|%Pxxn, CI Port Timeout||Logged|
|%Pxxn, CI port ucode not at required rev level. ---RAM/PROM rev is xxxx/xxxx||Logged|
|%Pxxn, CI port ucode not at current rev level.---RAM/PROM rev is xxxx/xxxx||Logged|
|%Pxxn, CPU ucode not at required rev level for CI activity||Logged|
|Queue Interlock Failures|
|%Pxxn, Message Free Queue Remove Failure||Logged|
|%Pxxn, Datagram Free Queue Remove Failure||Logged|
|%Pxxn, Response Queue Remove Failure||Logged|
|%Pxxn, High Priority Command Queue Insert Failure||Logged|
|%Pxxn, Low Priority Command Queue Insert Failure||Logged|
|%Pxxn, Message Free Queue Insert Failure||Logged|
|%Pxxn, Datagram Free Queue Insert Failure||Logged|
|Errors Signaled with a CI Packet|
|%Pxxn, Unrecognized SCA Packet---FLAGS/OPC/STATUS/PORT xx/xx/xx/xx||Logged|
|%Pxxn, Port has Closed Virtual Circuit---REMOTE PORT 1 xxx||Logged|
|%Pxxn, Software Shutting Down Port||Logged|
|%Pxxn, Software is Closing Virtual Circuit---REMOTE PORT 1 xxx||Logged|
|%Pxxn, Received Connect Without Path-Block---FLAGS/OPC/STATUS/PORT xx/xx/xx/xx||Logged|
|%Pxxn, Inappropriate SCA Control Message---FLAGS/OPC/STATUS/PORT xx/xx/xx/xx||Logged|
|%Pxxn, No Path-Block During Virtual Circuit Close---REMOTE PORT 1 xxx||Logged|
|%Pxxn, HSC Error Logging Datagram Received Inaccessible---REMOTE PORT 1 xxx||Inaccessible|
|%Pxxn, Remote System Conflicts with Known System---REMOTE PORT 1 xxx||Logged|
|%Pxxn, Virtual Circuit Timeout---REMOTE PORT 1 xxx||Logged|
|%Pxxn, Parallel Path is Closing Virtual Circuit--- REMOTE PORT 1 xxx||Logged|
|%Pxxn, Insufficient Nonpaged Pool for Virtual Circuits||Logged|
|Cable Change-of-State Notification|
|%Pxxn, Path #0. Has gone from GOOD to BAD---REMOTE PORT 1 xxx||Inaccessible|
|%Pxxn, Path #1. Has gone from GOOD to BAD---REMOTE PORT 1 xxx||Inaccessible|
|%Pxxn, Path #0. Has gone from BAD to GOOD---REMOTE PORT 1 xxx||Inaccessible|
|%Pxxn, Path #1. Has gone from BAD to GOOD---REMOTE PORT 1 xxx||Inaccessible|
|%Pxxn, Cables have gone from UNCROSSED to CROSSED---REMOTE PORT 1 xxx||Inaccessible|
|%Pxxn, Cables have gone from CROSSED to UNCROSSED---REMOTE PORT 1 xxx||Inaccessible|
|%Pxxn, Path #0. Loopback has gone from GOOD to BAD---REMOTE PORT 1 xxx||Logged|
|%Pxxn, Path #1. Loopback has gone from GOOD to BAD---REMOTE PORT 1 xxx||Logged|
|%Pxxn, Path #0. Loopback has gone from BAD to GOOD---REMOTE PORT 1 xxx||Logged|
|%Pxxn, Path #1. Loopback has gone from BAD to GOOD---REMOTE PORT 1 xxx||Logged|
|%Pxxn, Path #0. Has become working but CROSSED to Path #1.--- REMOTE PORT 1 xxx||Inaccessible|
|%Pxxn, Path #1. Has become working but CROSSED to Path #0.--- REMOTE PORT 1 xxx||Inaccessible|
PMC---port maintenance and control register
PSR---port status register
See also the CI hardware documentation for a detailed description of the CI port registers.
Two other messages concerning the CI port appear on OPA0:
%Pxxn, CI port is reinitializing (xxx retries left.) %Pxxn, CI port is going off line.
The first message indicates that a previous error requiring the port to shut down is recoverable and that the port will be reinitialized. The "xxx retries left" specifies how many more reinitializations are allowed before the port must be left permanently off line. Each reinitialization of the port (for reasons other than power fail recovery) causes approximately 2 KB of nonpaged pool to be lost.
The second message indicates that a previous error is not recoverable and that the port will be left off line. In this case, the only way to recover the port is to reboot the computer.
Sample programs are provided to start and stop the NISCA protocol on a LAN adapter, and to enable LAN network failure analysis. The following programs are located in SYS$EXAMPLES:
|LAVC$START_BUS.MAR||Starts the NISCA protocol on a specified LAN adapter.|
|LAVC$STOP_BUS.MAR||Stops the NISCA protocol on a specified LAN adapter.|
|LAVC$FAILURE_ANALYSIS.MAR||Enables LAN network failure analysis.|
|LAVC$BUILD.COM||Assembles and links the sample programs.|
Reference: The NISCA protocol, responsible for
carrying messages across Ethernet and FDDI LANs to other nodes in the
cluster, is described in Appendix F.
D.1 Purpose of Programs
The port emulator driver, PEDRIVER, starts the NISCA protocol on all of the LAN adapters in the cluster. LAVC$START_BUS.MAR and LAVC$STOP_BUS.MAR are provided for cluster managers who want to split the network load according to protocol type and therefore do not want the NISCA protocol running on all of the LAN adapters.
Reference: See Section D.5 for information about
editing and using the network failure analysis program.
D.2 Starting the NISCA Protocol
The sample program LAVC$START_BUS.MAR, provided in SYS$EXAMPLES, starts the NISCA protocol on a specific LAN adapter.
To build the program, perform the following steps:
|1||Copy the files LAVC$START_BUS.MAR and LAVC$BUILD.COM from SYS$EXAMPLES to your local directory.|
Assemble and link the sample program using the following command:
$ @LAVC$BUILD.COM LAVC$START_BUS.MAR
To start the protocol on a LAN adapter, perform the following steps:
|1||Use an account that has the PHY_IO privilege---you need this to run LAVC$START_BUS.EXE.|
|2||Define the foreign command (DCL symbol).|
|3||Execute the foreign command (LAVC$START_BUS.EXE), followed by the name of the LAN adapter on which you want to start the protocol.|
Example: The following example shows how to start the NISCA protocol on LAN adapter ETA0:
$ START_BUS:==$SYS$DISK:[ ]LAVC$START_BUS.EXE $ START_BUS ETA
The sample program LAVC$STOP_BUS.MAR, provided in SYS$EXAMPLES, stops the NISCA protocol on a specific LAN adapter.
Caution: Stopping the NISCA protocol on all LAN adapters causes satellites to hang and could cause cluster systems to fail with a CLUEXIT bugcheck.
Follow the steps below to build the program:
|1||Copy the files LAVC$STOP_BUS.MAR and LAVC$BUILD.COM from SYS$EXAMPLES to your local directory.|
Assemble and link the sample program using the following command:
$ @LAVC$BUILD.COM LAVC$STOP_BUS.MAR
D.3.1 Stop the Protocol
To stop the NISCA protocol on a LAN adapter, perform the following
|1||Use an account that has the PHY_IO privilege---you need this to run LAVC$STOP_BUS.EXE.|
|2||Define the foreign command (DCL symbol).|
|3||Execute the foreign command (LAVC$STOP_BUS.EXE), followed by the name of the LAN adapter on which you want to stop the protocol.|
Example: The following example shows how to stop the NISCA protocol on LAN adapter ETA0:
$ STOP_BUS:==$SYS$DISK[ ]LAVC$STOP_BUS.EXE $ STOP_BUS ETA
When the LAVC$STOP_BUS module executes successfully, the following device-attention entry is written to the system error log:
DEVICE ATTENTION... NI-SCS SUB-SYSTEM... FATAL ERROR DETECTED BY DATALINK...
In addition, the following hexidecimal values are written to the STATUS field of the entry:
First longword (00000001)
Second longword (00001201)
The error-log entry indicates expected behavior and can be ignored.
However, if the first longword of the STATUS field contains a value
other than hexidecimal value 00000001, an error has occurred and
further investigation may be necessary.
D.4 Analyzing Network Failures
LAVC$FAILURE_ANALYSIS.MAR is a sample program, located in SYS$EXAMPLES,
that you can edit and use to help detect and isolate a failed network
component. When the program executes, it provides the physical
description of your cluster communications network to the set of
routines that perform the failure analysis.
D.4.1 Failure Analysis
Using the network failure analysis program can help reduce the time
necessary for detection and isolation of a failing network component
and, therefore, significantly increase cluster availability.
D.4.2 How the LAVC$FAILURE_ANALYSIS Program Works
The following table describes how the LAVC$FAILURE_ANALYSIS program works.
|1||The program groups channels that fail and compares them with the physical description of the cluster network.|
The program then develops a list of nonworking network components
related to the failed channels and uses OPCOM messages to display the
names of components with a probability of causing one or more channel
If the network failure analysis cannot verify that a portion of a path (containing multiple components) works, the program:
|3||When the component works again, OPCOM displays the message %LAVC-S-WORKING.|
Table D-1 describes the steps you perform to edit and use the network failure analysis program.
|1||Collect and record information specific to your cluster communications network.||Section D.5.1|
|2||Edit a copy of LAVC$FAILURE_ANALYSIS.MAR to include the information you collected.||Section D.5.2|
|3||Assemble, link, and debug the program.||Section D.5.3|
|4||Modify startup files to run the program only on the node for which you supplied data.||Section D.5.4|
|5||Execute the program on one or more of the nodes where you plan to perform the network failure analysis.||Section D.5.5|
|6||Modify MODPARAMS.DAT to increase the values of nonpaged pool parameters.||Section D.5.6|
|7||Test the Local Area OpenVMS Cluster Network Failure Analysis Program.||Section D.5.7|
D.5.1 Create a Network Diagram
Follow the steps in Table D-2 to create a physical description of
the network configuration and include it in electronic form in the
|1||Draw a diagram of your OpenVMS Cluster communications network.||
When you edit LAVC$FAILURE_ANALYSIS.MAR, you include this drawing (in
electronic form) in the program. Your drawing should show the physical
layout of the cluster and include the following components:
For large clusters, you may need to verify the configuration by tracing the cables.
|2||Give each component in the drawing a unique label.||If your OpenVMS Cluster contains a large number of nodes, you may want to replace each node name with a shorter abbreviation. Abbreviating node names can help save space in the electronic form of the drawing when you include it in LAVC$FAILURE_ANALYSIS.MAR. For example, you can replace the node name ASTRA with A and call node ASTRA's two LAN adapters A1 and A2.|
List the following information for each component:
||Devices such as DELNI interconnects, DEMPR repeaters, and cables do not have LAN addresses.|
Classify each component into one of the following categories:
||The cloud component is necessary only when multiple paths exist between two points within the network, such as with redundant bridging between LAN segments. At a high level, multiple paths can exist; however, during operation, this bridge configuration allows only one path to exist at one time. In general, this bridge example is probably better handled by representing the active bridge in the description as a component and ignoring the standby bridge. (You can identify the active bridge with such network monitoring software as RBMS or DECelms.) With the default bridge parameters, failure of the active bridge will be called out.|
|5||Use the component labels from step 3 to describe each of the connections in the OpenVMS Cluster communications network.|
|6||Choose a node or group of nodes to run the network failure analysis program.||
You should run the program only on a node that you included in the
physical description when you edited LAVC$FAILURE_ANALYSIS.MAR. The
network failure analysis program on one node operates independently
from other systems in the OpenVMS Cluster. So, for executing the
network failure analysis program, you should choose systems that are
not normally shut down. Other good candidates for running the program
are systems with the following characteristics:
Note: The physical description is loaded into nonpaged pool, and all processing is performed at IPL 8. CPU use increases as the average number of network components in the network path increases. CPU use also increases as the total number of network paths increases.
|privacy and legal statement|