B Troubleshooting Cluster Installation

This appendix describes problems that can occur during installation and how to deal with them.

B.1 Troubleshooting the LAN Interconnect

This section discusses the problems that can occur due to a misconfigured LAN interconnect and how you can resolve them.

B.1.1 Conflict with Default Physical Cluster Interconnect IP Name

In clusters with a LAN interconnect, the default physical cluster interconnect IP name has the form membermemberID-icstcp0.

The clu_create and clu_add_member commands use ping to determine whether the default name is already in use on the net. If this check finds a host already using the default IP name, you are instructed:

Enter the physical cluster interconnect interface device name []

After displaying this message, the command fails. Depending on which command was executing at the time of failure, you get one of the following messages:

Error: clu_create: Bad configuration
 
Error: clu_add_member: Bad configuration

If you see either of these messages, look in /cluster/admin/clu_create.log or /cluster/admin/clu_add_member.log, as appropriate, for the following error message:

Error: A system with the name 'membermemberID-icstcp0' is currently running on your network.

If you find this message, contact your network administrator about changing the hostname of the non-cluster system already using the default IP name. The clu_create and clu_add_member commands do not allow you to change the default physical cluster interconnect IP name.

B.1.2 Booting Member Joins Cluster But Appears to Hang Before Reaching Multi-User Mode

If a new member appears to hang at boot time sometime after joining the cluster, the speed or operational mode of the booting member's LAN interconnect adapter is probably inconsistent with that of the LAN interconnect. This problem can result from the adapter failing to autonegotiate properly, from improper hardware settings, or from faulty Ethernet hardware. To determine whether this problem exists, pay close attention to console messages of the following form on the booting member:

ee0: Parallel Detection, 10 Mbps half duplex
ee0: Autonegotiated, 100 Mbps full duplex

For a cluster interconnect running at 100 Mb/s in full-duplex mode, the first message may indicate a problem. The second message indicates that autonegotiation has completed successfully.

The autonegotiation behavior of the Ethernet adapters and switches that are configured in the interconnect may cause unexpected hangs at boot time if you do not take the following considerations into account:

Autonegotiation settings must be the same on both ends of any given cable. That is, if an Ethernet adapter is configured for autonegotiation, the switch port to which it is connected must also be configured for autonegotiation. Similarly, if the adapter is cross-cabled to another member's adapter, the other member's adapter must be set to autonegotiate. If you violate this rule (for example, by setting one end to 100 Mb/s full-duplex, and the other to autonegotiate), the member set to autonegotiate may set itself to half-duplex mode while booting and cluster transactions will experience delays.

Supported 100 Mb/s Ethernet network adapters in AlphaServer systems can use two different drivers: ee and tu.
Network adapters in the DE50x family (which have a console name of the form ew x0) are based on the DECchip 21140, 21142, and 21143 chipsets and use the tu driver. If the network adapter uses the tu driver, it may or may not support autonegotiation.

Note

DE500-XA adapters do not support autonegotiation. Proper autonegotiation succeeds more often with DE500-BA and DE504 adapters than with DE500-AA adapters.
To use autonegotiation, set the ewx0_mode console variable to auto and set the port on the switch connected to the network adapter for autonegotiation.
With network adapters using the tu driver, it may be easier to force the adapter to use 100 Mb/s full-duplex mode explicitly. To force the adapter to use 100 Mb/s full-duplex mode, set the ewx0_mode variable to FastFD. In this case, you must use a switch that allows autonegotiation to be disabled and set the port on the switch connected the network adapter for 100 Mb/s full-duplex. See tu(7) and the switch's manual for more information.
Network adapters in the DE60x family (which have a console name of the form ei x0) use the ee driver. Network adapters using the ee driver by default use IEEE 802.3u autonegotiation to determine which speed setting to use. Make sure that the port on the switch to which the network adapter is connected is set for autonegotiation. See ee(7) and the switch's manual for more information.

Supported 1000 Mb/s Ethernet network adapters in the DEGPA-xx family use the alt driver. Network adapters using the alt driver by default use IEEE 802.3u autonegotiation to determine which speed setting to use. Make sure that the port on the switch to which the network adapter is connected is set for autonegotiation. See alt(7) and the switch's manual for more information.

B.1.3 Booting Member Hangs While Trying to Join Cluster

If a new member hangs at boot time while trying to join the cluster, the new member might be disconnected from the cluster interconnect. The following may have caused the disconnect:

A cable is unplugged.

You specified an existing Ethernet adapter as the physical cluster interconnect interface to clu_add_member, but that adapter is not connected to other members (and perhaps is used for a purpose other than as a LAN interconnect, such as a client network).

You specified an address for the cluster interconnect physical device that is not on the same subnet as those of other cluster members. For example, you may have specified an address on the cluster interconnect virtual subnet (ics0) for the member's cluster interconnect physical device.

You specified a different interconnect type for this member (for example, the cluster_interconnect attribute in its clubase kernel subsystem is mct), whereas the rest of the cluster specifies tcp).

One of the following messages is typically displayed on the console:

CNX MGR: cannot form: quorum disk is in use.  Unable to establish contact
         with members using disk.

CNX MGR: Node pepperoni id 2 incarn 0xa3a71 attempting to form or join cluster deli

Perform the following steps to resolve this problem:

Halt the booting member.

Make sure the adapter is properly connected to the LAN interconnect.

Mount the new member's boot partition on another member. For example:
```
# mount root2_domain#root /mnt
 
```

Examine the /mnt/etc/sysconfigtab file. The attributes listed in Table C-2 must be set correctly to reflect the member's LAN interconnect interface.

Edit /mnt/etc/sysconfigtab as appropriate.

Unmount the member's boot partition:
```
# umount /mnt
 
```

Reboot the member.

B.1.4 Booting Member Panics with "ics_ll_tcp" Message

If you boot a new member into the cluster and it panics with an "ics_ll_tcp: Unable to configure cluster interconnect network interface" message, you may have specified a device that does not exist as the member's physical cluster interconnect interface to clu_add_member, or the booting kernel may not contain the device driver to support the cluster interconnect device.

Perform the following steps to resolve this problem:

Halt the booting member.

Mount the new member's boot partition on another member. For example:
```
# mount root2_domain#root /mnt
 
```

Examine the /mnt/etc/sysconfigtab file. The ics_ll_tcp attributes listed in Table C-2 must be set to correctly reflect the member's LAN interconnect interface.

If the interface does not exist, do the following:

Edit /mnt/etc/sysconfigtab as appropriate.

Unmount the member's boot partition:
```
# umount /mnt
 
```

Reboot the member.

If the interface name is correct, the vmunix kernel may not contain the device driver for the LAN interconnect device. To rectify this problem, do the following:

Boot the member on the genvmunix kernel.

Edit the /sys/conf/HOSTNAME file and add the missing driver.

Rebuild the vmunix kernel using the doconfig command.

Copy the new kernel to the root (/) directory.

Reboot the member from its vmunix kernel.

B.1.5 Booting Member Displays "ics_ll_tcp: ERROR: Could not create a NetRAIN set with the specified members" Message

If you boot a new member into the cluster and it displays the "ics_ll_tcp: ERROR: Could not create a NetRAIN set with the specified members" message shortly after the installation tasks commence, a NetRAIN virtual interface used for the cluster interconnect has may have been misconfigured. You will also see this message if a member of the NetRAIN set has been misconfigured.

Perhaps you have edited the /etc/rc.config file to apply traditional NetRAIN admin to the LAN interconnect. In this case, the NetRAIN configuration in the /etc/rc.config file is ignored and the NetRAIN interface defined in /etc/sysconfigtab is used as the cluster interconnect.

You must never configure a NetRAIN set that is used for a cluster interconnect in the /etc/rc.config file. A NetRAIN device for the cluster interconnect is set up completely within the ics_ll_tcp kernel subsystem in /etc/sysconfigtab and not in /etc/rc.config.

Perform the following steps to resolve this problem:

Use the rcmgr delete command to edit the newly booted member's /cluster/members/{memb}/etc/rc.config file to remove the NRDEV_x, NRCONFIG_x, NETDEV_x, and IFCONFIG_x variables associated with the device.

Use the rcmgr set command to decrement the NR_DEVICES and NUM_NETCONFIG variables that doubly define the cluster interconnect NetRAIN device.

Reboot the member.

B.2 Dealing with Other Issues

B.2.1 Booting a New Member Without a Cluster License Displays ATTENTION Message

When you boot a newly added member, the clu_check_config utility performs a series of configuration checks. If you have not yet installed the TruCluster Server license, the TCS-UA product authorization key (PAK), on the member, the boot procedure will display the following messages:

Starting Cluster Configuration Check...
The boottime cluster check found a potential problem.
For details search for !!!!!ATTENTION!!!!! in /cluster/admin/clu_check_log_hostname
check_cdsl_config : Boot Mode : Running /usr/sbin/cdslinvchk in the background
check_cdsl_config : Results can be found in : /var/adm/cdsl_check_list
clu_check_config : no configuration errors or warnings were detected

The following message appears in the /cluster/admin/clu_check_log_hostname file:

/usr/sbin/caad is NOT_RUNNING !!!!!ATTENTION!!!!!

When the TruCluster Server license is not configured on a member, the cluster application availability (CAA) daemon (caad) is not automatically started on that member. This is normal and expected behavior.

If you did not configure the license from within clu_add_member when you added the new member (as discussed in Chapter 5), you can configure it later using the lmf register command. After the license has been installed, you can start the CAA daemon on that member using the /usr/sbin/caad command.