B    Troubleshooting Cluster Installation

This appendix describes problems that can occur during installation and how to deal with them.

B.1    Troubleshooting the LAN Interconnect

This section discusses the problems that can occur due to a misconfigured LAN interconnect and how you can resolve them.

B.1.1    Conflict with Default Physical Cluster Interconnect IP Name

In clusters with a LAN interconnect, the default physical cluster interconnect IP name has the form membermemberID-icstcp0.

The clu_create and clu_add_member commands use ping to determine whether the default name is already in use on the net. If this check finds a host already using the default IP name, you are instructed:

Enter the physical cluster interconnect interface device name []
 

After displaying this message, the command fails. Depending on which command was executing at the time of failure, you get one of the following messages:

Error: clu_create: Bad configuration
 
Error: clu_add_member: Bad configuration
 

If you see either of these messages, look in /cluster/admin/clu_create.log or /cluster/admin/clu_add_member.log, as appropriate, for the following error message:

Error: A system with the name 'membermemberID-icstcp0' is currently running on your network.
 

If you find this message, contact your network administrator about changing the hostname of the non-cluster system already using the default IP name. The clu_create and clu_add_member commands do not allow you to change the default physical cluster interconnect IP name.

B.1.2    Booting Member Joins Cluster But Appears to Hang Before Reaching Multi-User Mode

If a new member appears to hang at boot time sometime after joining the cluster, the speed or operational mode of the booting member's LAN interconnect adapter is probably inconsistent with that of the LAN interconnect. This problem can result from the adapter failing to autonegotiate properly, from improper hardware settings, or from faulty Ethernet hardware. To determine whether this problem exists, pay close attention to console messages of the following form on the booting member:

ee0: Parallel Detection, 10 Mbps half duplex
ee0: Autonegotiated, 100 Mbps full duplex
 

For a cluster interconnect running at 100 Mb/s in full-duplex mode, the first message may indicate a problem. The second message indicates that autonegotiation has completed successfully.

The autonegotiation behavior of the Ethernet adapters and switches that are configured in the interconnect may cause unexpected hangs at boot time if you do not take the following considerations into account:

B.1.3    Booting Member Hangs While Trying to Join Cluster

If a new member hangs at boot time while trying to join the cluster, the new member might be disconnected from the cluster interconnect. The following may have caused the disconnect:

One of the following messages is typically displayed on the console:

CNX MGR: cannot form: quorum disk is in use.  Unable to establish contact
         with members using disk.
 

CNX MGR: Node pepperoni id 2 incarn 0xa3a71 attempting to form or join cluster deli
 

Perform the following steps to resolve this problem:

  1. Halt the booting member.

  2. Make sure the adapter is properly connected to the LAN interconnect.

  3. Mount the new member's boot partition on another member. For example:

    # mount root2_domain#root /mnt
     
    

  4. Examine the /mnt/etc/sysconfigtab file. The attributes listed in Table C-2 must be set correctly to reflect the member's LAN interconnect interface.

  5. Edit /mnt/etc/sysconfigtab as appropriate.

  6. Unmount the member's boot partition:

    # umount /mnt
     
    

  7. Reboot the member.

B.1.4    Booting Member Panics with "ics_ll_tcp" Message

If you boot a new member into the cluster and it panics with an "ics_ll_tcp: Unable to configure cluster interconnect network interface" message, you may have specified a device that does not exist as the member's physical cluster interconnect interface to clu_add_member, or the booting kernel may not contain the device driver to support the cluster interconnect device.

Perform the following steps to resolve this problem:

  1. Halt the booting member.

  2. Mount the new member's boot partition on another member. For example:

    # mount root2_domain#root /mnt
     
    

  3. Examine the /mnt/etc/sysconfigtab file. The ics_ll_tcp attributes listed in Table C-2 must be set to correctly reflect the member's LAN interconnect interface.

If the interface does not exist, do the following:

  1. Edit /mnt/etc/sysconfigtab as appropriate.

  2. Unmount the member's boot partition:

    # umount /mnt
     
    

  3. Reboot the member.

If the interface name is correct, the vmunix kernel may not contain the device driver for the LAN interconnect device. To rectify this problem, do the following:

  1. Boot the member on the genvmunix kernel.

  2. Edit the /sys/conf/HOSTNAME file and add the missing driver.

  3. Rebuild the vmunix kernel using the doconfig command.

  4. Copy the new kernel to the root (/) directory.

  5. Reboot the member from its vmunix kernel.

B.1.5    Booting Member Displays "ics_ll_tcp: ERROR: Could not create a NetRAIN set with the specified members" Message

If you boot a new member into the cluster and it displays the "ics_ll_tcp: ERROR: Could not create a NetRAIN set with the specified members" message shortly after the installation tasks commence, a NetRAIN virtual interface used for the cluster interconnect has may have been misconfigured. You will also see this message if a member of the NetRAIN set has been misconfigured.

Perhaps you have edited the /etc/rc.config file to apply traditional NetRAIN admin to the LAN interconnect. In this case, the NetRAIN configuration in the /etc/rc.config file is ignored and the NetRAIN interface defined in /etc/sysconfigtab is used as the cluster interconnect.

You must never configure a NetRAIN set that is used for a cluster interconnect in the /etc/rc.config file. A NetRAIN device for the cluster interconnect is set up completely within the ics_ll_tcp kernel subsystem in /etc/sysconfigtab and not in /etc/rc.config.

Perform the following steps to resolve this problem:

  1. Use the rcmgr delete command to edit the newly booted member's /cluster/members/{memb}/etc/rc.config file to remove the NRDEV_x, NRCONFIG_x, NETDEV_x, and IFCONFIG_x variables associated with the device.

  2. Use the rcmgr set command to decrement the NR_DEVICES and NUM_NETCONFIG variables that doubly define the cluster interconnect NetRAIN device.

  3. Reboot the member.

B.2    Dealing with Other Issues

B.2.1    Booting a New Member Without a Cluster License Displays ATTENTION Message

When you boot a newly added member, the clu_check_config utility performs a series of configuration checks. If you have not yet installed the TruCluster Server license, the TCS-UA product authorization key (PAK), on the member, the boot procedure will display the following messages:

Starting Cluster Configuration Check...
The boottime cluster check found a potential problem.
For details search for !!!!!ATTENTION!!!!! in /cluster/admin/clu_check_log_hostname
check_cdsl_config : Boot Mode : Running /usr/sbin/cdslinvchk in the background
check_cdsl_config : Results can be found in : /var/adm/cdsl_check_list
clu_check_config : no configuration errors or warnings were detected
 

The following message appears in the /cluster/admin/clu_check_log_hostname file:

/usr/sbin/caad is NOT_RUNNING !!!!!ATTENTION!!!!!
 

When the TruCluster Server license is not configured on a member, the cluster application availability (CAA) daemon (caad) is not automatically started on that member. This is normal and expected behavior.

If you did not configure the license from within clu_add_member when you added the new member (as discussed in Chapter 5), you can configure it later using the lmf register command. After the license has been installed, you can start the CAA daemon on that member using the /usr/sbin/caad command.