Next Previous Contents
  ------------------------------------------------------------------------

5. Troubleshooting

The Coda filesystem is still under development, and there certainly are
several bugs which can crash both clients and servers. However, many
problems users observe are related to semantical differences of the Coda
filesystem compared to well-known NFS or SMB network filesystems.

This section will point out several logs to look at for identifying the
cause of problems. Even if the source of the problem cannot be found, the
information gathered from Coda's logging mechanisms will make it easier for
people on the coda mailinglist <coda-discuss@coda.cs.cmu.edu> to assist in
solving the problem(s).

Some of the more common problems are illustrated in detail. At the end of
this section some of the more involved debugging techniques will be
addressed. This will be helpful to developers to isolate problems more
easily.

At the end there is a whole section describing how to solve some problems
with Windows95, only the Coda related stuff!.

5.1 Basic Troubleshooting

Most problems can be solved, or at least recognized by using the
information logged by the clients and servers. The first step in finding
out where the problems stems from is doing a tail -f on the logfiles.

It must also be noted that, when coda clients and servers crash they do not
`dump core', but start sleeping so that we developers can attach debuggers.
As a result, a crashed client or server still shows up in the ps auxwww
output, and only the combination of lack of file-service and error messages
in logfiles indicate that something is really wrong.

Client debugging output

   * codacon is a program which connects to venus and provides the user
     with run-time information. It is the initial source of information,
     but cannot be used to look back into the history. It is therefore
     advisable to always have a codacon running in a dedicated xterm.

          client$ xterm -e codacon

   * /usr/coda/etc/console is a logfile which contains mostly error or
     warning messages, and is a place to look for errors which might have
     occured. When assertions in the code fail, it is logged here.
   * /usr/coda/venus.cache/venus.log contains more in-depth information
     about the running system, which can be helpful to find out what the
     client is or was doing.

Server logs

   * cmon is an ncurses program that can be run on a client to gather and
     display statistics from a group of servers. When a server goes down it
     will not respond to the statistics requests, which makes this a simple
     method for monitoring server availability.

          client$ xterm -e cmon server1:100 server2:100 server3:100
          ...

   * /vice/srv/SrvLog and /vice/srv/SrvErr are the server logfiles.
   * /vice/auth2/AuthLog
   * /vice/srv/portmaplog
   * /vice/srv/UpdateClntLog
   * /vice/srv/UpdateLog

5.2 Client Problems

Client does not connect to testserver.coda.cs.cmu.edu.

When you have set up your client for the first time, and it can not connect
to the testserver at CMU, there are a couple of possible reasons. You might
be running an old release of coda, check the coda web-site to see what the
latest release is.

Another common reason is that your site is behind a firewall, which blocks,
or allows only outgoing, udp traffic. Either try coda on a machine outside
of the firewall, or set up your own server.

The third reason is that the testserver might be down, for maintenance or
upgrades. That does not happen often, but you can check whether it is up,
and how long it has been running using cmon.

cmon testserver.cs.cmu.edu:100

Venus comes up but prints cannot find RootVolume

All of the reasons in the previous item could be the cause. It is also
possible that your /etc/services file is not allright. It needs the
entries:

     # Iana allocated Coda filesystem port numbers
     rpc2portmap     369/tcp
     rpc2portmap     369/udp    # Coda portmapper
     codaauth2       370/tcp
     codaauth2       370/udp    # Coda authentication server

     venus           2430/tcp   # codacon port
     venus           2430/udp   # Venus callback/wbc interface
     venus-se        2431/tcp   # tcp side effects
     venus-se        2431/udp   # udp sftp side effect
     codasrv         2432/tcp   # not used
     codasrv         2432/udp   # server port
     codasrv-se      2433/tcp   # tcp side effects
     codasrv-se      2433/udp   # udp sftp side effect

Trying to access a file returns Connection timed out (ETIMEDOUT).

The main reason for getting Connection timed out errors is that the volume
where the file is located is disconnected from the servers. However, it can
also occur in some cases when the client is in write-disconnected mode, and
there is an attempt to read a file which is open for writing. See Volume is
disconnected/Volume is write-disconnected for more information.

Commands do not return, except by using ^C./

When command are hanging it is likely that venus has crashed. Check
/usr/coda/etc/console and /usr/coda/venus.cache/venus.log.

Venus fails when restarted.

If venus complains (in venus.log about not being able to open /dev/cfs0, it
is because /coda is still mounted.

# umount /coda

Another reason for not restarting is that another copy of venus is still
around, and venus is unable to open it's network socket. In this case there
will be a message in venus.log stating that RPC2_CommInit has failed.

Venus doesn't start.

A reason is that you do not have the correct kernel module. This can be
tested by inserting the module by hand, and then listing the available
modules. `coda' should show up in that listing. Otherwise reinstall (or
recompile) a new module.

# depmod -a
# insmod coda.o
# lsmod
Module                  Size  Used by
coda                   50488   2

If the kernel-module can be loaded without errors, check venus.log. A
message stating `Cannot get rootvolume name' indicated either a
misconfigured server or the codasrv/codasrv-se ports are not defined in
/etc/services, which should contain the following entries. See above for
the entries needed.

I'm disconnected and Venus doesn't start

Put the hostnames of your servers in /etc/hosts.

I cannot get tokens while disconnected.

Take vacation until we release version 5.2 of Coda. We will add this
feature.

Hoard doesn't work

Make sure you have version 5.0 of Coda or later. Before you can hoard you
must make sure that:

   * You started Venus with the flag -primaryuser "youruid"
   * You have tokens

5.3 Server Problems

The server crashed and prints messages about "AllocViaWrapAround"

This happens when you have a resolution log that is full. In the SrvLog
file you will usually be able to see which volume is affected, take down
it's volume id (you may need to consult /vice/vol/VRList on the SCM to do
this. Kill the dead (zombied) server, and restart it. The moment it is up
you do:

filcon isolate -s "this server" # to prevent clients from again
                                # overwriting the log
volutil setlogparms "volid" reson 4 logsize 16384
filcon clear -s "this server"

Unless you do "huge" things 16k will be plenty.

server doesn't start due to salvaging problems

If this happens you have several options. If the server has crashed during
salvaging it will not come up by trying again, you must either repair the
damaged volume or not attach that volume.

Not attaching the volume is done as follows. Find the volume id of the
damaged volume in the SrvLog. Create a file named /vice/vol/skipsalvage
with the lines:

1
0xdd000123

Here 1 indicates that a single volume is to be skipped and 0xdd000123 is
the volume id of the replica that should not be attached. If this volume is
a replicated volume, take all replicas offline, since otherwise the clients
will get very confused.

You can also try to repair the volume with norton. Norton is invoked as:

norton LOG DATA DATA-SIZE

These parameters can be found in /vice/srv.conf.

The Norton manual pages give details about norton's operation and there is
online guidance available which is possibly more helpful.

NOTES:

  1. Often corruption is replicated. This means that if you find a server
     has crashed and does not want to salvage a volume, your other replicas
     may suffer the same fate: the risk is that you may have to go back to
     tape (you do make tapes, right?). Therefore first copy out good data
     from the available replicas, then attend to repairing or skipping them
     in salvage.
  2. Very often you have to take both a volume and its most recent clone
     (generated during backup) offline, since corruption in a volume is
     inherited by the clone.
  3. If you find that a replica of a volume is corrupt, do not attempt to
     merely replace that replica. We have found that this corrupts the
     volume databases. It is better to make a new replicated volume and
     copy of the data from the healthy replicas (keep the server with the
     bad replica down).

How to restore a backup from tape

Tuesday I lost my email folder - the whole volume moose:braam.life was
corrupted on server moose, it wouldn't salvage. Here is how I got it back.

First I tried mounting moose.braam.life.0.backup but this was corrupted
too.

On the SCM in /vice/vol/VRList I found the replicated volume number
f0000427 and the volume number ce000011 (ficitious) for the volume.

I logged in as root to bison, our backup controlller. I read the backuplog
for Tuesday morning in /vice/backuplogs/backuplog.DATE and saw that the
incremental dump for August 31st had been fine. At the end of that log, I
saw the name f0000427.ce000011 listed as dumped under /backup (a mere
symlink) and /backup2 as spool directory with the actual file. The backup
log almost shows how to move the tape to the correct place and invoke
restore:

     cd /backup2
     mt -f /dev/nst0 rewind
     restore -b 500 -f /dev/nst0 -s 3 -i

The -s 3 option varies according to which /backup[123] volume the backup is
restored from. This invokes the restore command. Typing help allowed me to
add then extract the file I wanted. It took a little while before the file
was back. From the restore prompt do:

     restore> cd 31Aug1998
     restore> add viotti.coda.cs.cmu.edu-f0000427.ce000011
     restore> extract
     Specify volume #: 1
<verb>

In /vice/db/dumplist I saw that the last full backup had been on
Friday Aug28. I went to the machine room and inserted that tape
(recent tapes are above bison).  This time f0000427.ce000011 was a
200MB file (the last full dump) in /backup3. I extract the file as
above.

Then I merged the two dumps:

<verb>
     merge /restore/peter.mail /backup2/28Aug1998/f0000427.ce000011 \
           /backup3/31Aug1998/f0000427.ce000011

This took a minute or two to create /restore/peter.mail. Now all that was
needed was to upload that to a volume:

     volutil -h moose restore /restore/peter.mail /vicepa vio:braam.mail.restored

Back to the SCM, to update the volume databases:

     bldvldb.sh viotti

Now I could mount the restored volume:

     cfs mkm restored-mail vio:braam.mail.restored

and copy it into a read write volume using cpio or tar.

createvol_rep reports RPC2_NOBINDING.

When trying to create volumes, and createvol_rep reports RPC2_NOBINDING, it
is an indication that the server is not (yet) accepting connections.

It is useful to look at /vice/srv/SrvLog, the server performs the
equivalent of fsck on startup, which might take some time. Only when the
server logs `Fileserver Started' in SrvLog, it starts accepting incoming
connections.

Another reason is that an old server is still around, blocking the new
server from accessing the network ports.

RPC2_DUPLICATESERVER in the rpc2portmap/auth2 logs

Some process has the UDP port open which rpc2portmap or auth2 is trying to
obtain. In most cases this is an already running copy of rpc2portmap or
auth2. Kill all running copies of the program in question and restart them.

Server crashed shortly after updating files in /vice/db.

Servers can crash when they are given inconsistent or bad data-files. You
should check whether updateclnt and updatesrv are both running on the SCM
and the machine that has crashed. You can kill and restart them. Then
restart codasrv and it should come up.

Users cannot authenticate or created volumes are not mountable.

Check whether auth2, updateclnt, and updatesrv are running on all
fileservers. Also check their logfiles for possible errors.

5.4 Disconnections.

As most common problems are related to the semantical differences arising
as a result of `involuntary' disconnections, this section contains some
background information of why volumes become disconnected or
write-disconnected. And how to get them to reconnect again.

Volume is fully disconnected.

There are several reasons why a coda client may have disconnected some or
all volumes from an accessible server.

   * Pending reintegration.

     When modifications have been made to the volume in disconnected mode,
     the client will not reconnected the volume until all changes have been
     reintegrated. Also, reintegration will not occur without proper user
     authentication tokens. Furthermore, reintegration is suspended as long
     as there are objects in conflict.

     The most important item here is to have a codacon process running,
     since it will give up-to-date information on what venus is doing.
     Venus will inform the user about missing coda authentication tokens,
     `Reintegration: pending tokens for user <uid>'. In this case the user
     should authenticate himself using the clog command.

     Conflicts, which require us to use the repair tool, are conveyed using
     the `local object <pathname> inconsistent' message. Otherwise codacon
     should show messages about backfetches, and how many modifications
     were successfully reintegrated.

   * Access permissions.

     The client may also disconnect when a servers reports an error to an
     operation, when according to the client this is a valid operation.
     Causes for this are authentication failure; check tokens using ctokens
     and optionally obtain new tokens using clog. Or inconsistencies
     between the data cached on the client and the actual data stored on
     the server; this will reveal itself as an inconsistent object during
     subsequent reintegration.

   * Lost connections.

     Sometimes the client does not receive a prompt reply from an
     accessible server, and marks the server as dead. This will ofcourse
     disconnect the volume if the last server is lost. Once every five
     minutes, the client automatically verifies connectivity with all known
     servers, and can thus recover from lost connections. However, this
     action can also be triggered by the user by excecuting the cfs
     checkservers command.

     If cfs checkservers reports that servers are unreachable, it might be
     interesting to check with cmon if the server is responding at all,
     since we might be faced with a crashed server. When a server was
     considered unreachable, but is successfully contacted after `cfs
     checkservers', reintegration will automatically start (when a user has
     tokens, and there are no inconsistencies).

Volume is write-disconnected.

Write-disconnected operation is used as often as weakly connected mode to
describe this volume state, and they are effectively the same. This is the
special situation where a client observes a weak connectivity with a
server, and therefore forces the associated volumes in weakly connected
mode. Weakly connected volumes postpone writing to the server to
significantly reduce waiting on a slow network connection. Read operations
are still serviced by the local cache and the servers, as in fully
connected mode. Which is why this mode of operation is also called
write-disconnected operation.

The write operations are effectively a continuous reintegration
(trickle-reintegration) in the background. This mode, therefore, requires
users to be authenticated and gives more chance for possible file
conflicts. The following points are several reasons for write-disconnected
operation.

   * Weak network connectivity.

     Venus uses bandwidth estimates made by the rpc2 communication layer to
     decide on the quality of the network connection with the servers. As
     soon as the connectivity to one of the servers drops to below the
     weakly connected treshhold (currently 50 KB/s), it will force all
     volumes associated with that server into weakly-connected mode. The
     cfs wr command can be used to force the volumes back into fully
     connected mode, and immediately reintegrate all changes.

     When the user was not authenticated, or conflicts were created during
     the write-disconnected operation, the user must first obtain proper
     authentication tokens or repair any inconsistent objects before the
     volume becomes fully connected again. Here again codacon is an
     invaluable tool for obtaining insight into the client's behaviour.

   * User requested write-disconnect mode.

     Users can ask venus to force volumes in write-disconnected mode,
     exchanging high consistency for significantly improved performance. By
     using the -age and -time flags on the cfs wd commandline, some control
     is given about the speed at which venus performs the
     trickle-reintegration. For instance, to perform the
     trickle-reintegrate more quickly than the default, where only
     mutations to the filesystem older than 15 minutes are reintegrated.
     You could use cfs wd -age 5, which will reintegrate all mutations
     older than 5 seconds.

   * Pending reintegration.

     When a volume is write-disconnected, it will stay write-disconnected
     until a user properly authenticates using clog.

5.5 Advanced Troubleshooting

with rpc2tcpdump

rpc2tcpdump is the regular tcpdump, which is modified to decode rpc2
protocol headers. This makes it a very useful tool for analyzing why
programs fail to work.

All traffic between venus and the coda servers can be viewed using the
following command.

# tcpdump -s120 -Trpc2 port venus or port venus-se

To identify problems with clog, for instance which server it is trying to
get tokens from.

# tcpdump -s120 -Trpc2 port codaauth

debugging with gdb

To be able to debug programs that use RVM, most coda related application
will go into an endless sleep when something goes really wrong. They print
their process-id in the log (f.i. venus.log or SrvLog), and a user can
attach a debugger to the crashed, but still running, program.

# gdb /usr/sbin/venus `pidof venus`

This makes it possible to get a stack backtrace (where), go to a specific
stack frame (frame <x>), or view the contents of variables, (print
<varname>). By installing the coda sources in same place as where the
binaries were initially built from, it is possible to view the surrounding
code fragment from within the debugger using the list command.

When using RedHat Linux rpms, you can install the sources in the right
place by installing the coda source rpm file.

# rpm -i coda-x.x.x.src.rpm
# rpm -bp /usr/src/redhat/SPECS/coda.spec

On other platforms look at the paths reported in the backtrace and unpack
the source tarball in the correct place.

(gdb) where
#0  CommInit () at /usr/local/src/coda-4.6.5/coda-src/venus/comm.cc:175
#1  0x80fa8c3 in main (argc=1, argv=0xbffffda4)
    at /usr/local/src/coda-4.6.5/coda-src/venus/venus.cc:168
(gdb) quit
# cd /usr/local/src
# tar -xvzf coda-4.6.5.tgz

5.6 Troubleshooting on Windows 95

Common problems

Unable to mount Coda.

     It is only possible to mount coda when relay.exe and venus.exe are
     running. Do not type mount.exe n: before venus printed the message

          venus starting...

     In some installations the DPMI DOS Extender window, in which Venus is
     running suspends when it is not active. Because Venus serves the
     mount.exe n: call, check your DOS Window settings. Untick the window
     property Properties->Misc->Background->Always Suspend. If it is
     unticked, ticking and unticking it again might help.

Unable to shutdown Windows95.

     This occurs when the device codadev.vxd couldn't be unloaded on the
     request issued by typing unmount.exe. relay.exe printed the message

          UNLOAD RESULT 0
          UNLOAD RESULT 8

     when the request was successful. There is no way out of this. Just
     reboot your machine.

I cannot reboot Windows95 and I think it is due to the VXDs loaded for
Coda.

     Boot your System in DOS mode by pressing F8 on boot time. Cd to the
     windows directory and type edit system.ini. In the section [enh386]
     you will find the entries

          device=c:\usr\coda\bin\mmap.vxd
          device=c:\usr\coda\bin\mcstub.vxd

     Comment them out by using a ; in front of the lines. Try to restart
     Windows again.

How can I find out why relay.exe crashed.

     relay.exe is a very tiny program which hands requests from the file
     system driver up to venus.exe and the other way around. Crashes in
     relay.exe are most likely due to buffer overflows or null pointer
     referencing. relay.exe prints debug information in a file called
     relay.log which can be found in the directory where relay was started
     from.

     When relay.exe crashed, the file system driver is still loaded. Just
     restart relay.exe. It might pretty well not be possible to unmount the
     system by calling unmount.exe.

How can I find out why venus.exe crashed.

     See troubleshooting venus. When this happens it is possible to restart
     venus. The file system doesn't need to be unmounted before.

How can I find out more about what has happend

     Look in the file c:\vxd.log. The file system driver codadev.vxd prints
     information about all requests and answers in this file. Check the
     requestnumbers to match with those in relay.log.

Restrictions

   * hoard.exe does not work so far.
   * Handling large files (in particular executables) does not work well in
     a low bandwidth scenario.
   * cfs.exe uses absolute pathnames so far.
   * Long filenames are not supported under DOS environment yet. You can
     access files, but you need to use the long filenames.
   * Reintegration after disconnection does not work at present.

  ------------------------------------------------------------------------
Next Previous Contents