The Coda filesystem is still under development, and there certainly are several bugs which can crash both clients and servers. However, many problems users observe are related to semantical differences of the Coda filesystem compared to well-known NFS or SMB network filesystems.
This section will point out several logs to look at for identifying the cause of problems. Even if the source of the problem cannot be found, the information gathered from Coda's logging mechanisms will make it easier for people on the coda mailinglist < [email protected] > to assist in solving the problem(s).
Some of the more common problems are illustrated in detail. At the end of this section some of the more involved debugging techniques will be addressed. This will be helpful to developers to isolate problems more easily.
At the end there is a whole section describing how to solve some problems with Windows95, only the Coda related stuff! .
Most problems can be solved, or at least recognized by using the information logged by the clients and servers. The first step in finding out where the problems stems from is doing a tail -f on the logfiles.
It must also be noted that, when coda clients and servers crash they do not `dump core', but start sleeping so that we developers can attach debuggers. As a result, a crashed client or server still shows up in the ps auxwww output, and only the combination of lack of file-service and error messages in logfiles indicate that something is really wrong.
Since release 5.3.4 servers now actually exit when they crash,
create a file
/vice/srv/ZOMBIFY
to force a server to
go into an endless sleep again.
client$ xterm -e codacon
/usr/coda/etc/console
is a logfile which contains
mostly error or warning messages, and is a place to look for errors
which might have occured. When assertions in the code fail, it is
logged here.
/usr/coda/venus.cache/venus.log
contains more
in-depth information about the running system, which can be helpful
to find out what the client is or was doing.
client$ xterm -e cmon server1 server2 server3
...
/vice/srv/SrvLog and /vice/srv/SrvErr
are the
server logfiles.
Other logfiles that could be helpful in discovering problems are:
/vice/auth2/AuthLog
/vice/srv/portmaplog
/vice/srv/UpdateClntLog
/vice/srv/UpdateLog
testserver.coda.cs.cmu.edu
.
When you have set up your client for the first time, and it can not connect to the testserver at CMU, there are a couple of possible reasons. You might be running an old release of Coda, check the Coda web-site to see what the latest release is.
Another common reason is that your site is behind a firewall, which blocks, or allows only outgoing, udp traffic. Either try Coda on a machine outside of the firewall, or set up your own server.
The third reason is that the testserver might be down, for maintenance or upgrades. That does not happen often, but you can check whether it is up, and how long it has been running using cmon.
cmon testserver.coda.cs.cmu.edu
All of the reasons in the previous item could be the cause. It
is also possible that your
/etc/services
file is not
allright. It needs the entries:
# Iana allocated Coda filesystem port numbers rpc2portmap 369/tcp rpc2portmap 369/udp # Coda portmapper codaauth2 370/tcp codaauth2 370/udp # Coda authentication server venus 2430/tcp # codacon port venus 2430/udp # Venus callback/wbc interface venus-se 2431/tcp # tcp side effects venus-se 2431/udp # udp sftp side effect codasrv 2432/tcp # not used codasrv 2432/udp # server port codasrv-se 2433/tcp # tcp side effects codasrv-se 2433/udp # udp sftp side effect
The main reason for getting Connection timed out errors is that the volume where the file is located is disconnected from the servers. However, it can also occur in some cases when the client is in write-disconnected mode, and there is an attempt to read a file which is open for writing. See Volume is disconnected/Volume is write-disconnected for more information.
When command are hanging it is likely that venus has crashed.
Check
/usr/coda/etc/console
and
/usr/coda/venus.cache/venus.log
.
If venus complains (in
venus.log
about not being
able to open
/dev/cfs0
, it is because
/coda
is still mounted.
# umount /coda
Another reason for not restarting is that another copy of venus
is still around, and venus is unable to open it's network socket.
In this case there will be a message in
venus.log
stating that RPC2_CommInit has failed.
A reason is that you do not have the correct kernel module. This can be tested by inserting the module by hand, and then listing the available modules. `coda' should show up in that listing. Otherwise reinstall (or recompile) a new module.
# depmod -a # insmod coda.o # lsmod Module Size Used by coda 50488 2
If the kernel-module can be loaded without errors, check
venus.log
. A message stating `Cannot get rootvolume
name' indicated either a misconfigured server or the
codasrv/codasrv-se ports are not defined in
/etc/services
, which should contain the following
entries. See above for the entries needed.
Put the hostnames of your servers in
/etc/hosts
.
Take a vacation until we release a version of Coda which uses it's telepathic abilities to contact the auth2 server. We will add this feature.
Make sure you have version 5.0 of Coda or later. Before you can hoard you must make sure that:
-primaryuser
"youruid"
This happens when you have a resolution log that is full. In the
SrvLog
file you will usually be able to see which
volume is affected, take down it's volume id (you may need to
consult
/vice/vol/VRList
on the SCM to do this. Kill
the dead (zombied) server, and restart it. The moment it is up you
do:
filcon isolate -s "this server" # to prevent clients from again # overwriting the log volutil setlogparms "volid" reson 4 logsize 16384 filcon clear -s "this server"
Unless you do "huge" things 16k will be plenty.
If this happens you have several options. If the server has crashed during salvaging it will not come up by trying again, you must either repair the damaged volume or not attach that volume.
Not attaching the volume is done as follows. Find the volume id
of the damaged volume in the SrvLog. Create a file named
/vice/vol/skipsalvage
with the lines:
1 0xdd000123
Here
1
indicates that a single volume is to be
skipped and
0xdd000123
is the volume id of the replica
that should not be attached. If this volume is a replicated volume,
take all replicas offline, since otherwise the clients will get
very confused.
You can also try to repair the volume with
norton
.
Norton is invoked as:
norton LOG DATA DATA-SIZEThese parameters can be found in /vice/srv.conf.
The Norton manual pages give details about norton's operation and there is online guidance available which is possibly more helpful.
NOTES:
Tuesday I lost my email folder - the whole volume moose:braam.life was corrupted on server moose , it wouldn't salvage. Here is how I got it back.
First I tried mounting
moose.braam.life.0.backup
but this was corrupted too.
On the SCM in
/vice/vol/VRList
I found the
replicated volume number
f0000427
and the volume
number
ce000011
(ficitious) for the volume.
I logged in as root to bison, our backup controlller. I read the
backuplog for Tuesday morning in /vice/backuplogs/backuplog.DATE
and saw that the incremental dump for August 31st had been fine. At
the end of that log, I saw the name
f0000427.ce000011
listed as dumped under /backup (a mere symlink) and /backup2 as
spool directory with the actual file. The backup log almost shows
how to move the tape to the correct place and invoke restore:
cd /backup2 mt -f /dev/nst0 rewind restore -b 500 -f /dev/nst0 -s 3 -i
The
-s 3
option varies according to which
/backup[123]
volume the backup is restored from. This
invokes the restore command. Typing help allowed me to add then
extract the file I wanted. It took a little while before the file
was back. From the restore prompt do:
restore > cd 31Aug1998 restore > add viotti.coda.cs.cmu.edu-f0000427.ce000011 restore > extract Specify volume #: 1 < verb > In /vice/db/dumplist I saw that the last full backup had been on Friday Aug28. I went to the machine room and inserted that tape (recent tapes are above bison). This time f0000427.ce000011 was a 200MB file (the last full dump) in /backup3. I extract the file as above. Then I merged the two dumps: < verb > merge /restore/peter.mail /backup2/28Aug1998/f0000427.ce000011 \ /backup3/31Aug1998/f0000427.ce000011
This took a minute or two to create /restore/peter.mail. Now all that was needed was to upload that to a volume:
volutil -h moose restore /restore/peter.mail /vicepa vio:braam.mail.restored
Back to the SCM, to update the volume databases:
bldvldb.sh viotti
Now I could mount the restored volume:
cfs mkm restored-mail vio:braam.mail.restored
and copy it into a read write volume using cpio or tar.
When trying to create volumes, and createvol_rep reports RPC2_NOBINDING, it is an indication that the server is not (yet) accepting connections.
It is useful to look at
/vice/srv/SrvLog
, the
server performs the equivalent of
fsck
on startup,
which might take some time. Only when the server logs `Fileserver
Started' in SrvLog, it starts accepting incoming connections.
Another reason is that an old server is still around, blocking the new server from accessing the network ports.
Some process has the UDP port open which rpc2portmap or auth2 is trying to obtain. In most cases this is an already running copy of rpc2portmap or auth2. Kill all running copies of the program in question and restart them.
/vice/db
.
Servers can crash when they are given inconsistent or bad
data-files. You should check whether
updateclnt
and
updatesrv
are both running on the SCM and the machine
that has crashed. You can kill and restart them. Then restart
codasrv
and it should come up.
Check whether auth2, updateclnt, and updatesrv are running on all fileservers. Also check their logfiles for possible errors.
As most common problems are related to the semantical differences arising as a result of `involuntary' disconnections, this section contains some background information of why volumes become disconnected or write-disconnected. And how to get them to reconnect again.
Volume is fully disconnected.
There are several reasons why a coda client may have disconnected some or all volumes from an accessible server.
When modifications have been made to the volume in disconnected mode, the client will not reconnected the volume until all changes have been reintegrated. Also, reintegration will not occur without proper user authentication tokens. Furthermore, reintegration is suspended as long as there are objects in conflict.
The most important item here is to have a codacon process running, since it will give up-to-date information on what venus is doing. Venus will inform the user about missing coda authentication tokens, `Reintegration: pending tokens for user < uid > ' . In this case the user should authenticate himself using the clog command.
Conflicts, which require us to use the repair tool, are conveyed using the `local object < pathname > inconsistent' message. Otherwise codacon should show messages about backfetches , and how many modifications were successfully reintegrated.
The client may also disconnect when a servers reports an error to an operation, when according to the client this is a valid operation. Causes for this are authentication failure; check tokens using ctokens and optionally obtain new tokens using clog . Or inconsistencies between the data cached on the client and the actual data stored on the server; this will reveal itself as an inconsistent object during subsequent reintegration.
Sometimes the client does not receive a prompt reply from an accessible server, and marks the server as dead. This will ofcourse disconnect the volume if the last server is lost. Once every five minutes, the client automatically verifies connectivity with all known servers, and can thus recover from lost connections. However, this action can also be triggered by the user by excecuting the cfs checkservers command.
If cfs checkservers reports that servers are unreachable, it might be interesting to check with cmon if the server is responding at all, since we might be faced with a crashed server. When a server was considered unreachable, but is successfully contacted after `cfs checkservers', reintegration will automatically start (when a user has tokens, and there are no inconsistencies).
Volume is write-disconnected.
Write-disconnected operation is used as often as weakly connected mode to describe this volume state, and they are effectively the same. This is the special situation where a client observes a weak connectivity with a server, and therefore forces the associated volumes in weakly connected mode. Weakly connected volumes postpone writing to the server to significantly reduce waiting on a slow network connection. Read operations are still serviced by the local cache and the servers, as in fully connected mode. Which is why this mode of operation is also called write-disconnected operation.
The write operations are effectively a continuous reintegration ( trickle-reintegration ) in the background. This mode, therefore, requires users to be authenticated and gives more chance for possible file conflicts. The following points are several reasons for write-disconnected operation.
Venus uses bandwidth estimates made by the rpc2 communication layer to decide on the quality of the network connection with the servers. As soon as the connectivity to one of the servers drops to below the weakly connected treshhold (currently 50 KB/s), it will force all volumes associated with that server into weakly-connected mode. The cfs wr command can be used to force the volumes back into fully connected mode, and immediately reintegrate all changes.
To avoid switching to weakly connected mode, use cfs strong . This way venus ignores bandwidth estimates. cfs adaptive will make venus revert to interpreting bandwidth estimates.
When the user was not authenticated, or conflicts were created during the write-disconnected operation, the user must first obtain proper authentication tokens or repair any inconsistent objects before the volume becomes fully connected again. Here again codacon is an invaluable tool for obtaining insight into the client's behaviour.
Users can ask venus to force volumes in write-disconnected mode,
exchanging high consistency for significantly improved performance.
By using the
-age
and
-time
flags on the
cfs wd
commandline, some control is given about the speed at
which venus performs the trickle-reintegration. For instance, to
perform the trickle-reintegrate more quickly than the default,
where only mutations to the filesystem older than 15 minutes are
reintegrated. You could use
cfs wd -age 5
, which will
attempt to reintegrate all mutations older than 5 seconds.
When a volume is write-disconnected, it will stay write-disconnected until a user properly authenticates using clog .
rpc2tcpdump
is the regular tcpdump, which is
modified to decode rpc2 protocol headers. This makes it a very
useful tool for analyzing why programs fail to work.
All traffic between
venus
and the coda servers can
be viewed using the following command.
# tcpdump -s120 -Trpc2 port venus or port venus-se
To identify problems with
clog
, for instance which
server it is trying to get tokens from.
# tcpdump -s120 -Trpc2 port codaauth
To be able to debug programs that use RVM, most coda related
application will go into an endless sleep when something goes
really wrong. They print their process-id in the log (f.i.
venus.log
or
SrvLog
), and a user can
attach a debugger to the crashed, but still running, program.
# gdb /usr/sbin/venus `pidof venus`
This makes it possible to get a stack backtrace
(
where
), go to a specific stack frame (
frame
<
x
>
), or view the contents of variables, (
print
<
varname
>
). By installing the coda sources in same
place as where the binaries were initially built from, it is
possible to view the surrounding code fragment from within the
debugger using the
list
command.
When using RedHat Linux rpms, you can install the sources in the right place by installing the coda source rpm file.
# rpm -i coda-x.x.x.src.rpm # rpm -bp /usr/src/redhat/SPECS/coda.spec
On other platforms look at the paths reported in the backtrace and unpack the source tarball in the correct place.
(gdb) where #0 CommInit () at /usr/local/src/coda-4.6.5/coda-src/venus/comm.cc:175 #1 0x80fa8c3 in main (argc=1, argv=0xbffffda4) at /usr/local/src/coda-4.6.5/coda-src/venus/venus.cc:168 (gdb) quit # cd /usr/local/src # tar -xvzf coda-4.6.5.tgz
Check the DOS Windows settings of Venus and Relay. The check box Properties- > Misc- > Termination must be unticked.
Boot your System in DOS mode by pressing F8 on boot time. Cd to
the windows directory and type
edit system.ini
. In the
section
[enh386]
you will find the entries
device=c:\usr\coda\bin\mmap.vxd device=c:\usr\coda\bin\mcstub.vxd
Comment them out by using a
;
in front of the
lines. Try to restart Windows again.
venus.exe
crashed.
See troubleshooting venus. When this happens it might not be possible to restart Venus, if it is still mounted. In this case try to unmount by typing
If it does not work, you want to reboot the machine.unmount < drive > :
Look in the file
c:\vxd.log
. The file system driver
codadev.vxd
prints information about all requests and
answers in this file. the information is only stored if the debug
level has been turned on. the debug level is specified in the
registry
Set the debug level higher than 0 to receive messages in the debug file.HKLM/System/CurrentControlSet/Services/VxD/Codadev/Debuglevel
Venus switches to disconnected mode after a short timeout. After that it should work fine. If it doesn't, check if you have 'network connections' set up in the explorer (e.g. samba drive). 'Network connections' block your system, when no network is available.
cfs.exe
and
hoard.exe
use absolute
pathnames so far.