Coda File System

RE: Doubts on replication strategy in CODA

From: Sundaram, Arunkumar <arunkumar_sundaram_at_adaptec.com>
Date: Wed, 19 May 2004 19:35:50 +0530
Thanks Jan,

> I believe there is an implicit assumption that write-write sharing is
> uncommon.

Is there any thought to transparently support write-write sharing in CODA? 

We may apply collision detection algorithm (with little modification), when two or more clients are trying to simultaneously acquire write-lock for an object.

Thanks again,
Arun

-----Original Message-----
From: Jan Harkes [mailto:jaharkes_at_cs.cmu.edu] 
Sent: Tuesday, May 18, 2004 11:01 PM
To: codadev_at_TELEMANN.coda.cs.cmu.edu; codalist_at_TELEMANN.coda.cs.cmu.edu
Subject: Re: Doubts on replication strategy in CODA

On Tue, May 18, 2004 at 09:58:21PM +0530, Sundaram, Arunkumar wrote:
> In one of the documents I read that CODA uses read-one, write-all
> approach for server replication. I am a little bit concerned here.
>  
> Let's assume that there are three replicated servers A, B and C.
> And let's assume that client Ac is connected to A, client Bc is
> connected to B and client Cc is connected to C.

First problem, clients Ac, Bc, and Cc are all connected to servers A, B
and C. The only way they can be connected to a single server is when
there is a network split. This doesn't really affect your scenario all
that much.

> In this scenario let's say there is a file named "CommonFile" under
> root directory of CODA share. Consider a sequence of following
> operations on this file
>  
> Ac opens "CommonFile" for write
> Bc opens "CommonFile" for write
> Ac appends some content �A to "CommonFile"
> Bc appends some content �B to "CommonFile"
> Ac propagates �A to other replicated servers
> Bc propagates �B to other replicated servers
> lets say at this time �A is propagated to servers in the following order; first to A then to C and then to B

There is no ordering, a client writes simultaneously to all available
servers. It sends a STORE rpc call to all available servers which (try
to) get a write lock on the volume or object. Once the object or volume
is locked the servers all fetch the data associated with the file from
the client. Once all fetches have completed, the servers bump their
local version in the version vector. After that the STORE rpc returns
and the return code indicates if the transfer succeeded and which of the
available servers committed the update.

The client then piggybacks a COP2 (2nd phase commit) message on the next
operation, which informs the servers who also committed the operation
and this brings the version vector in sync on the replicas. If either Ac
or Bc goes first, the modified version vector will cause the other
update to be declared 'in conflict' and the second writer will have to
decide which version to keep (manual repair by the user).

When sending out operations that tend to grab locks, we often order
outgoing messages by IP address, this way all clients will try to lock
the volume on servers A, B and C in the same order. For the few
situations where this doesn't help (Ac and Bc both see/use a different
set of addresses to talk to the servers, or a network routing problem
has separated client Ac and server A, from Bc and the other servers) we
rely on the version vector.

Remember that the server who commits an operation bumps it's local
version vector slot. So if Ac commits it's write operation at server A,
and Bc commits it's version to server B.

If the original version of the file was (1,1), the version of Ac's
update on server A will become (2,1) while the version of Bc's update on
server B will have a version of (1,2). If B was somehow able to still
talk to server A and it's update came in later it would still be
detected as a local-global (client/server) conflict. But even if we
can't detect the client/server conflict at this time and both operations
are committed, when the network connectivity is restored and any client
accesses the object it will receive both version vectors from the
servers and notice the discrepancy. At this point the client will inform
the servers of the detected server/server conflict.

When servers are told there is a conflict they attempt an automatic
resolution process. Some cases are quite simple, it could have been that
server A was down for maintenance and simply missed several updates (all
slots in it's version vector are lower). Or it missed the piggybacked
2nd phase commit message which looks like a conflict, but identical
client identifier/operation sequence number on all servers. These are
quickly recognised and the missed update is simply copied from the most
up-to-date replica.

Worst case, we have the given (1,2) and (2,1) conflicting version
vectors and different operation identifiers. Even here we can
automatically resolve most directory conflicts because all servers keep
a log of recent operations and try to merge these logs and apply the
missed individual updates. Only once all resolution attempts have
failed, the object is marked as a conflict on the servers and the client
is notified. At this point a user will have to expand the conflict into
the various versions with the manual repair tool and decide which
replica he wants to keep or create a merged copy of the file to replace
the existing versions.

> Or is it a limitation that files cannot be used simultaneously at two
> clients connected to two different duplicated servers.

I believe there is an implicit assumption that write-write sharing is
uncommon. But multiple readers and a single writer sharing the same
files is definitely not a problem and should not lead to conflicts,
except if the implementation has a bug somewhere.

There is some interesting reading about the optimistic replication and
server resolution in Puneet's thesis,

    Mitigating the Effects of Optimistic Replication in a
    Distributed File System - Kumar, P.
    School of Computer Science, Carnegie Mellon University
    Dec. 1994, CMU-CS-94-215

It can be found among the other PhD theses on this page,
http://www-2.cs.cmu.edu/afs/cs/project/coda-www/ResearchWebPages/docs-coda.html

Jan
Received on 2004-05-19 10:06:38