From:	MERC::"uunet!CRVAX.SRI.COM!RELAY-INFO-VAX" 25-AUG-1992 19:47:06.40
To:	info-vax@kl.sri.com
CC:	
Subj:	Re: RE: Clustered Global Section

In article <9208190446.AA26060@uu3.psi.com>, leichter@lrw.com (Jerry Leichter) writes:
> 	I'm just about to cluster a couple of VAXs.  On one VAX an application
> 	uses global sections to communicate between processes.  After
> 	clustering we want this application to be available across the
> 	cluster.
>
> 	Is there a problem?
>
> 	Yes, I know that you can't share memory across a cluster, but since
> 	global sections are mapped to files (but are all sections mapped to
> 	files?), and assuming that the files are available clusterwide then
> 	will it work?
>
> Well, it'll WORK, but it won't do what you seem to think!
>
> 	I guess it would be somewhat (or a lot) slower since it'd be disk I/O
> 	rather than memory access.
>
> When you share a global section among processes on a single CPU, they are
> accessing the same memory locations.  A change made by one process is visible
> to all the other processes within a few cycles.  (In theory, at the end of
> the writing instruction, but in fact only after a synchronization point is
> reached is this guaranteed.  For a VAX, synchronization points include all
> the interlocked instructions, and a couple of other things that don't matter
> for non-privileged code.)
>
> If you have a global section open on the same file in multiple members of a
> cluster, each member of the cluster is accessing local memory.  If one makes
> a change to its own local memory, that change will not be visible until the
> processor making the change pages that page out, and the process looking for
> the change pages it in.  VMS will NOT force this to happen in any way that
> would be useful for this kind of programming.  (It's possible to force half
> of it to happen yourself by doing a $UPDSEC on the writing processor, which
> will write the new data to disk.  There's no simple way to force the page to
> be read back in from the disk.  Performance would be terrible in any case.)
>
> For all practical purposes, global sections are limited to single nodes (with
> multiple CPU's perhaps) in a cluster.
>
> Sorry.
> 							-- Jerry
>
--

Jerry got it right.  However, this being one of those rare times I can add
something to what Jerry writes, I will do so:

I've implemented an application which works just as you seek.  There are
basically two solutions, which are very similar.  However, neither is for the
faint of heart, and if you are not very comfortable with the following
description you should probably pass on these solutions.

Further, if your application is not properly designed, these solutions will
offer poor performance (reducing memory-time accesses to 2*disk-time).  Jerry
is (of course) correct that performance might be terrible, but for a
properly-designed application, this is not necessarily the case.

As further proof that this solution works, I note that Rdb global buffering
works in essentially the same fashion.  I think that Oracle parallel servers
work the same way, too, although I am not sure that they use a section file as
opposed to a direct IOs from disk.

The basic idea is this:  Set up a lock to coordinate access to the global
section between cooperating processes. When an application updates data that it
wants to make available to other processes (on any node), it simply uses the
$UPDSEC system service to write the appropriate addresses to backing store on
disk.  This synchs the disk file with memory, at the cost of one disk IO.
Next, the updater takes out a lock that uses blocking ASTs to notify the other
(interested) processes that an update has occured.  The lock value block
includes the address range of the interesting updates.

The "interested" processes block the lock and therefore have their blocking ASt
fired.  If they don't care about the update, they simply do nothing and
re-queue their "interested" lock.  If they care, then one process on each node
needs to take out another lock (all try, but only one succeeds) and updates
memory from the section file.  The cost of this is another disk IO.  (There are
some rather nasty synchronization issues here, and not a few race conditions,
but it is definitely do-able with the lock manager and some creative
instantiation of lock names and parent-child stuff.)  Another solution is to
dedicate only one process on each node as the one who is responsible for
keeping memory synched, possibly even making that a separate process from user
processes.  But this is not strictly necessary.

The really tricky part is that seemingly innocuous line "updates memory from
the section file".  There is no system service to do this, for architectural
reasons dating back to VMS V1.0.  (This direct from a long discussion one DECUS
with Larry Kenah.)  The easiest way to do it is to (gulp) post a QIO from the
correct addresses in the section file directly to the correct addresses in
memory.  Except for a small offset at the beginning of the section file (four
bytes, as I recall) the addresses are a direct map.  When this QIO has
completed, then memory on that node will now reflect the contents of the
section file, which, as you recall, was recently updated to reflect the
contents of memory on the updater node.

There are some possible gotchas here with respect to privileges, depending on
the nature of your section file, but nothing that can't be handled
straight-forwardly.

Obviously, such a method will work only when the updates can be co-ordinated
between processes explicitly, not when arbitrary updates are happening randomly
and without explicit "knowledge" of the updater.  That is, the updater needs to
have relatively few places where update is happening, and update cannot be "all
the time".  If your application spends all its time writing all over the
section file, then this method isn't for you and you should stick to a single
node solution.  (As an aside, should the previous sentence describe your
application, then you will likely have some difficulties moving this
application to Alpha from VAX, so you may want to look at a re-design of your
application anyway.)

As I said, I've done exactly the above on two occasions.  With good application
design, you can in fact use section files and two disk IOs to keep memory on
different nodes in synch.

Good luck!

Phil

_________________________________________________________________________
Philip A. Naecker                Consulting Software Engineer
Internet: pan@propress.com       1010 East Union Street, Suite 101
          uunet!prowest!pan      Pasadena, CA  91106-1756
Voice:    +1 818 577 4820        FAX: +1 818 577 0073

Also:     Technology Editor, DEC Professional Magazine
          VAX Professional Magazine Review Board Member
_________________________________________________________________________