From - Tue Feb 23 19:33:22 1999
Path: reader1.news.rcn.net!feed1.news.rcn.net!rcn!feeder.qis.net!newsfeed.cwix.com!204.238.120.130!news-feeds.jump.net!nntp2.dejanews.com!nnrp1.dejanews.com!not-for-mail
From: kparris@my-dejanews.com
Newsgroups: comp.os.vms
Subject: Advice for Disaster-Tolerant and High-Availability VMS sites
Date: Tue, 23 Feb 1999 23:20:37 GMT
Organization: Deja News - The Leader in Internet Discussion
Lines: 69
Message-ID: <7avd3s$8ta$1@nnrp1.dejanews.com>
NNTP-Posting-Host: 205.226.153.127
X-Article-Creation-Date: Tue Feb 23 23:20:37 1999 GMT
X-Http-User-Agent: Mozilla/4.03 [en] (Win95; U ;Nav)
X-Http-Proxy: 1.0 x10.dejanews.com:80 (Squid/1.1.22) for client 205.226.153.127
Xref: reader1.news.rcn.net comp.os.vms:222476

As a consultant, I'd like to offer the following valuable but free advice to
Disaster-Tolerant and high-availability VMS Cluster sites.

If anyone suggests that Unix can do any or all of the work your VMS cluster
does, this is a sign that their brain is gone and their body has been invaded
by aliens.  Have no mercy; kill them immediately, for if you don't they will
multiply rapidly and be an ever-increasing drain on monetary resources and
management mind-share, possibly preventing you from having enough reserve
capacity to ride smoothly through your next big surge in workload.

In a disaster-tolerant VMS Cluster, the GIGAswitches and WAN links between
sites, despite bearing a vague resemblance to network equipment, are actually
Cluster Interconnect Hardware.	Don't let anyone connect anything to them.
That way, if some joker someday accidentally bridges your DECnet network to
that of a another company with the same DECnet area, and its DECnet router
tries to take over and route your traffic from the far end of a T1 link, then
at least your VMScluster nodes will be able to continue communicating
unaffected, and then one of your GIGAswitch pairs won't do something like
become confused with all the duplicate MAC addresses they're seeing and stop
forwarding your SCS data.

If your workload is so intense, particularly in locking, that you've taken
advantage of the Fast_Path feature to shift interrupt workload from CI
adapters off of the primary CPU onto secondary CPUs, and thus avoid
saturation of CPU 0 in interrupt state, be aware that if you add Memory
Channel (which might be very attractive because it can do lock requests with
260 microsecond latencies compared with 485 microseconds for CI, and can help
off-load a star coupler which is near saturation), this will shift
interrupt-state workload caused by intra-node SCS traffic like lock requests
back onto the primary CPU, which may saturate it again.  This will not
prevent you from using Memory Channel, but it may force you to re-balance
your locking workload in the short term, and to find other alternatives, such
as additional CI adapters and star couplers, for nodes which due to
application design may be the focus of more lock-mastership activity and thus
most prone to interrupt-state saturation, or in the longer term even to
modify certain pieces of an application which might have initially been
developed in a non-clustered environment to take better advantage of the
cluster environment and spread the locking workload across more nodes.

Despite a strong Change Management process, little things can happen
gradually over time and come back and bite you.  The very existence of a
strong CM process may even hurt in this case, because it may concentrate too
much attention on recent changes, such as an application software
modification, which may be totally unrelated to the real problems.  In fact,
the real source of an I/O performance problem might be something like gradual
accretion of files over months of time in a crucial directory file, such that
it becomes so large that RMS can no longer effectively cache it.  Gradual
file fragmentation with growth and gradually increasing internal
disorganization of RMS files over time might also contribute to an I/O
slowdown as well.  Firing off several copies of a batch job which heavily
accesses files at just the wrong time can have unfortunate consequences if
your I/O subsystem is near to saturation.

I have become very skeptical about anything I read in the news now.  Even if
you have UPS systems and generators at two separate sites 130 miles apart,
your systems may be reported as being down due to a power failure!

Finally, if you're a customer of Compaq's, the level of support you can
receive when you're in trouble can bring tears of gratitude to your eyes. 
They might even send 8 or 10 people to your site on short notice, and have
them work long hours to help you identify causes and find solutions, even
while inaccurate news reports are flying around that could potentially damage
a vendor's reputation.
----------------------------------------------------------------------- Keith
Parris|Integrity Computing,Inc.|parris@decuserve.decus.org-nospam VMS
Consulting: Clusters, Perf., Alpha porting, Storage&I/O, Internals

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own