From - Tue Feb 23 19:33:22 1999 Path: reader1.news.rcn.net!feed1.news.rcn.net!rcn!feeder.qis.net!newsfeed.cwix.com!204.238.120.130!news-feeds.jump.net!nntp2.dejanews.com!nnrp1.dejanews.com!not-for-mail From: kparris@my-dejanews.com Newsgroups: comp.os.vms Subject: Advice for Disaster-Tolerant and High-Availability VMS sites Date: Tue, 23 Feb 1999 23:20:37 GMT Organization: Deja News - The Leader in Internet Discussion Lines: 69 Message-ID: <7avd3s$8ta$1@nnrp1.dejanews.com> NNTP-Posting-Host: 205.226.153.127 X-Article-Creation-Date: Tue Feb 23 23:20:37 1999 GMT X-Http-User-Agent: Mozilla/4.03 [en] (Win95; U ;Nav) X-Http-Proxy: 1.0 x10.dejanews.com:80 (Squid/1.1.22) for client 205.226.153.127 Xref: reader1.news.rcn.net comp.os.vms:222476 As a consultant, I'd like to offer the following valuable but free advice to Disaster-Tolerant and high-availability VMS Cluster sites. If anyone suggests that Unix can do any or all of the work your VMS cluster does, this is a sign that their brain is gone and their body has been invaded by aliens. Have no mercy; kill them immediately, for if you don't they will multiply rapidly and be an ever-increasing drain on monetary resources and management mind-share, possibly preventing you from having enough reserve capacity to ride smoothly through your next big surge in workload. In a disaster-tolerant VMS Cluster, the GIGAswitches and WAN links between sites, despite bearing a vague resemblance to network equipment, are actually Cluster Interconnect Hardware. Don't let anyone connect anything to them. That way, if some joker someday accidentally bridges your DECnet network to that of a another company with the same DECnet area, and its DECnet router tries to take over and route your traffic from the far end of a T1 link, then at least your VMScluster nodes will be able to continue communicating unaffected, and then one of your GIGAswitch pairs won't do something like become confused with all the duplicate MAC addresses they're seeing and stop forwarding your SCS data. If your workload is so intense, particularly in locking, that you've taken advantage of the Fast_Path feature to shift interrupt workload from CI adapters off of the primary CPU onto secondary CPUs, and thus avoid saturation of CPU 0 in interrupt state, be aware that if you add Memory Channel (which might be very attractive because it can do lock requests with 260 microsecond latencies compared with 485 microseconds for CI, and can help off-load a star coupler which is near saturation), this will shift interrupt-state workload caused by intra-node SCS traffic like lock requests back onto the primary CPU, which may saturate it again. This will not prevent you from using Memory Channel, but it may force you to re-balance your locking workload in the short term, and to find other alternatives, such as additional CI adapters and star couplers, for nodes which due to application design may be the focus of more lock-mastership activity and thus most prone to interrupt-state saturation, or in the longer term even to modify certain pieces of an application which might have initially been developed in a non-clustered environment to take better advantage of the cluster environment and spread the locking workload across more nodes. Despite a strong Change Management process, little things can happen gradually over time and come back and bite you. The very existence of a strong CM process may even hurt in this case, because it may concentrate too much attention on recent changes, such as an application software modification, which may be totally unrelated to the real problems. In fact, the real source of an I/O performance problem might be something like gradual accretion of files over months of time in a crucial directory file, such that it becomes so large that RMS can no longer effectively cache it. Gradual file fragmentation with growth and gradually increasing internal disorganization of RMS files over time might also contribute to an I/O slowdown as well. Firing off several copies of a batch job which heavily accesses files at just the wrong time can have unfortunate consequences if your I/O subsystem is near to saturation. I have become very skeptical about anything I read in the news now. Even if you have UPS systems and generators at two separate sites 130 miles apart, your systems may be reported as being down due to a power failure! Finally, if you're a customer of Compaq's, the level of support you can receive when you're in trouble can bring tears of gratitude to your eyes. They might even send 8 or 10 people to your site on short notice, and have them work long hours to help you identify causes and find solutions, even while inaccurate news reports are flying around that could potentially damage a vendor's reputation. ----------------------------------------------------------------------- Keith Parris|Integrity Computing,Inc.|parris@decuserve.decus.org-nospam VMS Consulting: Clusters, Perf., Alpha porting, Storage&I/O, Internals -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own