From - Thu Jul 15 19:14:14 1999 Path: reader2.news.rcn.net!feed1.news.rcn.net!rcn!netnews.com!newspeer1.nac.net!news.mv.net!not-for-mail From: "Bill Todd" Newsgroups: comp.os.vms Subject: Re: Whither VMS? Date: Thu, 15 Jul 1999 18:53:05 -0400 Organization: Why won't Outlook let me leave this blank? Lines: 288 Message-ID: <7mloh2$5ig$1@pyrite.mv.net> References: <910612C07BCAD1119AF40000F86AF0D802CCDDE2@kaoexc4.kao.dec.com> <1999Jul13.013336.1@eisner> <7mh6hh$b66$1@pyrite.mv.net> <1999Jul15.013239.1@eisner> NNTP-Posting-Host: bnh-6-39.mv.com X-Trace: pyrite.mv.net 932078946 5712 199.125.98.103 (15 Jul 1999 22:49:06 GMT) X-Complaints-To: abuse@mv.com NNTP-Posting-Date: 15 Jul 1999 22:49:06 GMT X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.00.2314.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Xref: reader2.news.rcn.net comp.os.vms:237549 Rob Young wrote in message news:1999Jul15.013239.1@eisner... ... > Yes.. and in another thread without directly addressing this issue Hein writes: Thought about responding to that, but it wasn't in this thread, and I kind of felt that if Hein wanted to become involved here, he would have. Nice to see he's still around: I remember working with him on a few RMS-11 issues a long time ago. > Before RMS global buffers go Galactic, I'd like to see them first > go really big, outside of the working set, mapped with granularity > hints (4MB superpages). That should be a super CPU win already. > > Next, I'd be inclined to think along the lines of what Oracle has > called 'cache fusion'. That is, if a bucket is in some buffer on > some system in the cluster, and needed somehwere else then do NOT > 'ping' it through the disk, but use a better communication mechanisme > to get it accross. That mechanisme might be a memory channel, galactic > memory, a kernel assist process to process copy in a single system. Yup, that's the kind of distributed sharable cache I've been advocating - but if you integrate it with some new locking mechanisms, you get other significant improvements as well, both in the cache itself and with RMS record locking. > Yeah, you might still want to write to the disk, but the reader > should not have to wait for it to come back from the disk, loading > up the IO bus / controller /adapter twice. The problem is, at least in RMS indexed files, that the careful-update sequences required to maintain (possibly sub-optimal) index structure integrity across crashes often have to keep buckets locked after a modification is done until the disk transfer completes (something similar is true for the single-block careful-update sequences ODS-2 performs), so the ability to pass dirty data without first writing it to disk is largely restricted to sequential-file write-shared user data or indexed file data that a) does not cause bucket splits or reclamations and b) has write-back caching semantics that allows changes to be made visible to others before they are committed to disk. Unless you change RMS to use a log to protect its integrity... > >> As the VMS DLM moves into shared memory and is distributed > >> (surely must be , my conjecture) across several nodes (see Galaxy > >> Locks thread) the CF suddenly isn't so hot after all. > > > >If you look at the relative performance of CF vs. locks in Galaxy shared > >memory (including the mechanisms required to keep one crashed system from > >bringing down the entire lock database), they're likely about equal. But > >the CF still has the advantage that it supports shared caching as well. > > > > But isn't it a natural move to migrate the XFC (eXtended > File Cache) into shared memory? Yup, just as natural a move as a distributed shared cache has been for the last 15 years - i.e., being natural doesn't guarantee it will happen any time soon. > >> So when you talk about "going up against".. (hee-hee-hee) the > >> IBM mainframe folks we'll see Alpha systems with Galaxy in the > >> next several years that will be truly monsterous in comparison to > >> large Sysplexes. > > > >As I suggested above, maybe, and maybe not. Not to mention RS/6000 > >clusters, which aren't likely to stand still in the interim (and, of course, > >are already a full 64-bit architecture): they are not an unreasonable > >alternative to VMS today for many (not all: Unix still doesn't treat > >in-process asynchrony very well) applications, and the Monterey initiative > >is going to make them significantly more attractive in terms of providing a > >compatible application architecture from the x86 on up. Give them a > >top-notch cluster file system and the (appropriately modified) Sysplex > >version of DB2 (or just run Oracle, since they can already run Oracle > >Parallel Server on [truly] shared disks the same way VMS can) and they may > >well be somewhat superior to VMS for the majority of applications - they > >already support distributed shared memory and a high-performance > >interconnect which can doubtless be improved if seriously challenged by the > >Galaxy shared-memory speed. > > > > Yes... and of course at that point IBM must make a choice. As > RS/6000 moves into and surpasses mainframe performance and what-not > ... what to do what to do. Having problems competing with yourself sure beats having problems competing with others. However, it does raise the interesting point that IBM may try to walk a rather fine line between sharing the AIX *interface* with its Monterey partners and sharing its underlying *technology*, so as to maintain whatever advantages it can in the latter while participating in a 'standard' environment - the same sort of thing I was suggesting Compaq might try if it fielded a Monterey of its own... > >> > The underlying VMS cluster facilities are up to the task, but > >> > the file system falls a bit short in areas of performance > >> > >> Think we've beat on that one a while before but worth mentioning > >> again. If you have a Terabyte of memory isn't the filesystem mostly > >> for writes? And if I have VCC_WRITEDELAY and VCC_WRITEBACK > >> enabled, I'll race you, okay? :-) > > > >I'd be more than happy to race (metaphorically), as long as you let me pull > >the power switch on both systems in the middle so we can see how their > >robustness compares. What's that? You didn't write back your file system > >meta-data and didn't have it logged? Too bad... When it comes to > >performance and availability, I prefer to have my cake and eat it too - > >especially if I can get better performance on a given amount of hardware > >simply by bringing my software up to contemporary designs. > > > > If go back 2 years and change and my memory serves me correctly > (unwilling to do a Deja search, gotta get up at 5), > a fellow mentioned to this group that much of the trickier > aspects of Galaxy was "error-pathing". I believe that fellow > reads these threads and may chime in. > > If I were to design the VCC_WRITE* stuff, I would take advantage > of fault tolerant memory. Galactic slides from DFWLUG of May 98 > point out that memory allocation (in a future Galaxy phase) will > include "fault tolerant" among the choices. > > Let's run a scenario.. I've got VCC_ turned on.. you pull the > plug. I've got redundant battery backup, I also have my > VCC_ designed such that it uses fault tolerant memory allocation > (typically 64 to 128 MByte is all that is needed, let's say).. > as soon as my batteries kick in, the VCC_ master node flushes > writes to recovery log(s) while attempting to post the writes. Ah, > you say .. let's introduce a total power outage so you also lost > your disk farm. Okay, you were to be sure that your recovery log(s) > (which is a search list of files) were in locally attached disk(s) > powered by the batteries? Yup, that works, but it requires yet more special hardware and the special code to manage it. Whereas each cluster member in the architecture I'm suggesting has a dedicated small, circular recovery log on a shared, mirrored disk (or at least the log is mirrored if the system has at least 2 disks). For better performance, place the logs on their own dedicated disks. If you're after the best performance, the log disks are solid-state: the logs are small, and you can place multiple logs on each one. A transaction completes when its commit record hits the persistent log, which is how consistency is maintained regardless of the kind of failure, and does not require battery-backed disks. Other systems have used stable, mirrored memory to substitute for a logging approach, but it has some drawbacks. First of all, as Larry K. has recently pointed out, the mirroring of such memory must span multiple Galaxy boxes if you're going to be able to survive a site (or any single-box) disaster - which slows things down to a point comparable to the speed of a solid-state disk access through a VI-style connection. Then, of course, you need a bunch of special recovery logic to make the mirrored stable memory work (as just one example, when a single box fails, you need to elect another to continue mirroring, allocate memory on it, set up the new correspondence...), whereas with a log you have a standard mechanism available that anything (well, at least any system thing: unrestricted user access could clog it) can use and that is not restricted to environments supporting special hardware. Incidentally, I'm not acquainted with the VCC_* stuff you have been referring to (though I can make some educated guesses): any descriptions available? > >Then again, this isn't the first time that people have prophesied that > >increasing amounts of memory would make file systems effectively > >write-only - e.g., this was a large part of the rationale behind > >log-structured file systems. Recent papers seem to be backing away from > >this position, after having evaluated the LSFSs that are available for > >inspection. > > > > Point me to a paper where they are talking about a Terabyte > of memory and hundreds of Gigabytes of shared cache, not > distributed. Where do they describe the drawbacks of that, I am > curious and am willing to read such a paper. Sorry - I would have included a reference if I had one. It's just something I ran across fairly recently. While I'm sure it wasn't talking about a TB of cache, it was discussing caches of reasonable size (i.e., where the cost of the cache memory exceeded that of the disk storage backing it up - which is the point where customers start asking for more cost-efficient approaches than brute-force caching). > I have followed Spiralog's ups and downs. I don't understand all that > took place. I thought for sure Spiralog was the next great thing. > Apparently, read caching was a pain or limiter (my fuzzy > recollection). Sometimes much can be learned from failures or > abandoned projects. Project Prism's cancellation was the ashes > that Alpha rose from. *Apparently* the next wave of IO caching > for VMS (VCC) is an outgrowth of Spiralog IIRC. Ahem. I'm afraid I've always held the opinion that log-structured approaches were suitable only for specialized applications (in-memory databases are one, the logical equivalent of a cache large enough to hold all the data), so I'll try not to be too smug. There are just too many ways to avoid write overhead in update-in-place environments, and so while Spiralog missed some opportunities of its own, the fact that ODS-2 remained competitive while also lacking many possible enhancements suggests that for general-purpose applications update-in-place retains its superiority. Prism's cancellation was a monumental triumph of corporate stupidity and politics. The fact that Alpha was retrieved from its ashes was a lucky accident aided by Bob Supnik's ability to push the hardware through a first pre-production phase to prove its value. It's not as if the company *learned* something from Prism which then enabled it to create Alpha: Alpha is basically mildly-massaged Prism hardware retrieved after the fact, though of course has now benefited from several additional generations of development. Too bad it doesn't have the rest of Prism as well... > So maybe my crutch is large memories. So? Big systems of > the future (2-5 years) will have massive memory and hundreds > of CPUs unless you ascribe to the "attack of the killer 8-CPU > node cluster" school. And petabytes of disk storage, which will keep caches about the same percentage of disk capacity that they are today: everything gets cheaper at a roughly similar rate, data items become larger (not all that space is lengthy video clips: a lot of it is static images or similar items accessed randomly just like text is today), and the balance doesn't change much. > Give me 2 systems and I can run the New York Stock Exchange off of > them. Maybe not so far fetched. Yup - data in special-purpose systems won't change as fast as general-purpose data, so increased caching will provide at least a temporary advantage. > I think Unix has lost its momentum. NT is rocketing and unless > the government pulls the plug NT will dominate the high-end > 5 years and out. Marketing, market-share , development, etc. > 8-CPU boxes today, more nodes in the cluster for NT. > > Desktop to DataCenter one OS. Unix lost the desktop (never had it), > is losing midrange penetration hand over fist (file and print) > the DataCenter is only a matter of time. Though MS would like to have people believe otherwise, NT is rocketing only into non-critical off-the-desktop areas: it's simply not stable enough for important use. And I'm beginning to believe that MS is intrinsically incapable of producing a competitively stable offering - ever - and in the event I'm wrong, I'll assert that in any event NT itself cannot be made stable enough, hence a new OS will be required, hence opportunity exists in that space. And if I'm even wrong there, given how long it has taken to get W2K out the door (assuming it doesn't slip yet again), just how long do you think it will take to make it stable, let alone a 64-bit successor? [I *still* think there may be some value in having VMS cluster in a file-system-only manner with NT, though: in that context it doesn't matter whether NT crashes a lot, and there's a lot less indication that NT is prone to bouts of insanity where it deliberately trashes shared data - though I'd have some reservations about using it in a Galaxy environment without hard memory fencing.] Lowish-end file and print service is about NT's limit - and even that requires NT clusters to provide adequate availability. The more centralized department-level and higher facilities are far less dependent on a common GUI (NT's principal advantage over other systems) and far more dependent upon availability, scalability, and performance. > >behemoth on the other, can VMS really afford to pass up opportunities for > >significant improvements? If VMS does pass them up, can Compaq convince > >customers that it's really backing VMS for the long term? > > What do you mean? Can you cite a specific example? What > opportunities has VMS passed up at an engineering level that > would make VMS stronger? I listed 6 areas (7/10/99 7:56 P.M., then added one more - 7/11/99 2:01 P.M.) where VMS could improve its existing data management offerings more than trivially, and have yet to see a detailed response. I'm not well-acquainted with the system-management end of things, but my impression is that while VMS has done a fair amount of work in this area it could still learn some things from IBM in areas of reducing or simplifying human involvement.