From - Thu Jul 15 19:14:14 1999
Path: reader2.news.rcn.net!feed1.news.rcn.net!rcn!netnews.com!newspeer1.nac.net!news.mv.net!not-for-mail
From: "Bill Todd" <billtodd@foo.mv.com>
Newsgroups: comp.os.vms
Subject: Re: Whither VMS?
Date: Thu, 15 Jul 1999 18:53:05 -0400
Organization: Why won't Outlook let me leave this blank?
Lines: 288
Message-ID: <7mloh2$5ig$1@pyrite.mv.net>
References: <910612C07BCAD1119AF40000F86AF0D802CCDDE2@kaoexc4.kao.dec.com> <1999Jul13.013336.1@eisner> <7mh6hh$b66$1@pyrite.mv.net> <1999Jul15.013239.1@eisner>
NNTP-Posting-Host: bnh-6-39.mv.com
X-Trace: pyrite.mv.net 932078946 5712 199.125.98.103 (15 Jul 1999 22:49:06 GMT)
X-Complaints-To: abuse@mv.com
NNTP-Posting-Date: 15 Jul 1999 22:49:06 GMT
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 5.00.2314.1300
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
Xref: reader2.news.rcn.net comp.os.vms:237549


Rob Young <young_r@eisner.decus.org> wrote in message
news:1999Jul15.013239.1@eisner...

...

> Yes.. and in another thread without directly addressing this issue Hein
writes:

Thought about responding to that, but it wasn't in this thread, and I kind
of felt that if Hein wanted to become involved here, he would have.  Nice to
see he's still around:  I remember working with him on a few RMS-11 issues a
long time ago.

>         Before RMS global buffers go Galactic, I'd like to see them first
>         go really big, outside of the working set, mapped with granularity
>         hints (4MB superpages). That should be a super CPU win already.
>
>         Next, I'd be inclined to think along the lines of what Oracle has
>         called 'cache fusion'. That is, if a bucket is in some buffer on
>         some system in the cluster, and needed somehwere else then do NOT
>         'ping' it through the disk, but use a better communication
mechanisme
>         to get it accross. That mechanisme might be a memory channel,
galactic
>         memory, a kernel assist process to process copy in a single
system.

Yup, that's the kind of distributed sharable cache I've been advocating -
but if you integrate it with some new locking mechanisms, you get other
significant improvements as well, both in the cache itself and with RMS
record locking.

>         Yeah, you might still want to write to the disk, but the reader
>         should not have to wait for it to come back from the disk, loading
>         up the IO bus / controller /adapter twice.

The problem is, at least in RMS indexed files, that the careful-update
sequences required to maintain (possibly sub-optimal) index structure
integrity across crashes often have to keep buckets locked after a
modification is done until the disk transfer completes (something similar is
true for the single-block careful-update sequences ODS-2 performs), so the
ability to pass dirty data without first writing it to disk is largely
restricted to sequential-file write-shared user data or indexed file data
that  a) does not cause bucket splits or reclamations and  b) has write-back
caching semantics that allows changes to be made visible to others before
they are committed to disk.  Unless you change RMS to use a log to protect
its integrity...

> >> As the VMS DLM moves into shared memory and is distributed
> >> (surely must be , my conjecture) across several nodes (see Galaxy
> >> Locks thread) the CF suddenly isn't so hot after all.
> >
> >If you look at the relative performance of CF vs. locks in Galaxy shared
> >memory (including the mechanisms required to keep one crashed system from
> >bringing down the entire lock database), they're likely about equal.  But
> >the CF still has the advantage that it supports shared caching as well.
> >
>
> But isn't it a natural move to migrate the XFC (eXtended
> File Cache) into shared memory?

Yup, just as natural a move as a distributed shared cache has been for the
last 15 years - i.e., being natural doesn't guarantee it will happen any
time soon.

> >> So when you talk about "going up against".. (hee-hee-hee) the
> >> IBM mainframe folks we'll see Alpha systems with Galaxy in the
> >> next several years that will be truly monsterous in comparison to
> >> large Sysplexes.
> >
> >As I suggested above, maybe, and maybe not.  Not to mention RS/6000
> >clusters, which aren't likely to stand still in the interim (and, of
course,
> >are already a full 64-bit architecture):  they are not an unreasonable
> >alternative to VMS today for many (not all:  Unix still doesn't treat
> >in-process asynchrony very well) applications, and the Monterey
initiative
> >is going to make them significantly more attractive in terms of providing
a
> >compatible application architecture from the x86 on up.  Give them a
> >top-notch cluster file system and the (appropriately modified) Sysplex
> >version of DB2 (or just run Oracle, since they can already run Oracle
> >Parallel Server on [truly] shared disks the same way VMS can) and they
may
> >well be somewhat superior to VMS for the majority of applications - they
> >already support distributed shared memory and a high-performance
> >interconnect which can doubtless be improved if seriously challenged by
the
> >Galaxy shared-memory speed.
> >
>
> Yes... and of course at that point IBM must make a choice.  As
> RS/6000 moves into and surpasses mainframe performance and what-not
> ... what to do what to do.

Having problems competing with yourself sure beats having problems competing
with others.  However, it does raise the interesting point that IBM may try
to walk a rather fine line between sharing the AIX *interface* with its
Monterey partners and sharing its underlying *technology*, so as to maintain
whatever advantages it can in the latter while participating in a 'standard'
environment - the same sort of thing I was suggesting Compaq might try if it
fielded a Monterey of its own...

> >> > The underlying VMS cluster facilities are up to the task, but
> >> > the file system falls a bit short in areas of performance
> >>
> >> Think we've beat on that one a while before but worth mentioning
> >> again.  If you have a Terabyte of memory isn't the filesystem mostly
> >> for writes?  And if I have VCC_WRITEDELAY and VCC_WRITEBACK
> >> enabled, I'll race you, okay?  :-)
> >
> >I'd be more than happy to race (metaphorically), as long as you let me
pull
> >the power switch on both systems in the middle so we can see how their
> >robustness compares.  What's that?  You didn't write back your file
system
> >meta-data and didn't have it logged?  Too bad...  When it comes to
> >performance and availability, I prefer to have my cake and eat it too -
> >especially if I can get better performance on a given amount of hardware
> >simply by bringing my software up to contemporary designs.
> >
>
> If go back 2 years and change and my memory serves me correctly
> (unwilling to do a Deja search, gotta get up at 5),
> a fellow mentioned to this group that much of the trickier
> aspects of Galaxy was "error-pathing".  I believe that fellow
> reads these threads and may chime in.
>
> If I were to design the VCC_WRITE* stuff, I would take advantage
> of fault tolerant memory.  Galactic slides from DFWLUG of May 98
> point out that memory allocation (in a future Galaxy phase) will
> include "fault tolerant" among the choices.
>
> Let's run a scenario.. I've got VCC_ turned on.. you pull the
> plug.  I've got redundant battery backup, I also have my
> VCC_ designed such that it uses fault tolerant memory allocation
> (typically 64 to 128 MByte is all that is needed, let's say)..
> as soon as my batteries kick in, the VCC_ master node flushes
> writes to recovery log(s) while attempting to post the writes.  Ah,
> you say .. let's introduce a total power outage so you also lost
> your disk farm.  Okay, you were to be sure that your recovery log(s)
> (which is a search list of files) were in locally attached disk(s)
> powered by the batteries?

Yup, that works, but it requires yet more special hardware and the special
code to manage it.  Whereas each cluster member in the architecture I'm
suggesting has a dedicated small, circular recovery log on a shared,
mirrored disk (or at least the log is mirrored if the system has at least 2
disks).  For better performance, place the logs on their own dedicated
disks.  If you're after the best performance, the log disks are solid-state:
the logs are small, and you can place multiple logs on each one.  A
transaction completes when its commit record hits the persistent log, which
is how consistency is maintained regardless of the kind of failure, and does
not require battery-backed disks.

Other systems have used stable, mirrored memory to substitute for a logging
approach, but it has some drawbacks.  First of all, as Larry K. has recently
pointed out, the mirroring of such memory must span multiple Galaxy boxes if
you're going to be able to survive a site (or any single-box) disaster -
which slows things down to a point comparable to the speed of a solid-state
disk access through a VI-style connection.  Then, of course, you need a
bunch of special recovery logic to make the mirrored stable memory work (as
just one example, when a single box fails, you need to elect another to
continue mirroring, allocate memory on it, set up the new
correspondence...), whereas with a log you have a standard mechanism
available that anything (well, at least any system thing:  unrestricted user
access could clog it) can use and that is not restricted to environments
supporting special hardware.

Incidentally, I'm not acquainted with the VCC_* stuff you have been
referring to (though I can make some educated guesses):  any descriptions
available?

> >Then again, this isn't the first time that people have prophesied that
> >increasing amounts of memory would make file systems effectively
> >write-only - e.g., this was a large part of the rationale behind
> >log-structured file systems.  Recent papers seem to be backing away from
> >this position, after having evaluated the LSFSs that are available for
> >inspection.
> >
>
> Point me to a paper where they are talking about a Terabyte
> of memory and hundreds of Gigabytes of shared cache, not
> distributed.  Where do they describe the drawbacks of that, I am
> curious and am willing to read such a paper.

Sorry - I would have included a reference if I had one.  It's just something
I ran across fairly recently.  While I'm sure it wasn't talking about a TB
of cache, it was discussing caches of reasonable size (i.e., where the cost
of the cache memory exceeded that of the disk storage backing it up - which
is the point where customers start asking for more cost-efficient approaches
than brute-force caching).

> I have followed Spiralog's ups and downs.  I don't understand all that
> took place.  I thought for sure Spiralog was the next great thing.
> Apparently, read caching was a pain or limiter (my fuzzy
> recollection).  Sometimes much can be learned from failures or
> abandoned projects.  Project Prism's cancellation was the ashes
> that Alpha rose from.  *Apparently* the next wave of IO caching
> for VMS (VCC) is an outgrowth of Spiralog IIRC.

Ahem.  I'm afraid I've always held the opinion that log-structured
approaches were suitable only for specialized applications (in-memory
databases are one, the logical equivalent of a cache large enough to hold
all the data), so I'll try not to be too smug.  There are just too many ways
to avoid write overhead in update-in-place environments, and so while
Spiralog missed some opportunities of its own, the fact that ODS-2 remained
competitive while also lacking many possible enhancements suggests that for
general-purpose applications update-in-place retains its superiority.

Prism's cancellation was a monumental triumph of corporate stupidity and
politics.  The fact that Alpha was retrieved from its ashes was a lucky
accident aided by Bob Supnik's ability to push the hardware through a first
pre-production phase to prove its value.  It's not as if the company
*learned* something from Prism which then enabled it to create Alpha:  Alpha
is basically mildly-massaged Prism hardware retrieved after the fact, though
of course has now benefited from several additional generations of
development.  Too bad it doesn't have the rest of Prism as well...

> So maybe my crutch is large memories.  So?  Big systems of
> the future (2-5 years) will have massive memory and hundreds
> of CPUs unless you ascribe to the "attack of the killer 8-CPU
> node cluster" school.

And petabytes of disk storage, which will keep caches about the same
percentage of disk capacity that they are today:  everything gets cheaper at
a roughly similar rate, data items become larger (not all that space is
lengthy video clips:  a lot of it is static images or similar items accessed
randomly just like text is today), and the balance doesn't change much.

> Give me 2 systems and I can run the New York Stock Exchange off of
> them.  Maybe not so far fetched.

Yup - data in special-purpose systems won't change as fast as
general-purpose data, so increased caching will provide at least a temporary
advantage.

> I think Unix has lost its momentum.  NT is rocketing and unless
> the government pulls the plug NT will dominate the high-end
> 5 years and out.  Marketing, market-share , development, etc.
> 8-CPU boxes today, more nodes in the cluster for NT.
>
> Desktop to DataCenter one OS.  Unix lost the desktop (never had it),
> is losing midrange penetration hand over fist (file and print)
> the DataCenter is only a matter of time.

Though MS would like to have people believe otherwise, NT is rocketing only
into non-critical off-the-desktop areas:  it's simply not stable enough for
important use.  And I'm beginning to believe that MS is intrinsically
incapable of producing a competitively stable offering - ever - and in the
event I'm wrong, I'll assert that in any event NT itself cannot be made
stable enough, hence a new OS will be required, hence opportunity exists in
that space.  And if I'm even wrong there, given how long it has taken to get
W2K out the door (assuming it doesn't slip yet again), just how long do you
think it will take to make it stable, let alone a 64-bit successor?

[I *still* think there may be some value in having VMS cluster in a
file-system-only manner with NT, though:  in that context it doesn't matter
whether NT crashes a lot, and there's a lot less indication that NT is prone
to bouts of insanity where it deliberately trashes shared data - though I'd
have some reservations about using it in a Galaxy environment without hard
memory fencing.]

Lowish-end file and print service is about NT's limit - and even that
requires
NT clusters to provide adequate availability.  The more centralized
department-level and higher facilities are far less dependent on a common
GUI (NT's principal advantage over other systems) and far more dependent
upon availability, scalability, and performance.

> >behemoth on the other, can VMS really afford to pass up opportunities for
> >significant improvements?  If VMS does pass them up, can Compaq convince
> >customers that it's really backing VMS for the long term?
>
> What do you mean?  Can you cite a specific example?  What
> opportunities has VMS passed up at an engineering level that
> would make VMS stronger?

I listed 6 areas (7/10/99 7:56 P.M., then added one more - 7/11/99 2:01
P.M.) where VMS could improve its existing data management offerings more
than trivially, and have yet to see a detailed response.  I'm not
well-acquainted with the system-management end of things, but my impression
is that while VMS has done a fair amount of work in this area it could still
learn some things from IBM in areas of reducing or simplifying human
involvement.