From: Bill Todd [billtodd@foo.mv.com]
Sent: Saturday, July 10, 1999 7:56 PM
To: Info-VAX@Mvb.Saic.Com
Subject: Re: Whither VMS?

Main, Kerry <Kerry.Main@Compaq.com> wrote in message
news:910612C07BCAD1119AF40000F86AF0D802CCDDE2@kaoexc4.kao.dec.com...
> Just so I understand a bit more .. You do realize that within an OpenVMS
> cluster, the common clustered file system that is part of OpenVMS and
> various shadowing products (both host based and controller based) means
that
> no matter what node a user logs into (even in a multi-site cluster), their
> files are local to them ? In a properly configured multi-site cluster, the
> user does not even know which system or which datacenter is providing his
> computing resources.

Yes, and the truly shared-disk nature of VMS clusters (at least when using
HSx controllers - and privately-served pseudo-shared disks are the next-best
alternative) carried all the way up through the file system (and I believe
DBMS and Rdb) is one of their greatest virtues over most of the competition.

S/390 Parallel Sysplex works the same way for low-level access, and VSAM and
IMS (I think - it's possible it works more like DB2 below) work pretty
much the same way up through higher levels:  cluster members have their own
private caches which cannot share each other's data but which are kept
coherent through invalidation mechanisms.

Oracle also has worked this way at least until very recently:  they may have
just implemented the ability to share undirtied cache contents across
cluster members, and reportedly are working on being able to share dirty
data too before long.  But I don't believe *any* existing cluster
implementation supports such facilities at the file level, and they can
offer real improvements in both access speed and total throughput:  having
to go to disk every time you need data that isn't in your own cache is
expensive when any real inter-node sharing is going on, and since you have
to take the hit of distributed interlocks in any event the sharable
distributed cache is virtually free (at least with the speed of today's
interconnects), since with only minor additions in total message count the
existing locking messages can be extended to support the cache mechanism as
well.

I seem to remember (dimly, now) that RMS went through an analogous
progression long ago.  In The Beginning any RMS caching was strictly
per-process.  Somewhere around VMS V2 a global buffering mechanism was added
so that processes on the same machine could share RMS data.  Then along came
clusters, and the global buffer mechanism may have had to be scrapped, at
least temporarily, since the new distributed facilities had trouble dealing
with the additional level of organization.  Then later they may have
returned - and more recently there's been some kind of node-local sharable
file system cache (forget what it's called) added so processes, though again
only on a single node, can share cached data.  But even today, 15 years
after clusters appeared, *nodes* still can't share cached data with each
other at either the file system or the RMS level:  if sharing within a node
is good, and distributed processing over the same data is valuable (and in
some situations it's absolutely *essential*), then sharing between nodes is
important.

DB2 on a Sysplex, however, can use the shared cache provided by the Coupling
Facility (CF) both to share data between members (at least in a limited
fashion) and, perhaps more significantly, as a shared write-back cache for
dirtied data (which is protected by the logs against crashes before being
written back to disk).

The situation on RS/6000 SP2 is different.  It doesn't support truly shared
disks, but does provide 'virtual shared disks' served at the disk level from
private disks on individual nodes over a high-performance switched cluster
interconnect.  I think (can't find the reference right now) that the AIX
Journaled File System (JFS - a log-backed file system) was *not*
significantly extended to support clustering, hence is definitely
sub-optimal (though fully functional).  Early on, the 'Vesta'
high-performance-computing shared file system was an option, but it's
definitely special-purpose (and I would say also sub-optimal for most use)
in nature; recently, the 'General Parallel File System' - based on a
special-purpose system for shared non-linear video editing - has appeared,
but again it's sub-optimal.  And DB2/6000, unlike DB2 on S/390, uses a
function-shipping (partitioned) model rather than making use of the virtual
shared disk facilities.  So RS/6000 looks pretty good when compared with
other Unix clusters and has a more modern (though less distributed) base
file system than VMS's, but is at best competitive with current VMS
facilities overall.

Sun's clusters use node-private disks that export private portions of their
file system in a client/server manner to other nodes.  Neither ideal for
performance nor for scaling.  HP's cluster file system is a derivative of
Veritas' VxFS, so if I had to guess I'd say it may work somewhat like the
Sun cluster architecture, but one really should look into it.

> It also means that if I try to update a user file on one system at the
same
> time as someone else on another system is, then whoever got there last
will
> get an error message preventing the data corruption that would occur on a
> non-clustered system with uncoordinated writes across multiple systems.
The
> various banks, stock exchanges, lotteries, manufacturing, billing and many
> other environments use these lock management features in their OpenVMS
> Cluster applications today.

Extension of single-system locking semantics to an entire cluster is not
exactly unusual technology any more - though it certainly was when
VAXclusters first appeared.

> Given that single server, single site solutions are likely to become
> unacceptable to serious e*Commerce players very shortly, how does an IBM
> mainframe implement transparent application and user access to data at
both
> sites ?

Exactly as VMS does - see above.  A 'single system image' is also no longer
a particularly novel feature in any cluster implementation (though use of
shared disks to optimize access to the data is still an unusual strength).

Sysplex members can be geographically separated (I forget the maximum
distance, but it qualifies as disaster-tolerant).  So can RS/6000 SP cluster
members.

> That is not to say that file system improvements are not being looked at.
>
> On the contrary, file system improvements are ongoing and will continue to
> improve. As an example, check out this update on the recent DECUS page:
> http://ww2.decus.org/saag/Abstract.asp?Code=OV114

Thanks for the reference.  However, while such work doubtless has value,
it's far from the kind of core performance issues I was talking about.  Sort
of like re-painting a house that's fairly close to falling down from old age
(no, that analogy's not entirely fair, but it's also not entirely inapt,
either).

> >>> The underlying VMS cluster facilities are up to the task, but the file
> system falls a bit short in areas of performance - which can also be
viewed
> as a scaling (and cost)  issue, since the more performance you have in
each
> cluster member, the fewer members you need.  <<<
>
> Can you expand on what you mean here ? Are there specific areas of the
> OpenVMS file system that you feel are big issues ?

1.  Sharable distributed data caching, mentioned above.  Not only can this
improve response time markedly, but it can significantly off-load hot-spot
activity on individual disks, improving scalability (after all, that's
perhaps the major strength of the shared-disk configuration:  processing -
and caching - facilities can scale independently of the storage).

2.  Journaled file and record managers.  Logging meta-data changes cuts
*'way* down on hot-spot disk write activity - e.g., file-header information
updates, allocation operations.  The logs can be spread across the cluster
members, further improving scalability (Oracle's main transaction-processing
bottleneck is reportedly its use of a single system-wide log file).  Logging
also makes it possible to use a write-back cache for changes (without the
need for something like mirrored NVRAM and associated recovery complexities
to protect its data), which in turn makes lazy updates to mirrored (and
possibly even parity-protected) disk arrays feasible without affecting
application response time and takes things like RMS index (and RRV) updates
out of perceived response time (as well as eliminating the ordering
constraints on their disk writes, which removes the potential need for
horizontal scans to recover from possibly incomplete index updates, which
makes it easier to reclaim buckets and to implement less restrictive index
update locking and possibly better trailing-byte key compression...).

3.  As I've mentioned in passing, once you've got effective distributed data
caching plus a journaling facility, you've got the basis for other data
management systems as well:  object managers, database managers,
special-purpose products...  Even if Compaq isn't interested in developing
such products itself, if VMS is going to be positioned largely as a
high-performance, high-availability, scalable server, providing the best
platform in the industry for third-party development of such products
wouldn't hurt:  by definition, they largely don't care what platform they're
running on, since they just provide services over some kind of wire.  If I
were such a product, I'd much prefer to run as a kernel component, but
that's not much different from installable file systems:  possibly a support
headache, but few worthwhile things aren't.

4.  Specific improvements in directory access, such as b-tree name-ordered
structures and 'clustering' of the child file headers to promote faster
directory-list operations (the latter gets messier with multiply-linked
files, but they are sufficiently rare that they won't compromise the
performance boost).

5.  Enhancements such as additional and possibly user-defined attributes and
multi-'stream' files.  If you're going to re-do the file system, might as
well bring it up to current facility levels - and if you do, then you've got
something you might be able to interest other systems in as well, either as
an added compatible file system or as something they could potentially share
with VMS in a heterogeneous cluster.  You could say that current VMS users
don't care about things like this, but DEC went to some trouble to emulate
such facilities so that VMS could act as a file server to other systems -
and as we agreed on already, if VMS is attractive only to current users, its
fate is likely sealed.

6.  Support for transactional (logged) sets of operations at the user level.
If the file and record sub-systems already use transactions internally, and
VMS already supports user-level transactions at a higher level for
database-like entities, it seems silly not to integrate the two somewhat
better than they are today.

I may have missed something important above, and there are a whole bunch of
more minor secondary effects from some of these that have additional impact,
but that's at least a start.  There's no untried, frightening technology in
all this:  it's just a bunch of things that no one else has gotten around to
implementing - yet - because they can't look beyond the next quarter's
profits and some of this stuff, while eminently do-able, isn't as easy to
implement as less general alternatives.  But the overall effect is at least
as great as the sum of its parts, and could put VMS at the forefront of
cluster/server data management for some time to come.

- bill