From: Bill Todd [billtodd@foo.mv.com] Sent: Saturday, July 10, 1999 7:56 PM To: Info-VAX@Mvb.Saic.Com Subject: Re: Whither VMS? Main, Kerry wrote in message news:910612C07BCAD1119AF40000F86AF0D802CCDDE2@kaoexc4.kao.dec.com... > Just so I understand a bit more .. You do realize that within an OpenVMS > cluster, the common clustered file system that is part of OpenVMS and > various shadowing products (both host based and controller based) means that > no matter what node a user logs into (even in a multi-site cluster), their > files are local to them ? In a properly configured multi-site cluster, the > user does not even know which system or which datacenter is providing his > computing resources. Yes, and the truly shared-disk nature of VMS clusters (at least when using HSx controllers - and privately-served pseudo-shared disks are the next-best alternative) carried all the way up through the file system (and I believe DBMS and Rdb) is one of their greatest virtues over most of the competition. S/390 Parallel Sysplex works the same way for low-level access, and VSAM and IMS (I think - it's possible it works more like DB2 below) work pretty much the same way up through higher levels: cluster members have their own private caches which cannot share each other's data but which are kept coherent through invalidation mechanisms. Oracle also has worked this way at least until very recently: they may have just implemented the ability to share undirtied cache contents across cluster members, and reportedly are working on being able to share dirty data too before long. But I don't believe *any* existing cluster implementation supports such facilities at the file level, and they can offer real improvements in both access speed and total throughput: having to go to disk every time you need data that isn't in your own cache is expensive when any real inter-node sharing is going on, and since you have to take the hit of distributed interlocks in any event the sharable distributed cache is virtually free (at least with the speed of today's interconnects), since with only minor additions in total message count the existing locking messages can be extended to support the cache mechanism as well. I seem to remember (dimly, now) that RMS went through an analogous progression long ago. In The Beginning any RMS caching was strictly per-process. Somewhere around VMS V2 a global buffering mechanism was added so that processes on the same machine could share RMS data. Then along came clusters, and the global buffer mechanism may have had to be scrapped, at least temporarily, since the new distributed facilities had trouble dealing with the additional level of organization. Then later they may have returned - and more recently there's been some kind of node-local sharable file system cache (forget what it's called) added so processes, though again only on a single node, can share cached data. But even today, 15 years after clusters appeared, *nodes* still can't share cached data with each other at either the file system or the RMS level: if sharing within a node is good, and distributed processing over the same data is valuable (and in some situations it's absolutely *essential*), then sharing between nodes is important. DB2 on a Sysplex, however, can use the shared cache provided by the Coupling Facility (CF) both to share data between members (at least in a limited fashion) and, perhaps more significantly, as a shared write-back cache for dirtied data (which is protected by the logs against crashes before being written back to disk). The situation on RS/6000 SP2 is different. It doesn't support truly shared disks, but does provide 'virtual shared disks' served at the disk level from private disks on individual nodes over a high-performance switched cluster interconnect. I think (can't find the reference right now) that the AIX Journaled File System (JFS - a log-backed file system) was *not* significantly extended to support clustering, hence is definitely sub-optimal (though fully functional). Early on, the 'Vesta' high-performance-computing shared file system was an option, but it's definitely special-purpose (and I would say also sub-optimal for most use) in nature; recently, the 'General Parallel File System' - based on a special-purpose system for shared non-linear video editing - has appeared, but again it's sub-optimal. And DB2/6000, unlike DB2 on S/390, uses a function-shipping (partitioned) model rather than making use of the virtual shared disk facilities. So RS/6000 looks pretty good when compared with other Unix clusters and has a more modern (though less distributed) base file system than VMS's, but is at best competitive with current VMS facilities overall. Sun's clusters use node-private disks that export private portions of their file system in a client/server manner to other nodes. Neither ideal for performance nor for scaling. HP's cluster file system is a derivative of Veritas' VxFS, so if I had to guess I'd say it may work somewhat like the Sun cluster architecture, but one really should look into it. > It also means that if I try to update a user file on one system at the same > time as someone else on another system is, then whoever got there last will > get an error message preventing the data corruption that would occur on a > non-clustered system with uncoordinated writes across multiple systems. The > various banks, stock exchanges, lotteries, manufacturing, billing and many > other environments use these lock management features in their OpenVMS > Cluster applications today. Extension of single-system locking semantics to an entire cluster is not exactly unusual technology any more - though it certainly was when VAXclusters first appeared. > Given that single server, single site solutions are likely to become > unacceptable to serious e*Commerce players very shortly, how does an IBM > mainframe implement transparent application and user access to data at both > sites ? Exactly as VMS does - see above. A 'single system image' is also no longer a particularly novel feature in any cluster implementation (though use of shared disks to optimize access to the data is still an unusual strength). Sysplex members can be geographically separated (I forget the maximum distance, but it qualifies as disaster-tolerant). So can RS/6000 SP cluster members. > That is not to say that file system improvements are not being looked at. > > On the contrary, file system improvements are ongoing and will continue to > improve. As an example, check out this update on the recent DECUS page: > http://ww2.decus.org/saag/Abstract.asp?Code=OV114 Thanks for the reference. However, while such work doubtless has value, it's far from the kind of core performance issues I was talking about. Sort of like re-painting a house that's fairly close to falling down from old age (no, that analogy's not entirely fair, but it's also not entirely inapt, either). > >>> The underlying VMS cluster facilities are up to the task, but the file > system falls a bit short in areas of performance - which can also be viewed > as a scaling (and cost) issue, since the more performance you have in each > cluster member, the fewer members you need. <<< > > Can you expand on what you mean here ? Are there specific areas of the > OpenVMS file system that you feel are big issues ? 1. Sharable distributed data caching, mentioned above. Not only can this improve response time markedly, but it can significantly off-load hot-spot activity on individual disks, improving scalability (after all, that's perhaps the major strength of the shared-disk configuration: processing - and caching - facilities can scale independently of the storage). 2. Journaled file and record managers. Logging meta-data changes cuts *'way* down on hot-spot disk write activity - e.g., file-header information updates, allocation operations. The logs can be spread across the cluster members, further improving scalability (Oracle's main transaction-processing bottleneck is reportedly its use of a single system-wide log file). Logging also makes it possible to use a write-back cache for changes (without the need for something like mirrored NVRAM and associated recovery complexities to protect its data), which in turn makes lazy updates to mirrored (and possibly even parity-protected) disk arrays feasible without affecting application response time and takes things like RMS index (and RRV) updates out of perceived response time (as well as eliminating the ordering constraints on their disk writes, which removes the potential need for horizontal scans to recover from possibly incomplete index updates, which makes it easier to reclaim buckets and to implement less restrictive index update locking and possibly better trailing-byte key compression...). 3. As I've mentioned in passing, once you've got effective distributed data caching plus a journaling facility, you've got the basis for other data management systems as well: object managers, database managers, special-purpose products... Even if Compaq isn't interested in developing such products itself, if VMS is going to be positioned largely as a high-performance, high-availability, scalable server, providing the best platform in the industry for third-party development of such products wouldn't hurt: by definition, they largely don't care what platform they're running on, since they just provide services over some kind of wire. If I were such a product, I'd much prefer to run as a kernel component, but that's not much different from installable file systems: possibly a support headache, but few worthwhile things aren't. 4. Specific improvements in directory access, such as b-tree name-ordered structures and 'clustering' of the child file headers to promote faster directory-list operations (the latter gets messier with multiply-linked files, but they are sufficiently rare that they won't compromise the performance boost). 5. Enhancements such as additional and possibly user-defined attributes and multi-'stream' files. If you're going to re-do the file system, might as well bring it up to current facility levels - and if you do, then you've got something you might be able to interest other systems in as well, either as an added compatible file system or as something they could potentially share with VMS in a heterogeneous cluster. You could say that current VMS users don't care about things like this, but DEC went to some trouble to emulate such facilities so that VMS could act as a file server to other systems - and as we agreed on already, if VMS is attractive only to current users, its fate is likely sealed. 6. Support for transactional (logged) sets of operations at the user level. If the file and record sub-systems already use transactions internally, and VMS already supports user-level transactions at a higher level for database-like entities, it seems silly not to integrate the two somewhat better than they are today. I may have missed something important above, and there are a whole bunch of more minor secondary effects from some of these that have additional impact, but that's at least a start. There's no untried, frightening technology in all this: it's just a bunch of things that no one else has gotten around to implementing - yet - because they can't look beyond the next quarter's profits and some of this stuff, while eminently do-able, isn't as easy to implement as less general alternatives. But the overall effect is at least as great as the sum of its parts, and could put VMS at the forefront of cluster/server data management for some time to come. - bill