From: young_r@eisner.decus.org Sent: Tuesday, July 20, 1999 11:47 PM To: Info-VAX@Mvb.Saic.Com Subject: Re: Whither VMS? In article <7n00vp$k8l$1@pyrite.mv.net>, "Bill Todd" writes: > > Rob Young wrote in message > news:1999Jul19.090336.1@eisner... > >> >> If that cache were in shared memory, a node dropping out >> would not nearly be as painful as the lock structures are >> in shared memory also > > Placing the lock structures in mirrored or fault-tolerant shared memory (to > give the DLM the same resilience to failures that it enjoys today with its > distributed, duplicated state) is a win, but requires significant algorithm > adjustments. Of course, if you're talking specifically about cache-related > locks, then new algorithms could allow that traffic to merge with the cache > traffic, in which case only the cache management is relevant (see below) and > shared Galaxy-style memory for lock management has no particular advantage - > in this specific context. > I understand the first part of this paragraph but you lost me beginning "Of course, if you're talking ..." > plus with 200-300 gigs of cache >> (maybe not unreasonable in 3 years) doesn't have to be reloaded. > > I could make exactly the same argument for a distributed shared cache today, > but that never caused it to materialize. Remember that a very large > percentage of the 'hot' data at any one node is also cached at some other > node(s) unless use is highly partitioned (i.e., such that a shared-nothing > approach would work as well as VMS's), so *any* mechanism supporting > distributed sharing will avoid most of the cache reloading disk access pain > (the things you re-load will be the 'cooler' ones that no one else is > interested in, which at least distributes the impact over time). > Wait a second.. you can't have it both ways. If I have a 200 gig cache in shared memory, in order for you to have an equivalent "hot" cache each node would need somewhere near that. So if one of your 8 nodes with 100 gig each took a hit, you avoid the reloads because "a very large percentage of the 'hot' data at any one node is also cached at some other node(s)". But if that cache is hot and has a modest amount of write activity isn't there a good deal of locking to contend with? And aren't we back to a scaling problem because realistically you would have 32 gigs per node for 8 nodes, where I have a single box with 256 gigs and a 200 gig shared memory section. Am I missing the obvious? Remember, I thought yesterday that NT is in the DataCenter :-). > A shared-memory cache is unquestionably more space-efficient than a > distributed shared cache, since no such duplication occurs. It's also > slower by the ratio of local to remote memory access speeds: if you're > touching much data in a given cached 'page', it's entirely possible that > taking a single latency hit moving it into your local memory will speed up > execution overall - almost certain for 'hot' pages that you would keep > around in your local cache and access frequently. The shared-memory cache > is also more fragile, unless you make it stable and fault-tolerant (when did > you say fault-tolerant memory will appear?): duplication isn't all bad... > Okay, I should have read a little further. > Oh yes - scaling. A shared distributed cache scales automatically as you > add more nodes, a shared-memory cache does not - an issue not only of size > but of contention/bandwidth (in the shared-memory version) as well. > Lost me here. You add more nodes. I slide in a processor board and a few memory boards. Aren't you spending more money? > Given distributed caching algorithms (which exist) that give some preference > to data that no one else duplicates (hence reducing the level of duplication > to the hotter items that you may well *want* duplicated), I can make a > pretty good argument that the Galaxy configuration would be better off just > using its shared memory as a fast interconnect and implementing a > distributed shared cache exactly as could be done today in non-Galaxy > configurations. > And I bet you're right regardless of what I say above. After all, I was paying attention when Fred Kleinsorge in comp.arch a month or so back said he is "a big fan of distributed caches." The catch may be the cache is physically distributed across several machines but managed as a single space. I think that is the trick. I may be off the mark of course. I think your argument of wanting them duplicated is for slower skinnier interconnects. I suspect very-high bandwidth low-latency switches are changing OS thinking fundamentally. The fun thing about Usenet is you can go back and read some of this years later and cringe. We shall see how it works out. >> Guess I was looking at a cheaper way to go. I'm assuming the >> large Wildfire has batteries. Seems silly to spend that kind >> of money and not get something that "oozes" non-stop at a >> hardware level. Yes solid state disks would work. I am looking >> at a large write-back cache and maybe a large write-back delay. >> And yes a locally attached "write-only" disk would work and >> I mentioned this in another discussion. > > The question is really whether one can find a general-purpose, flexible, > cost-effective-across-a-wide-range-of-system-scales approach which comes > quite close to offering optimal performance across this same range of system > sizes. If you can, it's awfully attractive; if not, you have to implement a > half-dozen different ways of doing just about the same things to satisfy > each niche. > And provide the appropriate SYSGEN parameters to turn them off and on. > Stable fault-tolerant/mirrored write-back caches obviously can work very > well, but require a fair amount of configuration/administration as loads and > system sizes fluctuate. They are also tricky to manage under failure > conditions: they are a part of the processor complex that is logically > instead part of the storage system, and no failure can be allowed to > separate them from the disk storage until the two have been reconciled > (long-established application - e.g., database - persistence semantics with > respect to data 'known' to be on disk must be honored). > > Log-based solutions can also work very well, tend to scale more > automatically, provide a common and well-understood recovery mechanism (I > would say in contrast to stable memory), and admit to more flexibility > (e.g., the performance option of using solid-state instead of conventional > log disks). They have the down-side of being less transparent than a > stable, fault-tolerant write-back cache, but this of course applies only to > applications that need not run in any other environments where that option > is not available: the rest of the applications will already have done what > they need to to run in such other environments, and may be unlikely to > create a special version for Galaxy - especially if the performance won't be > affected much. > I like what you say above. Good food for thought. > Of course, there is also a class of (often Unix-style) applications that use > write-back caching as a matter of course, but they don't need any special > Galaxy help for this: good old normal memory is what they expect, and if > they lose the data before writing it to disk, well, they just handle it. So > the real question is whether there is a significant intermediate class of > applications that use 'careful update' force-to-disk sequencing rather than > a log to protect their data: these, and only these, applications are the > ones that a special-purpose, reliable write-back cache would help optimize, > and their data is the only data that would allow improvements in overall > system throughput if handled in a reliable write-back cache. > > The current file system and RMS obviously fall into that special category: > while changing them to use logging mechanisms would be equally effective, a > reliable write-back cache would help a lot and be transparent. So I can't > argue against its expediency - but I wouldn't count on it to improve all > that many other applications, and it likely won't improve database > performance at all. > There may be a good deal of impetus to go with something that transparently boosts RMS. Maybe with a write_delay and write_back and the need for integrity you offer several variations with explicit setup instructions. Maybe for moderate sites the recovery logs could be on dedicated disks on controllers. This too will be interesting to see.. RMS improvements and how it interacts with new caching, etc. >> >> Maybe what I am thinking isn't practical. e.g. the recovery log >> really isn't touched, nor would it need to be UNLESS something >> catastrophic occurs. If the batteries kick in, it is assumed >> a catastrophy is underway. Guess this would gain you a few things.. >> >> o Writes would occur in under a millisecond (1) > > Low-latency VI access to a solid-state log disk should equal this. You > could even implement such a 'disk' in reliable shared memory. For that > matter, as soon as log activity to a non-solid-state log disk starts to > increase, you can start grouping multiple operations in each log-disk write: > while individual operation latency isn't as good as with a solid-state log > disk, system throughput limits tend to be about the same (gated by the log > disk's maximum large-sequential-write bandwidth, not its individual > operation latency). > Funny, you mentioned in reliable shared memory for the disk. Since a RAMDISK in shared memory is a future (10), you could shadow 2 smaller ones. What I have been kicking around and neglecting to ask is: Will/can shared memory be identified? Will the RAMDISK driver be "smart" enough to go to a different memory bank(s) for subsequent disks? Wouldn't make much sense to shadow RAMDISKS and blow out a board and they *both* go down. Since, fault tolerant allocation is a future there must be better/nicer/smarter hardware OS intereaction in this regard. Something I know very little about. >> o Writes could be reordered > > Same with write-back caching in normal memory protected by a log - that's > how databases do it. And of course also true for unprotected Unix-style > write-back caching in normal memory. > >> o Writes could be combined > > Ditto, at least if you're talking about combining logically-adjacent > writes - and once you can defer writes, you can also often defer some > allocation decisions to *make* a bunch of small writes logically adjacent: > one of many things an update-in-place approach can do to retain its own > advantages while appropriating some that are usually credited to > log-structured approaches. > >> o Writes occur during quiet times > > Ditto. > >> o Temp files don't hit disks > > Ditto: temp files are a fine example of Unix-style unprotected write-back > caching. The common thread in all this is that you don't need reliable > memory to do write-back caching: many applications don't need that degree > of safety, and logs are an already widely-used alternative for those that > do. > One comment here though.. for the paranoid VMS manager (many of us are as we are in mission-critical/bet-the-business/"write up a report if you go down" situations) I don't want to go to someone and say, "well you know I had write-back caching turned on and the procedure that creates the nightly financials crapped out because the temp file wasn't created, yadda yadda." When Spiralog talk was thick, one of the hot subthreads was write-back caching. Who could or why should you use it? Why go through all the work if the vast majority of VMS manglers can't use it? So when I see write_back/write_delay in SYSGEN I think they are hot on the trail of engineering it for Joe experienced manager that can set it up correctly. Especially with fault-tolerant memory lurking out there. Maybe fault-tolerant memory doesn't play a part. But whatever they are doing I hope it can be "set-it and forget it." >> >> Much of those bullets being Spiralog stuff. > > One of the drawbacks of treating your entire file sysem as one big log is > the lack of discrimination between what can wait to be written to disk and > what needs to get there *now* because some operation is waiting > synchronously for write-completion. Separating out the log function > clarifies this and keeps lazy writes out of the critical disk path. > Granted, you could pass down a flag to indicate the difference - but then > your critical writes start screwing up your queue optimization and still get > delayed by in-progress lazy operations. > okay. Learned something else. >> >> Realistically, I admit a recovery log would be active constantly >> to reassure folks. Unless it can be somehow "proven" writes will >> end up on magnetic material somewhere regardless of what happens. >> >> One thing that differs versus distributed cache is economy of scale. >> If writes are occuring to shared memory, all the Galaxy memobers would >> benefit from the scaling. > > And if you use solid-state log disks, multiple members can share each: in > both cases, the limits are those imposed by the bandwidth of the shared > resource - and while the bandwidth of a shared disk is much less than that > of shared memory in Galaxy (unless, of course, you implement that shared > disk in reliable shared Galaxy memory...), the *amount* of data written > there is much less as well, at least for typical log use. > You mentioned a Galaxy disk again. Why a disk? Why can't it be at an OS level and a carved portion of memory? Aren't most Unix recoverly logs transparent to the end-user? i.e. not locatable? [snip NT wandering, see other thread] Rob