Steve & Andy - I've written a bit about some issues that might be involved in getting I/O to scale better. My current thinking is more or less in the space of a layering with one or a few huge RAID systems at the bottom, snappy disks and virtual disks above, order unclear, so you break the huge area into usably small pieces and maintain ability to do backup of them with snaps, and the disk melding stuff on top so it all looks to the user like one huge file structure regardless of the number of volumes underneath, but so new storage can be added easily too. I also mention notions of a possible cache using solid state disk that might be of value as a way of automatically keeping "hot" sectors available on SSD as an option. The multipath work I'm currently doing is incorporating a number of key fixes to the I/O subsystem that will make this kind of stacking of functions work efficiently, BTW, and the code I handed Steve Flynn is designed to work properly with those changes. I have some VD drivers (virtual disk) that do and others that don't, so it'd be desirable to pick one of those that do... Steve Sicola (peaks::sicola) can give you info on the HSZ plans which involve multipathed controllers with sizeable cache and hardware RAID. I believe however that in Wildfire type configs it'll be essential to have some form of recoverable storage at bottom. I have ideas of how to put HSM into this picture too; Safety does this already, but my (currently -you should pardon the pun- shelved) plans for a "HSM phase II" that can remove headers completely from disk are partly superseded by disk melding. Anyhow I wanted to get this into your hands while I'm still around so that if there are any bits I can add, that can be done in timely fashion. Huge I/O Systems Glenn C. Everhart, 2 June 1997 This is intended to be a sketch of ideas on how one might reasonably build an I/O file system that will work for very large (i.e., like 10,000 disks of a few gb each) disk farms. The focus is tactical; that is, the idea here is to ask what can be done without designing a complete new file system, but instead be able mostly to use code that can be assembled relatively quickly for ovms. The resulting system should make it simple to manage and use from user perspective, ie, look like one big system, not lots of little ones, and should permit growth easily (shrinkage too?). Data recovery and speed improvements need to be thought about. I will make a bunch of assumptions about what exists and that the reader understands the limitations of ods2 (and planned ods5), esp. the million-cluster limit which COULD be lifted, supposedly fairly easily, and really OUGHT to be, by making the one 8 bit lock value block field represent not the relative block number of the bitmap, but the relative 1/256th of the bitmap being used currently. ---------------------------------------------------------------- Bits of Technology Available 1. RAID n in software. Available shadowing, striping, and RAID drivers. These can be used in combination if need be. 2. Partitioning drivers. VD: is the best known example, allows any contiguous segment of a disk to be treated as a separate volume. Works with any disk (including raid, shadow, stripe disks). Useful for shadowing different size disks, or for making a huge RAID disk look like a number of smaller disks where cluster factor gets too large. 3. Snappy Disks. These have an underlying file system which allows "snapshots" to be declared so as to enable "instantaneous" images of disks to be created for backup purposes. (Unknown how, in the limiting case, this underlying system will affect performance.) 4. HSM. Moves files around, no strategy known which lets it deal with removing all the headers from disk. (In principle if one creates directories on the fly, directory open could be used to signal an operation to pull in such things.) One problem is that directory and index file entries all stay where they are... 5. A utility which presents back/phys or HSC type backup tapes to VMS as disks exists on freeware CD. It has lots of cache, uses xor and crc info. Idea was that back/phys runs maybe twice as fast as normal backup; this scheme allows recovery of individual files. 6. A journalling virtual disk driver exists. Writes get written to underlying disk but also to a journal file, timestamped and block # stamped. The journal amounts to a "continuous backup" in combo with a phys. backup. Most useful case could be a variant where the journal goes to a solid state disk... 7. Kernel, and some of the user, code exists for a directory melder which would use a Spiralog volume of 0-length files for directories and keep actual data on "N" conventional file structure disks. The conventional disks could be ODS2 or ODS5 (or whatever...) and each would be complete file structures. Thus, adding new disks would be easy to do. A bit of create overhead gets added but directory search overhead goes down (Spiralog is 10 times faster than ods-n for this function and scales well since directories are B trees) and open overhead is very low. User sees ONE large directory structure with no visible storage boundaries, but it gets hosted on N volumes. Each can be backed up separately or restored separately (or in parallel) without special synchronization among them all. This same hook can be used to generalize from Spiralog to a full RDBMS later on, and to add some remote access hooks, at file system level and not RMS level. (SLIGHT digression) The idea of extending is that the code has the option of sending requests to a server which can run in user mode and having the server return data, as well as running kernel execution threads (not to be confused with VMS kernel threads...I mean a sequence of AST driven code). If instead of passing an I/O request (in knl mode) to a Spiralog filesystem to find the file location given a name or ID, one passes it to a server that runs a DBMS that has a relation containing pathname, underlying device, and underlying FID, the data can be returned from the DBMS. Pick a DBMS that scales well with size and directory access then scales well with size, and with this scheme you can stick the dbms version in when the system gets big enough to need it. What is more, the relation can have other stuff in it that can be included in a query, and it is possible to do "set default" (or "cd") that uses syntactic sugar to allow operations like "set default (containing=payroll)" (not exactly how it might look of course) to have directory or so on only display files which have keyword "payroll" in their keyword list. (Also include files that haven't yet had their keyword list built to keep from confusing folks.) This is a kind of V2 capability...Spiralog is probably good enough to gain better directory scalability and speed at first, and saves one from writing support for read-directory. However, it means that one generalizes the notion of directory. I believe the code that has been furnished can be used without too much grief to even allow the DBMS features for only SOME directory access, should that be desired. Opening up the notion of directory in this way is however a potential advantage because it means finding information can be readily machine aided, and programs require NO change. If they can select a subdirectory, they can select on additional criteria, either locally or from any remote clients. Thus a W95 or NT client would gain this capability automatically provided its storage were on VMS servers. (end of SLIGHT digression) 8. Various cache products exist. All use memory at present. 9. (not an existing system but an idea worth having around). It is possible to cache slow disks on faster ones; a cache or journal on solid state disk would for example tend to be faster than normal disks. Let block access be written first to such a solid state cache so writing could complete fast (and elevator schedule it to underlying disks, as scsi disks supporting tagged cmd queueing tend to do) and read preferentially from there. If it's a real cache, it isn't volatile so writing to underlying media can be done only when needed. Cache can be a larger fraction of underlying size, and file headers or directories wanted can tend to be there. It is possible to distinguish file system reads from user reads generally speaking...file system uses r/w logical, user uses r/w virtual...and perhaps prefer to keep filesys info in the cache... The basics...bottom level stuff Underneath, we need a physical interconnect. The stuff the HSZ folks are working looks good, and with path failover it'll be feasible to build reliable RAID disks with some hardware cache and even hardware writeBACK. Having the writeback in intelligent controllers is optimal in some ways. However let's suppose we want to support 3rd party disks and/or controllers that just give disk access like direct parallel SCSI disks now. We should have a way to deal with that too. A software RAID driver exists, but we'll think about stuff that does RAID or whatnot in hardware, giving some number of large reliable things that look like single disks. My earlier notion of an intercept which will allow one Spiralog directory structure to serve for lots of disks is a good starting point for handling thousands of volumes tactically. (To be sure, there needs to be something done to facilitate mounting them all doing many mounts in parallel. Even a DCL script could handle this though, using programmatic volume names in a config file.) To make this sort of thing work best one needs to split large volumes (e.g., a huge RAID set) up into a bunch of smaller virtual disks. VMS should be altered a bit so that its disk drivers could be told to start at some LBN and limit their access to some max # of LBNs for a disk unit, so that booting from a virtual disk would work. The edits needed at the start of start_io to handle this are really trivial, esp. for drivers like dkdriver that don't convert lbn. Min. req. is under half dozen lines of macro around start-io to relocate where on disk is represented & small checks to allow the bootable volume to look smaller. For simplicity leave the boot volume as ods-5 or ods-2. If underneath there's just a big stripeset instead of RAID it's desirable to have err recovery & to speed things up. I suspect the XFC will push write back cache. However, why not allow a site to have a larger writeback cache on solid state disk? Thus you'd split volumes up with a virtual dsk driver that clones each irp & sends one off to ssd with a time tag & lbn, the other to real disk if the real disk did writeback. If no writeback, notify a catch-up process to get the data from the solid state disk. The ssd buffer can be used as circular; when it fills, don't overwrite but wait. On a cluster you'd have N of these circular caches where N is less than or equal to the number of nodes. Keep track of the read & write positions with locks. If you come up after a crash, you'd have time tagged records pertaining to disks data, could find the last written data using the time tags. If data always went userprog ->ssd cache -> normal disk this could be used to catch everything. Thus on reboot & mount you'd use this to put (non system) data all onto disk & zero ssd. Reading the data would need a list of in-cache data so what wasn't on disk could be grabbed from ssd. Normally this would be a short list. If spinlocks are an issue, different lock tables or locks can be used for different lbn ranges. Lock 1/8192nd of a disk or some such...? So on write, send data to cache, grab lock, write lbn in cache info to the short list, bump pointers, ungrab lock. On read, check the list before reading from normal disk. With everything in common mem it might be possible to have no locks at all for reading...just check pointers. (Normally rms locks etc. keep conflicts out above this level.) Thus the problem of write to cache, and other processor reading stale data off disk, could be avoided so long as each processor will see the block number list when the write to cache is done. Remember "cache" = solid state disk (ssd) here! Assumption is that ssd is cheaper per GB than memory. Otherwise one uses memory for this! If you don't care about booting from virtual disks, vddriver in my latest version (that needn't grab postproc.) with a minor edit to incorporate the io post hook, should be as fast as needed, since it can keep its context info in the IRP itself and be used to partition disks.