File System Extension
Glenn C. Everhart

This document is meant to suggest a couple variant schemes that may
be able to enhance VMS' file system manageability and usability.

It would seem clear that a disk farm of dozens or more volumes in which
each volume is a separate entity has some disadvantages as well as
advantages.

The advantages lie in backup or error recovery, where a file structure
that becomes toast can be recovered in a more reasonable time frame than
would be the case if the file structure spanned the whole farm. Burroughs
learned that a long time ago...

There are also security advantages, since volumes can be protected
and volume access serves as a kind of "mandatory" protection for the
volume contents.

These however tend not to be widely visible.

The disadvantages are in managing the thing. Once a disk runs out of
capacity with systems like ntfs or files-11, files must be migrated,
usually manually. These considerations are constantly visible to
everyone and represent an operational disadvantage vis a vis unix.

The disadvantage group can be dealt with by allocating disk control
structures (notably, bitmaps) sized for a larger space than is actually
there, and permitting use according to what storage is really present.
(If all virtual space is visible, a driver can return the device full
error code, and VMS will respond sensibly. Of course, one could modify
the VCB slightly at mount time to adjust free blocks as well. My compressing
disk did the first, not the second, since its situation was too dynamic.)

However, build huge "rubber disks" like this and the advantages will tend
to be lost, if that is all you do. (Not that it's bad to do, just that
one loses the advantages.)

Possibility 1:

I will mention another possibility: a disk-backed LBN cache, where the
cache lives on disks...as many as you like...and the file structure is
again of the "rubber" class. However, the backing store need not all
be of equal speed, since the cache system could see that recently
accessed storage was on fast disks, but slower store could be used for
older data. The resulting system would logically span many disks,
the cache system handling device boundary issues as a side effect,
and be sized for many disks, but underlying store would be able to
be mostly quite a bit slower than usual, possibly even residing partly
on things like tapes. My design for such a beast involves letting a
cache server handle the actual work of swapping things around, and
providing a second path to get to real disks which only the cache
server would use. (Details would in some ways resemble what mount
verify does, with a bit more locking and general usability, but the
cache server would use normal (well, almost normal) $qio functions
and the whole thing can be constructed with straightforward i/o
interception techniques, not really needing mods to filesystem or
executive.

Disk access for in-cache storage (and the cache could be gigabytes
or larger) would be very fast, handled by tiny mods to IRPs on the
way to the appropriate storage device. The cache server gets into the
picture only for cache misses (and must coordinate across cluster when
the cache map changes). This kind of thing is wonderful if you have
a solid state disk, by the way; it can be used in layers if you want,
though logically it will look like a huge volume, or maybe several such.
(It would also adapt easily for WORMs.)

This would be useful and interesting, but still has the problems one
has with single huge file structures in some ways. (The backup restore
problem does become easier, since "new stuff" is on a much smaller store
than older things, and older things can be backed up as they migrate
to slower store.)

Possibility 2:
Another possibility exists that I'd like to suggest.

My idea here is that a collection of disks will be handled by two
directory structures, one on a "master" volume, and one on each
individual volume. The "master" volume will either have only
directories,or will have directories pointing to files on itself
and on everything else. Like a volume set, this structure will be
managed as one file structure. Unlike a volume set, it will have
individual volumes as sensible entities unto themselves, so that the
unit of backup will be the individual volume. The total directory
tree would resemble one in unix, with a root directory and all other
volumes falling at mount points somewhere below root.

One key to this is to add a restart capability to open. The other
key is to maintain both sets of directories in parallel, stating
clearly that the per-volume directories get maintained first, and
the master directories are handled second, so that in crashes,
some inconsistency can occur (and be fixed later). 

The other key is that when files are opened that are old, they need
to be found where they are, whereas new files should be created where
space exists.

The space trick might be doable using the volume set logic, but in fact
can be done by a front end.

The way you'd do this with a front end looks something like this:

You insert some processing by intercepting the XQP calls' FDT routines
(to be filesystem independent) (yeah, other possibilities exist too; I'll
describe this one.)

Create:
Now for non-kernel channels, you save the user open request in a pool
data structure, and keep all context (including prev. mode psl) there.
When you find the disk with most free space, save the original channel
UCB and point the CCB at the desired disk. Now enter the directory entry
in the master disk's directlry, with an ACE (or other marking if you
like) to tell where the real file is. Then let the original operation
run and update the directory in the disk where the file will have its
data. Capture deaccess (close) so the channel can be put back as the
program expects.

Open:
Save context as before, and issue a read-acl (or otherwise read the file
markings) to get the file location. Now again replace the channel UCB
and save the original one, and let the user open go with the user
FIB having had its file ID filled in, DID clear, and on the right disk.
If the marking has the FID this is direct; this works well but is not
symbolic. A symbolic access will want to do a full lookup of the file
(a kernel thread can do it from the intercept) and then alter the open.
(Note that the open IRP must be pointed at the right disk too.) When
you use a kernel thread, get an AST and issue the next processing from
there. You can replace the PSL prev. mode temporarily to reissue the
user I/O from inside an AST context. (I've used special kernel ast
for this.)

If you care to read a marking and interpret it as a filename or filepath
somewhere, the kernel mode state machine gets more complex, but at
each step in the path you issue an io$_access for the desired file
on the desired disk and check for markings on the final access, possibly
repeating the process. This amounts to a facility to restart an open
request, even though it gets layered ahead of the "real" open. It can
be done using the existing facilities if need be.

Access to directories in this case presumes that create, open, delete,
etc. all perform their operations on disk "one-volume" directory
structures, and also on a set of master directories, leaving markings
in the files in the master directories giving pointers to the
files in the one-volume directories.

This is closely akin to unix softlinks, with the ability to select
a site automatically. Reading directories linked in this way could
probably be handled too, by keeping track of the volume used to
read a directory. However, RMS' insistence on directory caching
gets in the way here. It would be desirable to somehow flag that
a directory was for a different volume, possibly by hacking off a
high bit or two of the file ID and keeping an internal table of
volumes and currently accessed directories. However, this is a sidelight
and not what I am suggesting.

The duplicated directory handling will take some time (amount
depending on the underlying system) to create files, and following
links will take a bit of time, but no process context mods really 
are needed here, and header and directory caching will reduce the
time needed. What is gained is a very flexible filesystem space
allocation across many volumes with ability to use the entire
filesystems name space as a single entity. The added linking for
file access could be avoided by accessing a particular disk, of course,
but none of this depends on what filesystems are in use. If all
filesystems supported the ods-2 ACP interface, conceivably the
access could be transparent across all of them. Rules for volume
selection do not need to be based only on space, either.


While these systems can be implemented with heavy hacking in the
file system or the I/O system, you'll note that they can also be
handled with a layer between these two and if done that way, they
need no modifications to the VMS kernel, nor to the file system.
The per-volume directory is treated as primary here, as currently,
and the master directory is the secondary path. In addition, provision
of softlinks would be trivial for files, and probably manageable
for directories.

There would of course need to be some facility for detecting when
the master directory was not fully updated, much as mount verification
is done now.

The advantage of a system like this is that the entire disk farm would
appear to be one large tree structure, and adding new disks or even
new filesystems would be simple. Initially I might not bother with 
crossing disk boundaries with this one, though it might turn out not
to be too hard to handle; the link headers might have the marking
in them to point at the next disk. In such a system, too, each
individual disk would have valid filestructures of its own, and
it would not be necessry to worry so much about volume size limits
so long as aggregate store remained.