Glenn C. Everhart
						18-May-1996
Once in a while the group should have a "blue sky" discussion (just as we
have in magic sessions at DECUS symposia at times). If you don't care
for this one you can skip it now...

If on the other hand you have any thoughts about these kinds of issues,
perhaps this can turn into a useful discussion about methods.

For the last several years I've thought of ways to make the remote
virtual disks I've given out work r/w from many locations. Since at this
point I've not seen anyone else's code that does this, it seems
appropriate to share some design info I've worked out, for what anyone
wants to do with it.

The FDdriver based remote disk uses a very simple protocol talking to its
server, which just does logical I/O. Therefore, logical I/O is
effectively moved from one machine to another. This is fine as far as it
goes, and is what the MSCP server does also. Inside a cluster there are
locking mechanisms the file system uses to coordinate file accesses, so
any disk with a unique name & allocation class gets treated as the same
storage, regardless of underlying mechanism. I've used this to have
VDdriver do cluster virtual disks without serving the virtual disks, or
with VEdriver or VQdriver or WQdriver to do shadowing with each node
having its own shadow driver, or with SDdriver or VWdriver to do striping
with each node using its own stripe driver. In networks however, this
locking traffic does not exist.

Now, it is conceivable to try and export the lock traffic by keeping
"shadow" locks around and having a remote server grab and release the
appropriate disk locks in response to messages from a local server
where the disk is that uses blocking ASTs to determine when different
locks are grabbed. This does have some dangers though; mainly, the
XQP usage of locks is not a documented interface, is likely to change
with time (think about it...) and it also IMPORTS into the local file
system any flakiness from the network's glitches. Still, the notion
of transmitting when the bitmap is locked around, and sending updates
out when this happens, has its attractions. But you must also sense
when index file, directories, and (if you want to get this deep) records
of files are written. The extent cache is used precisely to avoid hitting
the bitmap every time something needs to be written. If we continue to
export logical I/O, this would allow some operation in a network with
shared writing as if it were in a local system. The export would then
be in both directions. Note of course that trying to track all access
locking for records is a hard problem, because there's locking that goes
on within the XQP and RMS which is not explicitly available outside.
Providing shared reading (readonly access) or exclusive access is much
simpler since the locknames used for files are generally known.

Another approach is also feasible, and perhaps easier to understand.

Recall for this that every file access to a disk is via the ACP interface
(see the I/O user manual). A protocol that allows one system to handle
create, delete, extend, and truncate operations need not worry about
the bitmap; those are the only operations that affect the bitmap. If we
export logical I/O as the existing drivers do, then AS LONG AS THE
UNDERLYING FILE STRUCTURE IS THE SAME the disks will be handled so the
storage will not be corrupted. It is not quite enough in general, since
directory updates, for example, might not be coordinated. Some of this
can be handled by insisting we mount /nocache, so that the system has to
"go to the well" to get information off the disk every time. But it is
also worthwhile to be able to transport information about virtual I/O.
If we do this enough, and do NOT transport logical I/O, it is indeed
possible (but more work) to have most file operations work across a
network invisibly. You lose certain backup operations (so that, for
example, backup/physical would not work as it does for fddriver.) You
gain in principle the ability to make any file system on "the other end"
act like normal ODS-2. (Operations like this can be useful if you're
trying to do things like NFS clients.) For example, backup/record does
virtual I/O to the index file to write dates. What's important is to
ensure that writes get propagated around; for reading you may be able
if the systems speak the same filestructure language to get away with
leaving logical I/O transported. Thus transport of virtual write and 
logical read is a possible system to use.

It is possible to intercept I/O operations in a driver by intercepting
at driver FDT time, or to steal the XQP entry and gain access there when
the IRP is fully formed. Those are the convenient points. At XQP entry
you are in a normal kernel AST context within your process and have the
full IRP to work with, and can still access data buffers.

Suppose I want to open a file for read now and I'm on a system remote
from the actual disk. I have a server running on the disk system (the
one with the actual disk) and a virtual driver and process it talks to to
do the actual work. Now I can have my local system send a message to the
server to tell it to open the file on my behalf (transmitting security
information) so that it stays locked on the disk system. Generally I'd
do this for exclusive or r/o access (there's a bit in the FIB one can
look at to find about this quickly). The disk system need not send any
more info back; but it will "know" that file is open. I then just let my
normal local open run, and it reads the disk with read-logical and opens
the file locally. (Alternatively I can have the remote system send back
info about the file if I plan to handle only virtual I/O.)

To create a file, I need to tell the disk system to do the create, and
get it to tell me the resulting name (or report error) and if the create
went well, I need to locally open an existing file. (The disk system
would open the file at its end if needed.) Fortunately the arguments to
io$_access and io$_create are nearly alike so this can be handled.
It is permissible to wait till the disk system gets done its operation
before proceeding with the local operation, so things like the io$_access
io$m_create modifier can be handled.

To delete a file, I need to get both ends to delete it, but only the
remote end will actually do the delete. I may handle the deletion by
simply telling the local system that the disk cache is now invalid (as
clusters do) so it will forget about the file and any directory entries
it thought might contain pointers to it.

To truncate a file, the disk system gets told to do the truncate, and the
local system just gets told to do a window turn (invalidating its
windows so it gets them loaded off disk again). Extend works the same
way.

The server process should be guarded against being clobbered as must the
local process which talks to it. Setting nodelete (and maybe forcex
pending and delete pending) is fairly effective. Tricks like those used
by Ehud Gavron's "invisible" program could be used, but I consider doing
that on someone's system to be a slimy trick; if something goes wrong,
having a hidden process makes it hard to find out what's wrong, and a
process cannot be hidden from its necessary network connections anyway.
It is possible to monitor the io$_available, io$_unload, and io$_packack
functions to see when a disk is dismounted or mounted, and to use, for
example, io$_unload as a way to achieve a clean exit. Dismount/nounload
generates io$_available, and dismount (if the disk is marked as a
removable device) without /nounload generates io$_unload. Mount will
generate io$_packack. (This same pattern is true for non-disks, e.g.
tapes, as well.)

To handle virtual r/w one must send it across and have the remote
system do it, and (for read) return the data. It is also possible to
mark a window block on the local system such that your "ACP" intercept
code gets called for every operation. At FDT time, of course, virtual I/O
entries are used for all I/O. If you want the XQP to see the file, you
must be sure that window blocks are marked so the XQP always is called.
This can be done if need by by setting IRP$L_PID to your completion
routine call (save the value somewhere first!), reset the structures
as needed, and restore IRP$L_PID and perform actual completion. All XQP
paths check this completion path explicitly so that "system" routine
completion code can be invoked. If of course you propose to use all
virtual I/O, you can set up your own local window block that will have
this effect, since the disk system will be handling your actual I/O. (The
server had better have lots of channels available...). 

Note that I do not consider using MSCP or TMSCP protocols a viable
approach for any of this transport. The reason is that these protocols
are complex and documnentation of them is not generally available except
at Digital. Moreover, like other undocumented interfaces they will change
now and again, and for disks MSCP is not sufficient for the stuff
discussed above. Remote tape access is best handled by using something
like ZT_driver, which has gotten very good. (Over TCP/IP one can use the
same code, but one must add some code to break up large records and
recombine them, and handle retries.) While this means adding one's own
drivers and so on, talking to the TMSCP server is not something that I
believe can be done safely by a third party. The effort of trying to
reverse engineer the Digital protocols would be enormous, violate
licenses in any case, and would break when the protocol changes. (This
even though the listings are on the listing CDs; they are not there to be
lifted completely, and other methods of figuring out the protocols would
be on very questionable ground if the intent was to duplicate the effect
totally) Rather, a remote driver is perfectly feasible. Some functions
are possible to add, of course. One could, for example, add functions to
a server to report ucb$l_record from the real tape's UCB instead of
computing it. One could also then profitably allow some control remotely
over things like tape density, compression, compression algorithm, and so
on, so long as the underlying drivers have these capabilities. (On scsi
tapes, io$_diagnose can be used once you know the "magic densities" to
send to different drives; see examples on the sigtapes.) While using
zt_driver in a for-sale package has its questionable side too, I'm
speaking for public consumption here and don't think there'd be any
problems with someone adding to it as a service. Building something
comparable looks feasible too. 

Private protocols are handy for this, since they can readily be added to
and are not subject to change. The approaches I have suggested depend by
and large on documented VMS interfaces which (now that the step 2 driver
interface is here) are likely to be stable for a while.

It is interesting to think about approaches like this in terms of network
independence as well. Personally I'm inclined to wish such methods were
used instead of the current VMS scheme of having network code
incestuously involved with RMS code; it would seem to make both harder to
update. (Also, as I've speculated before, VMS I/O at driver level is very
fast, and user I/O can be fast also, by using the system/acp interface.
If one cared to, say, port a C runtime from Linux or some other system
where the underlying services needed are block I/O (remember RT11's
runtime?) it ought to run exceedingly fast.

(Another aside: don't you wish other vendors would publish enough
interfaces to allow this kind of thing?)