Steve & Andy -

I've written a bit about some issues that might be involved in
getting I/O to scale better. My current thinking is more or less
in the space of a layering with one or a few huge RAID systems at
the bottom, snappy disks and virtual disks above, order unclear,
so you break the huge area into usably small pieces and maintain
ability to do backup of them with snaps, and the disk melding
stuff on top so it all looks to the user like one huge file
structure regardless of the number of volumes underneath, but
so new storage can be added easily too. I also mention notions of
a possible cache using solid state disk that might be of value
as a way of automatically keeping "hot" sectors available on SSD
as an option.

The multipath work I'm currently doing is incorporating a number of
key fixes to the I/O subsystem that will make this kind of stacking
of functions work efficiently, BTW, and the code I handed Steve
Flynn is designed to work properly with those changes. I have some
VD drivers (virtual disk) that do and others that don't, so it'd
be desirable to pick one of those that do...

Steve Sicola (peaks::sicola) can give you info on the HSZ plans
which involve multipathed controllers with sizeable cache and
hardware RAID. I believe however that in Wildfire type configs
it'll be essential to have some form of recoverable storage
at bottom.

I have ideas of how to put HSM into this picture too; Safety does
this already, but my (currently -you should pardon the pun- shelved)
plans for a "HSM phase II" that can remove headers completely from
disk are partly superseded by disk melding.

Anyhow I wanted to get this into your hands while I'm still around
so that if there are any bits I can add, that can be done in timely
fashion.


Huge I/O Systems 
Glenn C. Everhart, 2 June 1997


This is intended to be a sketch of ideas on how one might reasonably
build an I/O file system that will work for very large (i.e., like
10,000 disks of a few gb each) disk farms.

The focus is tactical; that is, the idea here is to ask what can be
done without designing a complete new file system, but instead be able
mostly to use code that can be assembled relatively quickly for ovms.

The resulting system should make it simple to manage and use from user
perspective, ie, look like one big system, not lots of little ones, and
should permit growth easily (shrinkage too?).

Data recovery and speed improvements need to be thought about.

I will make a bunch of assumptions about what exists and that the
reader understands the limitations of ods2 (and planned ods5),
esp. the million-cluster limit which COULD be lifted, supposedly
fairly easily, and really OUGHT to be, by making the one 8 bit
lock value block field represent not the relative block number of
the bitmap, but the relative 1/256th of the bitmap being used
currently.
----------------------------------------------------------------
Bits of Technology Available

1. RAID n in software. Available shadowing, striping, and RAID
drivers. These can be used in combination if need be.

2. Partitioning drivers. VD: is the best known example, allows
any contiguous segment of a disk to be treated as a separate
volume. Works with any disk (including raid, shadow, stripe
disks). Useful for shadowing different size disks, or for making
a huge RAID disk look like a number of smaller disks where cluster
factor gets too large.

3. Snappy Disks. These have an underlying file system which allows
"snapshots" to be declared so as to enable "instantaneous" images of
disks to be created for backup purposes. (Unknown how, in the limiting
case, this underlying system will affect performance.)

4. HSM. Moves files around, no strategy known which lets it deal with
removing all the headers from disk. (In principle if one creates
directories on the fly, directory open could be used to signal an
operation to pull in such things.) One problem is that directory
and index file entries all stay where they are...

5. A utility which presents back/phys or HSC type backup tapes to
VMS as disks exists on freeware CD. It has lots of cache, uses xor
and crc info. Idea was that back/phys runs maybe twice as fast as
normal backup; this scheme allows recovery of individual files.

6. A journalling virtual disk driver exists. Writes get written to
underlying disk but also to a journal file, timestamped and block #
stamped. The journal amounts to a "continuous backup" in combo with a
phys. backup. Most useful case could be a variant where the journal
goes to a solid state disk...

7. Kernel, and some of the user, code exists for a directory melder
which would use a Spiralog volume of 0-length files for directories
and keep actual data on "N" conventional file structure disks. The
conventional disks could be ODS2 or ODS5 (or whatever...) and each
would be complete file structures. Thus, adding new disks would be
easy to do. A bit of create overhead gets added but directory search
overhead goes down (Spiralog is 10 times faster than ods-n for this
function and scales well since directories are B trees) and open
overhead is very low. User sees ONE large directory structure with
no visible storage boundaries, but it gets hosted on N volumes. Each
can be backed up separately or restored separately (or in parallel)
without special synchronization among them all. This same hook can be
used to generalize from Spiralog to a full RDBMS later on, and to
add some remote access hooks, at file system level and not RMS level.

 (SLIGHT digression)
The idea of extending is that the code has the option of sending
requests to a server which can run in user mode and having the
server return data, as well as running kernel execution threads
(not to be confused with VMS kernel threads...I mean a sequence
of AST driven code). If instead of passing an I/O request (in knl
mode) to a Spiralog filesystem to find the file location given a
name or ID, one passes it to a server that runs a DBMS that has a
relation containing pathname, underlying device, and underlying FID,
the data can be returned from the DBMS. Pick a DBMS that scales well
with size and directory access then scales well with size, and with
this scheme you can stick the dbms version in when the system gets
big enough to need it. What is more, the relation can have other stuff
in it that can be included in a query, and it is possible to do
"set default" (or "cd") that uses syntactic sugar to allow operations
like "set default (containing=payroll)" (not exactly how it might look
of course) to have directory or so on only display files which have
keyword "payroll" in their keyword list. (Also include files that haven't
yet had their keyword list built to keep from confusing folks.) This
is a kind of V2 capability...Spiralog is probably good enough to
gain better directory scalability and speed at first, and saves one
from writing support for read-directory. However, it means that one
generalizes the notion of directory.

I believe the code that has been furnished can be used without too
much grief to even allow the DBMS features for only SOME directory
access, should that be desired. Opening up the notion of directory
in this way is however a potential advantage because it means finding
information can be readily machine aided, and programs require NO
change. If they can select a subdirectory, they can select on
additional criteria, either locally or from any remote clients.
Thus a W95 or NT client would gain this capability automatically
provided its storage were on VMS servers.
(end of SLIGHT digression)

8. Various cache products exist. All use memory at present.

9. (not an existing system but an idea worth having around). It is
possible to cache slow disks on faster ones; a cache or journal on
solid state disk would for example tend to be faster than normal
disks. Let block access be written first to such a solid state
cache so writing could complete fast (and elevator schedule it
to underlying disks, as scsi disks supporting tagged cmd queueing tend
to do) and read preferentially from there. If it's a real cache,
it isn't volatile so writing to underlying media can be done
only when needed. Cache can be a larger fraction of underlying size,
and file headers or directories wanted can tend to be there.
It is possible to distinguish file system reads from user reads
generally speaking...file system uses r/w logical, user uses r/w
virtual...and perhaps prefer to keep filesys info in the cache...


The basics...bottom level stuff

Underneath, we need a physical interconnect. The stuff the HSZ folks
are working looks good, and with path failover it'll be feasible to
build reliable RAID disks with some hardware cache and even hardware
writeBACK. Having the writeback in intelligent controllers is optimal
in some ways.

However let's suppose we want to support 3rd party disks and/or
controllers that just give disk access like direct parallel SCSI
disks now. We should have a way to deal with that too.

A software RAID driver exists, but we'll think about stuff that does
RAID or whatnot in hardware, giving some number of large reliable
things that look like single disks.

My earlier notion of an intercept which will allow one Spiralog directory
structure to serve for lots of disks is a good starting point for handling
thousands of volumes tactically. (To be sure, there needs to be something
done to facilitate mounting them all doing many mounts in parallel.
Even a DCL script could handle this though, using programmatic volume
names in a config file.)

To make this sort of thing work best one needs to split large volumes
(e.g., a huge RAID set) up into a bunch of smaller virtual disks. VMS
should be altered a bit so that its disk drivers could be told to start
at some LBN and limit their access to some max # of LBNs for a disk
unit, so that booting from a virtual disk would work. The edits needed
at the start of start_io to handle this are really trivial, esp. for
drivers like dkdriver that don't convert lbn. Min. req. is under half
dozen lines of macro around start-io to relocate where on disk is represented
& small checks to allow the bootable volume to look smaller. For simplicity
leave the boot volume as ods-5 or ods-2.

If underneath there's just a big stripeset instead of RAID it's desirable
to have err recovery & to speed things up.

I suspect the XFC will push write back cache. However, why not allow a
site to have a larger writeback cache on solid state disk? 

Thus you'd split volumes up with a virtual dsk driver that clones each
irp & sends one off to ssd with a time tag & lbn, the other to
real disk if the real disk did writeback. If no writeback, notify
a catch-up process to get the data from the solid state disk. The
ssd buffer can be used as circular; when it fills, don't overwrite but
wait. 
   On a cluster you'd have N of these circular caches where N is less
than or equal to the number of nodes. Keep track of the read & write
positions with locks. 

If you come up after a crash, you'd have time tagged records pertaining to
disks data, could find the last written data using the time tags.

If data always went

userprog ->ssd cache -> normal disk

this could be used to catch everything. Thus on reboot & mount you'd
use this to put (non system) data all onto disk & zero ssd.

Reading the data would need a list of in-cache data so what wasn't on disk 
could be grabbed from ssd. Normally this would be a short list. If spinlocks
are an issue, different lock tables or locks can be used for different
lbn ranges. Lock 1/8192nd of a disk or some such...?

So on write, send data to cache, grab lock, write lbn in cache info to
the short list, bump pointers, ungrab lock. On read, check the list before
reading from normal disk. With everything in common mem it might be possible
to have no locks at all for reading...just check pointers. (Normally rms
locks etc. keep conflicts out above this level.)

Thus the problem of write to cache, and other processor reading stale data
off disk, could be avoided so long as each processor will see the
block number list when the write to cache is done.

Remember "cache" = solid state disk (ssd) here! Assumption is that ssd
is cheaper per GB than memory. Otherwise one uses memory for this!


If you don't care about booting from virtual disks, vddriver in my latest
version (that needn't grab postproc.) with a minor edit to incorporate
the io post hook, should be as fast as needed, since it can keep its
context info in the IRP itself and be used to partition disks.