SCSI Naming		DRAFT

Problem:
SCSI 3 widens IDs and LUNs to 64 bite each, making the current device naming
scheme unworkable. Also, some Alphas can take many more than 26 SCSI
controllers in principle, so the controller letter scheme for SCSI
port drivers is also broken. A new scheme is needed in which a booting
cluster member can determine device names, and in which all such
will find the same stable and nonconflicting names.

The current DKdriver incarnation supports only 16 device IDs (on a wide
bus) and 8 LUNs per ID, though many disks have no LUNs. Moreover, the
entire boot path presumes that a single letter suffices to identify a
port. (The boot code generally uses two letters (a la many spreadsheets)
when the number of controllers gets over 26.)

As new devices are added to a cluster, the names chosen must be consistent
clusterwide lest file structures be corrupted. The naming mechanism must
have a way by which names can be consistently chosen. For user simplicity,
this mechanism must also be largely automatic and able to work in the
face of devices moved around from one connection point to another.

A corollary problem is that a very large fibre channel network might in
principle have hundreds (or more) of devices, while a particular computer
connected to that network might have interest in only a few, and indeed
not have sufficient memory to even hold UCBs for all the disks on such a
configuration. In the past, the SCSI bus has been small enough that this
has not been an issue. Fibre channel in principle means there can be an
enormous number of devices, and a configuration scheme must not only
ensure that all device names are uniform across a cluster, but must be
able to do so where not every device is made known to every machine.

Goals:

1. Come up with a way to allow more SCSI ports than current limits.
2. Come up with a way to permit more units on SCSI busses.
3. Ensure that some provision is present for selectively not
	configuring EVERY device that is findable.
4. Ensure that provision for uniformity cross cluster is well 
	supported and at least mostly automatic.
5. Be able to interface with switching and with servers.
6. Ensure that a cross reference from device name to (worldwide)
	SCSI device ID exists.
7. Design so the system does not require huge efforts to configure
	or maintain.

Non-Goals:

1. Devise details of how one sets up names when autoconfiguring a
	fibre channel.
2. Specify all details of how SCSI devices are found on a bus.

Background:

VMS device names are limited to 15 characters by lots of existing code.
Current SCSI device names look like $12345$DKB1203:", in principle,
where their allocation class is large enough and where the unit number
is computed as SCSI ID * 100 + SCSI LUN. In this naming, the "$12345$"
part is the allocation class, the "DK" part is the class driver name,
the "B" part is the port driver letter, and the rest is the ID/LUN
number, where SCSI IDs may be in the 0-15 range and LUNs in the 0-7 range
currently. Note that the port letter is used to identify which port driver
is being used currently, so the expectation is that driver PKB0: will
control devices named *DKB*. If some naming commonality is not maintained,
another scheme to associate port and class drivers must be found.

Since the number of controllers and the number of SCSI devices on SCSI
busses are expected to grow beyond 100, and the number of ports that can
exist on Alphas can already be in hundreds, the current naming scheme
falls short both for SCSI device count and for number of ports.


Discussion:

Number of Port Drivers Limitation:

There are a few possible approaches.

Approach 1
In principle, port drivers for SCSI could simply be assigned different
unit numbers rather than different port letters so there would be a way
to load as many such as were wanted. Some early tests demonstrated that
there are however a number of assumptions in the code which will make
this choice not totally trivial to implement. The major problem is that
we generally have loaded each port driver without regard to its type,
and port unit numbers would require new code to keep track of type. That
is, it would make sense that ten units of PKSdriver should all be
connected to a loaded PKSdriver, but the usual way that drivers are loaded
does not make it as straightforward to analyze to have PKC0: connected to
PKSdriver, PKC1: connected to PKCdriver, PKC2: connected to PKJdriver,
and so on.

Approach 2
As an alternative, port drivers (where the port letter is 'Z' or 
would need to go above it) can be modified to also have port allocation
classes, so that they would be unique via port allocation class. By
so doing, the paradigm introduced for 7.1 is used, reducing user
confusion, and by leaving port letters alone where there are 25 or
less ports, many smaller configurations will notice no change. Since
in fact the support for SCSI ports in OVMS runs out at around 18 as
currently claimed, this means no change for anyone.

Instead of finding a new controller letter for every port driver, a new
allocation class will be needed after the 26th port.  This appears to be
supported by the driver_load routine as an alternative. While this will
allow more unique port driver names to be added, it must also be noticed
that this solution is not completely clean. Port letters are generally
assigned very early in a boot sequence, and in this case there would be a
range of port driver allocation classes which could be needed before any
cluster activity could be started to check such access. This is not a
direct problem, inasmuch as port driver names are not exported to the
cluster, but if class driver names generally would match this allocation
class to permit simple port-class name association, it can be an issue. The
simplest solution in this case is to require that port allocation classes
be supplied for all such cases, and alter the PAC recognition code to check
for port drivers by allocation class as well as letter. A unique range of
automatically generated port driver allocation classes would facilitate
this without much loss of generality in use of allocation classes (since 16
bits to hold them exist.)

Thus port drivers would be given names PKA, PKB, PKC, ... PKY, $16000$PKA,
$16001$PKA, $16002$PKA, and so on, as an example.  Because of the folding
of letters to A where port letters were Z or larger, naming can be made
consistent for high port names, even as port allocation classes fold port
letters to A currently. (In practice we'd fold the names to A for class
drivers.)

In this way, problems with port driver unit numbers in INIT_IO_DB can
be skirted and devices can initially be assigned these port allocation
classes, with checks to ensure that they do not collide being implemented
at the time the class devices were created (as currently is done for
port allocation classes). Should the port allocation class naming not
be available, ports beyond PKZ would simply not be configured.

Should ports be added or removed, the port identifiers will change as has
been the case in the past. Considering that no current port has an
architected identity, attempting to record the order of such is not
feasible automatically. It is however possible that some way to assign
port letters or allocation classes "out of order" may become desirable.

This port naming approach could be used, but producing the "right"
(for the cluster) classes would be a problem when it came time to
relate the port and class driver names, and may serve only to introduce
confusion.

Approach 3
It is possible for port drivers to use two letters for the port identity,
since there would be only one port driver per port and thus all could be
unit zero. If class units reflected this naming, class device names would
be unacceptably constrained to 3 digits for unit number. But there is no
need for class drivers to follow suite. Recall that port allocation classes
break that already.

If port drivers just use 2 letters, of course, they could be related to
class devices by an algorithm, rather than by having identical names.
Since port drivers are local to each system, coordination cross-cluster is
not needed. But once class driver unit names are generated, a fixed
algorithm becomes hard to obtain. One is tempted to abandon naming
consistency of port and class rather strongly. This in fact is what I
recommend. We will need to force port allocation classes on class units
matching these, and thus the "controller letter" would be forced to "A"
regardless. This will be simpler in some ways than using allocation class
for port drivers, since it means that the controller letters will not need
to match somehow across a cluster. Matching them is very hard since early
in the boot path, not all information may be available.

The simplest solution to this is to be able to record the port to class
association somehow in each node and not even attempt to make port names
consistent across a cluster. Where there are shared busses in large
configurations, it becomes overwhelmingly likely that port allocation
classes will be used for class devices, where a mechanism for
coordinating the names exists. Port drivers don't need to be visible
across clusters, so all that should be needed is enough info for
SDA and similar apps to find the match.

Of the three solutions here (port driver unit numbers, multi-letter
port "letters" together with requiring port allocation classes for
controllers above Z, and allocation classes for port driver names)
the multiletter port letters seems the simplest.

In init_io_db, a full longword is available for the port "letter", so
that the progression A,B,C...Z,AA,AB,AC,...AZ,BA,BB,...ZZ seems feasible.
There are assumptions that the name length is 4 characters scatterd
here and there, but if the unit number parsing is arranged so that
a name like PKAC: is treated as though it has been PKAC0: (with all
unit numbers being zero) this can be lived with for the currently
foreseeable future. This encoding can handle up to 702 ports per
choice of the initial two letters, enough for machines we have and
likely enough for some time to come. (By using 3 letters, the number
of possibilities grows to 18954, should the 4 letter name assumptions
in init_io_db be relaxed. Enough other assumptions in VMS file handling
would need to be updated to handle these numbers of controllers that
it seems unlikely that it will be necessary to exceed that for at least
the next 5 years.)


Number of Units Limitation:

Unit numbers currently use a 100 by 100 space (ID*100 + LUN), which
is in general inadequate to switched fibers (which may have many LUNs and
only a few IDs).

Approach 1. (Not best)
1. Somehow completely virtualize the names, with a start and end unit
number for each node as new SYSGEN parameters and some convention for
numbering units on HSZs and similar controllers. Similarly, SYSGEN
parameters would be needed for other controllers or for FC loops.

In this mode, we have a SYSGEN parameter for intelligent controller  nodes'
starting and ending device numbers and require that these be adjusted so
that the numbering matches clusterwide. This is already a massive problem,
since one can imagine 8400 class machines with scores of HSZs. Add other
controllers and an intractable problem becomes completely unmanageable.

Over and above the difficulties of the plethora of new and obscure SYSGEN
parameters, getting the convention to pick the HSZ order always the same
could be tricky unless we assume all HSZs are connected and that they can
neither be added nor removed without rebooting the cluster, and even then
more external unit number information would seem to be needed to ensure it
is stable...possibly a sysgen parameter per HSZ or equivalent (in some
configuration data file).

Using large numbers of SYSGEN parameters looks infeasible.

A variant approach simply uses a configuration file (replicated if need be,
and with names checked cross cluster when new devices are found), where
a worldwide unique ID maps to a name, but the names are completely abstract
rather than showing a relation to their ports. The difficulty here is
that as new devices are found, some way to prevent two systems
from using the same abstract name for different devices is needed.
There are ways to do this; for example, a common lock used to regulate who
has name choice rights, the lock value having the last name. Once a choice
is made it should be placed in the config file(s) so that subsequent
accesses do not need to choose it, and will know of it.

The difference is we don't try to assign an a priori part of the namespace
per cluster member, but negotiate it as the system comes up using a config
file to avoid having to negotiate already-negotiated names.

There are two ways to handle the boot disk name problem. One is to require
that the boot disk for each node pre-exist in the configuration file each
node can reach directly on its boot disk when the system boots, backed 
by a parameter which could fill in where this is absent. The other is to
assign a unique name to the boot disk which cannot ever conflict with the
finally-negotiated name, and use that during boot. Once the "real" name
becomes available, just rename the class UCB appropriately by changing
its DDB name, UCB unit number, DDB allocation class (if need be), and
so on. Pointers to the UCB would remain valid, and one can construct
an exec logical name to point the early boot name at the new device
name so that any residual attempts to open the old name would get the
correct device. This would only be necessary if the configuration file
did not have a valid device name. If it had something that looked valid
but which (once the cluster communications were up) turned out to be
in use elsewhere, the node would have to crash itself. (More will be said
about cross cluster checks below.)

This is feasible. While it is a disadvantage that it needs the
configuration file to have a line for every device in the cluster, it can
be done, and where a local config file has a name for a device found by
worldwide ID, that name can be used directly.


Approach 2. (Not best)
2. If we use port letter or allocation class, requiring the latter where
the former is ambiguous, we can relate ports to class devices. To handle
the large number of device names possible, we will require a configuration
file which will contain a cross reference of world wide IDs of devices
and their names. This configuration file will be normally generated
automatically by the configuration process in VMS, being built as devices
are seen, and written automatically once the system is configured. 

Any configuration file scheme has similar needs about getting things
synchronized cross cluster.

A scheme that would need to have a configuration file needs it created in
memory early on (presuming one started with none on disk) which would
contain WW ID, port identification (a letter or an allocation class, with
space for the longer, plus the "identity" of the port where this is
feasible to have, so that we are not dependent on device scan order in the
future), and unit number on that port. When the code was going to assign a
unit number, it would need to acquire a lock exclusively and assign it, and
notify the rest of the cluster of the configuration.  This can be
implemented in a number of ways, and presuming one starts from a valid
configuration, a chosen way must end in a valid configuration. The port
name would be available to associate port and class devices and would be
needed per node where more than one node could directly access a device.
(Once a unique local name is arrived at, and often a port allocation class
so that ports above "Z" could be handled, servers will propagate it. Our
config database need only be concerned with devices on direct paths.

Since it is not possible to have common storage always available to all
cluster nodes, it will be necessary to either check all name consistency
every boot, or have a parameter which can be set to indicate that the
configuration has been changed so that a total cross check can be done
when this is set. The default would be for such a flag to be set so that
positive action would be needed to reset it. (This is controversial; maybe
it is best to take longer but allow no chance for accidentally corrupting
disks.)

A machine entering (or forming) a cluster would first attempt to lock and
read the configuration file and find a worldwide ID there (and unlock when
done). If it succeeded, it would be able to assign the unit number directly
with no arbitration. If it did not, it would need to arbitrate using locks
(or other convenient mechanisms as may appear) to select a unit. A lock value
block or similar would need to hold the current highest unit number being
assigned, so that it would be readily available. (The first thing a new
machine should do is read the entire configuration file, so it can also fill
in holes, avoiding running out of unit numbers.) This configuration step
would also be skipped if a flag were set to do so in the config file. If of
course the unit were known, the new machine would instead find the generally
used name and set that up.

Where a configuration file disagreed in the driver name and had no port
allocation class to fix it, it would be necessary for a new node finding
such a name to hang and not do any work, lest it corrupt a disk. Port
allocation classes must be mandatory for port letters above "Z".

Once the machine had finished setup, it would proceed to write out its
new configuration (if it had changed) to its storage site, doing so
late enough in system uptime that the full filesystem would be available
for locking. The configuration would however remain in memory, and
any time new devices were added would be reworked. The config file would
be written by every node sensing changes, so that all copies on disk would
be maintained. In essence this configuration file would be "shadowed"
all over the cluster.

The structure of such a file must be such that each node's identity is
present also, for most devices may be known on one node only. It must
further be set up so that any node can read it, merge in its changes, and
write it out again. By setting up the file with scssystemid of each node
this could be done. Reading the file and writing it need to be done again
once the full system is up, to ensure that no race conditions occur not
protected against by cluster locking protocols. A node would thus, once its
ports were configured,

* Read the config file if any into memory
* Arbitrate with other nodes if any to add in all new devices found in its
	device scans, building a memory structure of all devices in the
	cluster using its local devices and what it found in the config
	file.
* Lock its config file with a clusterwide lock (which needs to be able to
	prevent any new machine from reading the database by whatever 
	means may be convenient)
* Reread its config file (so nothing can have been missed)
* Merge in its changes and rewrite its config file
* Release its lock so others could update their files.


When this is done, all nodes will have had a chance to update their names
and be synchronized. The arbitration must of course use a SYSAP or the
lock manager to communicate name choices. To ensure that common busses
are known, the current port allocation class logic will be used,
but with the possibility that there might be more than one port letter.


Approach 3 (best)
3. Remembering that we need only concern ourselves with directly connected
devices, we can consider the naming in two steps. It will be necessary to
use port allocation classes for shared busses; those are not controversial.
We can however also use allocation classes for nonshared busses by
arbitrating the classes and keeping a record of what we use in a local
configuration file in every node. The configuration file will store the
values assigned locally for device names per allocation class. Since 
allocation class separates them, we need only cluster-arbitrate for an
allocation class...easily done with the lock manager or a sysap which
needs to be listening...and do not need to arbitrate local bus names.
These can be in a local config file for devices not on a shared bus
and its naming choice can be definitive. For shared bus devices, the
nodes sharing the devices need to check choice of names.

This is proposed because hashing a 64 bit quantity into a unit number
in the range 0-9999 cannot reliably produce something unique, and we
will want the device name to be related to its worldwide ID, not some
more evanescent value.

It would be attractive to find some way to have most device names in
shared busses be automatically assigned, but again, if a hash is used
and devices whose names don't hash uniquely were arbitrated, even if
this could be reliable, device unit numbers would then be semi-randomly
scattered over the number space, a situation customers are unlikely to
want.

Therefore a system is needed which can use the lock manager to select
a "naming master" for each shared bus (by acquisition of a lock), and
to define a "right to talk to the naming master". These rights would be
defined after the (current) arbitration of the port allocation class, so that
we know that only names on "this" shared bus need to be unique. The
naming master will initially be the first system to configure the bus.

The naming master will read its configuration file (and note the config
files are supposed to be all identical but may each be a copy of the
others). It will then find what WW IDs are on this bus, and will then
assign device names using its config file to cross reference the names
and assigned unit numbers. Where a vacancy has been created, it will
note this with a flag being added to that record so that that number can
be reused later by a cleanup of the record (manually or with some
automated process to be defined.) This will account for devices which
may be powered down or otherwise temporarily unavailable. Should a device
reappear, it will be flagged present later, and in any case its unit
number assignment will remain reserved. New devices will also be assigned
unit numbers and have records created. A node will in all cases attempt
to acquire the naming master lock and the "communicate with naming master"
lock (the latter in a mode to block others also, with a blocking AST,
if it acquires the naming master lock). By storing a tag like SCSSYSTEMID
in the lock value it can identify when the lock was acquired by itself
and others can sense race conditions.

Now when another system comes up and tries to grab the naming master lock,
it will find it in use. (Some work with lock values must be done to guard
against race conditions.) Thus it will acquire the lock that allows it
to communicate with the naming master (thus notifying the naming master
via blocking AST that someone has appeared) (also handling race conditions
in case the master has not fully initialized). Then it will receive, and
the master will send, a copy of the master's configuration file (which
would be in the master's memory by then). This can be written to the
slave's configuration file and used to set things up. The one exception to
this is that the slave must ensure that its system disk, if on the same
bus, has been named compatibly. This is the one item it may need to pull
from its local configuration file prior to the opening of cluster
communications. Should the system disk be misnamed, the node so affected
must simply hang and the local configuration file will have to be edited
to clear the conflict.

The great advantage of this scheme is that the configuration file
information needs to be maintained only per bus...no grand global naming
concordance is needed...and only one configuration file is treated as
authoritative, and will be recorded and duplicated. Also, because single
system busses will be distinguished by a port allocation class, if a
letter cannot be used, only one file will be the naming authority. This
provides boot to boot name stability with checking.

It is desirable of course to use a bulk transfer to move the configuration
file information, rather than a long handshake with locks, to reduce the
time needed.

It is conceivable that two configuration files which are badly out of
synch might be used so that you might boot nodes A, B, C, and D in that
order in the cluster, then boot E, B, C, and D later with a different
master configuration. The names will still be unique so no corruption of
disks will happen, but not stable. This needs to be warned against.
However, should the cluster in question ever boot with A and E in the
cluster at the same time, the disagreement will automatically be cleared.
This situation can only happen if A, B, C, D, and E are all on one shared
bus, by the way. In such cases, the more common situation is likely to be
that the boot disk may be shared and there would be only one config file.


Of the schemes presented, the third is far and away the simplest in logic
and overhead. It factors out most of the naming conflicts by making all
decisions only on directly connected nodes, and only per bus. Note too
that the notion of a simple flag in a configuration file preventing a
device from being connected still works in this scheme, as config files
are used still to provide boot - to - boot naming stability.

Thus the third scheme is preferred.

Implementation:

To implement Approach 3, some mods in INIT_IO_DB will be required to
handle multi-letter port names (guess a month at most to do this), some
locking logic needs to be added, more or less in the same locations now
being used for Port Allocation Classes, and some code to force PACs to be
used where the bus letter is over 'Z' will be required also. This looks
like a couple months' work, with a considerable part in the research into
getting the SCS messages transferred to efficiently move data...an area
not often used in SCSI-land. (One reason to bite the bullet now in this is
that its efficiency in galaxies stands to gain tremendously; requiring
full target mode in SCSI could do the job, but could be much less
efficient.)(shhhhh!)

To integrate with older devices, the current ID/LUN numbers together with
the port PAC or letter can be used in place of the WW IDs; numbering here
need not even change, so older systems will see NO alterations to what
they have grown accustomed to. Test mode code will need to fake large port
letters (maybe arrange to bump the letters by 20 instead of by 1 count or
something similar to ensure that the code works for large numbers of
ports) and fake WW IDs. This does not, in prospect, seem overly complex.

Risks:

The use of PAC logic and its prior implementation substantially cuts the
risk of this approach. The MSCP server should need no changes at all,
and the QIO server does not grow any more complex. Since all names are
assigned relative to a SCSI port, the problems of a completely
virtualized disk naming scheme envisioned in John Hallyburton's
investigations are just bypassed; if some busses aren't there, it has no
effect on any others' namings (so that for example if node E above were a
large node with many devices, it wouldn't matter so long as the PACs had
been set up correctly). The possibility of name instability where clusters
are partly booted exists only for shared busses, and in almost every case
a scan of the same bus using the same logic will come up with the same
results even though done independently. One would have to go out of the
way to louse up such a configuration, and even then it would self repair
the first time all systems were up on the shared bus together.

Configuration files will be in general created automatically and will need
manual maintenance only where a system should not configure all devices it
finds. This is currently rare, and should it come to be the rule later
for, say, FC nets, it should suffice to provide a tool when there is need
and perhaps to default in those cases that device names will be known but
not used. The fact that we use a naming master in the architecture should
segue neatly into using a name server on a FC net should that need arise
also.