SCSI Naming DRAFT Problem: SCSI 3 widens IDs and LUNs to 64 bite each, making the current device naming scheme unworkable. Also, some Alphas can take many more than 26 SCSI controllers in principle, so the controller letter scheme for SCSI port drivers is also broken. A new scheme is needed in which a booting cluster member can determine device names, and in which all such will find the same stable and nonconflicting names. The current DKdriver incarnation supports only 16 device IDs (on a wide bus) and 8 LUNs per ID, though many disks have no LUNs. Moreover, the entire boot path presumes that a single letter suffices to identify a port. (The boot code generally uses two letters (a la many spreadsheets) when the number of controllers gets over 26.) As new devices are added to a cluster, the names chosen must be consistent clusterwide lest file structures be corrupted. The naming mechanism must have a way by which names can be consistently chosen. For user simplicity, this mechanism must also be largely automatic and able to work in the face of devices moved around from one connection point to another. A corollary problem is that a very large fibre channel network might in principle have hundreds (or more) of devices, while a particular computer connected to that network might have interest in only a few, and indeed not have sufficient memory to even hold UCBs for all the disks on such a configuration. In the past, the SCSI bus has been small enough that this has not been an issue. Fibre channel in principle means there can be an enormous number of devices, and a configuration scheme must not only ensure that all device names are uniform across a cluster, but must be able to do so where not every device is made known to every machine. Goals: 1. Come up with a way to allow more SCSI ports than current limits. 2. Come up with a way to permit more units on SCSI busses. 3. Ensure that some provision is present for selectively not configuring EVERY device that is findable. 4. Ensure that provision for uniformity cross cluster is well supported and at least mostly automatic. 5. Be able to interface with switching and with servers. 6. Ensure that a cross reference from device name to (worldwide) SCSI device ID exists. 7. Design so the system does not require huge efforts to configure or maintain. Non-Goals: 1. Devise details of how one sets up names when autoconfiguring a fibre channel. 2. Specify all details of how SCSI devices are found on a bus. Background: VMS device names are limited to 15 characters by lots of existing code. Current SCSI device names look like $12345$DKB1203:", in principle, where their allocation class is large enough and where the unit number is computed as SCSI ID * 100 + SCSI LUN. In this naming, the "$12345$" part is the allocation class, the "DK" part is the class driver name, the "B" part is the port driver letter, and the rest is the ID/LUN number, where SCSI IDs may be in the 0-15 range and LUNs in the 0-7 range currently. Note that the port letter is used to identify which port driver is being used currently, so the expectation is that driver PKB0: will control devices named *DKB*. If some naming commonality is not maintained, another scheme to associate port and class drivers must be found. Since the number of controllers and the number of SCSI devices on SCSI busses are expected to grow beyond 100, and the number of ports that can exist on Alphas can already be in hundreds, the current naming scheme falls short both for SCSI device count and for number of ports. Discussion: Number of Port Drivers Limitation: There are a few possible approaches. Approach 1 In principle, port drivers for SCSI could simply be assigned different unit numbers rather than different port letters so there would be a way to load as many such as were wanted. Some early tests demonstrated that there are however a number of assumptions in the code which will make this choice not totally trivial to implement. The major problem is that we generally have loaded each port driver without regard to its type, and port unit numbers would require new code to keep track of type. That is, it would make sense that ten units of PKSdriver should all be connected to a loaded PKSdriver, but the usual way that drivers are loaded does not make it as straightforward to analyze to have PKC0: connected to PKSdriver, PKC1: connected to PKCdriver, PKC2: connected to PKJdriver, and so on. Approach 2 As an alternative, port drivers (where the port letter is 'Z' or would need to go above it) can be modified to also have port allocation classes, so that they would be unique via port allocation class. By so doing, the paradigm introduced for 7.1 is used, reducing user confusion, and by leaving port letters alone where there are 25 or less ports, many smaller configurations will notice no change. Since in fact the support for SCSI ports in OVMS runs out at around 18 as currently claimed, this means no change for anyone. Instead of finding a new controller letter for every port driver, a new allocation class will be needed after the 26th port. This appears to be supported by the driver_load routine as an alternative. While this will allow more unique port driver names to be added, it must also be noticed that this solution is not completely clean. Port letters are generally assigned very early in a boot sequence, and in this case there would be a range of port driver allocation classes which could be needed before any cluster activity could be started to check such access. This is not a direct problem, inasmuch as port driver names are not exported to the cluster, but if class driver names generally would match this allocation class to permit simple port-class name association, it can be an issue. The simplest solution in this case is to require that port allocation classes be supplied for all such cases, and alter the PAC recognition code to check for port drivers by allocation class as well as letter. A unique range of automatically generated port driver allocation classes would facilitate this without much loss of generality in use of allocation classes (since 16 bits to hold them exist.) Thus port drivers would be given names PKA, PKB, PKC, ... PKY, $16000$PKA, $16001$PKA, $16002$PKA, and so on, as an example. Because of the folding of letters to A where port letters were Z or larger, naming can be made consistent for high port names, even as port allocation classes fold port letters to A currently. (In practice we'd fold the names to A for class drivers.) In this way, problems with port driver unit numbers in INIT_IO_DB can be skirted and devices can initially be assigned these port allocation classes, with checks to ensure that they do not collide being implemented at the time the class devices were created (as currently is done for port allocation classes). Should the port allocation class naming not be available, ports beyond PKZ would simply not be configured. Should ports be added or removed, the port identifiers will change as has been the case in the past. Considering that no current port has an architected identity, attempting to record the order of such is not feasible automatically. It is however possible that some way to assign port letters or allocation classes "out of order" may become desirable. This port naming approach could be used, but producing the "right" (for the cluster) classes would be a problem when it came time to relate the port and class driver names, and may serve only to introduce confusion. Approach 3 It is possible for port drivers to use two letters for the port identity, since there would be only one port driver per port and thus all could be unit zero. If class units reflected this naming, class device names would be unacceptably constrained to 3 digits for unit number. But there is no need for class drivers to follow suite. Recall that port allocation classes break that already. If port drivers just use 2 letters, of course, they could be related to class devices by an algorithm, rather than by having identical names. Since port drivers are local to each system, coordination cross-cluster is not needed. But once class driver unit names are generated, a fixed algorithm becomes hard to obtain. One is tempted to abandon naming consistency of port and class rather strongly. This in fact is what I recommend. We will need to force port allocation classes on class units matching these, and thus the "controller letter" would be forced to "A" regardless. This will be simpler in some ways than using allocation class for port drivers, since it means that the controller letters will not need to match somehow across a cluster. Matching them is very hard since early in the boot path, not all information may be available. The simplest solution to this is to be able to record the port to class association somehow in each node and not even attempt to make port names consistent across a cluster. Where there are shared busses in large configurations, it becomes overwhelmingly likely that port allocation classes will be used for class devices, where a mechanism for coordinating the names exists. Port drivers don't need to be visible across clusters, so all that should be needed is enough info for SDA and similar apps to find the match. Of the three solutions here (port driver unit numbers, multi-letter port "letters" together with requiring port allocation classes for controllers above Z, and allocation classes for port driver names) the multiletter port letters seems the simplest. In init_io_db, a full longword is available for the port "letter", so that the progression A,B,C...Z,AA,AB,AC,...AZ,BA,BB,...ZZ seems feasible. There are assumptions that the name length is 4 characters scatterd here and there, but if the unit number parsing is arranged so that a name like PKAC: is treated as though it has been PKAC0: (with all unit numbers being zero) this can be lived with for the currently foreseeable future. This encoding can handle up to 702 ports per choice of the initial two letters, enough for machines we have and likely enough for some time to come. (By using 3 letters, the number of possibilities grows to 18954, should the 4 letter name assumptions in init_io_db be relaxed. Enough other assumptions in VMS file handling would need to be updated to handle these numbers of controllers that it seems unlikely that it will be necessary to exceed that for at least the next 5 years.) Number of Units Limitation: Unit numbers currently use a 100 by 100 space (ID*100 + LUN), which is in general inadequate to switched fibers (which may have many LUNs and only a few IDs). Approach 1. (Not best) 1. Somehow completely virtualize the names, with a start and end unit number for each node as new SYSGEN parameters and some convention for numbering units on HSZs and similar controllers. Similarly, SYSGEN parameters would be needed for other controllers or for FC loops. In this mode, we have a SYSGEN parameter for intelligent controller nodes' starting and ending device numbers and require that these be adjusted so that the numbering matches clusterwide. This is already a massive problem, since one can imagine 8400 class machines with scores of HSZs. Add other controllers and an intractable problem becomes completely unmanageable. Over and above the difficulties of the plethora of new and obscure SYSGEN parameters, getting the convention to pick the HSZ order always the same could be tricky unless we assume all HSZs are connected and that they can neither be added nor removed without rebooting the cluster, and even then more external unit number information would seem to be needed to ensure it is stable...possibly a sysgen parameter per HSZ or equivalent (in some configuration data file). Using large numbers of SYSGEN parameters looks infeasible. A variant approach simply uses a configuration file (replicated if need be, and with names checked cross cluster when new devices are found), where a worldwide unique ID maps to a name, but the names are completely abstract rather than showing a relation to their ports. The difficulty here is that as new devices are found, some way to prevent two systems from using the same abstract name for different devices is needed. There are ways to do this; for example, a common lock used to regulate who has name choice rights, the lock value having the last name. Once a choice is made it should be placed in the config file(s) so that subsequent accesses do not need to choose it, and will know of it. The difference is we don't try to assign an a priori part of the namespace per cluster member, but negotiate it as the system comes up using a config file to avoid having to negotiate already-negotiated names. There are two ways to handle the boot disk name problem. One is to require that the boot disk for each node pre-exist in the configuration file each node can reach directly on its boot disk when the system boots, backed by a parameter which could fill in where this is absent. The other is to assign a unique name to the boot disk which cannot ever conflict with the finally-negotiated name, and use that during boot. Once the "real" name becomes available, just rename the class UCB appropriately by changing its DDB name, UCB unit number, DDB allocation class (if need be), and so on. Pointers to the UCB would remain valid, and one can construct an exec logical name to point the early boot name at the new device name so that any residual attempts to open the old name would get the correct device. This would only be necessary if the configuration file did not have a valid device name. If it had something that looked valid but which (once the cluster communications were up) turned out to be in use elsewhere, the node would have to crash itself. (More will be said about cross cluster checks below.) This is feasible. While it is a disadvantage that it needs the configuration file to have a line for every device in the cluster, it can be done, and where a local config file has a name for a device found by worldwide ID, that name can be used directly. Approach 2. (Not best) 2. If we use port letter or allocation class, requiring the latter where the former is ambiguous, we can relate ports to class devices. To handle the large number of device names possible, we will require a configuration file which will contain a cross reference of world wide IDs of devices and their names. This configuration file will be normally generated automatically by the configuration process in VMS, being built as devices are seen, and written automatically once the system is configured. Any configuration file scheme has similar needs about getting things synchronized cross cluster. A scheme that would need to have a configuration file needs it created in memory early on (presuming one started with none on disk) which would contain WW ID, port identification (a letter or an allocation class, with space for the longer, plus the "identity" of the port where this is feasible to have, so that we are not dependent on device scan order in the future), and unit number on that port. When the code was going to assign a unit number, it would need to acquire a lock exclusively and assign it, and notify the rest of the cluster of the configuration. This can be implemented in a number of ways, and presuming one starts from a valid configuration, a chosen way must end in a valid configuration. The port name would be available to associate port and class devices and would be needed per node where more than one node could directly access a device. (Once a unique local name is arrived at, and often a port allocation class so that ports above "Z" could be handled, servers will propagate it. Our config database need only be concerned with devices on direct paths. Since it is not possible to have common storage always available to all cluster nodes, it will be necessary to either check all name consistency every boot, or have a parameter which can be set to indicate that the configuration has been changed so that a total cross check can be done when this is set. The default would be for such a flag to be set so that positive action would be needed to reset it. (This is controversial; maybe it is best to take longer but allow no chance for accidentally corrupting disks.) A machine entering (or forming) a cluster would first attempt to lock and read the configuration file and find a worldwide ID there (and unlock when done). If it succeeded, it would be able to assign the unit number directly with no arbitration. If it did not, it would need to arbitrate using locks (or other convenient mechanisms as may appear) to select a unit. A lock value block or similar would need to hold the current highest unit number being assigned, so that it would be readily available. (The first thing a new machine should do is read the entire configuration file, so it can also fill in holes, avoiding running out of unit numbers.) This configuration step would also be skipped if a flag were set to do so in the config file. If of course the unit were known, the new machine would instead find the generally used name and set that up. Where a configuration file disagreed in the driver name and had no port allocation class to fix it, it would be necessary for a new node finding such a name to hang and not do any work, lest it corrupt a disk. Port allocation classes must be mandatory for port letters above "Z". Once the machine had finished setup, it would proceed to write out its new configuration (if it had changed) to its storage site, doing so late enough in system uptime that the full filesystem would be available for locking. The configuration would however remain in memory, and any time new devices were added would be reworked. The config file would be written by every node sensing changes, so that all copies on disk would be maintained. In essence this configuration file would be "shadowed" all over the cluster. The structure of such a file must be such that each node's identity is present also, for most devices may be known on one node only. It must further be set up so that any node can read it, merge in its changes, and write it out again. By setting up the file with scssystemid of each node this could be done. Reading the file and writing it need to be done again once the full system is up, to ensure that no race conditions occur not protected against by cluster locking protocols. A node would thus, once its ports were configured, * Read the config file if any into memory * Arbitrate with other nodes if any to add in all new devices found in its device scans, building a memory structure of all devices in the cluster using its local devices and what it found in the config file. * Lock its config file with a clusterwide lock (which needs to be able to prevent any new machine from reading the database by whatever means may be convenient) * Reread its config file (so nothing can have been missed) * Merge in its changes and rewrite its config file * Release its lock so others could update their files. When this is done, all nodes will have had a chance to update their names and be synchronized. The arbitration must of course use a SYSAP or the lock manager to communicate name choices. To ensure that common busses are known, the current port allocation class logic will be used, but with the possibility that there might be more than one port letter. Approach 3 (best) 3. Remembering that we need only concern ourselves with directly connected devices, we can consider the naming in two steps. It will be necessary to use port allocation classes for shared busses; those are not controversial. We can however also use allocation classes for nonshared busses by arbitrating the classes and keeping a record of what we use in a local configuration file in every node. The configuration file will store the values assigned locally for device names per allocation class. Since allocation class separates them, we need only cluster-arbitrate for an allocation class...easily done with the lock manager or a sysap which needs to be listening...and do not need to arbitrate local bus names. These can be in a local config file for devices not on a shared bus and its naming choice can be definitive. For shared bus devices, the nodes sharing the devices need to check choice of names. This is proposed because hashing a 64 bit quantity into a unit number in the range 0-9999 cannot reliably produce something unique, and we will want the device name to be related to its worldwide ID, not some more evanescent value. It would be attractive to find some way to have most device names in shared busses be automatically assigned, but again, if a hash is used and devices whose names don't hash uniquely were arbitrated, even if this could be reliable, device unit numbers would then be semi-randomly scattered over the number space, a situation customers are unlikely to want. Therefore a system is needed which can use the lock manager to select a "naming master" for each shared bus (by acquisition of a lock), and to define a "right to talk to the naming master". These rights would be defined after the (current) arbitration of the port allocation class, so that we know that only names on "this" shared bus need to be unique. The naming master will initially be the first system to configure the bus. The naming master will read its configuration file (and note the config files are supposed to be all identical but may each be a copy of the others). It will then find what WW IDs are on this bus, and will then assign device names using its config file to cross reference the names and assigned unit numbers. Where a vacancy has been created, it will note this with a flag being added to that record so that that number can be reused later by a cleanup of the record (manually or with some automated process to be defined.) This will account for devices which may be powered down or otherwise temporarily unavailable. Should a device reappear, it will be flagged present later, and in any case its unit number assignment will remain reserved. New devices will also be assigned unit numbers and have records created. A node will in all cases attempt to acquire the naming master lock and the "communicate with naming master" lock (the latter in a mode to block others also, with a blocking AST, if it acquires the naming master lock). By storing a tag like SCSSYSTEMID in the lock value it can identify when the lock was acquired by itself and others can sense race conditions. Now when another system comes up and tries to grab the naming master lock, it will find it in use. (Some work with lock values must be done to guard against race conditions.) Thus it will acquire the lock that allows it to communicate with the naming master (thus notifying the naming master via blocking AST that someone has appeared) (also handling race conditions in case the master has not fully initialized). Then it will receive, and the master will send, a copy of the master's configuration file (which would be in the master's memory by then). This can be written to the slave's configuration file and used to set things up. The one exception to this is that the slave must ensure that its system disk, if on the same bus, has been named compatibly. This is the one item it may need to pull from its local configuration file prior to the opening of cluster communications. Should the system disk be misnamed, the node so affected must simply hang and the local configuration file will have to be edited to clear the conflict. The great advantage of this scheme is that the configuration file information needs to be maintained only per bus...no grand global naming concordance is needed...and only one configuration file is treated as authoritative, and will be recorded and duplicated. Also, because single system busses will be distinguished by a port allocation class, if a letter cannot be used, only one file will be the naming authority. This provides boot to boot name stability with checking. It is desirable of course to use a bulk transfer to move the configuration file information, rather than a long handshake with locks, to reduce the time needed. It is conceivable that two configuration files which are badly out of synch might be used so that you might boot nodes A, B, C, and D in that order in the cluster, then boot E, B, C, and D later with a different master configuration. The names will still be unique so no corruption of disks will happen, but not stable. This needs to be warned against. However, should the cluster in question ever boot with A and E in the cluster at the same time, the disagreement will automatically be cleared. This situation can only happen if A, B, C, D, and E are all on one shared bus, by the way. In such cases, the more common situation is likely to be that the boot disk may be shared and there would be only one config file. Of the schemes presented, the third is far and away the simplest in logic and overhead. It factors out most of the naming conflicts by making all decisions only on directly connected nodes, and only per bus. Note too that the notion of a simple flag in a configuration file preventing a device from being connected still works in this scheme, as config files are used still to provide boot - to - boot naming stability. Thus the third scheme is preferred. Implementation: To implement Approach 3, some mods in INIT_IO_DB will be required to handle multi-letter port names (guess a month at most to do this), some locking logic needs to be added, more or less in the same locations now being used for Port Allocation Classes, and some code to force PACs to be used where the bus letter is over 'Z' will be required also. This looks like a couple months' work, with a considerable part in the research into getting the SCS messages transferred to efficiently move data...an area not often used in SCSI-land. (One reason to bite the bullet now in this is that its efficiency in galaxies stands to gain tremendously; requiring full target mode in SCSI could do the job, but could be much less efficient.)(shhhhh!) To integrate with older devices, the current ID/LUN numbers together with the port PAC or letter can be used in place of the WW IDs; numbering here need not even change, so older systems will see NO alterations to what they have grown accustomed to. Test mode code will need to fake large port letters (maybe arrange to bump the letters by 20 instead of by 1 count or something similar to ensure that the code works for large numbers of ports) and fake WW IDs. This does not, in prospect, seem overly complex. Risks: The use of PAC logic and its prior implementation substantially cuts the risk of this approach. The MSCP server should need no changes at all, and the QIO server does not grow any more complex. Since all names are assigned relative to a SCSI port, the problems of a completely virtualized disk naming scheme envisioned in John Hallyburton's investigations are just bypassed; if some busses aren't there, it has no effect on any others' namings (so that for example if node E above were a large node with many devices, it wouldn't matter so long as the PACs had been set up correctly). The possibility of name instability where clusters are partly booted exists only for shared busses, and in almost every case a scan of the same bus using the same logic will come up with the same results even though done independently. One would have to go out of the way to louse up such a configuration, and even then it would self repair the first time all systems were up on the shared bus together. Configuration files will be in general created automatically and will need manual maintenance only where a system should not configure all devices it finds. This is currently rare, and should it come to be the rule later for, say, FC nets, it should suffice to provide a tool when there is need and perhaps to default in those cases that device names will be known but not used. The fact that we use a naming master in the architecture should segue neatly into using a name server on a FC net should that need arise also.