<<< EVMS::DOCD$:[NOTES$LIBRARY]SCSI_ARCHITECTURE.NOTE;1 >>> -< SCSI ARCHITECTURE >- ================================================================================ Note 50.2 SCSI I/O Flow and Control 2 of 2 EVMS::RLORD 545 lines 12-OCT-1995 14:21:29.79 -< A bit more detail and a few ideas on some of the same topics >- -------------------------------------------------------------------------------- 12 October 1995, 14:10:26 I've added this as a reply to Jim's note because I think it covers (or will cover when it's done) a lot of the same material. A lot of what he talks about I think is implementation specific - for instance, I don't think that I can say much about flow control without some idea of what the design is going to look like except in a very general way; yes it might be useful to have flow control based on a quota scheme for each level of object (bus, LUN, ID) - but where that would be implemented I don't know. In order to better understand what is expected of the new SCSI architecture, I thought it would be a good idea to describe what the current one does. I believe that going to, for instance, mount verification documentation and experts would not necessarily tell us everything that we need to know about how the architecture should work with mount verification. This is even more true of shadowing, which I don't think is particularly well documented. I'd like to describe how things work now and how they might be better implemented in the next architecture. Even if all we do is preserve the current functionality it can't hurt to have it documented somewhere outside of the driver code itself for maintenance. This first pass focuses on DKDRIVER because I think it's the most complex of the SCSI class drivers - or at least the one with the most undocumented or poorly documented connections to other parts of the OpenVMS Exec. Some of the sections in here are currently just placeholders for brain dumps, too. 1 STALLING I/O OUTSIDE OF THE CLASS DRIVER There are times when it's necessary or convienient to stall I/O outside the class driver. Stalling I/O outside the class driver means that OpenVMS cannot hand the driver any new work: it can't call the driver's Start I/O routine with a new IRP. 1.1 Using The OpenVMS Exec Busy Bit The ability to block new I/O is provided by the UCB$V_BSY bit in the UCB field UCB$L_STS. When OpenVMS has an I/O request for a driver which has made it successfully through the driver's FDT routine, OpenVMS checks the driver's busy bit to see if it's OK to call the driver's Start I/O routine. If the bit is set then the driver is assumed to be busy with another request and the new one is placed on the UCB$L_IOQFL queue to be processed by the driver when it's done with whatever is making it busy. If the bit is not set, then OpenVMS sets it - because now it really is busy - and calls the driver's Start I/O routine. Completing an I/O is a bit more complicated than this brief description, but essentially when you call IOC_STD$REQCOM in IOSUBNPAG, UCB$V_BSY in UCB$L_STS is cleared if there is not another IRP to start - that is, if there's nothing for the routine to remove from UCB$L_IOQFL and send to the driver's Start I/O routine. This is the only time it's cleared outside of the driver. Note that UCB$V_ALTBSY and UCB$L_ALTIOWQ are not simply the parallels to UCB$V_BSY and UCB$L_IOQFL for the driver's Alternate Start I/O routine; these fields are used by IOSUBNPAG to protect the ordering of I/O requests to the Alternate Start I/O routine when it's SCSI Architecture Miscellany Page 2 necessary to switch CPUs while queueing a request to the Alternate Start I/O routine. The state of UCB$V_ALTBSY means nothing to the driver, and there is currently no way to block new I/O requests from being started through this driver entry point. The model assumed by VMS is that the driver is either busy working on an I/O or ready to be handed an I/O to work on. The SCSI disk class driver, however, runs multi-threaded internally, meaning that at any given time, any given device (at least those which support command queuing) may have dozens of requests active on it. There are times when for reasons of error recovery or I/O sequencing the driver wants to complete an I/O, but has enough to work on without allowing a new I/O to be started. OpenVMS does not provide DKDRIVER with that option: if DKDRIVER sends a request off to be post processed (completed), it had better be ready to handle another request if there's one on it's I/O queue. 1.2 The Busy Bit IRP The OpenVMS busy bit mechanism is what forced us to implement the Golden IRP, aka the Busy Bit IRP. DKDRIVER places this dummy IRP at the head of it's I/O queue before a request is to be completed if it wants to block new I/O from being started for any reason. OpenVMS's request completion routine removes the busy bit IRP from the I/O queue and passes it to the driver's Start I/O routine. The busy bit IRP is never (intentionally) sent to I/O post processing by DKDRIVER, so the busy bit is left set and the driver is free to do whatever it needs to do until it sends the next real I/O request off to be post processed. This allows the driver to complete requests without being forced to handle any new requests, and preserves the ordering of the IRPs on the I/O queue. Is there a better way for us to prevent new I/O from being started when we complete an I/O? Why does the driver set UCB$V_BSY prior to RET when processing the busy bit IRP? At this point the bit is already set, and there's no chance of it being cleared due to this IRP being post processed because this IRP never _is_ post processed. 1.3 Using DKDRIVER's Class Busy Bit This is a composite bit which indicates why the Exec busy bit is set. TBW 2 CANCELLING I/O It is possible for a given process to cancel all of the I/O it currently has outstanding on a channel. SCSI Architecture Miscellany Page 3 Matching (PID and CHAN) I/O which is on the driver's I/O queue when cancellation is requested is simply completed immediately; the driver never has to worry about it at all. I/O requests which are in progress - that is, which have already been handed to the driver's Start I/O routine - are different: they can be cancelled only by going through I/O completion, so the goal is to get the driver to recognize that an I/O (IRP) is now persona non grata and to complete the I/O request immediately, or as soon as practical. In a single-threaded driver a process will only have to wait for a single I/O to be cancelled - everything else would have been in the I/O queue and so would have been cancelled immediately - but in the multi-threaded SCSI disk class driver a process might have dozens of I/O requests active on a single channel, each one of which must be sent to post processing. The granularity of a cancel operation is basically one SCSI command: if an I/O request is being cancelled then a SCSI command in progress for it (quite a lot of I/O requests actually cause several SCSI commands to be executed) won't be terminated, but the next SCSI command, if any, will not be issued. When DKDRIVER's cancel routine is invoked, it checks to see if the I/O being cancelled is for the MSCP server or not, and takes two different paths based on this bit: MSCP invokes a driver's cancel routine by calling the driver's entry point directly through DDT$PS_CANCEL_2 with a cancel code of CAN$C_MSCPSERVER and the address of an IRP which contains the PID and CHAN to be used as criteria for cancelling I/O. DKDRIVER will walk through all of the IRPs on it's UCB$Q_IRP_LIST and set the IRP$M_SRV_ABORT bit in every IRP whose channel and PID match those in the IRP passed in. At specific points in the processing of an I/O request, DKDRIVER will check for this bit in the IRP it's working on, and if it's found to be set will abort the request immediately. NOTE In the MSCP code which invokes DKDRIVER's cancel routine (MSCP.LIS line 11466, label 1060$) it appears that it is possible for R3, which is supposed to be an IRP address, to instead contain the address of SCS$GL_MSCP (a DSRV). This can happen if the UCB's I/O queue is empty and the branch at line 11387 is taken before R3 is loaded with a valid IRP address. As long as this is a safely addressable value it's (almost) certain to be harmless, because the values at IRP$L_PID and IRP$W_CHAN are unlikely to both match any IRP in the queue being scanned. SCSI Architecture Miscellany Page 4 In the current (V6.2, X61Q) DKDRIVER there appears to be a bug at line 16624, label 440$, in that it should be doing a MOVL instead of a MOVAB. As currently coded all it's ever going to do is move the address of the IRP to itself; fortunately rather than show up as an infinite loop the very first test will succeed. R6 will always and forever be equal to R7 at that point, and so cause the loop to be terminated. Right now we're never setting IRP$M_SRV_ABORT. If the reason code is not CAN$C_MSCPSERVER (there are no checks for other specific cancel codes), then the OpenVMS routine IOC_STD$CANCELIO is called with the same arguments as were passed in to the class driver's cancel I/O routine. This OpenVMS routine will set the cancel bit in the UCB (UCB$V_CANCEL in UCB$L_STS) if appropriate, and that bit will (should) be checked by other parts of the driver, causing the current I/O to be terminated as soon as convienient. There are a few problems with this, none of them critical because cancellation is not a guaranteed service - but if we're going to support it at all we should do it as well as we can: 1. DKDRIVER never actually checks the UCB cancel bit, so as is there's little sense in calling IOC_STD$CANCELIO. 2. The SYS$CANCEL service calls the driver's cancel I/O routine with the IRP address which is in UCB$L_IRP (which may be 0). DKDRIVER, being a multi-threaded driver, makes very selective use of this field, basically only loading it when it's required by an OpenVMS routine the driver is invoking. For instance, when sending a request off to be post processed DKDRIVER will load the IRP address into UCB$L_IRP; when invoking one of the error logging routines it will do the same. At all other times it's 0, because the driver works on lots of different IRPs and it just doesn't need the overhead of keeping it current. This makes it hit or miss (actually extemely unlikely) that the IRP address it's called with has any meaning at all. It's almost certain to be 0. 3. It seems that if the IRP address passed in for non-MSCP cancel does meet the criteria for being cancelled we're missing an opportunity to treat it the same way as an MSCP cancel: we should walk UCB$Q_IRP_LIST and mark each matching I/O so as to cause it's early exit. It's too bad that OpenVMS's cancel routine doesn't just call the driver with a PID and the channel and let it be the driver's problem to ferret out any matching I/O requests. As is it's not only almost useless to us, it can (has) caused problems in leaving the IO$_DIAGNOSE interface locked up when a process which got a CHECK CONDITION status terminated after the address of an IRP from another process had become current. SCSI Architecture Miscellany Page 5 Since I/O really has to be cancelled from the port up, we should have a port driver entry point accessible to the class driver which would mark each matching IRP. If the SCSI command was in the port queue, a last minute check before sending it out could be done. The time gained by cancelling a SCSI command would probably not be worthwhile for things like INQUIRY, TEST UNIT READY, START UNIT, etc. - but for huge READs or WRITEs, or for FORMAT commands, which can be very time consuming, it may be worthwhile to have IRPs marked as being cancelled issue an ABORT or ABORT TAG message; we wouldn't necessarily wake those stalled threads up to do it, but we could do it the next time the device reselected the host. Of course, each SCSI command would have to be evaluated to see what effect it's cancellation might have on the integrity of the media or on later SCSI commands; aborting a REWIND on a tape would likely leave us with position lost, and aborting a FORMAT might leave the media only partially formatted. 3 WORKING WITH SHDRIVER This is a summary of what DKDRIVER currently does to work and play well with shadowing. UNIT INITIALIZATION: During unit initialization the current disk UCB address is compared to SYS$AR_BOOTUCB to see if the unit being initialized is the one the system is booting from; if so, then the low bit of the global cell EXE$GL_SHADOW_SYS_DISK is tested - if set, then the driver knows that the system disk is shadowed. To test whether or not SHDRIVER unit initialization of this unit has completed, SYS$AR_BOOTUCB is compared to EXE$GL_SYSUCB - if they're not equal then it has (my guess is that EXE$GL_SYSUCB is written with the address of the shadow set UCB instead of with the address of either physical unit's UCB at that point), and the initialization continues - if not, then a fork-wait loop is entered until it has completed. TESTING HOST BASED SHADOWING SUPPORT: In order to support HBS a drive must support the optional READ LONG and WRITE LONG commands; these commands allow the driver to read a block and it's associated ECC data and write it back to the disk corrupted - that is, so that the ECC and data segments of the block do not agree. This is necessary to allow all of the members of a shadow set to be made block-for-block identical. The test for host based shadowing support is made in the routine CHECK_HBS, which is invoked only once per boot by the IO_PACKACK routine. CORRUPTING A BLOCK: If the I/O function code is IO$_WRITEPBLK and if the IO$V_MSCPMODIFS bit is set in IRP$L_FUNC, and if the MSCP$V_MD_ERROR bit is set in IRP$L_MEDIA+6 (which should be identified as IRP$W_SHD_DEV_TYPE or IRP$W_SHD_MSCP_DISK_MODIFIER since both symbols are in $IRPDEF), then the DKDRIVER routine FORCE_ERROR is called to corrupt the specified block with a READ LONG, WRITE LONG sequence; this routine will fail with SS$_UNSUPPORTED if the UCB$V_HBS_CHECK bit in UCB$L_DK_FLAGS is not set, because then it is SCSI Architecture Miscellany Page 6 unknown whether or not the device supports the optional READ LONG and WRITE LONG commands. WRITE BLOCK: Only processes with the SYSPRV privilege are allowed to write to HBS members; DK_SHAD_WRITE is an FDT routine which checks to see if the device is a shadow set member, and if so verifies that the process has SYSPRV before passing the request on to EXE$WRITEBLK; if not, the request is failed with SS$_ILLIOFUNC. This FDT routine is dispatched to on I/O function codes of IO$_WRITEPBLK, IO$_WRITELBLK, IO$_WRITEVBLK. ADD/REMOVE SHADOW SET MEMBER: To add a SCSI disk to a shadow set or to remove one from a shadow set, DK_CRESHAD and DK_REMSHAD are included in DKDRIVER. These are two names for the same routine, which checks the global cell EXE$GL_HBS_PTR for a negative value (system address), and if found assumes that it is the address of a shadow dispatcher; it calls this routine with with the CCB, UCB, IRP and PCB addresses. If the global cells contains a value greater than or equal to 0 the request is failed with SS$_ILLIOFUNC. This two-named FDT routine is dispatched to on I/O function codes of IO$_REMSHAD or IO$_CRESHAD. Presently the only thing I have to say about how DKDRIVER considers shadowing and mount verification together is that in DK_BEGIN_MNTVER if the IRP is found to be a shadow IRP (IRP$V_SHDIO in IRP$L_STS2 is set) then the request is not processed as described below, but instead is just passed to IOC_STD$POST_IRP. There's more to come on this subject though. 4 LOAD BALANCING When you have a lot of I/O going to one or more fast devices it can dominate the SCSI bus and starve out slower devices, more easily so if the slower devices are at lower IDs. We've seen this before, with RZ29s starving out all of the devices at lower IDs and causing hundreds of errors to be logged for those devices. This is what initiated a change in DKDRIVER's error logging scheme, where rather than log an error we now convert certain port status codes to completion codes which we know will invoke mount verification and cause the I/O to be retried. Failing a port operation all the way back to mount verification should obviously be avoided if at all possible. Because this is actually a bus arbitration problem, it may not be as bad for SCSI-3, where even though faster devices will likely be arbitrating for the bus more freqently they will not automatically win when competing with lower-ID devices. Implementing a quota-based flow control scheme on the driver side won't necessarily help with load balancing because even if we're choking I/O to a device down to 1 at a time the device may complete requests so quickly that the driver spends all of it's time maintaining a constant depth of 1 to that device. This depends on the SCSI Architecture Miscellany Page 7 algorithm that we use to feed I/O to a device, of course, but if completing an I/O drops a quota and causes us to immediately issue another I/O to the same device, we may never (or only rarely) get past that device. This has been supported (I wouldn't call it rigorously proved) by experiments where we've dropped the queue depth of a fast device with no improvement in throughput or reduction in errors on lower ID devices. The problem is that once an initiator has issued a SCSI command to a target - even just one command - there's absolutely nothing it can do to prevent the target from reselecting the initiator and trying to complete the command (or the next segment of it, as in a large data transfer); and it will always win when competing with a lower-ID device. I think that load balancing requires the introduction of fast device idle periods: periods where for at least several arbitration delays (2.4 us) the faster devices on the bus have zero SCSI commands outstanding on them; this would give slower devices on the bus a chance to arbitrate competing only with each other. These idle periods should only be introduced as needed, though, or else we take away the benefit of having a fast device. The current implementation contains a routine which scans those SCDRPs representing commands actually issued to devices so as to anticipate disconnect timeouts; this is a good thing to keep. The question is what to do when commands are at risk of timing out. The high ID choke, which basically just set the highest ID to which the port driver would issue SCSI commands to that of the lowest ID device with commands at risk, was an attempt to correct for these anticipated timeouts. The idea was that it would allow higher ID devices to complete I/O they had already been issued and leave the lower-ID device with at-risk commands nothing to compete against in trying to reselect the initiator to complete those commands. Whenever a command was found to be at risk of timing out it was marked as such and the port driver, recognizing that corrective action was now being taken, gave it one more TQE scan period to complete. The assumption is that when an I/O is at risk the command is really done, and the device just can't get the bus to let the initiator know it; at that point all we want to do is drain outstanding I/O from the higher IDs. The performance of some I/O devices (at least the faster disks), meaning the speed at which they can complete a SCSI command, is closer now to significant bus periods than it was when SCSI first appeared; the delay between their recieving a SCSI command and their completing it (or generating bus activity on it's behalf) is much smaller - a single device can now probably generate as high a rate of bus activity as was formerly generated by several devices; add that to the delta between these fast devices and older, slower devices and I don't believe that even putting the slower devices at higher IDs will solve the problem. So some adapter level work is needed to recognize and resolve these problems or customers won't realize the full benefit of having a fast disk. SCSI Architecture Miscellany Page 8 This is a bit more complicated in a SCSI cluster with two or three hosts on the bus. As far as each system is concerned, it's got it's own SCSI bus which contains it's own SCSI adapter and just the devices it knows about. In real terms, however, each initiator and each device is competing for the bus with every other device on the bus, and a fast device controlled by host B can starve devices at IDs that host B doesn't even know about. In a situation like this it may be necessary to provide support for the SEND command between initiators so that when one processor detects that a device it owns at ID 3 has commands about to time out it can tell the other processors on the bus to stall I/O to devices at IDs above 3; but then it has to tell when the risk has passed, too. If processors knew which devices were owned by which other processors, they could select just the processor which controlled devices above their at-risk ID. That implies that each processor would have to make sure to notify other processors as their configuration changed. The mechanism for this, the SEND command with the AEN bit set, is pretty well defined by the SCSI standard. I haven't thought anything about where in the SCSI architecture this maintenance of knowledge of other processor's SCSI configurations would be implemented or how it might be referenced outside of the driver. It's pretty clear that AEN SEND commands would be targeted towards whatever entity maintained such knowledge; non-AEN SEND commands could also be used, but if it's possible that the SCSI bus will ever be used as a medium for cluster traffic then something would have to be done to differentiate data intended for the SCSI subsystem from that intended for some higher level entity. This could be done either by encapsulating the data like a network packet or reserving some non-0 LUN for such data. In either case, since it's currently possible to issue SEND commands through the IO$_DIAGNOSE interface it would have to be able to coexist with user-generated SEND data. 5 SPECIAL CONSIDERATION FOR MSCP AND CLUSTER I/O SET_UNIT_ONLINE, notifies SCS$GL_MSCP_NEWDEV ... IOAVAILABLE, invoked with IO$_AVAILABLE ... IOUNLOAD, invoked with IO$_UNLOAD ... IONOP, invoked with IO$_NOP ... 6 MOUNT VERIFICATION Technically, returning any other-than-success status in R0 when an I/O completes can cause mount verification - if it's a disk or tape device then post processing code simply calls EXE$MOUNT_VER with the status and allows that routine to decide whether or not mount SCSI Architecture Miscellany Page 9 verification is appropriate. DKDRIVER should not know or care what might cause mount verification and should instead try to return meaningful status codes which indicate the nature of an error - in practice, however it knows that certain status codes will invoke mount verification and some will not, and may use this knowledge to force or inhibit it in some cases. The address of the routine DK_MOUNT_VER is stored in DKDRIVER's DDT. This routine is invoked by EXE$MOUNT_VER when an I/O which is completing meets whatever criteria it decides is appropriate for the device type. DK_MOUNT_VER simply invokes either DK_BEGIN_MNTVER or DK_END_MNTVER, depending on the whether the value passed to it in R3 is non-0 or 0, respectively. DK_BEGIN_MNTVER: This routine is called by EXE_STD$$MOUNT_VER with the address of an IRP which was allocated just for this purpose, and is so indicated by the IRP$V_MVIRP bit in IRP$L_STS being set. The mount verification in progress bit in the UCB has been set at this point. When processing an IRP, if it is found that the device is a served device, that is, if the host is serving the device to other nodes, then SCS$DISKMSCPDRIVERMV is invoked to inform the server that the device is being mount verified; this is determined by checking the DEV$VSRV bit in UCB$LDEVCHAR2. DK_END_MNTVER: TBW There are checks for various mount verification related bits scattered throughout DKDRIVER: UCB$V_MNTVERIP in UCB$L_STS, UCB$VCBMNTVERIP in UCB$LCLASSBUSY, IRP$VMVIRP in IRP$LSTS, and both IRP$V_SHDIO and IRP$V_SRVIO in IRP$L_STS2, the last two of which will inhibit the start of mount verification for the IRP. I can't explain them all yet. The approach I'd like to take to understanding mount verification is to assume the following situation: 1. There are 13 active, disconnected SCSI commands on an device, each described by an SCDRP on the device queue 2. There are 6 pending requests on the device queue 3. There are 19 pending requests on the class driver's UCB$L_IOQFL Now, if a request completes with an error which invokes mount verification, say a bus reset, what happens or should happen to each of these commands and requests? 7 SINGLE-THREADED VS MULTI-THREADED OPERATION What is meant by single-threaded and multi-threaded? SCSI Architecture Miscellany Page 10 Why must some sequences be performed single-threaded? 8 QUEUE TAG TYPES What determines the type type of tag associated with a given SCSI command? How are tagging and single- or multi-threaded operation related?