<<< EVMS::DOCD$:[NOTES$LIBRARY]SCSI_ARCHITECTURE.NOTE;1 >>> -< SCSI ARCHITECTURE >- ================================================================================ Note 50.1 SCSI I/O Flow and Control 1 of 1 STAR::DUNHAM "Jim Dunham" 470 lines 26-SEP-1995 08:47:54.53 -< SCSI I/O Flow and Control - (with change bars) >- -------------------------------------------------------------------------------- Abstract: ========= This paper attempts to discuss I/O flow throughout the OpenVMS I/O subsystem as its relates to SCSI I/O. It discusses previous, current and future directions in what was, is and could be done to address I/O flow control and its relationship with resource utilization in the EXEC I/O. Overview of tagged command queuing ================================= OpenVMS Alpha SCSI device drivers, support the tagged command queuing architecture of the SCSI-2 standard. Tagged command queuing allows a class driver to pass multiple queued I/O requests directly to a port driver without waiting for any one I/O request to complete. The SCSI-2 standard states that, tagged queuing allows a target (SCSI disk or raid subsystem) to accept multiple I/O processes (up to 255 SCSI commands) from each initiator (an SCSI adapter on an AlphaGeneration system) to each logical unit (disk or raid set). OpenVMS Alpha SCSI device drivers, also support the SCSI-1 standard for non-queuing SCSI devices, non-queuing SCSI adapters and as a model for contingent allegiance processing. This is the non-tagged command queuing model. Non-Tagged Command Queuing OpenVMS I/O Model ============================================= The OpenVMS SCSI non-tagged command queuing I/O subsystem, consists of ! a single threaded I/O processing queue, driven by I/O requests issued to the SCSI class driver. This allows for only a single I/O request to be processed at a time by the associated SCSI device. If the SCSI class driver is idle (UCB$V_BSY not set), then UCB$V_BSY is set and the current I/O request (IRP) is sent directly to the class driver's start-I/O routine. These actions causes I/O processing to commence on the class driver associated device. If the class driver is busy (UCB$V_BSY set), then the current I/O request is place on the class drivers I/O pending queue of its associated class driver UCB. Each IRP is processed one at a time from "start-I/O" to "request complete". During, "request complete" processing, if the pending I/O queue is not empty, then the next IRP is dequeued and again sent to the driver's start-I/O routine. If the pending queue is empty, UCB$V_BSY is cleared and the class driver is now idle. For the disk class SCSI device driver (DKDRIVER), it performs an "elevator" seek optimization at the end of each I/O request (just before "request complete" for the current I/O). Queue reordering is done on reads and writes such that the LBN of the next I/O to execute is numerically closest to the LBN of the last I/O, in the current direction of head movement. DKDRIVER will only look ahead 4 I/O's in it's pending queue, thus limiting the maximum realized benefit. While this approach gains some amount performance by helping to optimize seek latency, much greater performance benefit may be realized by sending all outstanding I/O requests to the device. The functionality is called tagged command queuing (TCQ). This functionality allows the SCSI device to alter the queuing of I/Os and perform device specific latency optimization. An additional benefit is gained by eliminating latency associated with passing a request through the port/class device driver layers on at a time, prior to starting the next I/O. Tagged Queued I/O Model ======================= The OpenVMS SCSI tagged command queuing I/O subsystem, consist of a ! multiple threaded I/O processing queue, driven by I/O requests issued to the SCSI class driver. This allows for multiple I/O's to be queued concurrently to the associated SCSI device. If the SCSI class driver is idle (UCB$V_BSY not set), then UCB$V_BSY is set and the current I/O request (IRP) is sent directly to the class driver's start-I/O routine. These actions causes I/O processing to commence on the class driver associated device. If the class driver is busy (UCB$V_BSY set), then the current I/O request is place on the class drivers I/O pending queue of its associated class driver UCB. As each IRP starts processing and there are no outstanding reasons (resource contention, flow-control, etc.), UCB$V_BSY is cleared. If the pending I/O queue is not empty, then the next IRP is dequeued and again sent to the driver's start-I/O routine. This loop of dequeuing IRPs continues until one or more reasons exist in which it would not be advantageous to start the next IRP for processing. At "request complete" processing or when an outstanding reason no longer exists, an attempt is again made to restart queued I/O. As each concurrent IRP is processed, the class and port drivers collectively process and initiate each SCSI I/O request, tags and sends them directly to the target device for processing, without waiting for the I/O to complete. These SCSI I/O requests will then be queue on the SCSI devices internal command queue for processing. The SCSI device can then perform optimization based on a-priori knowledge of it's hardware capabilities, dynamic positioning and latency schedules, spindle configuration etc., thereby gaining performance benefits over single I/Os. In addition, I/Os pending in the device queue may begin immediately upon completion of the current I/O, without waiting for the current I/O and its status to be returned back up through the port and class driver levels. This yields a significant performance gain, since I/O completion processing will be performed at the same time the SCSI device is working on its internally queue I/Os. When the SCSI I/O request completes, its results, status and tag are returned through the SCSI port and class drivers, which completes the I/O request. Calling Driver's STARTIO Routine ================================ To initiate I/O to a device driver EXE$QIODRVPKT is invoked, which in turn calls the routine EXE$INSIOQ. EXE$INSIOQ carries out the following actions which may insert the I/O in the device's unit queue (UCB$L_IOQFL), or call the driver at its STARTIO routine. o acquires the device's fork lock (UCB$B_FLCK) o increments the devices queue length (UCB$L_QLEN) o if the device is not busy (UCB$V_BSY clear), it sets busy then calls IOC$INITIATE which will call the device driver's STARTIO entry point o if the device is busy (UCB$V_BSY set), then call EXE$INSERT_IRP to insert the current IRP into the device's I/O Queue, by priority order. o release the device's fork lock (UCB$B_FLCK) ! Single threaded class driver, STARTIO routine ! ============================================ ! In the single threaded class driver (GKDRIVER and MKDRIVER), each IRP is processed one at a time through STARTIO processing, invoking function specific device processing, and completing I/O processing by calling REQCOM. During the time the I/O processing is active (STARIO -> REQCOM), UCB$V_BSY is set which causes any new I/Os to be queued to the device's unit queue (UCB$L_IOQFL). REQCOM may start the next I/O, if one was queued. (See REQCOM processing). ! Multi-threaded class driver, STARTIO routine ! ============================================ ! The multi-threaded class driver's (DKDRIVER) STARTIO routine is implemented as a loop, where IRPs are processed concurrently until either there are no more IRPs to process, or and internal event (resource contention, flow-control, etc.) exists which causes the device driver to be busy (See class busy bits). Within STARTIO when the next IRP (if one exists), is dequeued and there are no outstanding events to prevent the starting of it, a KPB thread is created and associated with this IRP. This KPB thread performs the IRPs function specific processing, until such a time as the I/O completes (see REQCOM processing below). When the KPB thread stalls for the first time, the class driver's STARTIO loop is resumed, which attempts to dequeue the next IRP for processing. The STARTIO routine exits if internal events exists that prevent further I/O processing, by placing the IRP back at the front of the unit queue, and setting the device's busy flag. The STARTIO routine also exits when there are no more IRPs to process, but the devices's busy flag is clear. EXE$KP_STARTIO Startup ====================== During the transition of an I/O from single threaded IRP processing (via device drivers STARTIO routine), and multi-threaded IRP processing (via call to EXE$KP_STARTIO), there is a possibility that the resources available to start the KPB may not be available. This fact is needs to be reflected in the class driver's class busy bit processing. Specific routine callbacks were added to EXE$KP_STARTIO, that are invoked when the resources are not available, and again when the become available. Complete of KP_STARTIO process associates a KPB (kernel-process-block) and its associated stack with each active IRP. One the KPB thread becomes active, the class driver allocates off the KPB stack a SCSI class driver request packet (SCDRP) based on information contained in the I/O request packet (IRP). Additional processing is performed on the SCDRP, and it is then passed from the SCSI class driver to the SCSI port driver for processing. The SCDRP contains the addresses of the class drivers unit control block (UCB), SCSI connection descriptor table (SCDT), I/O request packet (IRP), and critical IRP data. Once the port driver queues the I/O for processing, the class driver KPB thread is suspended. Driver's Request Complete (REQCOM) processing ============================================= To complete device driver I/O processing, the driver calls the routine IOC$REQCOM which carries out the following actions: o If error call Mount Verification (clearing pending and in progress) Depending on specific error, Mount Verification may be started o Insert IRP in "correct" post processing queue o If Mount Verification in progress, determine if entry is already in queue. If not, start Mount Verification o If Mount Verification not in progress, attempt to dequeue next entry in unit queue (UCB$L_IOQFL). o If queue is empty, clear busy and exit o If entry dequeued, call IOC$INITIATE (see above) Block unwanted I/O Initiation from REQCOM ========================================= The VMS I/O executive assumes that all drivers are single threaded drivers, and the a call to REQCOM means the driver has just completed its one and only outstanding unit of work, and therefore is capable of having new I/O initiated to it. ! For a SCSI multi-threaded class driver, this presents a problem in that internally the driver is consider busy, but externally (based solely on I/O completion and UCB$L_IOQFL), new I/O maybe started and/or the UCB$V_BSY bit will get cleared, which may allow future I/O, or worse Mount Verification I/O to be started. Prior to returning control of I/O processing to REQCOM, the class busy bits are tested to determine if there are any reasons why this driver does not what the UCB$V_BSY cleared. If so, then "golden" IRP is checked to see if it not in the UCB$L_IOQFL, and then placed there if not. ! Draining I/O for single threaded I/O ! ===================================== ! Within a multi-threaded class driver, there exists I/O functions where a required level of sequentiality is associated with them (IO$_NOP, IO$_UNLOAD, IO$_AVAIL). To implement this functionality, the class driver must transition from a multi-threaded to a single-thread I/O processor to perform the specific operation. ! The transition from multi-threaded to single-thread operation is done as ! follows: ! o set a class busy bit, this will prevent any new I/O's from being started. ! o If the current I/O is the only I/O active, return as we are single-threaded ! (Note: After the current I/O is complete, the class busy bit will be cleared ! and multi-threaded I/O will continue). ! o Since there are more I/O's active then this one, place this I/O on the ! drain queue, and stall waiting for I/O's to complete. ! o As each active I/O is competed, the following sequence of check are made: ! - Is a busy bit still set? ! - Is the only I/O's left, draining I/Os? ! - Are there any I/O's on the drain queue? ! - If so, restart the threads associated with the draining I/O's ! Note: After the draining I/Os are complete, the class busy bit will be cleared ! and multi-threaded I/O will continue. ! ! Cancel I/O ! ========== ! On behalf of layered subsystems above DKDRIVER (MSCP, SHADOWING, etc.), there is ! a need to cancel an I/O in progress. The function IO$_CANCEL, locates the ! associated I/O request and if found 'marks' the SCDRP with a cancel flag. A ! various points throught the SCSI Class Driver, this bit is tested and if set the ! I/O is canceled with a SS$_CANCEL status. ! Mount Verification of SCSI devices (disks and tapes) =================================================== Mount verification is a sequence of events, which if successful should restore 'normal' I/O processing. Within the SCSI subsystem, various ! anomalies in I/O processing, (bus resets, timeouts, bus errors, etc.), ! disrupt current I/O, (and maybe all outstanding I/O's), such that a mechanism is needed to methodically and reliably restore I/O processing. A golden rule is that if the SCSI I/O failure is transient in nature, it should be retried after successful completion of Mount Verification. NOTE: ! Due to the variety of SCSI port driver implementations and lack of overall SCSI port driver knowledge by a single individual, implementation of a tiered approach to resolve I/O anomalies was ! considered to be a deliverable of unknown complexity for SCSI-2. ! Also if an individual 'tier' was to fail to restore normal I/O ! processing, the transition from tier to tier, would ultimately ! result in a 'last chance' tier, would must at all costs resolve the ! problem. This is Mount Verification. At this point in time, ! consideration for adding individual tiers at key points in the SCSI ! subsystem can be considered to address specific failures. Typically ! these points are the places marked by 'IF_CANCEL" macro and as a direct ! function of implementing IO$_CANCEL as an atomicly, verse the defered ! method as is done today via the IF_CANCEL macro functionality. The risk ! in doing this is bounded, since that there is now a 'last chance' tier in place. Mount Verification as implemented for SCSI is a BIG HAMMER solution. The cost in I/O serialization, and SCSI I/O database cleanup is vast. ! NOTE: ! Also these discussion have strong reference to DKDRIVER, since it ! is the only multi-threaded class driver. This implies that port ! driver callbacks to set class busy bits are NOPs for MKDRIVER and ! GKRIVER. ! As noted above, Mount Verification is called from I/O request completion processing (REQCOM), by disk and tape class drivers whenever and I/O operation completes with an error status. Initial mount verification processing determines if the current I/O can be recovered, (based on I/O status values, whether its a server I/O, shadowing I/O, etc. ). If the I/O is not recoverable, or there is another subsystem (MSCP, Shadowing) is better suited to recovering the I/O, control is return back to I/O request completion for additional post processing operations. Note: A consistent, documented policy should be created and agreed upon between all OpenVMS drivers which implement some sort of recovery, so that Mount Verification, "shadowing's" volume validation, MSCP Server's failure over, and other I/O subsystems behave in a predictable manner. This was not done, although hinted to during ! OpenVMS Alpha V6.2 development. The functional/design specification ! for the 'new' QIO Server project, should be a good source of ! information for these, as 'they' must resolve this for a consist and ! correct implementation. ! ! If the I/O has the possibility of being recovered, either internally ! by port driver, up above by the class driver, it should be done, ! keeping in mind issues around DK_DRAIN, IO_CANCEL, CLASS_BUSY bits, ! etc. If all else fails, the recovery can be accomplished using mount verification, then a recovery thread may be initiated (if not already in progress), or a mount verification pending status is set, such that when the device becomes quiescent, mount verification can be initiated. Regardless of how mount verification gets initiated, the rules of driving it completion are specific. o The device must continually assert a busy status (UCB$V_BSY) until Mount Verification is completed, or the device (UCB) is taken off-line after MVTIMEOUT. o Depending on the event (I/O failure, Bus Reset, etc.), typically the first failing I/O triggers Mount Verification. Subsequent failures typically reconfirm this failure, and are dismissed by the pending bit being set. o For single threaded device drivers, the EXEC I/O supplied routine (IOC$MNTVER) does the correct things in regards to triggering Mount Verification. For each I/O which fails, (and this I/O is not served), it reinserted it into the devices unit queue (UCB$L_IOQFL), in 'exact' order in which EXE$INSIOQ would have placed them there. For served I/O, the failure is not handle by EXEC I/O's mount verification processing, but allowed to continue through post-processing to be handled by the serving node. o For multi-threaded device drivers, a special version of IOC$MNTVER is need to transition the driver from multi-thread, to singled threaded mount verification processing. For each I/O which fails, (and this I/O is not served or shadowed), it reinserted it into the devices unit queue (UCB$L_IOQFL), in 'exact' order in which EXE$INSIOQ would have placed them there. For served or shadowing I/O, the failure is not handle by EXEC I/O's mount verification processing, but the I/O is allowed to continue through post-processing to be handled by the MSCP or shadowing subsystem. If I/O is being handled by the class driver, a CLASS BUSY bit is set, MSCP Mount Verification processing (if this node is the serving node) is invoked. Note: It is a violation of Mount Verification processing for a Mount Verification IRP to retrigger Mount Verification. The sequence of I/O operations which by their successful completion will eventually restore a device to normal operation, (ie., the clearing of Mount Verification in progress), must not start until the device is quiescent. Upon completion of Mount Verification processing, if successful multi-threaded I/O operations are restarted, else all I/O's are failed to I/O post processing with SS$_VOLINV. Class driver Busy processing ================================ ! Note: ! The class busy bit solution is a 'well' proven mechinism for ! addressing and managing a driver processing state. All failures ! detected by this mechinism (either BUG_CHECKs, or a hung system) ! are easy to diagnose and used in part with the class busy bit ! traceing during debug and development, it offers a 'very' ! supportable mechinism. Also it is a replacement for the 'RWAITCNT' ! mechinism implemented in the Cluster I/O subsystem, which is ! proven hard to maintain, understand, support and diagnose. ! When a $QIO system service posts an I/O, the $QIO sets the UCB$M_BSY (busy) bit. The class driver STARTIO routine clears this bit for I/O requests that it queues to the port driver. If a $QIO posts another I/O before the queued I/O completes, the $QIO again sets the busy bit and the class driver subsequently clears it again. Clearing the busy bit in STARTIO bypasses the system wait cycle for I/O completion. This sequence can continue until a condition occurs that causes the class driver to leave the busy bit set (for example, error recovery in progress or mount verification). In such cases, a $QIO will queue the new I/O to the UCB busy queue (pointer is UCB$L_IOQFL) as in the standard OpenVMS driver sequence. There are several cases where the class driver would not clear the busy bit, all of which are summarized in the UCB$L_CLASS_BUSY longword described in below. If any of the bits are set in this longword, the class driver will not clear UCB$V_BSY. As I/O completes, the port driver resumes the appropriate class driver thread to complete the I/O to the user process. At I/O post processing time, the class driver checks for I/O on the UCB busy queue using UCB$L_IOQFL. If this list is not empty, the class driver initiates the next I/O on the list to the STARTIO routine. The STARTIO routine may then post the I/O to the port driver. If STARTIO clears the busy bit, the class driver continues to process the UCB busy queue until the list is empty or it encounters a request that does not clear the busy bit. The UCB$L_CLASS_BUSY longword is a bitmask of reasons why the class driver should leave UCB$V_BSY set, and subsequently not initiate another I/O until the conditions which set the bits are no longer are valid. Some of the bits are set based on specific routine and/or functions being executed, which by there very nature must be single threaded. Other bits are set by callbacks from the SCSI port drivers, to indicate that a transient condition exists and for flow control reasons. The mechanisms regarding setting, clearing and testing of the class busy bits are architecture specific, but the actual usage of each bit is "implementation" specific, and need not be discussed here. Note: These bits must be altered by interlocked bit instructions, to synchronize potential IPL 31 access to UCB$V_CB_INIT during a powerfail. This does not represent a performance problem because no alterations occur in mainline paths. Port driver common code flow control ==================================== The processing of I/O requests from the SCSI class drivers to the SCSI port drivers, is modeled after the OpenVMS $QIO / $SYNCH model, where an I/O is issued and then the thread of execution must call a synchronization routine to wait for the I/O to complete. Due to the queuing nature of the SCSI I/O ports and its associated threading package (KPB's), supported is provide for synchronized completion of I/O (SS$_SYNCH status), such the a call to synchronization routine may not be required. Port driver queue depth flow control ==================================== I/O's will continue to be sent to the SCSI I/O ports, until such a time as the class drivers have no more I/O, or they are stalled based on events occurring at either the class driver level or port driver level. One event which controls overall class/port driver flow control is queue depth management. A class driver sets and/or adjusts a value which is equal to the desired "running" queue depth of a SCSI device (via SET_CONNECTION_CHAR). I/O will accepted from an associated class driver as long as the total number of I/O's don't exceed the current 'queue depth' value. When the value is reached, the port driver (via a callback to the class driver), will set a busy bit in the class driver which will stall "new" I/O's from being started, although the threading model will allow "in progress" I/O's to still be initiated. ! Today, as each I/O in progress is completed, the total number of ! active I/O's will be compared to the desired queue depth, and if the ! difference (including underflow) is large enough, the port driver (via ! a callback to the class driver), will clear a busy bit in the class ! driver which will start "new" I/O's. ! ! This mechinism works, but it is based on a hard-coded table and does ! not take into account differences in 'queue depth' across each ! vendor's TCQ disks. I would strongly suggest that driving this ! functionality off of the 'queue full' SCSI status is the most optimal ! solution. ! ! NOTE: It is in my opinion that 'queue full' processing today is a weak ! point in out SCSI implementation. Port driver ACA and SCSI-1 flow control ======================================= Both ACA (the processing of a request_sense command after an CHECK_CONDITION), and non-queued I/O, require the SCSI port driver to do only one I/O at a time. In both cases after the I/O is issued, the class driver is already stalled behind a CLASS_BUSY callback, and no future I/O will be seen. At the time of I/O complete, adjustments to the CLASS_BUSY bits will be made, which will let through another I/O, which will again re-evaluate class of I/O. The re-evaluation processes (at least for ACA I/O), is the method which transitions the SCSI port from single-threaded ACA operations, back to multi-threaded I/O operations.