›7 I ›11h ›?20 J ›3300t ›1;3300r ›1;65536s ›0d 1-1 CHAPTER 1 XQP MECHANISM IMPLEMENTATION The basic flow of requests into and out of the XQP is described below. 1.1 Mapping The Code Into P1 And Allocating Impure Storage. The F11BXQP.EXE image, which contains only pure code, is mapped at process creation time into P1 space. There is a routine XQPMERGE in the SYS facility module PROCSTRT which knows how to do this. It is functionally equivalent to a LIB$P1MERGE call, but is optimized; SYSINIT has already looked up the image and set up a permanent global section for it. All that happens in XQPMERGE is to map it. If the sysgen parameter ACP_XQP_RES is set, SYSINIT has wired the code into physical memory and global valid page faults are also avoided. Once the code has been mapped, XQPMERGE jumps to the lowest address mapped, the initialization routine. The module DISPATCH is linked to be the first module in the image and the code there does a CMKRL, specifying the INIT_FCP routine. This routine does an EXPREG in P1 space to get the impure storage area allocated, including space for a private kernel stack. It then locks down the areas for the kernel stack and the parts of the impure area referenced at elevated IPL, fabricates a channel for use by the XQP, and initializes some queue headers and a handful of other locations in the impure area. It also notes the address of the impure area in the process cell CTL$GL_F11BXQP. In addition, the very first process so merged when booting the system (the SYSINIT process) creates a permanent mailbox (ACP$BADBLOCK_MBX, channel MBX_CHAN) for talking to the bad block scanner. ["Alternatively" XQPMERGE (the F11X module) can be called to take a given XQP (logical name TESTXQP) and force map it into P1 space and jump to its initialization entry-point.] Now the file system is ready for business. 1-1 XQP MECHANISM IMPLEMENTATION 1.2 Getting To The XQP From QIO All file system functions are QIOs. They start off life with QIO pre-processing in the SYS module SYSQIOREQ. An IRP is allocated and is the basic argument block passed to the file system for all functions. This IRP is first processed by the file system FDT routines who eventually get the request to the XQP, if necessary. 1.2.1 FDT Routine Processing The FDT routines for file system functions are in the SYS module SYSACPFDT (which also handles the system ACPs). These routines perform various setup and initialization functions. The IRP is queued (sent) to the XQP for processing by these routines. 1.2.1.1 Access and Create - These functions are performed by ACP$ACCESS. The steps are as follows. Check for a file already open (CCB$L_WIND non-zero). Check that JIB$W_FILCNT won't be exceeded. Build the XQP packet. Check for dismount. Decrement JIB$W_FILCNT. Interlock the channel (increment CCB$L_WIND). Queue the XQP packet. 1.2.1.2 De-access - This function is performed by ACP$DEACCESS. The steps are as follows. Check for a file open (CCB$L_WIND > 0 implies a process section (can't be de-accessed in this way), < 0 implies a window pointer, 0 implies no file). Build the XQP packet. Interlock the channel (increment CCB$L_WIND). Update the volume transaction count. If this is the only activity on the channel (CCB$W_IOC is 1), the XQP packet is queued. If other activity exists on the channel, CCB$W_IOC is decremented to account for this de-access, but the de-access IRP is placed on CCB$L_DIRP so as to be performed when the channel goes idle. The de-access function is special if the WCB indicates a non-FCP window (WCB$B_ACCESS bit WCB$M_NOTFCP set) or a shareable window (WCB$B_ACCESS bit WCB$M_SHRWCB set). If the window is shareable but is an FCP window, the reference count (WCB$W_REFCNT) is decremented. If the count is non-zero, we simply clear CCB$L_WIND to remove us as a user of the window. If the count is zero, then the XQP de-access packet is queued for processing. The processing of it will clear CCB$L_WIND. If the window is not a FCP window and not shareable, the "file" is closed by clearing CCB$L_WIND and incrementing JIB$W_FILCNT. If the window is both a non-FCP and a shared window, WCB$W_REFCNT is decremented and CCB$L_WIND is cleared to remove our reference to it. 1-2 XQP MECHANISM IMPLEMENTATION 1.2.1.3 Modify and Delete - These functions are performed by ACP$MODIFY. The XQP packet is built, we check for volume mounted, and the XQP packet is queued. 1.2.1.4 Mount - The mount function is provided by ACP$MOUNT. The steps are as follows. Build the XQP packet. Check for MOUNT privilege. Check for a mountable device (UCB$V_MOUNTING set in UCB$W_STS). Update the volume transaction count. Queue the XQP packet. 1.2.1.5 Read and Write - The read function is implemented in ACP$READBLK, write in ACP$WRITEBLK. The read function checks for read access to the file (WCB$B_ACCESS bit WCB$V_READ set). Read check enabled is checked at this time. The write function checks for file write access (WCB$B_ACCESS bit WCB$V_WRITE set). Write data check is checked at this time. Common processing includes checking the access to the user's buffer and mapping the virtual block number range into logical blocks. If the mapping (at least partially) succeeds, the request (now converted to a physical request) is queued to the driver. If the mapping fails, the request must be queued to the XQP for a window turn. (Non-FCP devices return SS$_ENDOFFILE.) If this is a write function and WCB$V_WRITE_TURN is set in WCB$W_ACON, then the mapping always fails, and the write request is sent to the XQP. This is done for direct writes to directories, and INDEXF, BITMAP and QUOTA files. Erase functions are coded as special write block functions. The difference is that a user buffer is not involved. If the erase pattern is zero (almost always true), a pre-allocated zero area is used. (This pre-allocated area consists of a single zero page of memory, but with a page of page table entries each pointing to it, thereby giving a (up to) 127 page erase buffer. If the erase pattern is non-zero, a page is allocated to hold the pattern (replicated from the four bytes) and a page of page table entries is allocated to map over it. Assuming that the virtual to logical mapping succeeded, the request is queued to the driver. Eventually, the driver will process the request and the I/O will complete. When the I/O post interrupt reaches IOC$IOPOST (SYS module IOCIOPOST), the VBN and count specified by the user are checked against the updated processed count. The buffer virtual address is also updated. (Note that for an erase function, the address is not updated since the buffer is a pseudo buffer.) If the byte count remaining is non-zero (indicating that the request did not map completely, could not be performed as a single contiguous transfer, or that the request exceeded the capacity of a single I/O), the remaining request is re-mapped and re-queued to the driver. If the mapping fails, the request must be queued to the XQP. In this 1-3 XQP MECHANISM IMPLEMENTATION case, the XQP queuing is done by IOC$WAKACP instead of EXE$QIOACPPKT (described below). Virtual to logical mapping is done by IOC$MAPVBLK (in the SYS module IOSUBRAMS). It walks down the WCB list associated with the request to locate mapping pointers that locate the desired VBN. The routine returns the starting LBN and number of bytes that can be mapped. Also, the UCB corresponding to the volume holding this extent (for a volume set) is returned. On a total mapping failure, the original UCB (WCB$L_ORGUCB) is returned. 1.2.1.6 XQP Packet Building - The XQP packet is built by the internal routine BUILDACPBUF. This routine allocates the space for the XQP packet (address placed in IRP$L_SVAPTE). The COMPLX, FILACP and VIRTUAL bits are set in IRP$W_STS. Accounting for buffer byte count quota is done here. The user arguments to the QIO (FIB, etc.) are checked for access. Descriptors for each user argument are created in the XQP packet. The number of descriptors is placed in IRP$L_BCNT. Finally, CCB$L_UCB is placed into IRP$L_MEDIA. 1.2.1.7 Volume Status - The FDT routines insure that the volume has the correct state for the request. The check dismount check makes sure that the volume isn't being dismounted (DEV$V_DMT set in UCB$L_DEVCHAR) and then checks for mounted. The mounted check checks that the device is mounted (DEV$V_MNT set in UCB$L_DEVCHAR), that the device is not a member of a shadow set, that the device is not in the dismount state (UCB$V_DISMOUNT set in UCB$W_STS) and that the volume is not mounted foreign. Once the volume checks (if any) pass, the volume transaction count is incremented (VCB$W_TRANS). This is normally done for the volume describing the desired UCB, but will be done to the UCB on which a file is open if the WCB so indicates. The IRP$L_UCB field is updated to this value. The IRP$L_MEDIA field is updated to this UCB if the device is not spooled. 1.2.2 XQP Packet Processing XQP packets (the IRP with the added XQP buffer descriptors) are queued to the XQP by IOC$WAKACP (SYS module IOCIOPOST) or by EXE$QIOACPPKT (SYS module SYSQIOREQ). In the case of the queuing of an XQP packet by the FDT routines (within the context of the requesting process), EXE$QIOACPPKT will generate a kernel mode AST specifying as the routine the value found in F11B$L_DISPATCH. For the case of I/O post processing, in which we are probably not in the context of the target process, a special kernel mode AST is queued to the process with an 1-4 XQP MECHANISM IMPLEMENTATION AST routine address of EXE$QXQPPKT (SYS module SYSQIOREQ) and an AST parameter of the IRP. EXE$QXQPPKT then performs the queuing of the kernel mode AST to the XQP dispatcher. To avoid the necessity of allocating an AST control block (ACB), the CDRP extension to the IRP is used as an ACB. There is a very definite assumption that this area is always there and not used by anything else for virtual I/O functions. This area is normally used by the disk class driver when processing disk I/O requests. 1.3 Initial Setup For F11BXQP Code Execution When the kernel AST specified above begins execution, we are finally executing code in the F11BXQP image. The routine called is the DISPATCH routine. The argument to this routine (the AST parameter) is the IRP. This routine queues the IRP onto a per-process queue. If there are no other requests being processed, the normal case, the routine enables the special XQP channel for use by stuffing the CCB$B_AMOD field so it appears to be a normal kernel mode channel. When the XQP is not using it, the CCB$B_AMOD field contains a -1, which makes it inaccessible to anyone at any mode because the privilege check for channels in IOC$VERIFYCHAN does a signed comparison against access mode. The system run-down routine, EXE$RUNDWN, in the SYS module SYSRUNDWN, also does signed comparisons against access mode to determine if a given channel should be de-assigned, and the negative access mode in the special XQP channel when the XQP is not actively processing a request prevents it from being de-assigned. At this point, the PCB$B_DPC cell is incremented. This is done to prevent process deletion while any file system request is being processed. The EXE$DELPRC routine, in the SYS module SYSDELPRC, waits for this cell to be zero before actually proceeding with process deletion. It waits at IPL 0 to allow kernel ASTs to be delivered, so that pending file system requests can in fact complete. Similar code in the process suspension service prevents a process from being suspended until pending file system requests are completed. The reason for blocking process suspension while file system requests are active is that random synchronization locks could be held indefinitely in that case, potentially locking up the entire cluster. The reason for blocking process deletion while processing a file system request is to minimize problems potentially caused by half completed operations. The DISPATCH routine then saves the current kernel stack limits and current frame pointer (FP) prior to setting the kernel stack limits to be the private XQP kernel stack and setting the stack pointer to the base of the private XQP kernel stack. It also sets up a register (R10, called "BASE") to point to the impure area (the contents of 1-5 XQP MECHANISM IMPLEMENTATION CTL$GL_F11BXQP). All XQP routines run assuming that R10 points to the XQP_QUEUE variable in the XQP storage area. DISPATCH then calls the DISPATCHER routine, and from now on we are on the private XQP kernel stack. The DISPATCHER routine de-queues the IRP from the queue we just stuck it on above and proceeds to call the appropriate routines to execute the desired function. After completing the request, it attempts to de-queue another request and process it. That is how pending requests are eventually processed. 1.4 What Happens When The XQP Needs To Wait For Something To stall in the caller's mode, the file system dismisses the kernel AST it is in when it needs to stall for either I/O or a lock request that is queued. A completion AST resumes the thread of execution. The initial entry into the XQP was also via an AST, so that the entire XQP operation is done at AST level. XQP activity is generally asynchronous with respect to normal process operation. The XQP is itself a serial function, though. Two routines in the DISPATCH module are used to accomplish stalls. Immediately after a QIO or ENQ request is queued for which the XQP must stall, the WAIT_FOR_AST routine is called. This saves the current frame pointer, makes the XQP channel inaccessible as described above, restores the previous stack limits and frame pointer and does a RET to dismiss the AST. Because the frame pointer has been restored, the RET picks up where we left off on the original kernel stack. PMS metering is stopped for the duration of the stall. The QIO or ENQ request that was queued specifies the the impure pointer as the AST parameter, and the routine CONTINUE_THREAD as the AST routine. When the AST arrives at the CONTINUE_THREAD routine, the impure pointer is restored from the AST parameter, the stack limits are pointed back to the XQP stack, the saved XQP frame pointer is restored, the XQP channel is made accessible again, PMS monitoring is restarted, and we RET from that routine, which returns us to the caller of the WAIT_FOR_AST routine. 1.5 Finishing Up XQP Processing After all the actual work has been done, e.g., a new file has been created and all buffers written back to disk, or a file has been accessed, or whatever, the IO_DONE routine is called from the DISPATCHER routine. IO_DONE moves USER_STATUS into IRP$L_MEDIA (actually a quadword), decrements the transaction count for the VCB, clears the name string descriptor length in the complex buffer packet to prevent write back of the name, copies the local FIB back into the 1-6 XQP MECHANISM IMPLEMENTATION complex buffer packet, if any, sets IRP$L_BCNT to ABD$C_ATTRIB for non-read functions so that the attributes don't get written back to the user's buffer, and calls the CHECK_DISMOUNT routine. With the XQP, we are already in the correct process context, so instead of issuing an IOPOST interrupt, this routine calls (via JSB) the special entry point IOC$BUFPOST in the SYS module IOCIOPOST which executes the same code (resetting PCB quotas, setting up what would be the special kernel mode AST completion routine (BUFPOST for XQP functions requiring a complex buffer (all except window turns) and buffer I/O, DIRPOST for direct I/O)) otherwise executed by the IOPOST software interrupt. After coming back to IO_DONE from that, the event flag is posted, and then another JSB executes the special kernel AST code. BUFPOST copies the IRP described buffers (FIB, etc.) back to the user buffers, de-allocates the complex buffer, and flows into DIRPOST. DIRPOST updates PHD quotas, decrements the channel activity count, sends a de-access request to the XQP if the activity count goes to zero and CCB$L_DIRP indicates de-access pending, writes the user IOSB, sets the event flag, queues the user AST and ends by freeing the I/O packet. The IOPOST software interrupt is signaled instead if the IRP$L_PID field is negative, indicating special post processing. It is not known if this feature is used. Finally, the DISPATCHER routine does a RET which gets us back into the DISPATCH routine (where the initial kernel AST came to in the first place). It makes the XQP channel inaccessible again, decrements the PCB$B_DPC cell to allow process deletion and suspension again, restores the original kernel stack limits and frame pointer, and does a RET. 1.6 Request Dispatching The next request to be performed is obtained by GET_REQUEST. This routine also initializes the impure area. Initialization consist of: starting PMS monitoring, zeroing the impure area, setting USER_STATUS[0] to 1, setting the BFR_LIST queue heads to empty lists, setting the value for PACKET (current IRP), CURRENT_UCB, CURRENT_WINDOW (if the low bit is set (no real window) then CURRENT_WINDOW is 0), CURRENT_FIB, CURRENT_UCB and PRIMARY_FCB if a window exists (window doesn't exist for access/create/mount functions), ORB, CURRENT_VCB, CURRENT_RVT, CURRENT_RVN, and IO_CCB$L_UCB (from CURRENT_UCB), clearing the byte count for the window block descriptor (to prevent I/O completion from writing it back), setting the spoolfile cleanup flag (if IRP$L_MEDIA not equal IRP$L_UCB), copying the ARB into LOCAL_ARB, setting SYSPRV flag in ARB if appropriate, also VOLOWNER and GROUPOWNER cleanup flags, setting SYSPRV cleanup flag if SYSPRV, BYPASS or READALL privileges set. 1-7 XQP MECHANISM IMPLEMENTATION Returning to the main flow of DISPATCHER, the XQP$_FILE_NAME (message buffer) is cleared to avoid possible confusion. Other message variables are set (FUNC_DESC and SUB1_FUNC_DESC). A minimum set of buffers are obtained (GET_REQD_BFR_CREDITS) as described under buffer management. The main function is dispatched upon. readpblk and writepblk, acpcontrol and mount functions are done directly. All others must first check the activity block lock (which blocks all XQP activity on the volume). Create/access decision is done here. After the operation has completed, NOTIFY_USER and PERFORM_AUDIT is done (because of the perturbations that the FID_TO_SPEC call in PERFORM_AUDIT will have). Cleanup is done: if status indicates success, then a normal cleanup is done; any error invokes ERR_CLEANUP. This is repeated, trying for a successful cleanup for a very large but finite number of tries. UNLOCK_XQP releases all XQP locks. PMS monitoring is ended. IO_DONE is called as described above The activity block lock is released (refer to the block lock later) if taken (BLOCK_CHECK set). 1.7 CHECK_DISMOUNT CHECK_DISMOUNT (in CHKDMO) performs deferred dismount processing. Walking down the UCB list for the volume[set], for those for whom DEV$V_DMT are set and the transaction count is 1 (us), a dismount is performed. Dismounting starts by setting the V_DISMOUNT bit under the IODB lock to prevent other people from trying to start I/O on the volume. An IO$_UNLOAD/IO$_AVAILABLE function is issued. For shared devices, get the value block for the device lock. Clear the high bit of UCB$W_DIRSEQ to warn RMS of the volume dismount (refer to the RMS directory cache). Decrement UCB$W_REFC. Decrement AQB$W_MNTCNT, if zero, remove from the AQB list. De-allocate all FCBs, ACLs, WCBs, de-queue access locks (by forcing FCB$W_REFCNT to 0). De-queue FID, extent cache locks, de-allocate cache. De-queue quota cache lock, de-allocate quota cache. De-queue volume lock (VOLLKID). De-queue shadow lock. For volume sets, clear the RVT list entry, decrement RVT$W_REFC, if zero de-queue structure lock, de-queue block lock (and clear BLOCK_CHECK so DISPATCHER won't try to release the block lock), de-allocate RVT. For single volumes, just de-queue the block lock as for a volume set. De-allocate VCB. If the device lock exists (promoted when getting value block above) demote to CR (keep at EX if allocated to someone). Post de-allocation on dismount (performed by IOC$DALLOC_DMT in the SYS module IOSUBPAGD). De-allocate buffer cache and AQB. 1-8 XQP MECHANISM IMPLEMENTATION 1.8 Volume Activity Blocking: The ability to cause the file system to stall requests cluster-wide is necessary to allow storage and index BITMAP, plus quota cache rebuilds while a volume is active and in use. Any changes that potentially modify those structures must be blocked. The REBUILD module in the MOUNT facility is used by MOUNT, SET VOLUME /REBUILD, and the DISKQUOTA utility to rebuild the BITMAPs and the QUOTA.SYS file. ANALYZE /DISK /REPAIR performs a similar function (plus others) with a different chunk of code. Both of them use the ACPCONTROL LOCK_VOL control function to prevent file creation, deletion, extension, and truncation activity. The LOCK_VOL function stalls activity with help from a system blocking routine on a special lock called the activity blocking lock (block lock). Its format is F11B$b where is a 12 byte unique volume or volume set identifier as discussed under serialization of conflicting activity. In the VCB (or RVT for a volume set) there is an activity count field, VCB$W_ACTIVITY or RVT$W_ACTIVITY, and a field to store the lock ID of the blocking lock, either VCB$L_BLOCKID or RVT$L_BLOCKID. A non-zero BLOCKID and an odd (low bit set) ACTIVITY count means that everything is normal and can proceed. The START_REQUEST routine is called from the DISPATCHER routine to check this before anyone starts taking out serialization locks or doing anything interesting. If the above conditions are true, START_REQUEST simply adds 2 to ACTIVITY (preserving its oddity) and returns. The FINISH_REQUEST routine is called from DISPATCHER when a request is completed, and subtracts the 2 out again. The F11B$b lock in this situation is a system owned lock held in CR mode with XQP$BLOCK_ROUTINE (in the SYS module SYSACPFDT) as the blocking routine address and the VCB address as the parameter. A process that comes along and does a LOCK_VOL function will call the routine TAKE_BLOCK_LOCK. This will queue an EX mode lock for the F11B$b lock. This is of course incompatible with the system owned CR lock, and on every node that holds that lock (normally all the nodes that have the volume mounted), the lock manager calls the XQP$BLOCK_ROUTINE entry point at IPL$_SYNCH with the VCB address in R1. That routine in turn, gets itself pointed to either the VCB or RVT ACTIVITY count, as appropriate. It then decrements ACTIVITY, now making the field even (it was odd). If that causes it to go to zero, the volume is now idle, and further activity is blocked. If so, it queues an AST to the swapper process, with the lock id of the F11B$b lock as the AST parameter, after clearing the BLOCKID field, with the AST entry point of XQP$DEQBLOCKER (also in SYSACPFDT). (The ACB used 1-9 XQP MECHANISM IMPLEMENTATION in this process is contained in the VCB/RVT.) XQP$DEQBLOCKER then actually de-queues the system owned CR mode F11B$b lock. If XQP$BLOCK_ROUTINE did not see ACTIVITY go to zero, it simply RSBs and the process calling FINISH_REQUEST that sees it go to zero will de-queue the lock. Once all nodes release their locks, the volume is idle and TAKE_BLOCK_LOCK will get its EX mode lock and proceed. The START_REQUEST and FINISH_REQUEST routines make their tests and changes at IPL$_SYNCH to interlock with the blocking routine correctly. Once the ACTIVITY field is even (low bit clear), subsequent callers to START_REQUEST will not add to it, but call the BLOCK_WAIT routine. This will queue for the F11B$b lock in PW mode, which will not be granted until the process that queued for it in EX mode de-queues that lock by doing an ACPCONTROL UNLK_VOL function. When the EX mode F11B$b lock is de-queued, the first waiting process to get the PW lock will convert that lock to the CR mode system owned lock and set the low bit of the ACTIVITY count, thus allowed things to proceed again. In fact, the F11B$b lock is initially armed in this fashion by the first function that calls START_REQUEST after the volume is mounted. Other users on the same node waiting for the lock will find the blocked value non zero, and so they will de-queue their (redundant) version of the lock. Another thing that BLOCK_WAIT will do is return the buffer credits it had already acquired while waiting for the blocking lock to be granted. This is to avoid having the CACHE_SERVER (discussed further on) not able to flush caches due to lack of available buffers which would cause a deadlock. Note that the process holding the EX F11B$b lock (via the LOCK_VOL function) and cache flushes are allowed to proceed, because they are necessary for the rebuild to work. File creates, etc., are still prevented via the VCB$V_NOALLOC flag, however. Allowing the process holding the block lock to proceed is done by virtue of the fact that the process has a non-zero value for BLOCK_LOCKID (one of the non impure XQP variables). DISPATCHER will not ask for the block lock before invoking a function. QUOTAUTIL requires the process to hold the block lock to perform a modify use function. All acpcontrol functions avoided the request for the block lock in DISPATCHER. ADD_QUOTA is an exception, though. The block lock is obtained in ACPCONTROL. BLOCK_LOCKID is also checked in the lock/unlock acpcontrol functions. This has the disadvantage, though, that effectively only one volume can be locked at a time. Also, failure of the process to unlock the volume prevents the volume from ever being unlocked. 1-10 XQP MECHANISM IMPLEMENTATION 1.9 Secondary Operations Various functions require so called secondary operations. As an example, bad block processing is done in secondary context to avoid confusing the main function. This secondary context is provided by saving the CONTEXT_START;CONTEXT_SIZE area into/out CONTEXT_SAVE. This is done by the SAVE_CONTEXT and RESTORE_CONTEXT routines in GETREQ. The secondary save area allows for only one secondary operation nested within the primary. Various restrictions apply to secondary context. The usages of secondary context area: operating upon the BADLOG file (SCAN_BADLOG and DEALLOCATE_BAD), marking for deletion a file being removed/superseded during a file creation (CREATE), opening a file from which attributes are being propagated (CREATE), extending the index file (EXTEND_INDEX), opening a file to determine placement (GET_LOC), extending or compressing a directory (SHUFFLE_DIR). The primary context must be restored when done. Note, though, that ERR_CLEANUP will detect if we were in secondary context and clean up secondary context first, and then move onto primary context. Secondary context may leave around buffers waiting to be written. However, secondary context must release all serialization locks obtained in secondary context, and must therefore write out any buffers protected by these locks (refer to serialization of activity and buffer management). Also, any unrecorded blocks (refer to cleanup processing) must be recorded before leaving secondary context. 1-11 CHAPTER 2 SERIALIZATION OF CONFLICTING ACTIVITY The procedure based XQP design requires explicit serialization of file system processing. The distributed lock manager is the mechanism used to do this. 2.1 Naming Of Serialization Lock Resources Everything of interest on a disk volume is a file. Files are files. The allocation storage BITMAP is a file. Directories are files. Every file has at least one file header. Every file header has a unique number which identifies it. All the file headers are in fact contained in a file called INDEXF.SYS, and knowing what a given file number is, you can calculate what block within the INDEXF file contains its file header. The file header for the INDEXF file is itself one of the blocks within the INDEXF file. Given that everything the file system wants to do has something to do with a file or files, serializing operations by taking out locks based on those file numbers is the natural thing to do. For example, say you want to rewrite the record attributes (which are contained in the file header) of file number 47 to update the end of file information (which is part of the record attributes). First you take out an exclusive lock whose name is based on "47" that has a parent lock associated with the specific volume. You then read the file header from the disk into a buffer, modify the record attributes in the buffer, and write the changed file header back to disk. When done, you release the exclusive lock "47". Someone else trying to do the same thing at the same time has to do the same thing. First they try to obtain an exclusive lock on "47", but you already have it, so they wait until you are completely done. When you finally release your lock, they are granted the lock, read the header, etc. Locks that are used to serialize processing in this manner are known as serialization locks. 2-1 SERIALIZATION OF CONFLICTING ACTIVITY Each of the file serialization locks has a parent lock associated with the volume. We need to pick a name for the parent lock that is unique for any given volume or volume set, and can also be generated identically from any node in the cluster that mounts a given shared volume. The volume label was chosen somewhat arbitrarily over the allocation class form of the device name. At 12 bytes, it is slightly shorter than the possible 15 byte device name, so that gives it an edge. It is also under the control of the file system. In conjunction with the device name, it is used by MOUNT to enforce the requirement that only one volume with a given name can be mounted shared at any given time in the cluster. Any number of volumes with the same volume label can be mounted privately. In that case, a combination of the node name plus the address of the UCB for the device forms a name guaranteed to be unique throughout the cluster. The resource name used is the character string "F11B$v" followed by either the volume label if mounted shared, or the node name-UCB address combination if mounted privately. That is, F11B$v The 12 byte volume identifier part of this lock name is generated by the routine GET_VOLUME_LOCK_NAME in the MOUNT facility module CLUSTRMNT when the volume is mounted. These 12 bytes are stored in the VCB field VCB$T_VOLCKNAM for subsequent use by MOUNT and the XQP. This volume identifier is available using the DEVLOCKNAM item code with the $GETDVI system service. That field actually returns a 16 byte field. One of the extra bytes is used to distinguish the shared from privately mounted cases to preclude any possible name space collisions. The file system avoids this by using the node specific lock as a parent for privately mounted volumes. The node lock id is contained in the exec cell EXE$GL_SYSID_LOCK. This also has the advantage of not requiring cluster traffic to determine master-ship of the lock. The volume lock itself is initially acquired in PW mode by the MOUNT routine GET_VOLUME_LOCK, also in the CLUSTRMNT module. When MOUNT processing is essentially complete, the volume lock is converted to a system owned lock in CR mode by the STORE_CONTEXT routine in the same module. This lock remains granted in that mode until the volume is dismounted and it is de-queued by the CHECK_DISMOUNT routine when the transaction count on the volume becomes idle after the volume is marked for dismount. Its lock ID is stored in the field VCB$L_VOLLKID. This volume lock, then, remains permanently granted as long as the volume is mounted, and is used as the parent lock for the file number serialization locks discussed above. A similar lock is taken out on the volume set name to provide a parent lock for operations on volume 2-2 SERIALIZATION OF CONFLICTING ACTIVITY sets. Its lock ID is stored in the RVT$L_STRUCLKID field. File number serialization locks are taken out in the routine SERIAL_FILE. Either the volume lock or the volume set lock is specified as a parent lock depending on whether it is a single or multi-volume disk volume. The resource name used for file number serialization locks is the string "F11B$s" followed by 3 bytes of file number plus 1 byte of relative volume number. This uniquely identifies any given file on a volume set (or single volume, obviously). 2.2 Serialization Strategy On any given file system function, any or all of the following disk volume structures may be referenced: o A directory file - including both the directory header and/or the directory file data blocks. This is the file specified by the FIB$W_DID, or directory ID, and may be used to look up the target file in an ACCESS function, or to make a directory entry in a CREATE operation. o The target file header and its possible extension headers. This is the file you wish to do something with, such as ACCESS it, write attributes, extend, truncate, etc. o The storage and index file bitmaps, and the QUOTA.SYS diskquota file. The storage bitmap is involved when free storage is mapped to a file being extended, or returned from a file being truncated or deleted. The index file bitmap is touched when a new file or extension header is being created, or headers are being deleted. The QUOTA.SYS file reflects allowed and current diskquota usages. For most operations, these structures are always accessed, if at all, in the following order: 1. The directory file is looked at to do a directory lookup, if any. 2. The target file header and its extension headers, if any, are looked at. The file header is calculated from its file number, which is either the result of the directory lookup above, or explicitly specified by FIB$W_FID. 3. Storage allocation changes, either extension, truncation, or deletion of the target file, including quota checking, if enabled. 2-3 SERIALIZATION OF CONFLICTING ACTIVITY 2.3 Serialization On Specific Files Serialization locks for the directory and target files are handled by the SERIAL_FILE routine, which takes the file ID as input and extracts the file number and relative volume number to construct the F11B$s lock resource name and then take the lock out. The SERIAL_FILE routine returns an index into the vector LB_LOCKID, which keeps track of the lock ID of the lock granted. This index is stored in the cell DIR_LCKINDX for the directory serialization lock, and into the cell PRIM_LCKINDX for the target, or primary file. In order to minimize both the number of locking operations and the number of locks required to perform a given operation, it was decided to not serialize access to extension headers with a separate lock, but rather to serialize access to all extension headers with a serialization lock on the primary header. In normal operations this works out nicely because one always goes after the primary header first and follows a link from each header to the next header. Serialization locks on the directory and primary file are normally held until the completion of the entire operation. All locks are released in the routine UNLOCK_XQP. This routine is called from the DISPATCHER routine. 2.4 Serialization Of Volume Changes Operations that involve the storage or index file bitmaps are serialized with an F11B$v lock. This lock is taken out by the ALLOCATION_LOCK routine, which is very similar to the SERIAL_FILE routine. The lock ID of this allocation lock is always stored in element 0 of the LB_LOCKID vector. Note, though, that manipulating the headers of the storage or bitmap files requires taking out the serialization lock on the file, as well as the allocation lock. 2.5 Deadlock Considerations The file system is designed to be deadlock free. By assuming a hierarchical directory and file structure, taking out locks in the above order results in a deadlock free system. Certain operations, such as creating a new file, must access files in a different order. Specifically, the allocation lock must be taken out first to determine what the file number of the to be created primary file is, and then the directory entry can be made. In this case, deadlock is avoided by releasing the allocation lock prior to acquiring the serialization lock on the new file header. In general, 2-4 SERIALIZATION OF CONFLICTING ACTIVITY the allocation lock is always released before acquiring a new file number serialization lock. The ALLOCATION_UNLOCK routine does this. It is considered okay to hold the serialization lock on a newly created file and then go after the directory serialization lock (even though it violates the ordering rule above) on the theory that if this is a new file, nobody should be able to find it in the directory. Note that deleting a file removes the directory entry first, so that even if a system crashes while deleting a file, if anything is gone, it will be the directory entry, so that helps also. You could probably go out of your way to construct a directory with dangling entries, and then try to force a deadlock by having one process look up the non-existent files while another is creating them and causing new directories entries to be made, but the odds seem very low that is a real problem. 2.6 Internal Serialization Checks To enforce the requirement that an appropriate serialization lock is held when a given header is read from disk, there is another vector, using the lock index returned from SERIAL_FILE, call LB_BASIS, which contains the lock basis, or file number + RVN field, of a given serialization lock. The CURR_LCKINDX field contains the last lock index returned by SERIAL_FILE. The READ_HEADER routine uses these bits of information, in addition to looking at the header just returned from the READ_BLOCK routine (discussed later), to determine if in fact the correct serialization lock is held for the header just requested. The READ_HEADER routine is used by all code in the XQP to read a header. For example, if a user specifies an extension header by file ID and attempts a DELETE function on the extension header, the MARK_DELETE routine will first serialize on the given file ID by calling SERIAL_FILE, which sets the CURR_LCKINDX and LB_BASIS fields up, as noted above. It will then call READ_HEADER to actually get the header. READ_HEADER, however, will note that it has an extension header (based on the FH2$W_SEG_NUM field in the header), and that furthermore the lockbasis for the serialization lock held does not match the primary header for that file (determined from the FH2$W_BK_FIDNUM and FH2$B_BK_FIDNMX fields), it will exit with an "SS$_NOSUCHFILE" status, thereby making direct access to extension headers impossible. There is an exception to this, however. BACKUP and DUMP, to name two, will perform an ACCESS function explicitly on extension headers for the purposes of getting the extension header with a read attributes list that returns the complete file header, because they want to know exactly what is in all the extension headers of a given file. The ACCESS routine works together with the READ_HEADER routine to handle that case. To perform an ACCESS function on an extension header, the 2-5 SERIALIZATION OF CONFLICTING ACTIVITY ACCESS routine will first take the serialization lock on the given file ID as if it were a primary header - it has no choice because it cannot tell yet. Then it calls READ_HEADER with an extra argument, an optional output from READ_HEADER. The extra argument tells READ_HEADER to not simply return SS$_NOSUCHFILE if it encounters a lockbasis mismatch, but rather to return what the real lockbasis for that extension header is, derived from the primary header backlink field noted above. The ACCESS routine then releases the incorrect serialization lock it had acquired, gets the right one based on what READ_HEADER told it, and tries again. It must actually retry the READ_HEADER again, of course, because until it has the correct serialization lock, what that header actually is could change out from underneath it. 2.7 Serializing Access To Shared Data Structures Besides serializing access to file headers, the F11B$s locks also serialize access to the File Control Block, or FCB. This is a shared structure, and it must not be changed by some other process while the process is unscheduled. The serialization lock works fine for this once you've found the FCB corresponding to the file being operated on. Access to buffers associated with a given file is also serialized by the F11B$s lock. This is not the same as locating the buffers in the cache, of course, and those mechanisms are discussed later. The FCBs are in a doubly linked list off the VCB. To scan that list, the XQP raises IPL to SCHED to prevent rescheduling while it is scanning. There are currently no consistency bugchecks within the XQP to validate that an appropriate serialization lock is held when referencing an FCB. The file extent and file number caches (pointed to by VCB$L_VCA) are similarly serialized by the F11B$v allocation lock. That is, the control of access to those shared structures by multiple processes is from using the allocation lock. 2.8 CURR_LCKINDX versus PRIM_LCKINDX versus DIR_LCKINDX DIR_LCKINDX records the index into the lock arrays (LB_BASIS and LB_LOCKID) for the parent directory of the operation. It is not in the context area saved for sub-operations. PRIM_LCKINDX is the index corresponding to the lock on the primary file header. CURR_LCKINDX is set by SERIAL_FILE to record the last index returned by SERIAL_FILE. Calling RELEASE_SERIAL_LOCK with a lock index equal to CURR_LCKINDX will zero CURR_LCKINDX. Zero (the allocation lock) is not a valid value for these lock index variables. READ_BLOCK always uses the CURR_LCKINDX when reading random file 2-6 SERIALIZATION OF CONFLICTING ACTIVITY headers and blocks. READ_HEADER also uses the CURR_LCKINDX value when checking for correct lockbasis. These locks are not normally released until request cleanup time. However, those who make such a lock in secondary context, or who operate on the index file or such in primary context (moving EOF), must release their lock separately. DEALLOCATE_BAD, MARK_DELETE perform explicit writes of their modified buffers and explicitly release the lock (clearing PRIM_LCKINDX also). Normally, PRIM_LCKINDX has the same value as CURR_LCKINDX. PRIM_LCKINDX is normally not itself referenced, although ERR_CLEANUP forces CURR_LCKINDX to equal PRIM_LCKINDX. There are several cases in PRIM_LCKINDX and CURR_LCKINDX are not related. In SEARCH_QUOTA, if it is necessary to serialize on the quota file (to rebuild stale FCBs for it), the CURR_LCKINDX value must be saved during the rebuild sine it will refer to the quota file. Quota file operations (QUOTA_FILE_OP) runs with the quota file serialization lock (as well as the allocation lock) using CURR_LCKINDX. When advancing the index file EOF (not currently needed), CURR_LCKINDX will refer to the index file serialization itself during the header write. Likewise, while re-mapping the index file, CURR_LCKINDX will refer to the index file serialization lock. In PROPAGATE_ATTR (in CREATE), attributes are being copied from one file to another. In this routine, executed in secondary context, PRIM_LCKINDX points to the file from which attributes are being copied. CURR_LCKINDX is saved across the OPEN_FILE call and is kept pointing to the target file, in case its headers must be re-read (when we go to find their buffers to write in attributes). Entering the CREATE function, a serialization lock may be held from a previous ACCESS attempt (if this was a create-if access) and so any PRIM_LCKINDX lock is released. (ACCESS, like most routines, doesn't clean up after itself.) The DELETE_FILE routine, when purging the buffers for the extension headers, fabricates a serialization lock on the extension header file ID as a basis for purging the buffers. This lock requires saving the value of CURR_LCKINDX. In DIR_ACCESS, CURR_LCKINDX is saved while DIR_LCKINDX is being established (in a call to SERIAL_FILE). FID_TO_SPEC releases the PRIM_LCKINDX lock to avoid synchronization deadlocks with processes walking down the hierarchy toward this file. The reference count is incremented on the FCB, though, to keep it alive. CURR_LCKINDX will refer to the various directories in the back link chain. PRIM_LCKINDX is re-determined when we return to the file after the search. 2-7 SERIALIZATION OF CONFLICTING ACTIVITY READ_WRITEVB will obtain a serialization lock (CURR_LCKINDX) on a file id when it determines that a process is trying to directly write a file header. SHUFFLE_DIR resets (in secondary context) CURR_LCKINDX to DIR_LCKINDX so that READ_BLOCK will work. The lockbasis corresponding to PRIM_LCKINDX can be wrong if we try to access directly an extension file header. The code to correct PRIM_LCKINDX (to get the correct lockbasis and lock) is in ACCESS. 2-8 CHAPTER 3 XQP I/O BUFFER CACHING The file system manages its I/O buffers as an LRU cache. The intent is to retain, in memory, the buffers corresponding to the disk blocks the file system has most recently referenced, and thus avoid actually moving the data from disk after it has been read from disk once. All XQP I/O is performed to the buffers in the cache. There are two major problems faced by the XQP I/O buffer cache. 1. Providing a shared, system wide cache in a multi-threaded, procedure based environment. 2. Cluster wide validation/invalidation of buffer contents. 3.1 Shared, System-wide I/O Buffer Cache Each node maintains a system-wide I/O buffer cache. The contents of the buffers are copies of the corresponding disk blocks. This first section discusses the management of this cache on a single node. The next section discusses the mechanisms used to validate these buffers against operations performed by other nodes in a cluster. 3.1.1 Allocation And Initialization Of I/O Buffer Cache A system wide (single node) I/O buffer cache is used by the XQP. The buffers are allocated from paged pool. This is done by the SETUP_BLOCKCACHE routine in the MOUNT module STACP. MOUNT qualifiers are used to control buffer cache creation. By default, all mounted volumes share the same buffer cache that is allocated when the system disk is mounted during the boot process. The number of pages allocated for each of the pools described above are taken from the active values of the SYSGEN parameters ACP_MAPCACHE (storage bitmap pool), ACP_DIRCACHE (directory and quota file data blocks), 3-1 XQP I/O BUFFER CACHING ACP_HDRCACHE (file headers and index file bitmap), and ACP_DINDXCACHE (directory index cache). A separate I/O buffer cache can be specified by use of the MOUNT qualifier /PROCESSOR=UNIQUE. A specific I/O buffer cache can be specified by using the /PROCESSOR=SAME:mntdev qualifier, where "mntdev" is the name of an already mounted device. If an attempt is made to allocate a separate cache, but the allocation fails (lack of enough contiguous space in paged pool), MOUNT will try to allocate a minimal size cache instead. If the minimal size cache can be allocated, the REDCACHE (reduced cache) message will be issued and the volume will be mounted successfully. If the minimal cache allocation attempt fails, you get an error. 3.1.2 Finding The I/O Buffer Cache The cache for a given mounted device is found by following the UCB$L_VCB pointer to the VCB, then the VCB$L_AQB pointer to the AQB, and finally the AQB$L_BUFCACHE pointer to the cache header. There is a single AQB for each buffer cache. However, multiple VCBs may (and usually do) point to a single AQB. 3.1.3 Layout Of The I/O Buffer Cache The I/O buffer cache itself consists of a fixed overhead area (F11BC structure), a variable size buffer descriptor array (BFRD structures), a variable size lock descriptor array (BFRL structures), a variable size buffer LBN hash table, a variable size lock basis hash table, and finally, an array of page aligned I/O buffers. Each area performs the following functions. 1. The fixed overhead area contains pointers to the variable areas that follow and their sizes. It also contains a number of queue headers discussed later. 2. The next area is an array of buffer descriptors, or BFRDs. These describe what disk block a given buffer belongs to (by LBN and UCB address), whether it is valid or modified or being used, and what type of buffer it is. It also has an index to its associated BFRL, discussed later. 3. The BFRLs describe the locks associated with the buffers in the cache. They are discussed further on. 4. The buffer LBN hash table is an array of word indices into the BFRD array. It reduces the amount of time required to search the cache to determine if a given LBN is already in 3-2 XQP I/O BUFFER CACHING the cache over what a linear search all the descriptors would involve. The hash function is a modulo function using the desired LBN and the size of the hash table in words. Overflows are handled by chaining through the BFRDs. 5. The lock basis hash table serves a similar function. It allows a relatively quick search of the BFRLs to determine if one already exists for a given lock basis. This is discussed later. There are pointers in the fixed overhead area to the variable areas that follow it. There are as many BFRDs and BFRLs as buffers, so the size of those areas is directly proportional to the number of buffers in the cache. The buffer LBN and lock basis hash areas have a minimum of one word each per buffer. The minimum size requirements are thus calculated by the SETUP_BLOCKCACHE routine with an extra page thrown in so we will have enough room to always page align the buffers themselves regardless of where the space is actually allocated in paged pool. Any extra room between the lock descriptors (BFRLs) and the start of the I/O buffers is split up between the two hash tables. The entire overhead area and the buffers themselves are currently allocated as a single contiguous chunk of paged pool. However, the implementation allows for the descriptor area to be allocated separately from the buffers. The total overhead area is about 10 percent of the size of the buffers themselves. 3.1.4 Segregation Of Buffers Into Pools The XQP divides all buffers in the cache into 3 pools for purposes of LRU replacement. They are: 1. Storage bitmap blocks and the Storage Control Block (SCB). These are all the data blocks mapped by the file [0,0]BITMAP.SYS. 2. Directory data blocks and data blocks of the [0,0]QUOTA.SYS file. This is the only pool that performs multi-block reads. 3. File headers and index file bitmap blocks. These are all data blocks mapped by the [0,0]INDEXF.SYS file. In addition there is a fourth pool of pages used by the directory index caching mechanism. These pages are not I/O buffers, but are managed by the buffer caching routines because they provide the necessary cluster validation. This will be discussed later. 3-3 XQP I/O BUFFER CACHING 3.1.5 Buffer Replacement The replacement algorithm for buffers is Least Recently Used (LRU). When the desired disk block cannot be found in the cache, the oldest buffer is tossed out and replaced with the desired block. This is accomplished by linking all BFRDs for a given pool onto a queue header for that pool. F11BC$Q_POOL_LRU is a vector of four queue headers for that purpose. Since the buffer manager can release a buffer at any time (only if you ask it to read something, of course), it is possible for local variables (and globals such as FILE_HEADER) to no longer point to the buffer desired. If it is necessary to read a set of blocks, it will be necessary to re-ask for the original block. 3.1.6 Serialization Of Cache Manipulation Changing the state of the cache descriptors in the overhead area must be done atomically, i.e., any process needing to use it must always see a consistent picture. Searching or manipulating the cache must therefore be serialized. This only needs to be done for the processes on a given node, however, not across an entire cluster. The lock manager is therefore not required in this case, and we can perform the function faster without it in this restricted case. There are two routines, SERIAL_CACHE and RELEASE_CACHE, that acquire and release the cache interlock. These routines are called by the other routines in the RDBLOK module. SERIAL_CACHE queues the IRP for the current function onto the queue header of the AQB. If it is at the head of the queue, it returns from that routine and its caller may proceed. If it is not at the head of the queue, the process puts itself to sleep until it is at the head of the queue. The RELEASE_CACHE routine removes the IRP from the head of the queue. If another element remains, i.e., is now at the head, RELEASE_CACHE queues an AST to that process so that it will proceed. The CDRP area of the IRP is used as an ACB, just like it is used to start the whole AST thread in the first place, discussed already in an earlier section. The cache serialization interlock is only held while searching the cache or changing the state of the buffer descriptors. It is never held when the XQP must stall for I/O, or anything else for that matter. 3-4 XQP I/O BUFFER CACHING 3.1.7 Reserving Buffers Before a given file system operation is allowed to use any buffers in the cache, it must first "reserve" the minimum number of buffers required to perform the operation. This is done by maintaining counters (BFR_CREDITS, one for each pool) in the fixed overhead area that represent the number of buffers currently reserved by concurrent file system activity. This is the F11BC$L_POOLAVAIL vector. The GET_REQD_BFR_CREDITS routine performs this function. The currently required buffer credits are: 1 bitmap block buffer, 2 directory data block buffers, 3 file header buffers, 1 directory index buffer. CACHE_HDR and AQB are initialized here. This routine will stall a process until enough buffers are available. The F11BC$Q_POOL_WAITQ vector has listheads for each pool for the IRPs to be queued on while they wait. The RETURN_CREDITS routine will send them an AST when buffer credits are returned. The reason for this is that deadlocks could result if a partially completed operation already held buffers that in turn were required by another process waiting for those buffers before releasing his. The obtaining of credits is done under the cache interlock (routines SERIAL_CACHE and RELEASE_CACHE). RETURN_CREDITS returns the buffer credits to the free pool counts, under the cache interlock and only if the buffers are not in use. It will wake up some process if there is a process on the pool wait queue or the ambiguity queue (discussed below) by adding such a process after our entry on the cache interlock queue. This causes them to be awakened when we release the cache interlock. 3.1.8 Free Versus In-process Buffers When all is quiet, that is, there is no file system activity, the four values in the F11BC$L_POOLAVAIL vector will equal the four values in the F11BC$W_POOLCNT (pool counts). In addition, all buffers for a given pool will be linked onto their respective F11BC$Q_POOL_LRU queue header. When a buffer is being used by a particular process during an operation, it is removed from the POOL_LRU queue, and inserted onto a per process BFR_LIST queue. The BFR_LIST structure is itself a vector of queue headers, one for each pool. Each process also has four element BFR_CREDITS and BFRS_USED vectors representing the number of buffers reserved and the number actually in use. The number of BFRDs on each queue header in BFR_LIST must always correspond to the BFRS_USED value. When a BFRD is on the BFR_LIST queue, the BFRD$W_CURPID field will contain the internal PID index for that process. 3-5 XQP I/O BUFFER CACHING 3.1.9 Extending Buffer Credits During An Operation As mentioned earlier, a minimal number of buffers must be reserved before any operation is allowed to proceed, in fact, before any operation is allowed to hold any locks. For example, 3 buffers from the file header pool are always reserved. If there were only 6 buffers in the file header pool (ACP_HDRCACHE sysgen parameter) only two processes would be allowed to proceed concurrently. Until one of them completes, another process coming along will be stalled. In that situation, if a file with 4 headers is being accessed, the process will have to discard the first header read from its BFR_LIST and re-use that buffer to read the fourth header. The FREE_ONE routine in RDBLOK will do this. The BFR_LIST itself is managed LRU so that the oldest buffer gets tossed. All callers of the READ_BLOCK routine must be prepared for this possibility. However, if there are more than 6 unreserved buffers in a given pool, additional buffer credits will be extended to a process to avoid invalidating a recently accessed buffer, as the above example did. This is done by decrementing POOLAVAIL and incrementing BFR_CREDITS when the additional buffer is desired, subject to POOLAVAIL being greater than or equal to 6. The number six is somewhat arbitrary. The intent is to preserve a certain amount of concurrency under all conditions. 3.1.10 The Ambiguity Queue The queue header F11BC$Q_AMBIGQFL is the ambiguity queue. As mentioned earlier in the serialization discussion, it is possible to serialize on the wrong lock basis when attempting to access an extension header directly. If that same extension header is concurrently being accessed by another process as an extension header, using the correct lockbasis, it is possible for one of those processes to locate the buffer as being "in use" by the other process in the cache. Normal file number serialization usually makes this impossible, and, except for the specific case of file headers, would cause an XQPERR bugcheck. However, in this case, the process will put itself on the ambiguity queue and go to sleep. This is done in the RESOLVE_AMBIGUITY routine, which queues the IRP onto the F11BC$L_AMBIGQFL queue. When the next operation completes, it will be awakened and look again. This is done by the RETURN_CREDITS routine. This can happen any number of times until the ambiguity is resolved. FIND_BUFFER detects the ambiguity case (when it finds a buffer in use in some other process). WRONG_LOCKBASIS and RETURN_CREDITS check the ambiguity queue for processes to waken. 3-6 XQP I/O BUFFER CACHING 3.1.11 Multi-block Disk Reads The directory and quota file data block pool allows multi-block reads. A contiguous group of buffers will be assembled in the FIND_BUFFER routine to be used in a single multi-block QIO when the desired buffer was not already in the cache and the caller requested it. Directory and quota file processing will request it. The number of buffers assembled are limited by the sysgen parameter ACP_MAXREAD. The starting point for the contiguous assembly is the BFRD pulled from the POOL_LRU list. We first try to assemble adjacent BFRDs in ascending memory sequence. If we bump into the end of the pool, we attempt to proceed from the starting BFRD in descending memory sequence. If any BFRD is already in use (BFRD$W_CURPID non-zero), we quit. If the LBN we intend to read into that BFRD is already in the cache somewhere, we quit. If we exceed our buffer credits and are not extended anymore, we quit. 3.1.12 Disk Writes All writing to disk (except for normal virtual write functions and erase functions) is performed by WRITE_BLOCK (in RDBLOK) (or WRITE_HEADER, also in RDBLOK, which performs a checksum first). Buffers can be explicitly written in this way. WRITE_BLOCK is invoked automatically when it is necessary to remove a buffer from the in-process list (dirty buffers must be only on the in-process list). WRITE_DIRTY can be called to write out all dirty buffers associated with a lockbasis (0 implies write all buffers). TOSS_CACHE_DATA will do the same given a lock array index, except that it also invalidates the cache buffers. This is done when closing a file opened using OPEN_FILE. Most operations that modify buffers will simply mark them as dirty and allow CLEANUP to write them (WRITE_DIRTY (0)). There are various exceptions. ERR_CLEANUP force writes the current directory buffer when it performs a re-enter function. CREATE_HEADER force writes the index file header when advancing the EOF (not currently ever done). CREATE_HEADER force writes blocks of the index file bitmap when filling the FID cache. DELETE_FID performs likewise when returning FIDs to the index file bitmap. DEALLOCATE_BAD force writes (WRITE_DIRTY (lockbasis)) the modified file headers itself. SCAN_BADLOG will force write the BADLOG file header when extending its header. 3-7 XQP I/O BUFFER CACHING MARK_DELETE force writes the updated (marked as deleted or actually deleted) headers out to disk. DELETE_FILE does likewise. WRITE_AUDIT performs a WRITE_DIRTY given the primary lockbasis before doing the FID_TO_SPEC translation which will release the lockbasis. EXTEND_CONTIG force writes data blocks as it copies them to the new extended contiguous file. The new header is force written. Likewise, SHUFFLE_DIR force writes directory blocks during its copy. TRUNCATE force writes the file header with the map pointers truncated so that it guarantees that the header is updated before the storage map shows the blocks as free. 3.1.13 System Wide Buffer State Buffers are either in the system list, possibly marked as validly containing the data described from disk, or in a in-process list, again possibly valid and also possibly marked dirty (modified, not yet written to disk). READ_BLOCK takes a buffer and moves it from the system list to the in-process list. In the process, READ_BLOCK will read the block if the buffer descriptor describes it as invalid (refer also to cluster wide cache validation, later). CREATE_BLOCK does the same, except that it is called when it is known that the disk blocks contents are meaningless (such as for a block within a new file extension). Here, CREATE_BLOCK simply zeros the block returned. It is marked as dirty and valid. (An exception is when the desired LBN is -1, indicating that we simply want a free scratch buffer. This is done in SHUFFLE_DIR, where spare blocks are needed to hold directory entries being moved. The correct LBN for the buffers is established with RESET_LBN.) The buffers are moved back to the system list via RELEASE_LOCKBASIS, performed by RELEASE_SERIAL_LOCK (see below). 3.1.13.1 INVALIDATE - INVALIDATE will move a buffer to the front of the in-process LRU list and mark it as not valid (and not dirty). Several operations do this. The various readers and writers of the SCB (DISMOUNT in ACPCNTRL) do this to make sure that the SCB is not cached for a shadow set (since mount verification writes asynchronously to the SCB). CREATE_HEADER calls INVALIDATE upon a file header if it finds it doesn't want to use it. This helps avoid confusion if the header is found in the cache when it shouldn't be. READ_HEADER also invalidates headers that are invalid. 3-8 XQP I/O BUFFER CACHING When reading a new header (CREATE_HEADER), if the read fails, we want to test to see if we can read/write the block. So, something is written (WRITE_BLOCK), the buffer is invalidated, and a READ_BLOCK is done again. SHUFFLE_DIR will do an invalidate on a buffer being squished out. MARK_DELETE reads, as a data block, the first block of a directory being deleted to make sure it is empty. The buffer read is invalidated since it will not be needed. 3.1.13.2 RESET_LBN - RESET_LBN changes the LBN recorded with a buffer. When modifying the index file header (CREATE_HEADER), the LBN is changed to reflect the alternate index file header so that a new WRITE_BLOCK will get it. INVALIDATE is called immediately thereafter to avoid screwing up. A similar operation is performed when reading the index file header (READ_IDX_HEADER). If the file size from the header is incorrect, the alternate index file header is read by doing a READ_BLOCK of the alternate index file header and performing a RESET_LBN on the buffer if that succeeds. (If it fails, the buffer is simply invalidated and we punt.) EXTEND_INDEX will also do this when actually extending the index file. RESET_LBN is used when copying a contiguous file when it is extended. READ_BLOCK reads the old file as data blocks, RESET_LBN is called, and the blocks are explicitly written. They can remain in the cache since the LBN recorded does match their new location. This operation is also done when compressing a directory. 3.1.13.3 KILL_CACHE - KILL_CACHE invalidates all buffers associated with a particular UCB. Buffers in the system list are purged, buffers in our process list are marked invalid, buffers in other process lists are left alone. KILL_CACHE is called to flush the cache of any buffers when the volume is being dismounted or is flagged as nocache (CLEANUP). 3.1.13.4 KILL_BUFFERS - KILL_BUFFERS performs the same function as KILL_CACHE except that it takes a pool number and a lockbasis (directory data and directory index pools only) and works against CURRENT_UCB. KILL_BUFFERS is called to flush directory blocks when we find a directory as write accessed (CLEANUP). This is done when not in a 3-9 XQP I/O BUFFER CACHING cluster, since when in a cluster the sequence numbers associated with the serialization locks will protect these buffers (refer to cluster validation below). KILL_BUFFERS is also called when deleting a directory, to flush its data blocks out of the cache. Also, turning off the directory bit for a directory flushes directory blocks. The special file write virtual function (READ_WRITEVB) performs a KILL_BUFFERS when the user writes to the index file, bitmap file, quota file or a directory. (In a cluster, this is done through the buffer sequence numbers, described below.) An explicit WRITE_DIRTY (-1) is performed when de-accessing the quota file (QUOTAUTIL) to write out the quota file blocks (after clearing quota cache). KILL_BUFFERS (1, -1) purges the buffers associated with the quota file (data blocks). 3.2 Cluster Wide Buffer Validation In a cluster, there is a separate I/O buffer cache on each processor. The contents of a given buffer in the I/O cache on a specific processor will become stale if that disk block is modified by the file system on another processor. Because each buffer corresponds to some on-disk structure the file system manipulates, the reading and writing of those buffers must be serialized by one of the serialization locks discussed in the previous section. Those locks, and their value blocks, are the key to cluster wide buffer validation. 3.2.1 Use Of Value Blocks The basic scheme is to maintain a sequence number in the value block of serialization locks (which are associated with specific buffers in the cache). This sequence number is incremented whenever any buffer associated with that lock is modified. All buffers associated with a given lock in a given cache retain a copy of the sequence number as of the last time those buffers were used and valid. When a buffer is subsequently found in the cache by a later operation, the retained sequence number is compared to the current value from the serialization lock. If they match, no other processor has modified the associated disk block, and hence the contents of the cached buffer are valid. If the retained sequence number and the current sequence number do not match, the contents of the cached buffer are stale, and it must be refreshed by reading the current contents from disk. Different parts of different value blocks are used to validate different buffers. The following buffers are validated by the F11B$s file number serialization lock: 3-10 XQP I/O BUFFER CACHING o File headers are validated by the FC_HDRSEQ field in the F11B$s lock for that file. Note that a single sequence number is used to validate all headers for a given file, therefore modifying just the primary header would cause all cached headers for that file elsewhere to become invalid. o Directory data blocks are validated by the FC_DATASEQ field in the F11B$s lock for a given directory file. Same comment as above - all data blocks are validated by a single sequence number. Note, however, that a directory file header and its data blocks are validated by different parts of the same value block, hence can be independently modified without invalidating each other. Data blocks of any file opened by the internal OPEN_FILE routine are also validated by this field. The actual fields in the value blocks are only referenced by the SERIAL_FILE and RELEASE_SERIAL_FILE routines. They are referenced elsewhere by the LB_HDRSEQ and LB_DATASEQ vectors, indexed by the lock index returned by SERIAL_FILE. (The LB_FILESIZE values, also corresponding to value block fields, are not used.) The following buffers are validated by the F11B$v allocation lock: o Storage bitmap blocks (BITMAP.SYS data blocks) use the low word of the VC_SEQNUM field. o Index file bitmap blocks use the high word of the VC_SEQNUM field. o Quota file data blocks use bits 1 through 15 of the VC_FLAGS field. The validation for buffers found in the cache is done by the FIND_BUFFER routine in the RDBLOK module. Modification of the sequence numbers is done by the WRITE_BLOCK routine. This is the only routine that writes modified buffers to disk. Note that when a node fails, the very latest copy of the value block may be lost. If that is possibly the case, the lock manager will return an SS$_VALNOTVALID warning status on $ENQ operations requesting the value block. The SERIAL_FILE and ALLOCATION_LOCK routines check for this status and increment all of the sequence number fields in the value block to force a cache miss if that happens. The SS$_VALNOTVALID condition is cleared by rewriting the value block. 3-11 XQP I/O BUFFER CACHING 3.2.2 Volume Status Value Block Fields The allocation lock value block also contains the fields IBMAPVBN, SBMAPVBN, VOLFREE and IDXFILEOF. These fields are used as follows. 3.2.2.1 Free Volume Block Count - VOLFREE is passed around so that the last node to update the volume free block count can reflect that to other nodes. Note that this must be considered only approximate, since a node may crash holding the value block, thereby not reflecting the true last value. The description that follows assumes that VOLFREE is good. EXTEND_INDEX uses the volume free value in its algorithm to estimate the number of files likely to yet be created on the volume, when deciding how much to extend the index file. Likewise, this figure is used by SELECT_VOLUME to pick a likely victim for a file. The free figure is used when deciding how many blocks to record in the extent cache (SMALOC). When a volume is unlocked, the unlocking node (which must also have been the locking node) has the only good notion of free space. So, LOCK_VOLUME saves the free space figure from the VCB and writes it back into the VCB under the allocation lock. (Acquiring the allocation lock will update the volume free count from some random value block.) 3.2.2.2 Index Map VBN - The index map VBN is maintained (FILL_FID_CACHE and REMOVE_FILE_NUM) in the VCB as a starting point for header allocation (CREATE_HEADER). This value is used since it reflects the last point of interest in the index file map, a likely place to look for new headers. When allocating file headers, the value is incremented to the index file map block from which we succeed in performing an allocation. If we return to the map (from the FID cache) a FID below this value, the value is decremented so that other nodes filling of their FID caches will start from here. (Refer to the FID cache in the cache chapter.) 3.2.2.3 Storage Map VBN - In a similar manner to the index map VBN, the storage map VBN is kept to record a starting point of interest in the storage map. It is updated only during storage map allocations. If the desired blocks are not found from this point to the end of the map, a scan is started from the beginning. As such, this value may be reset to a lower value if the desired blocks are found lower in the map. 3-12 XQP I/O BUFFER CACHING 3.2.2.4 Index File EOF - The index file EOF (set in CREATE_HEADER and EXTEND_INDEX) are passed around as an obvious limit to the header search. 3.2.3 Associating Locks With Buffers The lock manager maintains two structures for a given lock. One is the resource block, which contains the resource name and the value block. The other is the lock block, which represents a specific lock on that resource. The resource and lock blocks are created when the first lock is taken out on a given resource name. The resource block disappears when the last lock is de-queued. The locks used to serialize access to a given disk block are used to validate a cached copy of that disk block in a cluster. These are the F11B$v and F11B$s locks discussed earlier. However, the F11B$s locks are normally de-queued at the end of an operation. This means the resource block would be de-allocated and we would lose the value block. Therefore, if a buffer is to remain in the cache, we must keep a lock on it. The buffers in the cache really belong to the system, not to any given process. Therefore, the concept of a "system owned" lock was invented. This allows a granted lock to be converted such that it is no longer associated with a given process. When a buffer is in the cache it must have a NL mode, system owned lock associated with it. The BFRL structure is used to keep track of those locks. Multiple buffers may have the same lock basis, and hence many BFRDs may point to the same BFRL. The BFRL contains a reference count of BFRDs so we know when the lock can be completely de-queued. A problem with this is that the sequence number is transmitted from node to node via value blocks associated with the lock backing a set of buffers. Since many buffers may be associated with a given BFRL (for example, there might be 10 storage bitmap blocks for the same volume in the cache), it is necessary for them all to have their sequence numbers updated in sync. If one of them is modified by another operation on that node, the BFRD$L_SEQNUM field will be updated, as well as the appropriate value block field. However, a subsequent operation that references a different bitmap block in the cache will get a mismatch on the sequence numbers because it will be comparing the value block field from the last operation against its the sequence number it got when they were all brought in the first time. It really is valid because no one has modified it, but we cannot tell that. Releasing the serialization locks and potentially converting them to system owned is done together by the RELEASE_SERIAL_LOCK and RELEASE_LOCKBASIS routines. RELEASE_SERIAL_LOCK calls RELEASE_LOCKBASIS which then scans the in-process list of buffer 3-13 XQP I/O BUFFER CACHING searching for a given lock and associating a BFRL with them if one does not already exist. If a new BFRL has been created, RELEASE_LOCKBASIS returns with a status causing RELEASE_SERIAL_LOCK to convert the serialization lock to system owned and store the lock id in the BFRL, otherwise the serialization lock is simply de-queued. The cache serialization interlock is held during this scan, and we cannot stall while doing so. For that reason, all modified buffers must have been written prior to calling RELEASE_SERIAL_LOCK. Explicit calls to WRITE_BLOCK for individual buffers, or to the WRITE_DIRTY routine to scan the lists will accomplish that. All of this stuff with locks is necessary for the cache to work in a cluster. For a non-cluster system, their are no locks associated with the buffers, as they are not necessary. Since the allocation lock backs up storage bitmap and index file map blocks, they must be written out and released before the allocation lock can be released. ALLOCATION_UNLOCK performs this function. Note that the DELETE_FILE routine, when purging the buffers for the extension headers, fabricates a serial lock on the extension header file ID as a basis for purging the buffers. This buffer purging is done here, instead of waiting for cleanup, to avoid someone picking up these file IDs as primary headers later and getting our buffers. 3.3 The Directory Index Cache Pool The directory index cache is the fourth pool in the buffer cache. They are not buffers, though, but rather an index into a given directory file, constructed on the fly as a given directory is processed. The directory index cache is managed by the buffer cache code because it essentially has the same cluster wide content validation problems that buffers do. A directory index block has a small header area followed by about 30 15 byte cells. These represent the highest record found in the corresponding directory data block. This allows every block to have an entry for directory files smaller than 30 blocks, every other block for directories between 30 and 60, etc. The V4 implementation fixes the cell size at 15 characters and limits it to 1 page. 15 bytes was picked because MAIL$800.... files that are about a day apart in creation vary in the fourteenth or fifteenth character. Instead of being located by hashing on LBN, a directory index block is pointed to by the directory FCB, the FCB$L_DIRINDX cell. BFRD$L_LBN points back to the directory FCB. The routine MAKE_DIRINDX in RDBLOK is called from DIR_ACCESS in DIRACC to validate a directory FCB. If the FCB has no corresponding DIRINDX block, one is removed from the 3-14 XQP I/O BUFFER CACHING list for the directory index pool and linked to the FCB. Otherwise, the block is used. The block is validated from the LB_HDRSEQ and LB_DATASEQ values for the directory. Only one directory index block is allowed to be used by the process (since the XQP only works against a single parent directory in an operation). KILL_DINDX breaks the linkage between the FCB and the directory index block. ERR_CLEANUP will call KILL_DINDX when it needs to delete a directory with a corresponding directory index block. Likewise, when MARK_DELETE goes to delete a file (reference count hits 0), it will call KILL_DINDX before purging the FCBs. Unhooking the buffer descriptor for a directory index block (done when it pops to the top of the LRU list and is being used for a new directory, or in KILL_BUFFERS or KILL_CACHE, or in KILL_DINDX) will also break the link between it and its FCB. SET_DIRINDX (called from CLEANUP, and also CLOSE_FILE) tries to keep around FCBs for popular directories and keep the association of the FCB to the directory index block. If the caller of SET_DIRINDX finds that the reference count for the FCB goes to zero, they would normally delete the FCB chain. If, however, their is a directory index block lying around, FCB$V_DIR is set (at IPL$_SCHED). SEARCH_FCB will notice (at IPL$_SCHED) whether an FCB has been left lying around for this reason. Unhooking a directory index block must check for this case, and de-allocate the FCB chain. UNHOOK_BFRD clears the FCB$L_DIRINDX value first, so that a SEARCH_FCB will not find the FCB lying around by virtue of a directory index block. We check to see if a SEARCH_FCB did find the FCB prior to clearing FCB$L_DIRINDX by virtue of the fact that SEARCH_FCB cleared the FCB$V_DIR bit (also at IPL$_SCHED). DIR_ACCESS requests the creation of a directory index block. If it finds a valid one, it also knows it has valid FCBs (due to the checks in MAKE_DIRINDX). Otherwise, it reads the FCBs and calls MAKE_DIRINDX for real. If the directory is not really a directory, KILL_DINDX is called to get rid of the bogus directory index block. Likewise, turning off the directory bit (WRITE_ATTRIB) will also call KILL_DINDX. The directory index block is built by UPDATE_INDX (called by ENTER and DIR_SCAN) as they walk down the directory. DIR_SCAN then uses the block to save work on subsequent scans. The routine ZERO_IDX (CLENUP) is called by SHUFFLE_DIR when the directory's header is to be updated. This routine increments the FCB$W_DIRSEQ and also sets the corresponding INUSE value in the directory index block to zero, since the block layout is now different. (FCB$W_DIRSEQ is updated when a direct access of a directory occurs, when SHUFFLE_DIR must change the directory header, and after an access to a directory that does not locate a valid directory index block.) 3-15 XQP I/O BUFFER CACHING 3.4 Invalidation Of Cached Buffers By Users Because all volume structures that the file system uses to maintain the volume are themselves files, it is possible for random users to access those files and read and write the blocks in them. For example, anyone with write access to a directory can open the directory file and write junk into its data blocks. Or with a little privilege, open BITMAP.SYS and rewrite it. A disk rebuild does this, for example. The problem here is how to invalidate cached copies of those blocks that may be in the I/O buffer cache. The solution is to trap all write virtual requests in the file system. This is done by setting a flag, WCB$V_WRITE_TURN in the WCB of any write accessed directory file, INDEXF.SYS, or BITMAP.SYS. This is done when the file is accessed by either the ACCESS routine for directories, and the MAKE_ACCESS routines for the others. Whenever a write virtual function is performed on one of those files, the QIO FDT routine forces a window turn. In the READ_WRITEVB routine, the XQP will take out the appropriate serialization lock (if it doesn't already have it) that would validate that buffer and increments the appropriate field in the value block. This really only works correctly if the volume blocking lock is held by the process doing this. There are all sorts of race conditions because the user just in not synchronized in any real way with the file system. To do it right, you would have to introduce locking semantics on read virtual that the file system would respect. I'm pretty sure the QUOTA.SYS file is not even handled this well. If the node is not in a cluster, the KILL_BUFFERS routine is called by READ_WRITEVB to scan the cache and invalidate the correct buffers. When a process initially accesses one of these files for writing, the appropriate cache is flushed under the allocation lock. The cache write lock is taken out on the cache (refer to the chapter on caches). This lock will be released in MAKE_DEACCESS. 3-16 CHAPTER 4 ACCESS ARBITRATION Access arbitration is when you open a file saying "open for write, disallow writers" and when someone else comes along and tries to open it for write, they fail. For a given node, access arbitration is handled with counters in the FCB. The FCB maintains counters for total accessors, total readers and total writers (on this node). 4.1 Access Locks For a cluster, locks are used to control access. The routine ARBITRATE_ACCESS first arbitrates the desired access against any pre-existing accesses on that node (reflected in the FCB). If those checks fail, there is no point in checking further. If they succeed, and the system is in a cluster (and the device is cluster accessible), either the NEW_ACCESS_LOCK or the CONV_ACCLOCK routines (both in LOCKERS) will be called, depending on whether an access lock already existed or not. The access lock is a root lock of the form F11B$a This is a system owned lock. The follows the rules for F11B$v locks, and file number is like the same part in F11B$s locks. The various access and sharing combinations map into lock modes as follows: o LCK$K_EXMODE - Read/write, disallow read/write o LCK$K_PWMODE - Read/write, disallow write o LCK$K_PRMODE - Read, disallow write 4-1 ACCESS ARBITRATION o LCK$K_CWMODE - Read/write, allow read/write o LCK$K_CRMODE - Read, allow read/write o LCK$K_NLMODE - ignore whatever anyone else says. The current access lock mode is stored in FCB$B_ACCLKMODE with the lock id in FCB$L_ACCLKID. Whenever a process comes along whose access is compatible with accessors on that node but whose access requires a higher lock, ARBITRATE_ACCESS is called to raise that node's lock. CONV_ACCLOCK will lower the lock back to the supplied (previous) value. If the FCB reference count is zero, CONV_ACCLOCK will de-queue the access lock outright, since no one is left on this node who needs it. The MAKE_DEACCESS routine in CLENUP converts the lock if the process is de-accessing the file. Using the updated reference counts in the FCB, a new ACCTL value is determined which LOCK_MODE can use to determine the new lock value. If this lock value is lower that the current node lock (or the node's FCB reference count goes to zero), CONV_ACCLOCK is called. DEACC_QFILE (QUOTAUTIL) performs a similar computation when de-accessing the (node wide) quota file. NUKE_HEAD_FCB, called by CLEANUP, MARK_DELETE, CLOSE_FILE and UNHOOK_BFRD (when deleting a directory FCB lying around by virtue of its directory index block) will also request a CONV_ACCLOCK to NL mode for the purpose of possibly writing out the value block (discussed later). CONV_ACCLOCK will de-queue the lock given that the reference count for the FCB is zero. The normal call to ARBITRATE_ACCESS is when performing an access function. CONN_QFILE (QUOTAUTIL) does this when starting up quota operations. DIRACC also uses it to determine if any has requested no write access to a directory (this would have to be from an explicit user open of a directory). In this case, ARBITRATE_ACCESS is called to see if we can write, but we return the access lock to its original mode (implying a null lock for us). The same technique is used when opening a file (except for explicit interlock ignore) in OPEN_FILE (FILUTL). The MODIFY functions of extend and truncate also do this. They are allowed to lower the access lock since they hold the serial lock on the file which will prevent any new accessors from coming in who might object to their intended operation. The combination of the FCB reference counts and the LOCK_COUNT of the access lock indicates the other users of the file. Certain operations do not allow other collections of users, even though they are allowed by the normal access rules. The most obvious of these is truncation, described below. Also, changing the security classification of a file (CHANGE_CLASS in RWATTR) allows no other accessors. 4-2 ACCESS ARBITRATION 4.2 Deferred Truncation A writer implicitly disallows truncation. The problem is that truncation depends on the ability to invalidate windows (WCBs) for all accessors. This is especially difficult because it would be necessary to revoke I/O that was in driver queues when the truncation was performed. The result is that truncation is only allowed by a single writer accessing the file. If there are readers when the truncation is performed, the actual truncation is deferred until the last reader de-accesses the file, in much the same way that a file is marked for deleted and doesn't really go away until completely de-accessed. The access lock is the mechanism used to determine when the last accessor goes away. This is because there will be exactly one access lock per node that has any accessors at all, and if a $GETLKI function that returns a count of locks on an access lock returns 1, then we know no one else is out there. The routine LOCK_COUNT performs this function. When a truncation is being deferred, a flag is set in the value block of the access lock, as well as the VBN to truncate to. This is how that information is passed to another node when the truncate operation occurs on one node, and the last de-accessor is somewhere else. Since this information is also recorded in the FCB for the file, it is necessary to mark the FCBs stale cluster-wide. If, however, after the writer requesting truncation de-access the file, another writer comes along, the delayed truncation is canceled (see ACCESS). This is done by forcing the access lock to at least PW mode, clearing the delayed truncation flags in the FCB (implying the lock value block) and doing a lock conversion to that same mode (which will write the value block). The first conversion to PW mode is always possible immediately. (If the delayed truncation flag is on, it indicates that someone requested truncation while having the file accessed in PW mode (no other writers). Since we succeeded in locking the file for writing, that other exclusive writer must have de-accessed the file. Since the delayed truncation flag is on, it indicates that some readers are still present (who must have allowed writers for us to have succeeded in getting write access). Thus, the highest mode requested in the cluster must be CR mode, thereby allowing us to convert to PW.) DEACCESS checks for a request to truncate. If we are the only accessor, this is done directly. Otherwise, the truncation validity checks are made and the values stored in the value block. (The lock is upgraded to at least PW mode. The lowering of the mode (in MAKE_DEACCESS in CLENUP) when we actually finish de-accessing will write the value block out.) Also, DEACCESS notices if we were a reader and were the last one to de-access a file. If so, and delayed truncation was requested, the truncation is performed. 4-3 ACCESS ARBITRATION MODIFY itself only allows a truncation request if we are the only accessor, since it always does the truncation at the time of request. If we have the file accessed, we can simply check the reference counts in the FCB and the lock count for the access lock. If we don't have the file accessed, we must do an ARBITRATE_ACCESS specifying no readers and then check for our being the only accessor. (CONV_ACCLOCK restores the old lock.) 4.3 Marking FCBs Stale FCBs are an in memory summary of a number of pieces of interesting information about a given file header. That is, whenever you do anything to a file, you first build an FCB from the header. For an accessed file, there is an FCB for each header in the file, and the FCBs stick around in memory. If we have a file accessed on our node, it is quite reasonable for a process on another node to be changing the file header(s) for it, such as the protection, allocation (adding extension headers), or marking it for delete. However, if an FCB is present, the XQP will believe it without looking at the header. We'd like to know if someone else has been messing with our accessed file without reading the headers all the time so we can rebuild our FCB chain from the headers when we need to. A system blocking routine associated with the access lock is used for this purpose. The XQP$FCBSTALE blocking routine (SYS module SYSACPFDT) is armed with the primary FCB address as its parameter. Whenever a piece of code messes with headers in a way that fouls up the FCB contents, it calls the routine MAKE_FCB_STALE. This routine queues for an EX mode lock on the access lock, triggering the blocking routine, which simply sets the FCB$V_STALE flag in the FCB$W_STATUS field. The main call to MAKE_FCB_STALE is in CLEANUP, who keys off the CLF_MARKFCBSTALE flag which is set by anyone who modifies a file header that would require rebuilding the FCBs. This flag is set by EXTEND and RWATTR (protected attributes, UIC, class, file protection, ACL). In the case of TRUNCATE, this function is performed by its callers. (In the case of MODIFY, truncation is allowed only if there are no other accessors, so MAKE_FCB_STALE is not called. In DEACCESS, a delayed truncation implies that there will be no accessors, so this is not needed.) Extending the quota file will explicitly call MAKE_FCB_STALE for the quota file. Performing a SHUFFLE_DIR will do likewise for the directory. An equivalent function is performed in MARKDEL_FCB when an FCB is marked for delete. 4-4 ACCESS ARBITRATION The stale flag will only be used for a cluster shareable device; that is, if no access lock is held by this node, stale will never be set and the FCB chain is good. DELETE fires the blocking AST (by hand) to mark that it has marked the file as delete pending. DEACCESS and SEARCH_FCB will force the local FCB chain stale if the access lock was held in NL mode (all accessors on our node requested no lock) since the FCBs are always questionable (blocking ASTs are not delivered to NL modes). CREATE_HEADER also sets the index file FCB stale, under a similar assumption. Various places in the XQP that look at the FCB test the stale flag and rebuild the FCB chain from the headers if it is set. These include ACCESS (file to be accessed), SEARCH_QUOTA (may also require serializing on the quota file)) (quota file), CREATE (file being accessed), DEACCESS (file being de-accessed), MARK_DELETE (file being deleted), OPEN_FILE (file being accessed), MODIFY (file being modified), CONN_QFILE (quota file). When the FCB chain is found to be stale, REBLD_PRIM_FCB is called. It will initialize a new FCB from the real file header, and rearm the blocking AST by converting the lock to the same mode. This is followed by BUILD_EXT_FCBS to pick up the extension FCBs which might also have changed. 4-5 CHAPTER 5 CACHES Other than the buffer cache, described under I/O buffer management, the system maintains other caches to speed up file system operations. These are described below. 5.1 RMS Directory Cache The RMS directory cache is a list of directory names and their file IDs that RMS has seen before. This is how it normally avoids calling the XQP every step of the way down a 6 level deep sub-directory tree, for example. However, it needs to know if anyone has been messing with the directory structure, like deleting or renaming one. If so, it needs to call the XQP to step down the tree. To make this test and keep the overhead low, real low, it picks up the sequence number UCB$W_DIRSEQ, and stores it with its cached directory entries. Whenever the file system does something that changes the directory structure, it calls UPDATE_DIRSEQ (in CHKDMO) (done by ENTER when superseding a directory, REMOVE when removing a directory name, and RWATTR (when the directory bit is turned off)). The sequence number is also incremented when the volume is mounted, to avoid some races. The volume lock is used for this purpose. RM$ARM_DIRCACHE (in RM0SETDID) converts the volume lock to CR mode specifying RM$DIRCACHE_BLKAST (in RMSRESET in SYS) as a blocking AST routine. When UPDATE_DIRSEQ does a QEX_N_CANCEL on the volume lock, this routine will bump the sequence number on the distant nodes. The high order bit of the sequence number indicates that the blocking AST is armed. CHECK_DISMOUNT clears this bit when the lock (and therefore the blocking AST) is disarmed. The bit is also cleared when the blocking AST goes off. When the blocking AST is successfully armed, and the sequence number matches what we started with, the armed bit is set. 5-1 CACHES 5.2 File ID Cache The FID cache is effectively a pre-allocated section of the index file map. It is found from VCB$L_CACHE and VCA$L_FIDCACHE. The cache holds a list of known empty FIDs. The FID cache is maintained by CREATE_HEADER and DELETE_FID. When going to allocate a FID, the FID cache is checked first. If the cache is empty, blocks will be read from the index file map until one is found with a free bit. The free bits will be added to the FID cache. The index file map block is force written if this is done. Of course, the file header for this FID must check out. If the FID cache is found not valid, we try to re-obtain the cache lock as part of making it valid. Entries are also added to the FID cache by DELETE_FID. If the FID can be put into the cache successfully, fine. Otherwise, some entries must be removed. Since they will be written to the index file map, and they will be read into (possibly) some other node's cache, we want to write out as many as possible that will fit in a given index file map block. The index map VBN of this block is recorded in the allocation lock value block. For a cache flush, all FIDs are returned. A cache flush also reduces the cache lock to NL mode and marks the cache invalid (refer to cache flushing below). 5.3 Extent Cache The extent cache is effectively a pre-allocated section of the bitmap file. It is found from VCB$L_CACHE and VCA$L_EXTCACHE. The cache holds a list of known free extents (LBN and size). The extent cache is maintained by routines in SMALOC. The idea is to maintain a certain fraction of the disk free space in the extent cache. When trying to allocate an extent, the extent cache is checked first. If this fails, allocation occurs directly from the bitmap. If that fails, the extent cache is flushed to hedge our bets on another try at the bitmap. After the allocation, a try is taken at (re)filling the extent cache from the bitmap block in memory, and then from the bitmap itself. The VBN of the block from which we succeed in finding free blocks is recorded in the storage map VBN field in the allocation lock value block. Returning blocks likewise returns first to the extent cache. If this overflows, some extents are purged back to the storage map. When operating on the extent cache, if it is marked invalid, an attempt is made to start keeping it valid by re-obtaining the cache 5-2 CACHES lock. When removing entries from the extent cache, an effort is made to remove those extents that would map into the same bitmap block on disk. This saves us writes and saves other nodes reads to find sufficient free space. For a complete flush, though, we just write out all extents. In this case, the cache lock is reduced to NL mode. 5.4 Cache Flushing The presence of the FID and extent caches implies that the index file and storage maps do not actually reflect the true amount of free space. When you really need to know everything that is really available, you've got to get everyone else to flush their cache back to the bitmap where you can find it. This is done with system blocking locks, with help from the CACHE_SERVER process (PID XQP$GL_FILESERVER, routine XQP$GL_FILSERV_ENTRY, which is set to CACHE_SERVER). Flushing back the extent cache can be triggered by various conditions generated when allocating storage in the SMALOC module. It is also triggered by write accessing the BITMAP.SYS file. Flushing the file number cache can be triggered by the CREATE_HEADER routine failing to find a free header, or by write accessing the INDEXF.SYS file. Basically, the idea is that when those caches are being used, they hold a lock with the blocking routine XQP$UNLOCK_CACHE (SYSACPFDT module, SYS facility), and an AST parameter that identifies the UCB with the cache type encoded in the low bits. This lock has the form F11B$c where is either the index file, bitmap file, or quota file. When you need to flush, you queue an incompatible lock for the appropriate F11B$c lock (routine CACHE_LOCK) and the blocking routine in turn queues an AST for the CACHE_SERVER process, with an AST parameter telling it what to do. The CACHE_SERVER, in turn, does a normal XQP call with a special ACPCONTROL function that flushes the correct cache. The cache flushing function needs a specific process since the blocking AST, being from a system lock, may go off in any random process (in particular, one without the XQP mapped (the null process, for instance)). In the normal case, each node will take out a PR mode lock on each cache of interest. If any node needs to do something special (write access one of the files or otherwise request a cache flush elsewhere), the node takes out or converts its lock mode to CW, firing the other 5-3 CACHES nodes blocking ASTs. The other nodes flush their caches, and then convert their lock to NL mode, allowing the other node to get its lock. Accessing for write one of the special files (INDEXF, BITMAP or QUOTA) causes a system lock in CW mode to be taken out on the cache. Also, if quota processing is turned on, and we notice that someone has the quota file opened for write (while quota processing was off), the CW mode cache lock is taken. We will wait for other nodes to release their PR locks. If CREATE_HEADER wants other nodes to flush their FID cache, it takes out a process CW lock on the cache lock. No blocking AST is associated with this. The lock is de-queued as soon as we get it. Likewise, if ALLOC_BLOCKS fails to allocate, it will take out a process CW mode lock on the cache lock. We wait to get the lock, so that we know that all other nodes gave up their PR locks (that is, that they flushed their caches). If a cache is found invalid (the starting state for a cache, also the state after a flush), a PR mode lock is requested on the cache lock. We do not wait for this. If we fail to take the lock, we simply flag the cache as invalid. The quota cache will also be marked as needing flushing (since the quota software will still read quota records into the cache and modify them there). We do not wait because some node may be holding a CW mode lock indefinitely, by virtue of a write access to the file associated with the cache lock. A complete flush of a cache will reduce the cache lock to NL mode to allow the requesting nodes to continue with their lock requests. CHECK_DISMOUNT will de-queue any special cache locks it finds. MAKE_DEACCESS will de-queue a cache write lock if held on the file being de-accessed. DEACC_QFILE will de-queue the cache lock if held. 5-4 CHAPTER 6 QUOTA FILE PROCESSING As mentioned earlier, one of the last functions performed in an XQP request is to reflect new disk usage in the quota file. This is done unless FIB$V_NOCHARGE is set (which a user cannot since GET_FIB will clear it). The flag is set by EXTEND_INDEX so that the index file isn't charged. (EXTEND checks this flag.) 6.1 Quota File Operations There are various acpcontrol functions that a user can request to operate on the quota file. These are implemented by QUOTA_FILE_OP. A problem QUOTA_FILE_OP has to start with is that it must acquire the serial lock on the quota file before it can obtain the allocation lock for the volume. It must acquire the allocation lock to prevent quota figures from changing. However, the allocation lock protects the existence of the quota file FCB; that is, VCB$L_QUOTAFCB is not stable except under the allocation lock. So, QUOTA_FILE_OP loops, requesting the serial lock on the quota file (using whatever random value it gets by looking at VCB$L_QUOTAFCB), getting the allocation lock, checking that VCB$L_QUOTAFCB matches what it believes, and unlocking and re-trying until it does. QUOTA_FILE_OP performs the protection checks needed. SEARCH_QUOTA locates the desired quota record. (Note that this search is done directly against the quota file, not the quota cache.) A disable quota function flushes the quota cache (described below), force writes any quota file buffer blocks, and then performs the actual quota file de-access (DEACC_QFILE). An examine quota function simply returns the quota file record. A remove quota function entry returns the old entry. The entry is zeroed, under a exclusive lock on the quota entry. 6-1 QUOTA FILE PROCESSING A modify quota function patches the entry and writes it. Note that the process must hold the volume blocking lock to modify the usage figure. An add quota function finds the next free record and writes the quota information. An EXTEND_CONTIG will be done if necessary to grow the file. To actually de-access the quota file, DEACC_QFILE starts by killing any quota file buffers (KILL_BUFFERS (1, -1)). The access lock on the quota file is downgraded to show our de-access. The quota cache is de-allocated. We also release our quota cache lock if we had it. The inverse operation of connecting the quota file is done by CONN_QFILE. CONN_QFILE does a FIND to locate the quota file. Under the quota file serialization lock, the FCB is found or created, and the extension FCBs built. Write access is requested for the quota file. MAKE_QFCB allocates the quota cache, linking it to the VCB. The ACBs in the cache header for the various blocking routines (described below) are set up here. If the quota file is already write accessed, the quota cache lock is taken out. 6.2 Quota Cache The quota cache has entries based on UIC to keep track of allowed usage, current usage, etc., without the need to read and write the QUOTA.SYS file itself all the time. It is allocated by in paged pool by MAKE_DISK_MOUNT in the MOUNT routine MOUDK2 and de-allocated by CHECK_DISMOUNT. The quota cache is found by chasing through VCB$L_QUOCACHE. This points to a VCA block. Each entry contains a UIC, the quota information, a lock status block used with the quota cache entry locks (described below), the quota file record number, and LRU indexes. The cache header contains a LRU counter. When a new entry is added, the value is put into the entry and this counter incremented. FLUSH_QUO_CACHE returns each entry to disk. The corresponding record on disk is located and updated (CLEAN_QUO_CACHE). Any quota entry locks are released, including a conversion to NL of the quota cache lock itself. SCAN_QUO_CACHE looks up an entry in the cache. If the cache is marked invalid, this routine tries to get the normal (PR) cache lock so that the cache can stay valid. If the entry is marked dirty, the record is updated from disk. The quota entry lock is released for the entry. If the entry is not valid, the quota entry lock (PW) is obtained to make it valid. CLEAN_QUO_CACHE updates the disk record from a cache entry. The disk 6-2 QUOTA FILE PROCESSING buffer is marked dirty, the cache entry marked clean. ENTER_QUO_CACHE copies a given record into the cache. The LRU index is updated if requested, the entry marked dirty if requested. If SCAN_QUO_CACHE finds the quota cache invalid and cannot obtain the cache lock, it sets the CACHEFLUSH flag in the cache header. CLEANUP will check for this and flush the quota cache when we are done (to reflect our changes to the process holding the quota cache lock). 6.3 Quota File Manipulation The main routine within quota file processing is SEARCH_QUOTA. This routine locates a quota record for a given UIC. It will scan the quota cache, updating the quota file from the cache entry if necessary. If the record can't be found, or wild card search was requested, the quota file must be scanned. The scan of the quota file is done before the quota cache in the wildcard case to get the records in proper order. If the record returned is in the cache, the returned address is that of DUMMY_REC within CHARGEQ. This value is special cased elsewhere within CHARGEQ. REAL_Q_REC is the address of the buffer containing the actual disk quota record, if there is one. WRITE_QUOTA will update the cache entry, and/or the disk record (mark the buffer dirty) depending on these variables. CHARGE_QUOTA does the system processing of charging for quota, checking for overdrawn, etc. It will write out the new quota record if the quota charge is okay. 6.4 Dynamic Quota Cache Entry Lock Passing For nodes in a cluster, for each entry in the quota cache, the node holds a lock of the form F11B$q in PW mode. All the relevant information of the quota entry is packed into the value block, so it can be shared cluster-wide. (If the value block comes back as invalid from the lock manager, we simply mark the quota cache entry as invalid.) When a new entry is added to the quota cache (by SCAN_QUO_CACHE,) the quota entry lock is obtained. When an entry is removed from the cache, either by explicit flush (FLUSH_QUO_CACHE) or by LRU replacement (SCAN_QUO_CACHE), the lock is de-queued. The lock is held, normally, as a system owned lock, specifying the XQP$REL_QUOTA blocking routine in the SYSACPFDT module (SYS). A subsequent operation on that node hitting that cache entry will not do any lock conversions at all. 6-3 QUOTA FILE PROCESSING When another node queues for the F11B$q lock, the blocking routine is triggered and sends an AST to the swapper (routine XQP$UNLOCK_QUOTA). If the quota entry is valid, the cache lock is demoted to CR mode (compatible with the PW that the other node is requesting). This will write out the value block, which the other node will pick up when it succeeds in getting its PW mode lock. If the quota entry is not valid, the lock is de-queued entirely and the entry marked vacant. When a quota entry is being removed from the quota file itself, the quota entry lock is requested in EX mode. This forces any node that holds any lock to it to perform a cache flush of the entry. 6-4 CHAPTER 7 DIRECTORY OPERATIONS Operations upon directories, other than explicit accessing of directories, is handled by the modules DIRACC, DIRSCN, FIND, ENTER, REMOVE and SHFDIR. FIND is called from the main processing routines ACCESS, DELETE, MODIFY and CONN_QFILE. ENTER is called from CREATE. REMOVE is called from FIND to perform a requested deletion. These are the interfaces into the directory routines. FIND does the processing of taking the directory FID from the FIB and locating and accessing the directory, and finding the directory entry to get the file's FID. ENTER adds a new directory entry, possibly removing/superseding one. REMOVE removes a directory entry. DIR_ACCESS accesses a directory. This is a less involved operation than accessing a file since the directory access only lasts for the duration of an XQP operation. DIR_SCAN does the walking down of a directory. SHUFFLE_DIR extends and contracts directories as requested by ENTER and REMOVE. ERR_CLEANUP can perform various directory operations. It will remove an entry if need be, or restore an entry by calling RESTORE_DIR, DIR_SCAN and MAKE_ENTRY directly. 7.1 FIND FIND looks up a directory entry. It takes as input the buffer descriptors and the FIB supplied by the user. It can optionally remove the entry (requested by DELETE) and set the version limit (sometimes requested by MODIFY). These sub-functions are provided because the calling routines do not wish to operate on the directory entry themselves. FIND will also return the resultant name string (requested by DELETE). 7-1 DIRECTORY OPERATIONS The real work of FIND is performed by DIR_ACCESS and DIR_SCAN. FIND handles a few cases of wild card searching. Note that the directory search is metered as a sub-operation. 7.2 DIR_ACCESS DIR_ACCESS accesses a directory. It is called by FIND and ENTER. (REMOVE works against the directory entry described by the directory context area.) The arguments to DIR_ACCESS are the user's FIB and the necessary access type (read/write/execute). DIR_FCB is set as a result. The serialization lock is obtained on the directory. Access locks are manipulated just to make sure that no node is preventing our access (due to explicit opening of the directory); the serialization lock obtained obviates the need to hold an access lock. Note that DIR_ACCESS is basically a no-op if DIR_FCB is set. In the case of create-if, ACCESS would have called DIR_ACCESS (via FIND) to try to find the directory entry. This would have done a execute protection check. The write protection check that CREATE needs (normally done by DIR_ACCESS called from ENTER) will be skipped since DIR_FCB is still set. So, CREATE performs the protection check in this case. 7.3 DIR_SCAN DIR_SCAN locates a particular directory record. It takes as inputs a name descriptor block, a FID to locate, a starting block, record and version, a predecessor record, and a number of records to scan. In the case where a normal name lookup is to be done (supersede cleanup, ENTER test for duplicate file name), only the name descriptor is passed. If we need to remove an entry, a DIR_SCAN is done given the new starting point, specifying a predecessor record and no record limit (find version set to -32768) to find the oldest version. In the FIND case of supplied wild card context, FIND asks to find FID (-1. -1. -1) for the given starting block and record number from the wild card context (this is the case in which a record limit is provided). This will fail, but will position to the desired record. A full wild search is done from that starting point. If a resultant string is supplied, search the indicated block for the given entry. If the search fails immediately (no records are traversed), search again from the start. If the version is not wild, we search for the oldest version so we are positioned at the start of the next name. Thus we are left positioned at the record, or where it used to be. When all of this processing is done, the actual desired entry is found. The directory index block is used by DIR_SCAN to reduce search time. The directory index block gives the last file name present in every 7-2 DIRECTORY OPERATIONS nth block of the directory (where n is normally 1). This block is used to select a starting and ending block for the search. The directory is processed a block at a time. This is necessary, since the buffer manager can only guarantee reading a single block at a time. This is just as well, because this gives an easy handle to build the directory index block. 7.4 NEXT_REC The DIR_SCAN routine NEXT_REC returns a pointer to the next record in the current block. This routine validates the format of the entry. 7.5 UPDATE_INDX The directory index block is maintained by UPDATE_INDX, called by DIR_SCAN, as it walks down directory blocks not described by the directory index block, and by ENTER when it changes a block. Updating the index is a simple string copy into the correct directory index block cell. 7.6 NEXT_DIR_REC NEXT_DIR_REC will find the next directory entry, including reading the next block if necessary. The routine returns a value only if the entry matches the name of the previous entry supplied. This routine is used by FIND and ENTER. 7.7 ENTER ENTER adds a new entry into a directory. It also includes a FIND operation. ENTER takes a buffer descriptor block and a FIB, as does FIND, as well as returning the resultant name string. ENTER is called from CREATE (MAKE_ENTRY is called during ERR_CLEANUP). As such, ENTER makes its own DIR_ACCESS call. In the case of superseding an entry (meaning to replace an existing entry that matches in name, type and version), this can be done in line. Otherwise, the work is done by MAKE_ENTRY. 7-3 DIRECTORY OPERATIONS 7.8 MAKE_ENTRY MAKE_ENTRY does the work of actually entering an entry. This is called in ENTER, and also in ERR_CLEANUP to undo a remove operation. The inputs are a file name block and the user's FIB. MAKE_ENTRY keys off the directory position set by the caller's DIR_SCAN. If the entry is added to the end of a block, the directory index block cell is updated. MAKE_ENTRY handles the case needing a removal also. In such a case, the directory context denoting the insertion point is saved. A second DIR_SCAN locates the oldest entry to remove. After removing the entry, the directory context is restored. 7.9 RESTORE_DIR RESTORE_DIR restores a saved directory context. The only interesting aspect of this is the possible re-reading of a directory block. This is done during the MAKE_ENTRY search for an entry to remove; also the ERR_CLEANUP restoration of directory context before undoing a remove. 7.10 REMOVE A directory entry is removed by REMOVE. REMOVE is called by FIND, when requested to remove the found entry, by ENTER, when it is necessary to remove an entry due to version limitations, and by ERR_CLEANUP, to undo an enter operation. As an option, it will keep a name with no versions. (This is done by ENTER, when we are removing an entry, and we want to leave the name (in case there was only one version) for us to enter.) Note that removing an entry does not change the directory index block, since the old name is just as good a pointer to this directory block as is the new last record in the block. 7.11 SHUFFLE_DIR SHUFFLE_DIR expands/contracts a directory. This is done by ENTER and REMOVE. The argument tells the direction to expand (1) or contract (-1). The operation keys off the directory context generated by the caller. Performing a directory shuffle is a secondary context operation. If the operation is an extend, the directory will be expanded into unused end blocks if present. Otherwise, a contiguous extend is needed. The directory is expanded by half its present size, with the 7-4 DIRECTORY OPERATIONS old blocks copied into the newly allocated space. The copy is done so as to duplicate blocks, for safety in a crash. The current block is split into two parts, depending on the current position. For a compression, a block is squished out. The blocks following the squish are copied downward. The directory file header is updated to show the new blocks/EOF. The FCBs are rebuilt to map. Finally, the directory index block is cleared out, now that we lost track of the names in the block squished/added. 7-5 CHAPTER 8 ACL OPERATIONS The ACL for a file is an obvious parameter to CHECK_PROTECT. The ACL is stored as an in-memory list (paged pool) located from the ORB located from the primary FCB for a file. The ACL is created (copied from the file header chain) by FILL_FCB, when the CLF_NOBUILD flag is off (that is, only during initial FCB creation). The ACL is threaded onto the ACL queue of the ORB, with ACL_INIT_QUEUE called. Correspondingly, REBLD_PRIM_FCB calls ACL_DELETEACL upon this ACL, and then calls INIT_FCB2 (which calls FILL_FCB) without CLF_NOBUILD so that the ACL is rebuilt. NUKE_HEAD_FCB, called by CLEANUP, MARK_DELETE, CLOSE_FILE and UNHOOK_BFRD (when deleting a directory FCB lying around by virtue of its directory index block) are the main path to deleting an ACL (ACL_DELETEACL). The ACL will be deleted by CHECK_DISMOUNT when it de-allocates any FCBs associated with the device. The ACL can be returned to the user via a read attributes function. It is set by WRITE_ATTRIB. GET_FIB initially sets FIB$L_ACL_STATUS to success. This value gets its real value in READ_ATTRIB/WRITE_ATTRIB. READ_ATTRIB calls ACL_DISPATCH for ACL operations other than add, delete and modify. WRITE_ATTRIB will pass on all operations to ACL_DISPATCH. Writing to the ACL causes the FCB to be marked stale cluster-wide (forces rebuilding the in-memory ACL). The file header ACL chain is rebuilt (ACL_BUILDACL). The ACL is initially built in a file header chain by CREATE. For a file just being entered, PROPAGATE_ATTR will copy an ACL. For a real create, a WRITE_ATTRIB will copy the user's ACL. If the file is owned by other than the creator, an explicit ACL term for the creator is added (ACL_ADDENTRY). CREATE will explicitly call ACL_BUILDACL for good measure, in case it wasn't picked up in any ACL manipulation earlier. For a new file, PROPAGATE_ATTR (COPY_INFO, actually) will copy (ACL_COPYACL) the default protection ACL from the DIR_FCB. For a new version of an existing file, the ACL is copied from the old file's FCB (obtained in secondary context). 8-1 ACL OPERATIONS 8.1 ACL_BUILDACL This routine copies the in-memory ACL into the file header chain. During this copy, the BADACL bit is set in the primary header so that the presence of the corrupted ACL can be seen. The file's header is extended if we run out of extension headers. 8.2 ACL_COPYACL ACL_COPYACL copies specified ACEs from one FCB to another. It is called to copy the entire ACL from a file to a new version thereof, or the default protection ACEs from a directory to a new file. The operation is a simple copy and thread operation. 8.3 ACL_DISPATCH READ_ATTRIB/WRITE_ATTRIB call ACL_DISPATCH to return/affect the ACL for a file. ACL_DISPATCH simply calls the desired routine, given the operation code specified in the user attribute area. The caller must reflect any changes into the file headers. The actual operations are done by routines contained in ACLSUBR. The return value is stored in FIB$L_ACL_STATUS. 8.3.1 ACL_INIT_QUEUE ACL_INIT_QUEUE is called before any explicit operations upon the ACL. An ACL exists as a threaded list from the ORB, manipulated under a mutex (at elevated IPL). The mutex in the ORB is initialized by this routine, the mutex locked, and the queue head set to be empty under the lock. The routine leaves with the mutex unlocked. 8.3.2 ACL_ADDENTRY Adds an ACE to an ACL. Note that ACL segments are limited to 512 bytes, so adding an ACE may require splitting an ACL segment. 8.3.3 ACL_DELENTRY Delete an ACE from an ACL. The segment containing the old entry is always deleted; the remaining ACEs are copied to a new segment. 8-2 ACL OPERATIONS 8.3.4 ACL_MODENTRY Modify an entry, just a delete followed by an add. 8.3.5 ACL_FINDENTRY The basic ACE finder. Matches ACEs depending on context of use. 8.3.6 ACL_FINDTYPE Locate an ACE based on type. 8.3.7 ACL_DELETEACL Delete the entire ACL. Also called when deleting FCBs. 8.3.8 ACL_READACL Return as much of the ACL as fits in the user area. 8.3.9 ACL_ACLLENGTH Return the length of the total ACL. 8.3.10 ACL_READACE Return a single ACE. 8.3.11 ACL_LOCATEACE Locate ACE by context value. 8-3 CHAPTER 9 USER BUFFER PROCESSING The IRP sent to the XQP contains an address to an ACP I/O buffer packet in IRP$L_SVAPTE. AIB$L_DESCRIPT points to a blockvector of elements, each a buffer descriptor. Buffer descriptor here refers to a user area that is supplying, or is to be supplied with, information. Each element (ABD) contains an offset to the data, the size, and the user virtual address of the data. The offset plus one added to the address of the buffer descriptor gives the address of the buffer (the byte preceding that is the access mode taken from IRP$B_RMOD). Each possible user buffer has a reserved index in the blockvector. The indexes are zero origin. The last element reserved corresponds to the read/write attribute user function. All buffers from then on correspond to read/write attribute buffers. IRP$L_BCNT contains the number of buffer descriptors present. (Note that for a window turn, IRP$V_COMPLX is off, and none of this applies.) The actual user buffers are copied into the AIB buffers by FDT processing (SYS module SYSACPFDT). They are copied back to the user's area by I/O processing completion (BUFPOST in SYS module IOCIOPOST). The first entry is for returning the window pointer. This is not a user supplied buffer. BUILDACPBUF (in FDT processing) sets the window pointer return address to CCB$L_WIND. GET_REQUEST zeros the window pointer return length (except for window turns) so that the value is not returned. MAKE_ACCESS restores the window pointer return length (to 4) and return the window pointer. ZCHANNEL cleanup (which aborts a failed access attempt) returns a zero for the window pointer. The user's FIB occupies one element of the buffer descriptor list. It is copied into LOCAL_FIB by GET_FIB. The updated FIB is copied back to the FIB buffer by IO_DONE. The file name buffer is passed as the input to PARSE_NAME from ENTER and FIND to parse the user's file name into the internal name block. COPY_NAME (called from CREATE and FIND (for a spooled device)) copies the file name buffer into the result string buffer. It also sets the result string length buffer value. IO_DONE clears the file name return length to inhibit write-back of it. 9-1 USER BUFFER PROCESSING For quota file operations (QUOTA_FILE_OP), the file name buffer is used to pass a quota file transfer block (DQF). For operations on a spooled device, FDT processing placed the user name and account in the file name string to be placed in the file header. RETURN_DIR, called from ENTER and FIND, returns the name from DIR_ENTRY and DIR_VERSION into the result string buffer. The result string length buffer is also set. The result string is itself passed to PARSE_NAME from FIND when processing a wild card search. Quota file operations call RET_QENTRY (QUOTAUTIL) to return the quota record (DQF) into the result string buffer. The result string length is set here. If a user attribute buffer exists, a read/write attributes function (READ_ATTRIB/WRITE_ATTRIB) is performed. ACCESS will perform an attribute read. CREATE, DEACCESS, MODIFY will perform an attribute write. IO_DONE sets IRP$L_BCNT during non-read operations so as to inhibit write-back of the attributes. The attribute list sometimes contains placement data (processed for compatibility) when FIB$V_ALLOCATR is set. GET_LOC_ATTR, called from CREATE and MODIFY, will scan the user's attribute list for placement data, copying it into standard format in the FIB. 9-2 CHAPTER 10 SPOOL FILE OPERATIONS Spool files are non-entered files that are flagged as spooled. Such files will be sent to the symbiont when de-accessed. The idea is to allow a process to pretend to be writing to a printer, when in fact it is not. For spool files, IRP$L_UCB (which is loaded into CURRENT_UCB by GET_REQUEST) refers to the spool file. IRP$L_MEDIA is set to the spooled device UCB (a printer). Spool file operations are recognized when a process does a create function to a printer specified as spooled. This will be translated into a creation of a spooled file. Requests to operate on spool files are recognized in FDT processing (SYS module SYSACPFDT). For implicit spooling, the file name user buffer is replaced by the user name and account, to become the file name in the header. GET_REQUEST notices that IRP$L_UCB is different from IRP$L_MEDIA, setting CLF_SPOOLFILE. The SPOOL flag is set in the file header for spool files by CREATE. This flag causes FILL_FCB to set FCB$V_SPOOL. The SPOOL header flag is one of the characteristics that can not be changed by WRITE_ATTRIB. Operations on a spool file no-op directory operations. DEACCESS will set CLF_DOSPOOL when de-accessing a spool file. The CLF_DOSPOOL cleanup flag causes the file to be sent to the symbiont. SEND_SYMBIONT picks up the queue name for the request from VCB$B_QNAMECNT (an ASCIC string). An itemlist is built providing the name, FID, etc. of the file. This is sent to $SNDJBC. $SNDJBC notices that it is being called from kernel mode, and does not do the various access checks it would otherwise (which would cause us to hang). If the symbiont request fails, the file is deleted (by setting CLF_DELFILE), with the job controller error status being returned to the user (in USER_STATUS[1]). 10-1 CHAPTER 11 ACCESS OPERATION The ACCESS function is invoked if a IO$_ACCESS function code is specified. The basic steps follow. Find the directory entry if necessary. Serialize processing on the file. (This allows stable FCBs to be found.) Find the FCBs. This is the point at which we detect that we have serialized on the wrong lockbasis (the user is trying to access an extension header directly). In this case, we release the serial lock and serialize on the correct (primary header) lockbasis. Create the primary FCB if not found. Check for access conflicts; obtain the access lock. Create a window to the file. MAKE_ACCESS (thread the window, update access counts, return the window address in the user's attribute area). Set V_WRITE_TURN in the window for directories or other interesting files. See if the expiration date needs updating. Build/check extension FCBs. Check user access to the file. If the file is interesting, flush caches of the file. Obtain the appropriate cache lock. Check access for, and read attributes, if requested. Check the need for cathedral windows. 11-1 CHAPTER 12 CREATE OPERATION The CREATE function is invoked if a IO$_CREATE function code is specified, or if DISPATCHER detects SS$_NOSUCHFILE from ACCESS when IO$V_CREATE was specified. The basic steps are as follows. Clean-up from a previous access attempt, if this is a create-if operation. Perform the write access check on the parent directory, since the FIND within the access attempt only did an execute check. Find a volume for the file. Check user's access to create on that volume. Create/allocate a file header. Create the primary FCB. Enter the file into the supplied directory. For a propagate operation, serialize on the file. Search/build/create its FCBs. Copy attributes to the file. Check the back link/name. Update the header if no good. Do write attribute processing; correct ACL to include creator. Charge quota for the file. Access the file if requested. Perform extension, if requested. Update the file headers with the ACL. Re-map the file if extended (and cathedral windowed). Delete any file superseded/removed. 12-1 CHAPTER 13 MODIFY OPERATION The MODIFY function is invoked if a IO$_MODIFY function code is specified. The basic steps are as follows. Locate the directory entry. Serialize on the file. Search/build/create the FCBs. Check accessors; obtain access lock. Check for access to file. Perform write attributes processing. Perform extension or truncation. Update file header. 13-1 CHAPTER 14 DELETE OPERATION The DELETE function is invoked if a IO$_DELETE function code is specified. The basic steps are as follows. Find the directory entry. Serialize on the file. Read its header. Search/build/create FCBs. Check access on file. Make sure, if a directory, that its empty. Audit the deletion. Check for other accessors. Mark the header for delete. Kill any cache buffers for the file (directories only). Mark the FCB for deletion. If we are the only accessor, delete the file. Restore the access lock (manipulated when checking for other accessors). Delete any FCBs. For a directory entry removal (no file deletion), request entry removal of cleanup. Otherwise, directory entry removal is done by DELETE_FILE. 14-1 CHAPTER 15 DEACCESS OPERATION The DEACCESS function is invoked if a IO$_DEACCESS function code is specified. The basic steps are as follows. Serialize on the file. Rebuild the FCBs if something must be done to the file. Request cleanup deletion of the file if marked for deletion and we are the last accessor. Update revision count, etc. Update the file high water mark. Clear the de-access lock flag if attributes are being written. Write the attributes. Perform any requested truncation. If we are not the only accessor, set up for delayed truncation. If we are the last accessor and delayed truncation was requested, do it. DEACCESS returns with a zero as a condition value (error). This will invoke ERR_CLEANUP who will de-access and possibly delete the file. 15-1 CHAPTER 16 WINDOW TURNING AND BAD BLOCK PROCESSING If the FDT routines, when processing a virtual read or write, find that the existing WCBs do not map the desired VBN, they force a window turn. The request will be turned into an IO$_READPBLK or IO$_WRITEPBLK operation. Blocks declared as bad will likewise be mapped into these functions to be sent to the XQP. DISPATCHER will forward these I/O function codes directly to READ_WRITEVB. MAP_VBN does the work of mapping the VBN. Bad block processing is started by marking the FCB as having bad blocks. 16.1 VBN Mapping MAP_VBN makes the FCBs valid, if needed. (READ_WRITEVB obtains the serialization lock on the file.) For an incompletely mapped cathedral window, the file is simply re-mapped. We walk down the FCB chain, to find the volume containing the desired blocks. Given fresh FCBs, we try once more to map given the current windows (MAP_WINDOW). If this fails, TURN_WINDOW is called. TURN_WINDOW contains the gruesome code to update WCBs. The routine handles cases where the file was truncated or extended, where the WCBs describe VBNs before or after the desired area, etc. The new desired window pointers are built in a buffer within the routine. They are copied into the actual WCB at IPL$_SYNCH to synchronize with FDT routines trying to map other virtual requests. READ_WRITEVB checks to see if this is an operation upon an interesting (storage system) file. If so, cached buffers are killed. For a cluster, the sequence numbers are incremented, invalidating our buffers. The appropriate lock (allocation or serial) is obtained so that the sequence number in the value block is also updated for other nodes to see. If this is not a cluster, the buffers are purged from the cache outright. Once the block has been mapped, the IRP is re-queued to the driver (REQUEUE_REQ) for I/O. 16-1 WINDOW TURNING AND BAD BLOCK PROCESSING 16.2 Bad Block Processing In the case of a badblock processing request, the FCB is marked bad (MARKBAD_FCB) and SCAN_BADLOG is called to add the bad blocks to the BADLOG file (in secondary context). Setting FCB$V_BADBLK will cause DEACCESS to set the BADBLOCK flag in the file header. Likewise, INIT_FCB2 will set FCB$V_BADBLK if it finds on FH2$V_BADBLOCK. Setting FH2$V_BADBLOCK, in turn, causes DELETE_FILE to send the file to the bad block scanner for deletion. The bad block scanner (SEND_BADSCAN) sends a message through the bad block scanner mailbox (ACP$BADBLOCK_MBX, created by INIT_FCP during SYSINIT) specifying the UCB and FID of the file to be deleted. If this succeeds, a request is made for a process (name BADBLOCK_SCAN, all privileges, UIC [1, 3]) to run BADBLOCK.EXE. BADBLOCK.EXE is generated from the BADBLK facility. The main processing routine, MAIN_BAD (BADBLK module GETREQ) reads each message from the bad block mailbox. For each, it patches the UCB address in a CCB it holds for the purpose to that of the erring file. The routine SCAN (BADBLK module SCANFILE) scans down the file separating the bad from the good. SCAN tests each block of the file (user mode I/O inhibiting retries). As it does, it truncates the trailing blocks from the file. If the block is found to be bad, SCAN uses the special MARKBAD truncate option. Either way, all blocks are truncated from the file. The empty file is deleted. The truncate option FIB$V_MARKBAD causes the specified blocks (only the last cluster) to be sent to DEALLOCATE_BAD. (This requires SYSPRV.) DEALLOCATE_BAD, in secondary context, serializes on the BADBLOCK file. A map pointer is added to the last header to map the bad blocks. The EOF mark and highwater mark are set to include these blocks. SCAN_BADLOG is called to remove the BADLOG entry for these blocks, if one exists. The bad block scanner will also check the BADLOG file for any references to the file when it is done. 16-2 CHAPTER 17 ACPCONTROL FUNCTIONS All ACP control functions find there way into ACPCONTROL. 17.1 Quota Operations User invoked quota operations are done by QUOTA_FILE_OP, described elsewhere. The only interesting point is that the block lock is requested for an add quota function. This is done because DISPATCHER does not request the block lock for ACP control functions, since some affect the block lock state. Enabling quota processing on a volume makes a call to CONN_QFILE. Disabling quota processing is handled by QUOTA_FILE_OP, as are adding, examining, modifying and deleting quota file entries. 17.2 REMAP The re-map function invokes REMAP_FILE under the file serialization lock. 17.3 LOCK Volume Lock volume takes the block lock for the volume (TAKE_BLOCK_LOCK). VCB$V_NOALLOC. 17.4 UNLOCK Volume The unlock volume function clears VCB$V_NOALLOC, establishes the free space figure cluster-wide (refer to the free space allocation lock block field). and de-queues the block lock. 17-1 ACPCONTROL FUNCTIONS 17.5 Force Mount Verification Mount verification can be requested by a process with SYSPRV. The routine patches on PHY_IO to do a IO$_SHADMV. 17.6 Dismount The dismount operation flushes all caches and marks the SCB as the volume being dismounted. The interesting aspect of this is that the SCB is possibly written to (asynchronously) by mount verification. The SCB I/O must be re-tried to make sure that a consistent SCB is read/written. 17-2 CHAPTER 18 EVENT NOTIFICATION The file system outputs two sets of messages. A privileged user can request notification of interesting file system events. The system itself requests notification of security relevant events. These two sets of events are reported as follows. 18.1 NOTIFY_USER The un-documented command WATCH allows a suitably privileged user to ask the file system to notify the user when significant events occur. The list of significant events is stored as bits in the array PIO$GW_DFPROT indexed by the XQP event index. Various places in the file system check their corresponding bit and call NOTIFY_USER to send the user a message. To be able to capture these messages, it is necessary to perform a normal $PUT on them, not allowed in kernel mode. Since the file system can't jump out to exec mode, it is necessary to send these messages via declaring an AST in a higher mode. Since RMS can't be called from exec mode AST level, it is necessary to send them to supervisor mode. So, NOTIFY_USER builds the message in an area allocated in the CLI data area (CTL$AG_CLIDATA), passing the address as an AST parameter to an AST routine in supervisor mode. This routine is copied into an allocated area in the CLI data area. The address of the routine is stored in XQP$_AST_ROUTINE. The routine itself is NOTIFY_AST, found in XQPMSG. XQPMSG contains the descriptive messages for the various file system operations. Each is described by a descriptor. Unfortunately, descriptors are not position independent, so it is necessary to fix up the descriptors when the XQP is mapped. This is done by FIXUP_MESSAGES, called from INIT_FCP. The NOTIFY_AST routine opens, if needed, a FAB/RAB to SYS$OUTPUT (variables XQP$_IFI and XQP$_ISI). The message is $PUT here. The space used by the message is de-allocated in the CLI data area. 18-1 EVENT NOTIFICATION 18.2 PERFORM_AUDIT When all file system activity for a request is done, PERFORM_AUDIT is called if necessary. During the course of the request, audit blocks were placed in AUDIT_ARGLIST (by CHECK_PROTECT). These requests are passed to NSA$EVENT_AUDIT one at a time. The reason why they were deferred until the request was processed is that for each audit entry, a FID_TO_SPEC translation is done. FID_TO_SPEC disturbs other file system operations so much (it releases the primary serialization lock) that it is best to do this last. The exception is that a WRITE_AUDIT call appears in MARK_DELETE, since the file will not exist to be audited after the operation of that routine. 18-2 CHAPTER 19 ERROR HANDLING, STATUS AND CLEANUP As a general rule, the file system modules do not clean up after themselves. An operation performed in secondary context must clean itself up before returning to primary context, but the primary context need not be cleaned up. This is because the dispatcher will invoke a routine that will clean up everything before considering the request finished. If the dispatcher notices that the operation completed with error (USER_STATUS), ERR_CLEANUP is invoked. If this succeeds, CLEANUP is called, otherwise ERR_CLEANUP is called again. If CLEANUP fails, ERR_CLEANUP is tried again. This is repeated for a very large, but not infinite, number of times before we give up. Errors can occur at various points in the processing of a request. Some routines return error status which are handled by the calling routine. Some routines signal errors. A fatal error is signaled in such a way that the dispatcher knows to run ERR_CLEANUP. ERR_CLEANUP knows how to cleanup secondary context. 19.1 Error Handling A routine, when detecting an error, can do one of three things. It can return the error as a return status. Secondly, it can store the error status in the user return status (USER_STATUS) by using ERR_STATUS. (ERR_STATUS only stores the status value if the existing value is success or informational.) This is done for errors detected by main line code that are not fatal but that the user should see. Since invoking ERR_STATUS writes USER_STATUS directly, calling routines can't intercept the error, so only errors the user truly must see should be reported this way. The third way to report an error is to invoke ERR_EXIT. This macro basically signals the condition value. (It actually performs a CHMU of the argument, which translates into a signal in kernel mode. The macro will perform a return, given what is left in R0, if the handler returns.) DISPATCHER establishes a condition handler (MAIN_HANDLER) which copies the argument into USER_STATUS (again, only if USER_STATUS 19-1 ERROR HANDLING, STATUS AND CLEANUP does not already indicate an error), places USER_STATUS into the value that will be restored into R0, and unwinds to the routine that established the handler. As such, the mainline call to an XQP processing routine will effectively return with the status value passed to ERR_EXIT. The processing routine will be aborted. No XQP routines handle the unwind condition. Various routines will establish their own handler to intercept errors that they feel should not be fatal. ACL_BUILDACL has a handler that causes it to abort itself, returning the error status, but not to return the error status to USER_STATUS, and not to abort the entire XQP operation. Both the CREATE routines PROPAGATE_ATTR and its subroutine COPY_INFO have a condition handler that aborts the called routine, returning zero. The actual error is ignored. This allows the secondary context operation to not abort the mainstream of CREATE. READ_NEW_HEADER, called in CREATE_HEADER to read the new potential header, has a handler to trap disk errors. Surface errors cause the value returned from the READ_BLOCK call to be zero. Other errors cause a re-signalling of the error. The processing of a delayed truncation by DEACCESS has a handler so that the de-accessing user doesn't have to see such errors. The handler simply aborts the truncate operation. DELETE establishes a handler around the reading of the header so that it is possible to delete a bad header. The handler aborts the read, returning USER_STATUS to the value prior to the read. Note that the USER_STATUS before the read is saved in SAVE_STATUS, which is an XQP impure area variable. This is so since the handler can't access any local variables in DELETE, and the XQP cannot have any own variables. Flushing the special caches (FID, quota and extent) is done under a handler that aborts an I/O (simply returning zero from the I/O) to ensure completion even against I/O errors. (Flushing these caches is not essential, but it might as well take its best shot.) Contiguous file extension has a condition handler enabled around the truncation of the old blocks. Since a good new copy exists at this time, we don't want the new file to be lost by aborting the extension operation. The handler aborts the truncate, without an error. SHUFFLE_DIR does the same thing when truncating the old directory during an extend. OPEN_FILE has a handler to clean up the aborted open. The handler always re-signals the error. However, it will examine the state around it to see what it should clean up. If we did not hold a serialization lock on the file (STSFLGS [STS_KEEP_LOCK]), then we clean up the FCBs in the usual way. This is important since OPEN_FILE 19-2 ERROR HANDLING, STATUS AND CLEANUP is always called in secondary context, and we need to clean up after ourselves. READ_ATTRIB has a condition handler to ignore errors from MAKE_NAMEBLOCK. 19.2 USER_STATUS As mentioned above, USER_STATUS is the repository for the status from the various routines. Actually, USER_STATUS is a two longword vector that is returned to the user (IRP$L_MEDIA). These two longwords will form the IOSB returned to the user. Various places set the second longword; it is saved and restored across certain operations. EXTEND sets this second longword to the size extended. EXTEND_INDEX purposely zeros this field since it knows that EXTEND would have set it, and the extension of the index file is not of interest to the user. For a contiguous extend (ALLOC_BITMAP), this value is the largest contiguous extent size found. For a spool operation (SEND_SYMBIONT), this value is the job controller error status. For a file truncation (TRUNCATE), this value is the number of blocks that had to be kept on the file (that is, the number less than what the user requested to truncate) such that the truncated file would have an integral number of clusters. READ_WRITEVB sets this value to the second word of the I/O status block returned by the I/O. 19.3 Status Flags The variable STSFLGS contains various status bits set and sensed throughout the XQP. They serve a similar function to the cleanup flags, which indicate certain operations that must be done at the end of an operation. The status flags are global flags that allow special processing to be requested by a routine (admittedly a bad practice) without having to pass extra arguments to the routine. 19.3.1 STS_LEAVE_FILEHDR This flag is set for READ_HEADER. When set, READ_HEADER will not set the value of the returned header into FILE_HEADER. SEARCH_QUOTA will 19-3 ERROR HANDLING, STATUS AND CLEANUP set this if it needs to un-stale the quota file FCBs. DIR_ACCESS will also. They do this since these are secondary operations to start with, and the real file header is to be saved. BUILD_EXT_FCBS, when passed the optional primary FCB argument, will also set this flag. This is how its callers keep FILE_HEADER from being set in this way. 19.3.2 STS_DISK_READ This flag is set by READ_BLOCK to indicate whether a request required a disk read or not. This flag is only used by READ_HEADER. If READ_HEADER requests a file header, and it was in the cache, then either it was validated on its way in (previous READ_HEADER call), or it was internally generated, and clearly validated and checksumed on its way out. The presence of this flag means that READ_HEADER need not validate the returned buffer. 19.3.3 STS_HAD_LOCK and STS_KEEP_LOCK These flags are convenient for secondary context operations, where the secondary context file might be the same as the primary context file. In such a case, we might already hold a serialization lock when we go for it. SERIAL_FILE sets STS_HAD_LOCK when this was the case. OPEN_FILE will see this flag, and set STS_KEEP_LOCK, for use in CLOSE_FILE. CLOSE_FILE will see STS_KEEP_LOCK set and not release the serialization lock. (It does, however, clear PRIM_LCKINDX. This will keep cleanup from trying to release this lock when cleaning up the secondary operation. Restoring the primary context will restore PRIM_LCKINDX.) FID_TO_SPEC will key off this flag also, to decide when to release locks it finds on the way to the MFD. 19.4 Cleanup The steps in normal cleanup are as follows. Leave secondary context if in such. Secondary context is responsible for performing its own normal cleanup. (ERR_CLEANUP will cleanup secondary context before leaving secondary context.) Flush the quota cache if so indicated. (CACHEFLUSH was set in the cache header if we failed to get the quota cache lock, which would have been caused by someone write locking the quota file.) For any volume marked as being dismounted or /NOCACHE, flush out the buffer caches. Either way, write out all dirty buffers. (This is done storage map buffers first, to hedge that the storage map is updated 19-4 ERROR HANDLING, STATUS AND CLEANUP before the file headers. This needs to be made deterministic someday.) Invalidate windows if requested. Clean out the directory FCB. As usual, the FCB is saved if there is a directory index block associated with it. If the directory is write accessed, though, we kill any directory buffers and invalidate the directory index block. Generally forget about the directory. Mark the primary FCB stale cluster-wide, if requested. Purge the FCBs unless they should be saved (accessed or directory index block associated). 19.5 Error Cleanup The purpose of error cleanup is to undo various things done within routines that were not undone because of errors occurring in the routines. The steps in ERR_CLEANUP is basically driven off various cleanup flags and variables set in the processing routines. 19.6 Cleanup Flags and Actions The various cleanup flags and variables and their meanings and corresponding clean up actions are listed below. 19.6.1 CLF_CLEANUP Set by ERR_CLEANUP to indicate that cleanup is in progress. Copy of this bit saved in context save area is set by SAVE_CONTEXT. Cleared by RESTORE_CONTEXT. Causes REMOVE to not flag a cleanup re-enter of the entry being removed. 19.6.2 CLF_CLOSEFILE Set by OPEN_FILE once the file is opened. (Prior to this, a handler in OPEN_FILE will close the file.) Cleared by CLOSE_FILE. Causes the internal file associated with CURRENT_WINDOW to be closed. 19-5 ERROR HANDLING, STATUS AND CLEANUP 19.6.3 CLF_DEACCESS Set by DEACCESS to cause the de-access to occur. Set by MAKE_ACCESS when the access is complete. Causes the header associated with PRIM_LCKINDX to be de-accessed (MAKE_DEACCESS). 19.6.4 CLF_DEACCQFILE Set by MAKE_QFCB once CURRENT_VCB[VCB$L_QUOTAFCB] is set. Causes the quota file to be de-accessed. 19.6.5 CLF_DELFILE Set in CREATE to indicate that a header was created. Set by DEACCESS to request the deletion of a for which deferred deletion was requested for which we are the last accessor. Set by SEND_SYMBIONT if the job controller call failed (thereby indicating that the job controller will not delete the file and so we must). Causes the file associated with CURRENT_FIB to be deleted. The directory index block is killed likewise. 19.6.6 CLF_DELWINDOW Set by ACCESS and CREATE in case its access attempt fails. Set by DEACCESS to cause the de-access to occur. Causes the window blocks linked from CURRENT_WINDOW to be de-allocated. 19.6.7 CLF_DIRECTORY Set by GET_FIB to indicate that a directory operation (lookup) is necessary. Has no associated cleanup. 19.6.8 CLF_DOSPOOL Set by DEACCESS. Causes the header associated with PRIM_LCKINDX to be sent to the symbiont. 19-6 ERROR HANDLING, STATUS AND CLEANUP 19.6.9 CLF_FIXFCB Set by EXTEND_CONTIG when it is decided that the file must truly be extended. Set by EXTEND once an extend has started. Cleared by DEACCESS after performing a truncation requested at de-access. Set by WRITE_ATTRIB. Set by TRUNC. Causes the FCB chain to be re-built. 19.6.10 CLF_FIXLINK, PREV_LINK, PREV_INAME Set in CREATE when the backlink/name in the header is changed (done when the previous link is zero). Set in MARK_DELETE when the corresponding directory entry is being removed, thereby zeroing the backlink. Causes the backlink and file name in the header associated with PRIM_LCKINDX to be reset. 19.6.11 CLF_FLUSHFID Causes the FID cache to be flushed. (Currently not used.) 19.6.12 CLF_GRPOWNER Set by GET_REQUEST. Indicates process has effective GROUP privilege to volume. Has no associated cleanup. 19.6.13 CLF_HDRNOTCHG Set in CREATE once a header is created and recorded (FILE_HEADER). Cleared when the quota for it has been charged. It is also set by PROPAGATE_ATTR in CREATE to avoid bungling quotas when file owner is different from process UIC. Causes WRITE_ATTRIB, when changing the owner, to not include the blocks taken by the file headers when changing the charging of the blocks. Causes DELETE_FILE to not return the blocks taken by the headers to the quota account. 19.6.14 CLF_INCOMPLETE Set by TURN_WINDOW in various cases. Causes (in REMAP_FILE) the CURRENT_WINDOW to be marked incomplete. 19-7 ERROR HANDLING, STATUS AND CLEANUP 19.6.15 CLF_INVWINDOW Set by TRUNC. Causes the windows associated with PRIMARY_FCB to be invalidated. 19.6.16 CLF_MARKFCBSTALE Set by EXTEND. Set by WRITE_ATTRIB for the following attributes: protected, UIC, access class, file protection, ACL. Causes the FCBs associated with PRIMARY_FCB to be marked stale cluster wide. 19.6.17 CLF_NOBUILD Set by UPDATE_FCB when it performs its FILL_FCB. Set by EXTEND_HEADER when it performs its INIT_FCB2. Set by SHUFFLE_DIR before generating new FCBs for the shuffled directory. Causes INIT_FCB2 to not build an ACL segment. Cleared when INIT_FCB2 completes the FCB update. 19.6.18 CLF_NOTCHARGED Set by EXTEND once an extend has started but the blocks are charged. Cleared once the blocks are charged. Causes DELETE_FILE to not return the blocks taken by a file to the quota account. Checked by TRUNC to see if the blocks should be uncharged. 19.6.19 CLF_PFCB_REF_UP Set by FID_TO_SPEC to indicate that it has incremented the reference count for PRIMARY_FCB to hold the FCB while it chases back links. Causes the reference count to be decremented. 19.6.20 CLF_REENTER, PREV_NAME, PREV_VERSION Cleared in MARK_DELETE once deletion is committed. Set in REMOVE if CLF_CLEANUP is not set. Causes a MAKE_ENTRY of the old name entry. With CLF_SUPERSEDE on, this causes the re-enter of SUPER_FID. 19-8 ERROR HANDLING, STATUS AND CLEANUP 19.6.21 CLF_REMAP Causes the file to be re-mapped. (Not currently used.) 19.6.22 CLF_REMOVE Set by ENTER once the entry has been recorded. Causes the removal of the entry indicated by the directory context area. 19.6.23 CLF_SPOOLFILE Set by GET_REQUEST. Indicates a spoolfile operation. Has no associated cleanup. 19.6.24 CLF_SUPERSEDE Set by ENTER when the enter causes an entry to be superseded. CREATE will check this flag and delete a file removed during its enter operation. Causes the directory record to be fixed up with the SUPER_FID file ID superseded. 19.6.25 CLF_SYSPRV Set by GET_REQUEST. Indicates that the user has effective SYSPRV privilege on the volume. Has no associated cleanup. 19.6.26 CLF_TRUNCATE Set by EXTEND once an extend has started. Causes the CURRENT_FIB file to be truncated back. 19.6.27 CLF_VOLOWNER Set by GET_REQUEST. Indicates that the process has effective owner of the volume. Has no associated cleanup. 19-9 ERROR HANDLING, STATUS AND CLEANUP 19.6.28 CLF_ZCHANNEL Set by ACCESS and CREATE in case its access attempt fails. Set by DEACCESS to cause the de-access to occur. Causes the window pointer being returned to the user to be zeroed, and the user to be credited for closing a file (JIB$W_FILCNT). 19.6.29 NEW_FID, NEW_FID_RVN Set by CREATE_HEADER to indicate a FID that was created. Set by DELETE_FILE to request deletion of the FID. NEW_FID is zeroed in CREATE once the FID is recorded (FILE_HEADER) and in DELETE_FILE once the FID is deleted. Cleared in EXTEND_HEADER when the header extension is complete. Causes the specified FID to be deleted. 19.6.30 UNREC_COUNT, UNREC_RVN, UNREC_LBN Set by SELECT_VOLUME when asked to allocate a certain number of blocks on some volume. These blocks will be picked up by EXTEND. Set by EXTEND_CONTIG when a new extent into which to copy the file has been allocated but not recorded in the map pointers. Likewise set by SHUFFLE_DIR. Cleared once the copy has been done and the header updated. Set in EXTEND during a header extension to record blocks allocated that have yet to be added to the header since we must extend it. Cleared in EXTEND once the blocks are returned if we failed to add them or when we succeed to add them to the header. Causes the specified blocks to be returned to the storage map. 19-10 CHAPTER 20 XQP STORAGE AREA The breakdown, and usage, of the XQP storage area follows. 20.1 IO_CCB (non-impure) REF BBLOCK: CCB for IO_CHANNEL, created by INIT_FCP. IO_CCB$L_UCB is set to CURRENT_UCB by GET_REQUEST and to the new UCB by SWITCH_VOLUME. IO_CCB$L_UCB is used to refer to the desired UCB by WRITE_BLOCK (since buffer writes due to LRU replacement may be to other than CURRENT_UCB). 20.2 IO_CHANNEL (non-impure) LONG: channel assigned by INIT_FCP. Used for forcing mount verification on shadow sets, issuing an unload/available function at volume dismount, erasing blocks of the index file when extending EOF, reading and writing random blocks (READ_BLOCK, WRITE_BLOCK), erasing blocks for highwater and erase on return processing. 20.3 BLOCK_LOCKID (non-impure) LONG: lock id of activity blocking lock held by this process (refer to activity blocking) 20.4 USER_STATUS VECTOR [2]: XQP status to be returned to user (refer to error processing) 20-1 XQP STORAGE AREA 20.5 IO_STATUS VECTOR [2]: status block for XQP I/O 20.6 IO_PACKET REF BBLOCK: address of current I/O request packet, set in DISPATCHER 20.7 CURRENT_UCB REF BBLOCK: address of UCB of current request, set in GET_REQUEST and SWITCH_VOLUME 20.8 CURRENT_VCB REF BBLOCK: address of VCB of current request, set in GET_REQUEST and SWITCH_VOLUME 20.9 CURRENT_RVT REF BBLOCK: RVT of current volume set, or UCB, set in GET_REQUEST 20.10 CURRENT_RVN LONG: RVN of current volume, set in GET_REQUEST and SWITCH_VOLUME. This value drives APPLY_RVN. 20.11 SAVE_VC_FLAGS WORD: save volume context flags. These flag bits belong to the allocation lock value block. They contain the quota file buffer sequence number in bits 1 to 15. 20.12 STSFLGS BITVECTOR [8]: various internal status flags (refer to status flags) 20-2 XQP STORAGE AREA 20.13 BLOCK_CHECK BYTE: make operation blocking check (refer to basic request flow) 20.14 NEW_FID LONG: file number of unrecorded file ID (refer to cleanup processing) 20.15 NEW_FID_RVN LONG: RVN of NEW_FID (refer to cleanup processing) 20.16 HEADER_LBN LONG: LBN of last file header read (CREATE_HEADER, READ_HEADER). This value is placed into FCB$L_HDLBN by FILL_FCB. The setting of this value by READ_HEADER is another reason why headers often need to be re-read after various operations, especially secondary operations such as badblock processing. 20.17 BITMAP_VBN LONG: VBN of current storage map block. This value is used along with BITMAP_RVN to determine the validity of BITMAP_BUFFER. This value is cleared when the allocation lock is released, since we'd better not have a bitmap buffer active at this time. An INVALIDATE of the BITMAP_BUFFER will also clear this. 20.18 BITMAP_RVN LONG: RVN of current storage map block (BITMAP_BUFFER). 20.19 BITMAP_BUFFER REF BBLOCK: address of current storage map block. This value is used as an optimization in ALLOC_BLOCKS to decide if it needs to read a storage map block. The validity of BITMAP_BUFFER is decided by a non-zero value in BITMAP_VBN. 20-3 XQP STORAGE AREA 20.20 SAVE_STATUS LONG: saved status during CREATE's attribute copy, READ_IDX_HEADER. In DELETE, it is used to restore the old USER_STATUS if the delete fails, so as to ignore the delete of a bad header. 20.21 PRIVS_USED BBLOCK [4]: Privileges used to gain access. This bit array is maintained by CHECK_PROTECT. This value can be returned as a read attribute. 20.22 ACB_ADDR REF BBLOCK: address of ACB for cross process ASTs, set in READ_BLOCK to the CDRP portion of the IO_PACKET. 20.23 BFR_LIST BLOCKVECTOR [4,8,BYTE]: listheads for in-process buffers (refer to buffer management) 20.24 BFR_CREDITS VECTOR [4,WORD]: buffers credited to this process (refer to buffer management) 20.25 BFRS_USED VECTOR [4,WORD]: buffers actually in-process (refer to buffer management) 20.26 CACHE_HDR REF BBLOCK: Address of buffer cache header, set by GET_REQD_BFR_CREDITS. 20-4 XQP STORAGE AREA 20.27 CLEANUP_FLAGS (save context area) BITVECTOR [32]: cleanup action flags (refer to cleanup processing) 20.28 FILE_HEADER (save context area) REF BBLOCK: address of current file header, set by CREATE, CREATE_HEADER. Mainly set by READ_HEADER, unless STS_NOUPDHDR is set. EXTEND_HEADER sets this to the new extension header. DELETE_FILE zeros FILE_HEADER when it writes out the deleted header. 20.29 PRIMARY_FCB (save context area) REF BBLOCK: address of primary file FCB, set by GET_REQUEST. Also set by ACCESS, CREATE, MARK_DELETE, EXTEND_CONTIG, EXTEND_INDEX, OPEN_FILE (cleared by CLOSE_FILE), MODIFY, DEACC_QFILE, CONN_QFILE, SHUFFLE_DIR. Cleared by MARK_DELETE when we are done deleting the file. Cleared by GET_FIB, ACCESS, MODIFY when the FID in the user's FIB does not match that of the FCB associated with the channel (okay for access if only a find was desired). 20.30 CURRENT_WINDOW (save context area) REF BBLOCK: address of file window, set by GET_REQUEST. Also set by ACCESS, CREATE, EXTEND_INDEX, OPEN_FILE (cleared by CLOSE_FILE). Cleared by GET_FIB, ACCESS, DELETE, MODIFY when the FID in the user's FIB does not match that of the FCB associated with the channel. 20.31 CURRENT_FIB (save context area) REF BBLOCK: pointer to FIB currently in use, set to LOCAL_FIB by GET_FIB and GET_REQUEST. Set to SECOND_FIB by SAVE_CONTEXT (LOCAL_FIB is not in the save context area). 20.32 CURR_LCKINDX (save context area) LONG: Current file header lock index (refer to serialization of activity). 20-5 XQP STORAGE AREA 20.33 PRIM_LCKINDX (save context area) LONG: Primary file lock basis index (refer to serialization of activity). 20.34 LOC_RVN (save context area) LONG: RVN specified by placement data, set by GET_LOC. 20.35 LOC_LBN (save context area) LONG: LBN specified by placement data, set by GET_LOC. 20.36 UNREC_LBN (save context area) LONG: start LBN of unrecorded blocks (refer to cleanup processing). 20.37 UNREC_COUNT (save context area) LONG: count of unrecorded blocks (refer to cleanup processing). 20.38 UNREC_RVN (save context area) LONG: RVN containing unrecorded blocks (refer to cleanup processing). 20.39 PREV_LINK (save context area) BBLOCK [FID$C_LENGTH]: old back link of file (refer to cleanup processing). 20.40 CONTEXT_SAVE VECTOR [CONTEXT_SIZE, BYTE]: area to save primary context 20-6 XQP STORAGE AREA 20.41 LB_LOCKID VECTOR [LB_NUM]: serial lock ids (refer to serialization of activity). 20.42 LB_BASIS VECTOR [LB_NUM]: lock name bases (refer to serialization of activity). 20.43 LB_HDRSEQ VECTOR [LB_NUM]: file header cache sequence numbers (refer to buffer management) 20.44 LB_DATASEQ VECTOR [LB_NUM]: file data block cache sequence number (refer to buffer management) 20.45 LB_FILESIZE VECTOR [LB_NUM]: value block file size (refer to buffer management) 20.46 DIR_FCB REF BBLOCK: FCB of directory file, set in DIR_ACCESS. Cleared in DELETE if we are deleting the directory itself. 20.47 DIR_LCKINDX LONG: Directory lock basis index (refer to serialization of activity) 20.48 DIR_RECORD LONG: record number of found directory entry within the block. Maintained by DIR_SCAN and FIND. Zeroed before an ENTER operation. DIR_RECORD + 1 becomes the low order 6 bits of the wild card context 20-7 XQP STORAGE AREA (FIB$L_WCC) returned to the user. 20.49 DIR_CONTEXT BBLOCK [DCX_LENGTH]: current directory context. The directory context is saved within ENTER when it is necessary to do another DIR_SCAN, to find the lowest entry to remove. Restored (by RESTORE_DIR) when a directory operation is to be done at cleanup time. 20.50 DIR_VBN (directory context) LONG: VBN of DIR_BUFFER. DIR_VBN - 1 forms the high 10 bits of the wild card context (FIB$L_WCC) returned to the user. 20.51 DIR_BUFFER (directory context) REF BBLOCK: pointer to current directory block. 20.52 DIR_ENTRY (directory context) REF BBLOCK: pointer to current directory entry. A non-zero value indicates the presence of a directory entry/block/version. 20.53 DIR_VERSION (directory context) REF BBLOCK: pointer to current directory version entry. 20.54 DIR_END (directory context) REF BBLOCK: pointer to end of directory entries 20.55 DIR_PRED (directory context) REF BBLOCK: pointer to record preceding record found 20-8 XQP STORAGE AREA 20.56 VERSION_LIMIT (directory context) WORD: version limit of current entry 20.57 VERSION_COUNT (directory context) WORD: number of versions found 20.58 LAST_ENTRY (directory context) VECTOR [,BYTE]: name string of last record in previous block (counted string) 20.59 OLD_VERSION_FID BBLOCK [FID$C_LENGTH]: Old version's FID, set by DIR_SCAN 20.60 PREV_VERSION LONG: version number of previous directory entry, used for re-enter or un-supersede cleanup. 20.61 PREV_NAME VECTOR [FILENAME_LENGTH+1, BYTE]: name of previous entry (counted string) used for re-enter cleanup. 20.62 PREV_INAME VECTOR [FILENAME_LENGTH+6, BYTE]: previous internal file name (from file header) used for backlink/name cleanup (rename function). 20.63 SUPER_FID BBLOCK [FID$C_LENGTH]: file ID of superseded file. Re-entered by ERR_CLEANUP if necessary. 20-9 XQP STORAGE AREA 20.64 LOCAL_FIB BBLOCK [FIB$C_LENGTH]: primary FIB of this operation (see CURRENT_FIB) 20.65 SECOND_FIB BBLOCK [FIB$C_LENGTH]: FIB for secondary file operation (see CURRENT_FIB) 20.66 LOCAL_ARB BBLOCK [ARB$C_HEADER]: local copy of caller's ARB 20.67 QUOTA_RECORD LONG: record number of quota file entry, returned as wild-card context to user. 20.68 FREE_QUOTA LONG: record number of free quota file entry 20.69 REAL_Q_REC REF BBLOCK: buffer address of quota record read 20.70 QUOTA_INDEX LONG: cache index of cache entry found 20.71 DUMMY_REC BBLOCK [DQF$C_LENGTH]: dummy quota record for cache contents. Special cased in WRITE_QUOTA to mean that the quota record pointer does not point into a cache buffer. 20-10 XQP STORAGE AREA 20.72 AUDIT_COUNT LONG: number of argument lists in AUDIT_ARGLIST 20.73 MATCHING_ACE (non-impure) BBLOCK [ATR$S_READACL]: Matching ACE storage, set by CHECK_PROTECT to the ACE upon which the access check matched, returnable via READ_ATTRIB. 20.74 FILE_SPEC_LEN (non-impure) VECTOR [1, WORD]: current length of FULL_FILE_SPEC 20.75 FULL_FILE_SPEC (non-impure) VECTOR [1022, BYTE]: storage area to hold output of FID_TO_SPEC, used by WRITE_AUDIT and READ_ATTRIB. 20.76 PMS Metering Cells (non-impure) LONG: used to record total disk reads, total disk writes, total cache reads, number of reads/writes/cache reads/CPU/page faults for the function and sub-function 20.77 AUDIT_ARGLIST (non-impure) BBLOCK [AUDIT_LENGTH*MAX_AUDIT_COUNT]: used to accumulate audit records 20-11 CHAPTER 21 ROUTINE LIST ACCESS, facility F11X module ACCESS calling sequence: main driver for access function ACL_ACLLENGTH, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT, COUNT, LENGTH) determine total ACL length ACL_ADDENTRY, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT, LENGTH, ACE_BUFFER) add ACE to ACL ACL_BUILDACL, facility F11X module ACLCNTRL calling sequence: (FIRST_FCB) copy the memory ACL to the file header(s) ACL_COPYACL, facility F11X module ACLCNTRL calling sequence: (OLD_FILE_FCB, NEW_FILE_FCB, OPTION) copy ACEs from one file to another ACL_DELENTRY, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT, COUNT, ACE) delete ACE from ACL ACL_DELETEACL, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT) delete whole ACL ACL_DISPATCH, facility F11X module ACLCNTRL calling sequence: (CODE, ADDRESS, COUNT, ACE) dispatch on ACL request to ACL utilities ACL_FINDENTRY, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT, COUNT, ACE, INTERNAL) find ACE 21-1 ROUTINE LIST ACL_FINDTYPE, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT, COUNT, ACE, INTERNAL) find specific type of ACE ACL_INIT_QUEUE, facility F11X module ACLSUBR calling sequence: (ORB_ADDRESS) initialize ACL as a mutex protected list ACL_LOCATEACE, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACE_INDEX, ACL_POINTER, ACL_SPLIT) locate ACE by context ACL_MODENTRY, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT, COUNT, ACE) modify ACE ACL_READACE, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT, COUNT, ACE) read ACE ACL_READACL, facility F11X module ACLSUBR calling sequence: (ACL_QUEUE_HEAD, ACL_CONTEXT, LENGTH, ACE_BUFFER) read as much of ACL as possible ACPCONTROL, facility F11X module ACPCNTRL calling sequence: main dispatch for ACP control functions ALLOCATE, facility F11X module ALLOCB calling sequence: allocate unpaged memory ALLOCATION_LOCK, facility F11X module LOCKERS calling sequence: NOVALUE acquire index file/storage map allocation lock ALLOCATION_UNLOCK, facility F11X module LOCKERS calling sequence: NOVALUE release allocation lock ALLOC_BLOCKS, facility F11X module SMALOC calling sequence: (FIB, BLOCKS_NEEDED, START_LBN, BLOCKS_ALLOC) allocate a contiguous set of blocks ALLOC_PAGED, facility F11X module ALLOCB calling sequence: allocate paged memory ARBITRATE_ACCESS, facility F11X module LOCKERS 21-2 ROUTINE LIST calling sequence: (ACCTL, FCB) check, obtain access lock BLOCK_WAIT, facility F11X module LOCKERS calling sequence: NOVALUE wait for volume blocking lock to be released BUILD_EXT_FCBS, facility F11X module EXTFCB calling sequence: (PRIMHDR, PFCB) : NOVALUE build the extension FCB chain CACHE_LOCK, facility F11X module LOCKERS calling sequence: (LOCK_BASIS, LOCK_ID, MODE) acquire cache lock, possibly flushing caches on other nodes CACHE_SERVER, facility F11X module FILESERV calling sequence: run in cache server process to call ACPCONTROL functions when requested CHANGE_OWNER, facility F11X module RWATTR calling sequence: (UIC, ORG_FCB, ORG_HEADER) change owner fields in header chain, change quota CHARGE_QUOTA, facility F11X module CHARGEQ calling sequence: (UIC, BLOCK_COUNT, FLAGS) : NOVALUE charge/credit quota to user CHECKSUM, facility F11X module CHKSUM calling sequence: checksum file header CHECK_DISMOUNT, facility F11X module CHKDMO calling sequence: NOVALUE check for device now idle, dismount if requested and idle CHECK_HEADER2, facility F11X module CHKHD2 calling sequence: (HEADER, FILE_ID, HEADER_STATUS) validate file header CHECK_PROTECT, facility F11X module CHKPRO calling sequence: (ACCESS, HEADER, FCB, ACMODE, ALT_ACCESS, REQUIRED) check user's right to access a file CLEANUP, facility F11X module CLENUP calling sequence: cleanup after an operation CLEAN_QUO_CACHE, facility F11X module CHARGEQ calling sequence: (J, Q_RECORD) : NOVALUE mark quota record dirty, clean up cache entry 21-3 ROUTINE LIST CLOSE_FILE, facility F11X module FILUTL calling sequence: (WINDOW) : NOVALUE close internally opened file CONN_QFILE, facility F11X module QUOTAUTIL calling sequence: (ABD, FIB) : NOVALUE open and connect quota file CONTINUE_THREAD, facility F11X module DISPATCH calling sequence: AST routine to continue operation upon event completion CONV_ACCLOCK, facility F11X module LOCKERS calling sequence: (LCKMODEARG, FCBARG) convert the access lock for an FCB to exclude an accessor COPY_NAME, facility F11X module CPYNAM calling sequence: (ABD) copy name from buffer descriptor to result string CREATE, facility F11X module CREATE calling sequence: main driver for the create function CREATE_BLOCK, facility F11X module RDBLOK calling sequence: (LBN, COUNT, TYPE) create a zeroed cache block (do not read the block) CREATE_FCB, facility F11X module CREFCB calling sequence: (HEADER, PRIMFCB) allocate, initialize FCB, add to VCB CREATE_HEADER, facility F11X module CREHDR calling sequence: (FILE_ID) find a new file id CREATE_WINDOW, facility F11X module CREWIN calling sequence: (ACCTL, SIZE, HEADER, PID, FCB) create a window DALLOC_PAGED, facility F11X module ALLOCB calling sequence: de-allocate paged memory DEACCESS, facility F11X module DEACCS calling sequence: main driver for the de-access function DEACC_QFILE, facility F11X module QUOTAUTIL calling sequence: de-access quota file, remove cache 21-4 ROUTINE LIST DEALLOCATE, facility F11X module ALLOCB calling sequence: de-allocate unpaged memory DEALLOCATE_BAD, facility F11X module DELBAD calling sequence: (FIB, FILE_HDR, POINTER, LAST_COUNT) : NOVALUE remove blocks from file into BADBLOCK file DELETE, facility F11X module DELETE calling sequence: main driver for delete function DELETE_FID, facility F11X module DELFIL calling sequence: (FILENUM) : NOVALUE release the file id to FID cache/index file map, possibly flush cache DELETE_FILE, facility F11X module DELFIL calling sequence: (FIB, FILEHEADER) : NOVALUE delete contents of file and header DEL_EXTFCB, facility F11X module CLENUP calling sequence: (START_FCB) delete extension FCBs, decrement VCB transaction counts DEQ_LOCK, facility F11X module LOCKERS calling sequence: (LOCK_ID) : NOVALUE de-queue random lock DIR_ACCESS, facility F11X module DIRACC calling sequence: (FIB, WRITE) : NOVALUE access a directory DIR_SCAN, facility F11X module DIRSCN calling sequence: (NAME_DESC, FILE_ID, START_BLOCK, START_REC, START_VER, START_PRED, REC_COUNT) look up a name/FID in a directory DISPATCH, facility F11X module DISPATCH calling sequence: dispatch from XQP queue DISPATCHER, facility F11X module DISPAT calling sequence: NOVALUE main request dispatching ENTER, facility F11X module ENTER calling sequence: (ABD, FIB, RESULT_LENGTH, RESULT) : NOVALUE main driver for enter function ERASE_BLOCKS, facility F11X module ERASE calling sequence: (START_LBN, BLOCK_COUNT, CHANNEL) 21-5 ROUTINE LIST do logical I/O to erase contiguous blocks of file ERR_CLEANUP, facility F11X module CLENUP calling sequence: cleanup after an aborted operation EXTEND, facility F11X module EXTEND calling sequence: (USER_FIB, FILEHEADER) : NOVALUE extend a file, possibly also header EXTEND_CONTIG, facility F11X module EXTCONTIG calling sequence: (FIB, FCB, SIZE) extend contiguous file (by copying blocks) EXTEND_HEADER, facility F11X module EXTHDR calling sequence: (FIB, OLD_HEADER, FCB, NEW_VOLUME, BLOCKS_NEEDED) create an extension header EXTEND_INDEX, facility F11X module EXTIDX calling sequence: NOVALUE extend the index file FID_TO_SPEC, facility F11X module RWATTR calling sequence: (HEADER) : NOVALUE translate FID to filespec via backlinks FILE_SIZE, facility F11X module FILESIZE calling sequence: (HEADER) find file size from a given header FILL_FCB, facility F11X module INIFC2 calling sequence: (FCBARG, HDRARG, PRIMFCB) : NOVALUE fill in FCB from file header FIND, facility F11X module FIND calling sequence: (ABD, FIB, FIND_MODE, RESLEN_ARG, RESULT_ARG) : NOVALUE locate, operate upon directory entry FINISH_REQUEST, facility F11X module DISPATCH calling sequence: lower volume activity count, possibly release block lock FIXUP_MESSAGES, facility F11X module XQPMSG calling sequence: fix up message descriptors when XQP mapped FLUSH_QUO_CACHE, facility F11X module QUOTAUTIL calling sequence: NOVALUE flush dirty entries to quota file, release cache 21-6 ROUTINE LIST FMG$MATCH_NAME, facility F11X module MATCHNAME calling sequence: wildcard matching GET_FIB, facility F11X module GETFIB calling sequence: (ABD) copy user FIB, set default values GET_LOC, facility F11X module GETLOC calling sequence: (FIB, LOCRVN, LOCLBN) : NOVALUE find desired LBN/RVN for file placement GET_LOC_ATTR, facility F11X module GTLCAT calling sequence: (ABD, FIB) : NOVALUE convert compatibility mode placement data into FIB format GET_MAP_POINTER, facility F11X module GETPTR calling sequence: NOVALUE decode file map pointer GET_QUOTA_LOCK, facility F11X module CHARGEQ calling sequence: (J, MODE) : NOVALUE lock quota cache entry (cluster only) GET_REQD_BFR_CREDITS, facility F11X module RDBLOK calling sequence: NOVALUE reserve the minimum required cache buffers GET_REQUEST, facility F11X module GETREQ calling sequence: get request from XQP queue GET_TIME, facility F11X module GETTIM calling sequence: (BUFFER, TIME) : NOVALUE ODS-1 conversion INITXQP, facility F11X module DISPATCH calling sequence: initialize XQP INIT_FCB2, facility F11X module INIFC2 calling sequence: (FCBARG, HEADER, PRIMFCB) : NOVALUE fill in FCB and filesize INIT_FCP, facility F11X module INIFCP calling sequence: create impure storage area INIT_FID_CACHE, facility F11X module CREHDR calling sequence: (CACHE) : NOVALUE set FID cache valid 21-7 ROUTINE LIST INVALIDATE, facility F11X module RDBLOK calling sequence: (BUFFER) : NOVALUE invalidate a cache buffer's contents IOC$BUFPOST, facility SYS module IOCIOPOST calling sequence: post completion of buffered I/O IOC$DALLOC_DMT, facility SYS module IOSUBPAGD calling sequence: de-allocate device on dismount IOC$MAPVBLK, facility SYS module IOSUBRAMS calling sequence: map using window chain IO_DONE, facility F11X module IODONE calling sequence: post QIO completion directly KILL_BUFFERS, facility F11X module RDBLOK calling sequence: (POOL, LOCKBASIS) : NOVALUE kill buffers associated with CURRENT_UCB (and a lockbasis) KILL_CACHE, facility F11X module RDBLOK calling sequence: (UCB) : NOVALUE toss out buffers associated with a UCB KILL_DINDX, facility F11X module RDBLOK calling sequence: (FCB) : NOVALUE invalidate a directory index block LOCK_COUNT, facility F11X module LOCKERS calling sequence: (LOCKID) return number of lockers for a LOCKID LOCK_IODB, facility F11X module LOCKDB calling sequence: lock I/O data base mutex LOCK_MODE, facility F11X module LOCKERS calling sequence: (ACCTL) compute access lock mode for request MAKE_ACCESS, facility F11X module MAKACC calling sequence: (FCB, WINDOW, ABD) : NOVALUE hook up windows to FCB, update VCB fields for new file access MAKE_DIRINDX, facility F11X module RDBLOK calling sequence: (FCB) find/validate a directory index (cache) block 21-8 ROUTINE LIST MAKE_ENTRY, facility F11X module ENTER calling sequence: (NAME_DESC, FIB) : NOVALUE actually make the directory entry MAKE_FCB_STALE, facility F11X module LOCKERS calling sequence: (FCBARG) : NOVALUE make all nodes mark the given FCB stale MAKE_NAMEBLOCK, facility F11X module MAKNMB calling sequence: (LENGTH, STRING, NAMEBLOCK) : NOVALUE ODS-1 conversion MAKE_POINTER, facility F11X module MAKPTR calling sequence: (COUNT, LBN, FILE_HEADER, PLACEMENT_CODE) encode map pointer MAKE_STRING, facility F11X module MAKSTR calling sequence: (NAMEBLOCK, STRING) ODS-1 conversion MAP_IDX, facility F11X module CREHDR calling sequence: (VBN, COUNT) map block of index file MAP_VBN, facility F11X module MAPVBN calling sequence: (VBN, WINDOW, BLOCK_COUNT, UNMAPPED_BLOCKS) map VBN to LBN MAP_WINDOW, facility F11X module MPWIND calling sequence: caller for IOC$MAPVBLK MARKDEL_FCB, facility F11X module DELETE calling sequence: (FCB) mark the FCB as delete pending (cluster wide) MARK_COMPLETE, facility F11X module WITURN calling sequence: (WINDOW) : NOVALUE mark a window chain as totally mapped MARK_DELETE, facility F11X module DELETE calling sequence: (FIB, DO_DELETE, RESULT_LENGTH, RESULT) : NOVALUE mark file for deletion, delete if possible MARK_DIRTY, facility F11X module RDBLOK calling sequence: (BUFFER) : NOVALUE mark a cache buffer as modified MARK_INCOMPLETE, facility F11X module WITURN calling sequence: (FIRST_BLOCK) : NOVALUE mark a window chain as not totally mapped 21-9 ROUTINE LIST MODIFY, facility F11X module MODIFY calling sequence: main driver for modify (extent/truncate) function MOUNT, facility F11X module MOUNT calling sequence: mark UCB as mounted NEXT_DIR_REC, facility F11X module DIRSCN calling sequence: (OLD_REC, VBN) advance to next directory record (and block) for the same name NEXT_HEADER, facility F11X module NXTHDR calling sequence: (HEADER, FCB, EXT_FID, SEGNUM) read the next extension header NEXT_REC, facility F11X module DIRSCN calling sequence: (ENTRY) find the next directory entry within the current block NOTIFY_AST, facility F11X module XQPMSG calling sequence: supervisor mode routine to notify user NOTIFY_USER, facility F11X module DISPAT calling sequence: (CONTROL_STRING, FAO_ARGS) : NOVALUE sends message to user (via supervisor mode AST routine) NUKE_HEAD_FCB, facility F11X module CLENUP calling sequence: (FCB) : NOVALUE remove FCB from volume list, delete all appendages and locks OPEN_FILE, facility F11X module FILUTL calling sequence: (FID, WRITE) open a file for internal use PARSE_NAME, facility F11X module PARSNM calling sequence: (NAME_DESC, NAME_BUFFER, COUNT, STRING, FLAGS) : NOVALUE convert file name to name block PMS_END, facility F11X module PMS calling sequence: NOVALUE end metering main function PMS_END_SUB, facility F11X module PMS calling sequence: NOVALUE end metering sub-function PMS_START, facility F11X module PMS calling sequence: NOVALUE start metering main function 21-10 ROUTINE LIST PMS_START_SUB, facility F11X module PMS calling sequence: (INDEX) : NOVALUE start metering sub-function PURGE_EXTENT, facility F11X module SMALOC calling sequence: (ENTRY_COUNT, CACHE_LIMIT) : NOVALUE purge entries from extent cache, return to BITMAP QEX_N_CANCEL, facility F11X module LOCKERS calling sequence: (LOCKID) jiggle locks to fire blocking AST QUOTA_FILE_OP, facility F11X module QUOTAUTIL calling sequence: (ABD, FIB) : NOVALUE basic dispatching for quota file operations READ_ATTRIB, facility F11X module RWATTR calling sequence: (HEADER, ABD) read user requested attributes READ_BLOCK, facility F11X module RDBLOK calling sequence: (LBN, COUNT, TYPE) read a block (and possibly some more) READ_DATA, facility F11X module FILUTL calling sequence: (WINDOW, VBN, COUNT) read data from internal file READ_HEADER, facility F11X module RDHEDR calling sequence: (FILE_ID, FCB, REALBASIS_A) read main or extension file header READ_IDX_HEADER, facility F11X module CREHDR calling sequence: read primary/alternate index header READ_WRITEVB, facility F11X module RWVB calling sequence: read/write virtual block, including writing special files REBLD_PRIM_FCB, facility F11X module CREFCB calling sequence: (PFCB, HEADER) delete old ACL and extension FCB chain, refill FCB RELEASE_CACHE, facility F11X module RDBLOK calling sequence: NOVALUE release the buffer cache lock, wake others RELEASE_LOCKBASIS, facility F11X module RDBLOK calling sequence: (LCKINDX) release buffers associated with a lockbasis value 21-11 ROUTINE LIST RELEASE_SERIAL_LOCK, facility F11X module LOCKERS calling sequence: (LOCK_INDEX) : NOVALUE release serial lock, lock block, check caches REL_QUOTA_LOCK, facility F11X module CHARGEQ calling sequence: (J) : NOVALUE release quota cache lock REMAP_FILE, facility F11X module ACPCNTRL calling sequence: NOVALUE completely map file REMOVE, facility F11X module REMOVE calling sequence: (KEEP_NAME) : NOVALUE remove a directory entry REQUEUE_REQ, facility F11X module REQUEU calling sequence: re-queue request to driver (as a physical/logical request) RESET_LBN, facility F11X module RDBLOK calling sequence: (BUFFER, LBN) : NOVALUE change the LBN associated with a buffer RESTORE_CONTEXT, facility F11X module GETREQ calling sequence: NOVALUE restore context area after XQP sub-function RESTORE_DIR, facility F11X module ENTER calling sequence: (CONTEXT) : NOVALUE reposition directory according to saved context, restore context RETURN_BLOCKS, facility F11X module SMALOC calling sequence: (START_LBN, BLOCK_COUNT, ERASE_REQUESTED) : NOVALUE return a set of blocks to the storage map RETURN_CREDITS, facility F11X module RDBLOK calling sequence: NOVALUE return the reserved cache buffers RETURN_DIR, facility F11X module RETDIR calling sequence: (COUNT, STRING, ABD) : NOVALUE return result data from directory scan to user's result string RM$ARM_DIRCACHE, facility RMS module RM0SETDID calling sequence: arm blocking AST to notice UCB$W_DIRSEQ change RM$DIRCACHE_BLKAST, facility SYS module RMSRESET calling sequence: increment DIRSEQ in UCB 21-12 ROUTINE LIST SAVE_CONTEXT, facility F11X module GETREQ calling sequence: NOVALUE save context area for XQP sub-function SCAN_BADLOG, facility F11X module BADSCN calling sequence: (FID, BASE_VBN, BASE_LBN, MODE, BLOCK_COUNT) : NOVALUE add/remove entry from BADLOG file SEARCH_FCB, facility F11X module SCHFCB calling sequence: (FILE_ID) look for FCB off of VCB SEARCH_QUOTA, facility F11X module CHARGEQ calling sequence: (UIC, FLAGS, START_REC, USE_CACHE) search quota file/cache for a UIC SELECT_VOLUME, facility F11X module SELVOL calling sequence: (FIB, BLOCKS_NEEDED) : NOVALUE pick the best volume for an allocation SEND_BADSCAN, facility F11X module SNDBAD calling sequence: (FID) : NOVALUE send message to, spawn badblock routine SEND_ERRLOG, facility F11X module SNDERL calling sequence: (MODE, UCB) generate error log entry SEND_SYMBIONT, facility F11X module SNDSMB calling sequence: (HEADER, FCB) : NOVALUE send spool request to job controller SERIAL_CACHE, facility F11X module RDBLOK calling sequence: NOVALUE serialize (lock) the buffer cache SERIAL_FILE, facility F11X module LOCKERS calling sequence: (FID_ADDR) acquire serial lock, return lock block SET_DIRINDX, facility F11X module CLENUP calling sequence: (FCB) try to make an association between a directory and its index block SET_EXPIRE, facility F11X module ACCESS calling sequence: marks the window as needing expiration recording when closed SET_REVISION, facility F11X module DEACCS calling sequence: (HEADER, MODE) : NOVALUE update revision date in header 21-13 ROUTINE LIST SHUFFLE_DIR, facility F11X module SHFDIR calling sequence: (DIRECTION) : NOVALUE extend/compress a directory START_REQUEST, facility F11X module DISPATCH calling sequence: test for volume activity blocking and raise volume activity count SWITCH_CHANNEL, facility F11X module SWITVL calling sequence: (UCB) : NOVALUE switch XQP to new UCB for new volume SWITCH_VOLUME, facility F11X module SWITVL calling sequence: (NEW_RVN) : NOVALUE switch volume context, switch allocation lock to new volume if held TAKE_BLOCK_LOCK, facility F11X module LOCKERS calling sequence: NOVALUE acquire volume activity blocking lock for the process TOSS_CACHE_DATA, facility F11X module RDBLOK calling sequence: (LCKINDX) : NOVALUE write, invalidate all buffers associated with a lockbasis TRUNCATE, facility F11X module TRUNC calling sequence: (FIB, FILEHEADER, TRNVBN) : NOVALUE truncate blocks off end of file TRUNCATE_HEADER, facility F11X module TRUNC calling sequence: (FIB, HEADER, POINTER, LAST_COUNT) : NOVALUE return truncated blocks to storage map, clear map pointers TRUNC_CHECKS, facility F11X module TRUNC calling sequence: (FIB, HEADER) : NOVALUE perform validity checks on truncate request TURN_WINDOW, facility F11X module WITURN calling sequence: (WINDOW, HEADER, DESIRED_VBN, START_VBN) create window block(s), turn windows to desired VBN UNHOOK_BFRL, facility F11X module RDBLOK calling sequence: (BFRDARG) : NOVALUE unhook a buffer descriptor from a buffer lock UNLOCK_IODB, facility F11X module LOCKDB calling sequence: unlock I/O data base mutex UNLOCK_XQP, facility F11X module DISPAT calling sequence: NOVALUE release all locks 21-14 ROUTINE LIST UPDATE_DIRSEQ, facility F11X module CHKDMO calling sequence: causes update of DIRSEQ in UCB for all nodes (invalidates RMS directory caches) UPDATE_FCB, facility F11X module CREFCB calling sequence: (HEADER) : NOVALUE update (fill) primary FCB from given header UPDATE_INDX, facility F11X module DIRSCN calling sequence: (BLOCK, STR_SIZE, STR_ADDR, DIRFCB) : NOVALUE record the directory entry name in the directory index cache cell WAIT_FOR_AST, facility F11X module DISPATCH calling sequence: suspend XQP activity, return to user, wait for event WRITE_ATTRIB, facility F11X module RWATTR calling sequence: (HEADER, ABD, CONTROL_ACCESS) : NOVALUE write user attributes WRITE_AUDIT, facility F11X module DISPAT calling sequence: (AUDIT_BLOCK) : NOVALUE generate, write audit record WRITE_BLOCK, facility F11X module RDBLOK calling sequence: (BUFFER) : NOVALUE write a buffer back to disk WRITE_DIRTY, facility F11X module RDBLOK calling sequence: (LOCKBASIS) : NOVALUE write buffers associated with a lockbasis back to disk WRITE_HEADER, facility F11X module RDBLOK calling sequence: NOVALUE write a file header (checksum before write block) WRITE_QUOTA, facility F11X module CHARGEQ calling sequence: (Q_RECORD) : NOVALUE mark quota record for writing WRONG_LOCKBASIS, facility F11X module RDBLOK calling sequence: (HEADER) : NOVALUE return a buffer found to have the wrong lockbasis XQP$BLOCK_ROUTINE, facility SYS module SYSACPFDT calling sequence: block volume activity (decrement activity), invoke XQP$DEQBLOCKER if idle XQP$DEQBLOCKER, facility SYS module SYSACPFDT 21-15 ROUTINE LIST calling sequence: de-queue blocking lock (swapper) XQP$FCBSTALE, facility SYS module SYSACPFDT calling sequence: blocking routine to mark FCB stale XQP$REL_QUOTA, facility SYS module SYSACPFDT calling sequence: blocking AST to invoke XQP$UNLOCK_QUOTA XQP$UNLOCK_CACHE, facility SYS module SYSACPFDT calling sequence: pass system blocking cache AST on to cache server process XQP$UNLOCK_QUOTA, facility SYS module SYSACPFDT calling sequence: de-queue/demote quota cache entry lock XQPMERGE, facility F11X module XQPMERGE calling sequence: force a new file system into P1 space ZERO_IDX, facility F11X module CLENUP calling sequence: NOVALUE clear directory index block ZERO_ON_ERROR, facility F11X module DISPAT calling sequence: (SIGNAL, MECHANISM) handler to return zero as routine value ZERO_WINDOWS, facility F11X module CLENUP calling sequence: (FCB) de-allocate windows off the FCB 21-16