0145: I/O PLUMBING - I/O Interception Without Pain Glenn C. Everhart, General Cybernetic Engineering (Consulting in systems, networks, and internals VMS, Unix, MSDOS) Everhart@Arisia.GCE.Com 215 358 5875 Intercepting control during the I/O process is part of the craft of system programming that has numerous uses. Now, in general, you want to use well defined interfaces because: * Control flow is well defined there, may even be documented and stable. * All control narrows to flow through these interfaces, rather than going through many. * Data structures have to contain all information here, rather than having control info encoded in misc. parts of machine state. It is a fundamental concept of systems programming to find new uses for existing interfaces and to intercept control there. On real OSs, it is also fundamental to avoid assuming you have complete control of those interfaces. Manufacturers and other systems programmers use the same ones. It is important to design your applications so multiple users of an interface can exist cleanly. The PC world has had many problems because its interfaces started to be used by a large group who did not do this. Now back to VMS control stealing. Consider the major gateways that I/O passes through. I/O Flow: (Leave out other hacks like stealing the CHMK or CHME vectors by bashing the SCB; I consider only I/O here.) Process -> QIO call EVERY one of these points can be a place to intercept the operation. 1 -> sysqioreq code in VMS kernel; sets up IRP and device independent fields. Validates that driver can do the operation (using 1st FDT mask, 64 bits) -> sets buffered bit if appropriate (generally so for XQP calls). 2 -> calls driver FDT routines inside a tight loop. (In the FDT routines the kernel stack has a JSB - pushed PC and one CALLG frame from the qio call.) Finds FDT table from DDT, pointed to by ucb$l_ddt. (NOTE: Step 2 Alpha drivers use ONE CALL depending on function.) 3 -> FDT routines do additional setup, finish getting IRP set up (P1 to P6 in arg list on VAX; in IRP already on Alpha). Exit via RSB (getting back to loop and to next FDT), or to error exit, or to code to queue IRP to driver start_io or to XQP. In these cases eventually intermediate return to exe$qioreturn or friends for intermediate return (pending i/o done). Return pops stack back, sets IPL to 0, sets R0 to code (usually 1 where all's well). 4 (note: you can patch the global XQP entry point also if you like: IRP is complete here and you get here via normal knl AST. 5 -> Driver start-io entry. Does actual hardware protocol setup. I won't go into this; hardware can be timing dependent and patching from past here thru interrupt code is not usually desirable. 6 -> Post processing queue handles status return to IOSB, sometimes buffer copy/interpret, etc. 7 You can similarly patch almost anywhere else in VMS, so long as you can figure out how not to break synchronization. (Yes, you can even steal the CHMK handler from the SCB!) What One Does with these Patch Locations 1. Patch at the system call - use to monitor calls (per process...means you bash the table in P1 space) and record whatever user info you like. Difficulty: you don't usually own any process space for dedicated records or for replacing arguments and clobbering user calls is risky or worse. FTS012 is an example program that patches here. Your code can lie to the process too... Note: patching the SCB to steal the CHMK or CHME vector is more complete and can catch calls via S0 entry points too. 2. You can steal the driver's FDT table by pointing the DDT at your own after you insert your FDT processing ahead of what's there. Table is a series of (function mask, address) pairs. Code at "address" gets control when (and ONLY when) function mask bits for the current I/O function are set. You're in user process context at IPL 2 here and can allow, disallow, or to some extent modify the I/O or its results at this point. Here's how. It is possible to interrupt FDT processing PROVIDED you save the full context (registers), don't disturb the IRP, and make DARN sure the user process can't clobber its inputs. Remember that at FDT time you do not yet busy the driver, so you must save the I/O per channel, per process when doing this. Finding a fast way to get to this info is important; a base pointer can be stored in a pseudodriver UCB, a lock value block, a logical name, or (if you can find one) some unused system cell. The last is usually ill advised. You can also use a constant if your driver has a single unit. Once you save the I/O context you can notify a process about what you want; techniques such as using sch$postef to set a local event flag or exe$writembx to write to a MB: unit are examples. You then need to return to user context. It is VITAL if intercepting here not to let user processing occur. Blocking ASTs is optional, but the mainline can't be allowed to run or args can get modified (RMS does this). You can inhibit ASTs via the PCB$B_ASTEN byte (set to 1 to allow only kernel mode ASTs, for example), and inhibit a process with no side effects by putting it in RWAST state (see the calls to EXE$RWAIT) momentarily. This does not disturb other wait states that the process might enter. You can use other waits if desired. Also be sure the CCB doesn't go idle. Blocking AST processing while your daemon runs gives added safety. You can select which modes to block, too, so user mode code can be kept quiet while inner modes are undisturbed. Some care is needed due to possible locks; you must be sure you will fairly quickly and without fail reenable the process lest you cause deadlocks somewhere. Finally, be sure not to allow the user wait to finish before finding out if its return should be error or not, since an error return will never go further and the thread stalls. To get back into your processing, and undo a wait, using a special kernel AST gets you to much the same state as AST processing. Your AST must restore registers, reenable ASTs to where they were, and reissue the user's I/O, then exit the AST. This works fine on AXP too. Reissuing is ALMOST straightforward...just duplicate the FDT call loop lower on the stack. NOTE though that you return at IPL 0, so must maintain synch. "by hand" and return to IPL 2 as soon as you get back. This means the technique is not 100% general, but works fairly well. An advantage of special vs. regular kernel ASTs here is their simplicity of scheduling by VMS. If your intercept already has everything saved, you need really be careful only of process deletion which can be inhibited temporarily. Notice that FDT processing has the disadvantage that the IRP is not completely built yet. However, it's a well documented area and intercepts here don'tcare what ACP may come later. If you have an ODS-1, or a "new" file structure ACP or alternate XQP, intercepts at FDT time can affect what happens. Stealing the F11BXQP entry is only good for ODS2. Both have process context and so can do some interesting stuff including some system service calls (if you're careful!). Uses: You can do whatever you like, in process code, to check what should happen on certain I/Os. You can move files, deny or allow access to them, return info to alter process priority once back in the process, change extend parameters, change where reads/writes go, monitor I/O file transactions, build extra responses in to different I/O patterns, play any trick you like. This is not a good place to try to do caching, though, due to lack of synchronization. Also, remember that MSCP served disks act by sending IRPs to driver start_io, so in a cluster everyone needs to have this processing locally. 3. Stealing FDT entries themselves can be done. It's easier to add your own, though. 4. Stealing the XQP entry point. This is a very good place to monitor I/O. It is system wide and requires you decode the IRP packet formats for XQP operations, and has the same difficulty 2 does of finding a place for its data. You get control in a kernel AST (for the XQP) with the IRP all built. Creating an I/O error here is less well defined than in FDTs, a disadvantage, but you can twiddle I/O operations similarly to 2 and need not block a process, as all its arguments are fully encapsulated in the IRP by this point. Adding your own call to a process here is possible, though care is needed to ensure the process' I/O that you call is not blocked. This is an issue in 2 also. Uses: Applications that modify file extensions or monitor I/O are reasonable fits here. Other applications such as have been mentioned for 2 are generally possible, though changing I/O can be more involved here, as the XQP has its own methods for doing I/O which are not that similar to normal user ones. The synchronization is however cleaner in that you start in a kernel AST and can resume in one, and do not have hardcoded SETIPL #0 instructions to work around. 5. Stealing driver start_io entry. This location is often taken for purposes of implementing cache systems. You just change the pointer in the DDT (Driver Dispatch Table) and your code gains control before the regular driver's. At this point (or in the earlier ones) you can "steal" the IRP completion by filling in IRP$L_PID with the address of your completion routine. This location gains control at I/O postprocessing then. A "normal" driver's start_io entry is controlled by driver busy, so only one set of cells is needed to hold data until I/O done. Paul Sorenson and John Osudar suggested moving the DDT into a pseudodriver UCB, so that on a call, the UCB$L_DDT pointer of the intercepted driver points to a known offset in the intercepting driver's UCB. This makes access to data very fast. My driver [VAX92B.GCE92B.NET92B]QDRIVERSKEL.MAR) on the Fall 1992 SIG tapes is an example of code that does this with some extra work to allow multiple applications to steal the same entries. Both FDT and DDT stealing are there in the code. CDdriver is a good example of the use of stealing start_io; it implements a single CPU cache. (To do this across a cluster means taking out block or file locks across cluster, in the cacher, so when anyone starts to write a block the other nodes can disable it. There are some very touchy timing issues that make getting this right a hard problem. I'm not expert in them. Clearly, though, if you rely on the lock manager, you must arrange that access be delayed at any node wanting to write if that node doesn't own the lock. This means either rolling your own delay mechanism or using some mechanism that VMS uses, like RMS file locks. The delay is needed to block a write to a block that's in cache on another CPU, so the other CPU can invalidate its cache. This can be done by techniques like fddriver or ztdriver (remote virtual tape/disk drivers in sig tapes) use, saving the IRP aside until your communicating process that's dealing with the locks gets done its operation and then continuing the operation either with a fake interrupt in the pseudodriver you've got the extra code in, or via some FDT level entry that forks and reissues the IRP along from fork level to get it back to its original track. Incidentally, if your driver is using altstart and keeping its own queues, this gets much more complex. You can also point start-io to an intercept block somewhere and keep a queue in your intercept driver so that the intercept is never moved, even if the driver reloads. It is handy when you might need to reload a driver out from under existing IRPs; the controller init routine must rebuild the jump addresses. (Note this is harder on AXP; the other technique works there; I've done it.) 6. Stealing the post processing queue. This technique involves some earlier access to use the IRP$L_PID hook, and in general you need to at least call COM$POST eventually to complete I/O on the packet. However, you can use this point to edit what the I/O returns. (You can steal the iopost (ipl4) interrupt instead, but the IRP$L_PID hook is much easier to use.) I have heard reports of insertion of close at this point also. Now, in general, you can do $QIO from kernel mode (leaving out issues of synchronization), provided that: 1. All arguments are r/w from KERNEL mode 2. The QIO mode is kernel 3. On VAX, previous mode needs to be Kernel since PROBE instructions use previous mode for their checks. 4. Kernel AST delivery need to be enabled if you wish to avoid hangs, for XQP processed operations like deaccess (=close). The difficulty you have is dropping down from IPL4 to do the I/O without breaking synchronization and lousing up the IPL4 queue. The simplest thing to do is probably to send an AST to yourself and do the operations there; this is well documented. You can request another interrupt at IPL 4 after having requeued the packet (see fddriver sources for a sample fragment) so you can get back, and even get the system to complete the I/O for you. 7. Stealing generic VMS locations. ("blue sky") Suppose we have an IPL 0 site within VMS, a process, an image, or the like and want to insert some process code. (If a site is handled at some higher IPL and synchronization the synchronization issues have to be handled also.) At the site, insert a $cmkrnl call to get into your processing in kernel mode with your own entry mask. (Actual details of bashing a particular site are not the issue here; you need to be able to stash your patch into nonpaged pool and replicate whatever instructions you bash. Your patch will look like this: patch: duplicate instructions IF the process is not the service process THEN store off any desired info from the site movq r0,-(sp) $cmkrnl_s routin=mypatch2 movq (sp)+,r0 END IF return to original site .entry mypatch2,^m allocate a message buffer Fill in with address of patchAST (below), and whatever other info is desired. fill in r3,r4,r5 to point at the buffer and the mailbox UCB of the mailbox set up by your service process call EXE$WRITEMBX to send the message buffer to the mailbox if at high IPL, or just use $qio if not. Free the buffer Set local semaphore for this process Disable some AST deliveries if appropriate Loop in a loop calling SCH$RWAIT until your semaphore is set. (This prevents your process from noticing a wait while the service process runs) Pick up any results from the patchAST and bash whatever is appropriate. Reenable AST deliveries if appropriate. movl #ss$_normal,r0 RET ; patchAST entered via kernel or special kernel AST fired off from the ; service process. .entry patchAST,^M Pick off any arguments from the ACB so we can return them to the "mypatch2" procedure. Set the local semaphore so the SCH$RWAIT loop terminates RET The service process' outline is: Establish the mailbox and stash its UCB address where the patch can get it fast. forever: Read the mailbox Do whatever the process darn well pleases with the information there, having access to the entire machine/cluster/net... Send an AST to the address received in the mailbox message Loop Again, if you're not starting at IPL 0, you need to handle forking issues to get to the desired synchronization. Remember that you can queue ASTs to get to IPL 2 or fork to other levels. In passing, you can patch Alpha code too. You follow the code pointer in a procedure descriptor, set the page writeable, and do the patch, then put things back. This must be done with an eye to the macro-64 being generated, though, if you don't plan to replace a whole routine. If using macro-32 compiled code, replace a whole routine where possible. 8. It's possible for drivers and processes to have varying degrees of intimacy in exchanging data; some intercepts just move ALL IRPs to a process for further filtering, then come back to the driver. Others switch only certain IRPs there, depending on the need. More than one process can be used in this way, this being a generalization of what happens in driver <-> ACP communication. What can you do with this? Some example hacks: * Steal RMS entries from a process and get some other process to do things first. (Want to duplicate transactions without buying rms journalling??) * Steal VMS entries and filter them with user mode code any way you please. This can be per process or per system (IF you're careful!) * Stick a patch where DCL comes up with its common error message %DCL-W-IVVERB, unrecognized command verb - check validity and spelling which is ALWAYS the same and boring. Instead, why not have your system generate messages like: The way you type, we could be here all night. Things seem slow to you? I've got four SPACEWARs and an Ada compile running. I don't understand either. That command should have worked. You didn't really want the answer to that, did you? A puff of orange smoke appears and indicates...you screwed up again. A process can readily generate such things, blast them over, and then flag the patch should generate a skip around the code that comes up with the boring normal error message. (The process could also perhaps even try a second time to figure out what you might really want.) * You can from this mode change things like process privs, prompts, priority, etc. Doing this randomly is not so great, but imagine the following actions: "This person is opening the Ada compiler. NOBODY should be running THAT interactively! We'll lower THAT boy's priority to zero and null all his privs..." "Here's someone running TECO. Since he's obviously a wizard, raise his priority to 15 and give him SETPRV." "This is the 500th I/O with this guy in MAIL. Let's teach him a lesson by changing his DCL prompt to "MAIL> "." "This guy is running SYSGEN; he's fair game. Let me scare the wits out of him by changing his prompt to ">>>" and responding to a few commands like a system console for a few lines." "Here's a person who is in MAIL and using a lot of swear words or scatological language. Tsk tsk. He shouldn't have such a foul mouth, so I'll tell my process (HISMOMMY) to take over his terminal and demand an apology, and I won't let him get real control back until he types I'M SORRY." (Alternatively it could rewrite the message so, for instance, "motherf**ker" becomes "sweetie"; then everyone can wonder why he suddenly sounds like my old Grandmother...) "Joe SystemProgrammer is running games. We don't want the boss to catch him, so make him invisible." "J. RandomUser is running games. Generate some message about an earthquake ruining Colossal Cave and kick him off." (who says we have to be fair??) "The guy in the next cubicle is kind of paranoid and always seems to sign his vaxmail with "top secret crypto nuclear". He just typed that in. So I'll break in on him with some message like %SECURMON-I-FWDNSA, message copy forwarded to approved monitor and then let him go on...oughta really freak him out."