0145:	I/O PLUMBING - I/O Interception Without Pain
	Glenn C. Everhart, General Cybernetic Engineering
		(Consulting in systems, networks, and internals
		VMS, Unix, MSDOS)
		Everhart@Arisia.GCE.Com
		215 358 5875


   Intercepting control during the I/O process is part of the craft
of system programming that has numerous uses. Now, in general, you want
to use well defined interfaces because:

	* Control flow is well defined there, may even be documented
		and stable.
	* All control narrows to flow through these interfaces, rather
		than going through many.
	* Data structures have to contain all information here, rather
		than having control info encoded in misc. parts of
		machine state.

   It is a fundamental concept of systems programming to find new uses
for existing interfaces and to intercept control there.

   On real OSs, it is also fundamental to avoid assuming you have
complete control of those interfaces. Manufacturers and other systems
programmers use the same ones. It is important to design your applications
so multiple users of an interface can exist cleanly. The PC world has had
many problems because its interfaces started to be used by a large group
who did not do this.

   Now back to VMS control stealing.

   Consider the major gateways that I/O passes through.

  I/O Flow:

 (Leave out other hacks like stealing the CHMK or CHME vectors by bashing
  the SCB; I consider only I/O here.)

  Process -> QIO call
  EVERY one of these points can be a place to intercept the operation.
1	-> sysqioreq code in VMS kernel; sets up IRP and device
		independent fields. Validates that driver can
		do the operation (using 1st FDT mask, 64 bits)
	-> sets buffered bit if appropriate (generally so for
		XQP calls).
2	-> calls driver FDT routines inside a tight loop. (In the
		FDT routines the kernel stack has a JSB - pushed PC
		and one CALLG frame from the qio call.) Finds FDT table
		from DDT, pointed to by ucb$l_ddt.
	(NOTE: Step 2 Alpha drivers use ONE CALL depending on function.)
3	-> FDT routines do additional setup, finish getting IRP
		set up (P1 to P6 in arg list on VAX; in IRP already
		on Alpha). Exit via RSB (getting back to loop and to
		next FDT), or to error exit, or to code to queue
		IRP to driver start_io or to XQP. In these cases
		eventually intermediate return to exe$qioreturn or
		friends for intermediate return (pending i/o done).
		Return pops stack back, sets IPL to 0, sets R0 to
		code (usually 1 where all's well).
4	   (note: you can patch the global XQP entry point also if you
	          like: IRP is complete here and you get here via
		  normal knl AST.
5	-> Driver start-io entry. Does actual hardware protocol setup.
		I won't go into this; hardware can be timing dependent
		and patching from past here thru interrupt code is not
		usually desirable.
6	-> Post processing queue handles status return to IOSB, sometimes
		buffer copy/interpret, etc.


7       You can similarly patch almost anywhere else in VMS, so long as
		you can figure out how not to break synchronization. (Yes,
		you can even steal the CHMK handler from the SCB!)

What One Does with these Patch Locations

1. Patch at the system call - use to monitor calls (per process...means you
	bash the table in P1 space) and record whatever user info you like.
	Difficulty: you don't usually own any process space for dedicated
	records or for replacing arguments and clobbering user calls is
	risky or worse. FTS012 is an example program that patches here.
	Your code can lie to the process too... Note: patching the SCB to steal
	the CHMK or CHME vector is more complete and can catch calls via S0
	entry points too.

2. You can steal the driver's FDT table by pointing the DDT at your own
	after you insert your FDT processing ahead of what's there.
	Table is a series of (function mask, address) pairs. Code at
	"address" gets control when (and ONLY when) function mask bits
	for the current I/O function are set. You're in user process context
	at IPL 2 here and can allow, disallow, or to some extent modify the
	I/O or its results at this point.

	Here's how.

	It is possible to interrupt FDT processing PROVIDED you save the
	full context (registers), don't disturb the IRP, and make DARN
	sure the user process can't clobber its inputs. Remember that
	at FDT time you do not yet busy the driver, so you must save the
	I/O per channel, per process when doing this. Finding a fast
	way to get to this info is important; a base pointer can be stored
	in a pseudodriver UCB, a lock value block, a logical name, or
	(if you can find one) some unused system cell. The last is usually
	ill advised. You can also use a constant if your driver has a single
	unit.

	Once you save the I/O context you can notify a process about
	what you want; techniques such as using sch$postef to set
	a local event flag or exe$writembx to write to a MB: unit
	are examples. You then need to return to user context.

	It is VITAL if intercepting here not to let user processing
	occur. Blocking ASTs is optional, but the mainline can't be
	allowed to run or args can get modified (RMS does this).
	You can inhibit ASTs via the PCB$B_ASTEN byte (set to 1 to
	allow only kernel mode ASTs, for example), and inhibit a process
	with no side effects by putting it in RWAST state (see the calls
	to EXE$RWAIT) momentarily. This does not disturb other wait states
	that the process might enter. You can use other waits if desired.
	Also be sure the CCB doesn't go idle. Blocking AST processing
	while your daemon runs gives added safety. You can select which
	modes to block, too, so user mode code can be kept quiet
	while inner modes are undisturbed. Some care is needed due to
	possible locks; you must be sure you will fairly quickly
	and without fail reenable the process lest you cause deadlocks
	somewhere. Finally, be sure not to allow the user wait to finish
	before finding out if its return should be error or not, since
	an error return will never go further and the thread stalls.

	To get back into your processing, and undo a wait, using a special
	kernel AST gets you to much the same state as AST processing.
	Your AST must restore registers, reenable ASTs to where they
	were, and reissue the user's I/O, then exit the AST. This works
	fine on AXP too.

	Reissuing is ALMOST straightforward...just duplicate the
	FDT call loop lower on the stack. NOTE though that you
	return at IPL 0, so must maintain synch. "by hand" and
	return to IPL 2 as soon as you get back. This means the
	technique is not 100% general, but works fairly well. An
	advantage of special vs. regular kernel ASTs here is their
	simplicity of scheduling by VMS. If your intercept already
	has everything saved, you need really be careful only of process
	deletion which can be inhibited temporarily.


	Notice that FDT processing has the disadvantage that the IRP
	is not completely built yet. However, it's a well documented
	area and intercepts here don'tcare what ACP may come later.
	If you have an ODS-1, or a "new" file structure ACP or alternate
	XQP, intercepts at FDT time can affect what happens. Stealing
	the F11BXQP entry is only good for ODS2. Both have process
	context and so can do some interesting stuff including some
	system service calls (if you're careful!).

	Uses:  You can do whatever you like, in process code, to check
		what should happen on certain I/Os. You can move files,
		deny or allow access to them, return info to alter
		process priority once back in the process, change
		extend parameters, change where reads/writes go,
		monitor I/O file transactions, build extra responses
		in to different I/O patterns, play any trick you like.

	       This is not a good place to try to do caching, though,
		due to lack of synchronization. Also, remember that MSCP
		served disks act by sending IRPs to driver start_io, so
		in a cluster everyone needs to have this processing locally.

3. Stealing FDT entries themselves can be done. It's easier to add your
	own, though.

4. Stealing the XQP entry point. This is a very good place to monitor
	I/O. It is system wide and requires you decode the IRP packet
	formats for XQP operations, and has the same difficulty 2 does
	of finding a place for its data. You get control in a kernel
	AST (for the XQP) with the IRP all built. Creating an I/O
	error here is less well defined than in FDTs, a disadvantage,
	but you can twiddle I/O operations similarly to 2 and need
	not block a process, as all its arguments are fully encapsulated
	in the IRP by this point. Adding your own call to a process here
	is possible, though care is needed to ensure the process' I/O
	that you call is not blocked. This is an issue in 2 also.

	Uses: Applications that modify file extensions or monitor I/O
		are reasonable fits here. Other applications such as
		have been mentioned for 2 are generally possible, though
		changing I/O can be more involved here, as the XQP
		has its own methods for doing I/O which are not that
		similar to normal user ones. The synchronization is
		however cleaner in that you start in a kernel AST and
		can resume in one, and do not have hardcoded SETIPL #0
		instructions to work around.

5. Stealing driver start_io entry. This location is often taken for
	purposes of implementing cache systems. You just change the
	pointer in the DDT (Driver Dispatch Table) and your code
	gains control before the regular driver's. At this point
	(or in the earlier ones) you can "steal" the IRP completion
	by filling in IRP$L_PID with the address of your completion
	routine. This location gains control at I/O postprocessing
	then. A "normal" driver's start_io entry is controlled by
	driver busy, so only one set of cells is needed to hold data
	until I/O done.
	Paul Sorenson and John Osudar suggested moving the DDT into
	a pseudodriver UCB, so that on a call, the UCB$L_DDT pointer
	of the intercepted driver points to a known offset in the
	intercepting driver's UCB. This makes access to data very
	fast. My driver [VAX92B.GCE92B.NET92B]QDRIVERSKEL.MAR) on the
	Fall 1992 SIG tapes is an example of code that does this with
	some extra work to allow multiple applications to steal the
	same entries. Both FDT and DDT stealing are there in the code.

	CDdriver is a good example of the use of stealing start_io; it
	implements a single CPU cache. (To do this across a cluster
	means taking out block or file locks across cluster, in the
	cacher, so when anyone starts to write a block the other nodes
	can disable it. There are some very touchy timing issues
	that make getting this right a hard problem. I'm not expert
	in them. Clearly, though, if you rely on the lock manager,
	you must arrange that access be delayed at any node wanting
	to write if that node doesn't own the lock. This means either
	rolling your own delay mechanism or using some mechanism that
	VMS uses, like RMS file locks. The delay is needed to block
	a write to a block that's in cache on another CPU, so the
	other CPU can invalidate its cache.
	  This can be done by techniques like fddriver or ztdriver 
	(remote virtual tape/disk drivers in sig tapes) use, saving
	the IRP aside until your communicating process that's dealing
	with the locks gets done its operation and then continuing
	the operation either with a fake interrupt in the pseudodriver
	you've got the extra code in, or via some FDT level entry
	that forks and reissues the IRP along from fork level to
	get it back to its original track. Incidentally, if your driver
	is using altstart and keeping its own queues, this gets much
	more complex.
		You can also point start-io to an intercept block somewhere
	and keep a queue in your intercept driver so that the intercept
	is never moved, even if the driver reloads. It is handy when
	you might need to reload a driver out from under existing IRPs;
	the controller init routine must rebuild the jump addresses. (Note
	this is harder on AXP; the other technique works there; I've done it.)

6. Stealing the post processing queue. This technique involves some
	earlier access to use the IRP$L_PID hook, and in general you
	need to at least call COM$POST eventually to complete I/O
	on the packet. However, you can use this point to edit what
	the I/O returns. (You can steal the iopost (ipl4) interrupt
	instead, but the IRP$L_PID hook is much easier to use.)

	I have heard reports of insertion of close at this point also.
	Now, in general, you can do $QIO from kernel mode (leaving out
	issues of synchronization), provided that:
		1. All arguments are r/w from KERNEL mode
		2. The QIO mode is kernel
		3. On VAX, previous mode needs to be Kernel since
			PROBE instructions use previous mode for their
			checks.
		4. Kernel AST delivery need to be enabled if you wish to avoid
			hangs, for XQP processed operations like deaccess
			(=close).

	The difficulty you have is dropping down from IPL4 to do the I/O
	without breaking synchronization and lousing up the IPL4 queue.
	The simplest thing to do is probably to send an AST to yourself
	and do the operations there; this is well documented. You can
	request another interrupt at IPL 4 after having requeued the
	packet (see fddriver sources for a sample fragment) so you
	can get back, and even get the system to complete the I/O for
	you.

7. Stealing generic VMS locations. ("blue sky")
	Suppose we have an IPL 0 site within VMS, a process, an image,
	or the like and want to insert some process code. (If a site is
	handled at some higher IPL and synchronization the synchronization
	issues have to be handled also.)

	At the site, insert a $cmkrnl call to get into your processing
	in kernel mode with your own entry mask. (Actual details of
	bashing a particular site are not the issue here; you need to
	be able to stash your patch into nonpaged pool and replicate
	whatever instructions you bash.

	Your patch will look like this:

	patch:	duplicate instructions
		IF the process is not the service process THEN
		  store off any desired info from the site
		  movq r0,-(sp)
		  $cmkrnl_s routin=mypatch2
		  movq (sp)+,r0
		END IF
		return to original site

	.entry mypatch2,^m<r2,r3,r4,r5,r6,r7,r8,r9,r10,r11>

	allocate a message buffer
	Fill in with address of patchAST (below), and whatever
	other info is desired.
	fill in r3,r4,r5 to point at the buffer and the mailbox UCB
	of the mailbox set up by your service process

	call EXE$WRITEMBX to send the message buffer to the
		mailbox if at high IPL, or just use $qio
		if not.

	Free the buffer
	Set local semaphore for this process
	Disable some AST deliveries if appropriate
	Loop in a loop calling SCH$RWAIT until your semaphore
	is set. (This prevents your process from noticing a wait
	while the service process runs)

	Pick up any results from the patchAST and bash whatever is
	appropriate. Reenable AST deliveries if appropriate.

	movl	#ss$_normal,r0
	RET

; patchAST entered via kernel or special kernel AST fired off from the
; service process.
	.entry patchAST,^M<regs>
	Pick off any arguments from the ACB so we can return 
	them to the "mypatch2" procedure.

	Set the local semaphore so the SCH$RWAIT loop terminates
	RET


The service process' outline is:

	Establish the mailbox and stash its UCB address where the
	patch can get it fast.

	forever:
	  Read the mailbox
	  Do whatever the process darn well pleases with the
		information there, having access to the entire
		machine/cluster/net...
	  Send an AST to the address received in the mailbox message
	  Loop

	Again, if you're not starting at IPL 0, you need to handle
forking issues to get to the desired synchronization. Remember that
you can queue ASTs to get to IPL 2 or fork to other levels.

	In passing, you can patch Alpha code too. You follow the code
	pointer in a procedure descriptor, set the page writeable,
	and do the patch, then put things back. This must be done
	with an eye to the macro-64 being generated, though, if
	you don't plan to replace a whole routine. If using macro-32
	compiled code, replace a whole routine where possible.


8. It's possible for drivers and processes to have varying degrees
	of intimacy in exchanging data; some intercepts just move ALL
	IRPs to a process for further filtering, then come back to
	the driver. Others switch only certain IRPs there, depending
	on the need. More than one process can be used in this way,
	this being a generalization of what happens in driver <-> ACP
	communication.


What can you do with this? Some example hacks:
	* Steal RMS entries from a process and get some other process
		to do things first. (Want to duplicate transactions
		without buying rms journalling??)
	* Steal VMS entries and filter them with user mode code
		any way you please. This can be per process or per
		system (IF you're careful!)
	* Stick a patch where DCL comes up with its common error message

  %DCL-W-IVVERB, unrecognized command verb - check validity and spelling

		which is ALWAYS the same and boring. Instead, why not
		have your system generate messages like:

	The way you type, we could be here all night.
	Things seem slow to you? I've got four SPACEWARs and an Ada compile
		running.
	I don't understand either. That command should have worked.
	You didn't really want the answer to that, did you?
	A puff of orange smoke appears and indicates...you screwed up again.


	A process can readily generate such things, blast them over, and
then flag the patch should generate a skip around the code that
comes up with the boring normal error message. (The process could
also perhaps even try a second time to figure out what you might
really want.)

	* You can from this mode change things like process privs, prompts,
		priority, etc. Doing this randomly is not so great, but
		imagine the following actions: <evil grin>

	"This person is opening the Ada compiler. NOBODY should be
	running THAT interactively! We'll lower THAT boy's
	priority to zero and null all his privs..."

	"Here's someone running TECO. Since he's obviously a wizard,
	raise his priority to 15 and give him SETPRV."

	"This is the 500th I/O with this guy in MAIL. Let's teach him
	a lesson by changing his DCL prompt to "MAIL> "."

	"This guy is running SYSGEN; he's fair game. Let me scare the
	wits out of him by changing his prompt to ">>>" and responding
	to a few commands like a system console for a few lines."

	"Here's a person who is in MAIL and using a lot of swear
	words or scatological language. Tsk tsk. He shouldn't have
	such a foul mouth, so I'll tell my process (HISMOMMY) to take
	over his terminal and demand an apology, and I won't let
	him get real control back until he types I'M SORRY." (Alternatively
	it could rewrite the message so, for instance, "motherf**ker"
	becomes "sweetie"; then everyone can wonder why he suddenly
	sounds like my old Grandmother...)

	"Joe SystemProgrammer is running games. We don't want the 
	boss to catch him, so make him invisible."

	"J. RandomUser is running games. Generate some message
	about an earthquake ruining Colossal Cave and kick him
	off."		(who says we have to be fair??)

	"The guy in the next cubicle is kind of paranoid and always
	seems to sign his vaxmail with "top secret crypto nuclear".
	He just typed that in. So I'll break in on him with some
	message like

%SECURMON-I-FWDNSA, message copy forwarded to approved monitor

	and then let him go on...oughta really freak him out."