From: Jamie Hanrahan [jeh@cmkrnl.com] Sent: Friday, August 13, 1999 10:59 PM To: Jan Bottorff; ntdev@atria.com Subject: RE: [ntdev] User vs. Kernel mode (was re: DDK documentation 'hole') There are two distinct topics here, so I'm splitting up the replies to Jan Bottorff. > From: owner-ntdev@atria.com [mailto:owner-ntdev@atria.com]On Behalf Of > Jan Bottorff > Sent: Wednesday, August 11, 1999 18:05 > > [...] > I'd actually not be so nice to Microsoft by suggesting some > doc fixes willimprove driver stability. I'd actually go so far as to say > much of the instability goes to architectural decisions in NT. > > For example, the last two device drivers I've worked on had > absolutely no need to run in kernel mode (a serial port smart card > reader, and a PCMCIA security device). They are not high bandwidth devices and > the drivers only indirectly talk to hardware through other drivers. NT's > driver architecture required these to live in kernel mode. > > My view is you should try really hard to keep the amount of kernel code to > a minimum, as it has the potential to bring down the system. Anything that has direct access to I/O hardware has an unlimited ability to corrupt the OS, bypass file security (all security, for that matter), etc., etc. This is ok for kernel mode, which is considered trusted anyway. It is not ok for user mode. Even in the absence of code that is deliberately attempting to corrupt, hack, bypass, subvert, etc., etc., the OS, long experience with previous PC operating systems (DOS) firmly established that allowing user mode access to IO space was NOT good for system stability. This has been reconfirmed since via Windows 9x. I find it utterly ludicrous that you would propose allowing direct access to hardware from user mode -- when one of the devices you're dealing with is a "security" device! You may remember that prior to NT 4, NT ran its GDI code and GDI drivers in user mode, in a separate process. Hence it couldn't corrupt the OS and it couldn't corrupt any other process. This was done in the interests of "stability", since it was expected that many of the GDI drivers would be written by third parties (as indeed was and still is the case). Alas the overhead of the necessary IPCs to invoke that stuff was not 10% or 20%, but more like 800%. NO ONE was willing to pay that price. Not even on then-future CPUs. Try installing NT 3.51 on a modern CPU and compare its graphics performance with NT4 on the same system and see for yourself. You have to remember that IPC involves system service calls (user to kernel transitions) and thread context switches -- and in this case, process-to-process switches also. You can't say "well, the CPU power will catch up and allow us to call into fully protected subsystems with acceptable speed" because that stuff doesn't scale nearly linearly with CPU speed. Overall, basic GDI performance in NT4 was increased over 3.51 and earlier by about 8:1 or 9:1 by moving the graphics drivers into kernel mode. Of course there were some resulting stability problems when NT4 first came out. But it wasn't because the code was in kernel mode; it was because the kernel mode ports of GDI and the GDI drivers were new. These days, NT is more stable than it ever was prior to NT4. > Like most other pieces of code, drivers will ALWAYS have bugs, so the OS architecture > had better be prepared to cope with them. A blue screen is not my idea of > how an OS should cope. You haven't thought this through. I know this sounds like a contradiction -- but in the interests of *long term* system stability, a blue screen is the ONLY reasonable thing to do. There is no "apartment" model in kernel mode whereby a buggy piece of kernel mode code can be trusted to only have corrupted its own data. (To introduce such a model would require kernel mode components to live in separate, protected address spaces, with IPC overhead between them, and again, we wouldn't be talking about 20% extra overhead, we'd be talking about hundreds of percent. This isn't acceptable for what would otherwise be a simple call.) All kernel mode code in NT is trusted equally; all security checks, buffer access checks, etc., are made upon the transition from user to kernel mode. There is no such thing as "trusted user mode code", and there is no such thing as "untrusted kernel mode code", and there is no such thing as "kernel mode code trusted only to modify certain data". Given those assumptions, and given an unhandled exception or other violation of constraints that MUST apply to kernel mode code, the only thing the system can reasonably do when such a constraint is violated is to save its state and freeze. To try to go on ignores the following facts: (a) you have direct evidence that something in kernel mode that you just executed has a bug in it; and (b) you have no evidence that that's the ONLY bug -- it could have corrupted many other things in memory (like pointers) that just haven't been referenced yet. The alternative is to keep running. The problem with that, and the reason that this is not the right thing for stability in the long run, is that the system *will* fall over eventually. But the routines that caused the corruption from which you can't recover may be long gone... and this will make debugging far more difficult. Conclusion, yes, new code in kernel mode will always have bugs. (I find it wildly pessimistic to claim that "all code will always have bugs", but that's another topic.) But the bugs get fixed a lot faster because of the "blue screen of death" philosophy. > For most uses of NT, having an order of magnitude better stability (one > crash per year vs one crash per month) is much more important than having > 20% lower processor utilization (a horribly high allowance for interprocess > communication overhead). As I said, the proven results are that the overhead was not 20%, but more like 800%. And please, don't say "then they must not have done it right". The NT kernel team were (and are) not inexperienced at OS design (Dave Cutler has *four* shipping, commercially successful operating system products on his resume) nor at performance optimization, and they tried VERY hard to make this work, because running GDI as a protected subsystem in user mode was a pet goal of DaveC's... with exactly your agenda; they wanted to isolate the OS from bugs in GDI drivers. They implemented an optimized IPC path just for the purpose of invoking GDI calls in the Win32 subsystem process; they did a lot of client-side cacheing and batching... all to no avail. They couldn't make it run fast enough. --- Jamie Hanrahan, Kernel Mode Systems ( http://www.cmkrnl.com/ ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - [ To unsubscribe, send email to ntdev-request@atria.com with body UNSUBSCRIBE (the subject is ignored). ]