################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation (OSDI '96) Seattle, Washington, October 28-31, 1996 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: http://www.usenix.org Microkernels Meet Recursive Virtual Machines Bryan Ford Mike Hibler Jay Lepreau Patrick Tullmann Godmar Back Stephen Clawson Department of Computer Science, University of Utah Salt Lake City, UT 84112 flux@cs.utah.edu http://www.cs.utah.edu/projects/flux/ Abstract This paper describes a novel approach to providing modular and extensible operating system functionality and encapsulated environments based on a synthesis of microkernel and virtual machine concepts. We have developed a software-based virtualizable architecture called Fluke that allows recursive virtual machines (virtual machines running on other virtual machines) to be implemented efficiently by a microkernel running on generic hardware. A complete virtual machine interface is provided at each level; efficiency derives from needing to implement only new functionality at each level. This infrastructure allows common OS functionality, such as process management, demand paging, fault tolerance, and debugging support, to be provided by cleanly modularized, independent, stackable virtual machine monitors, implemented as user processes. It can also provide uncommon or unique OS features, including the above features specialized for particular applications' needs, virtual machines transparently distributed cross-node, or security monitors that allow arbitrary untrusted binaries to be executed safely. Our prototype implementation of this model indicates that it is practical to modularize operating systems this way. Some types of virtual machine layers impose almost no overhead at all, while others impose some overhead (typically 0--35%), but only on certain classes of applications. 1 Introduction Increasing operating system modularity and extensibility without excessively hurting performance is a topic of much ongoing research [5, 9 , 18 , 36 , 40 ]. Microkernels [4, 24 ] attempt to decompose operating systems ``horizontally'' by moving traditional kernel functionality into servers running in user mode. Recursive virtual machines [23 ], on the other hand, allow operating systems to be decomposed ``vertically'' by implementing OS functionality in stackable virtual machine monitors, each of which exports a virtual machine interface compatible with the machine interface on which it runs. Traditionally, virtual machines have been implemented on and export existing hardware architectures so they can support ``naive'' operating systems (see Figure 1). For example, the most well-known virtual machine system, VM/370 [28 , 29 ], provides virtual memory and security between multiple concurrent virtual machines, all exporting the IBM S/370 hardware architecture. Furthermore, special virtualizable hardware architectures [22 , 35 ] have been proposed, whose design goal is to allow virtual machines to be stacked much more efficiently. This paper presents a new approach to OS extensibility which combines both microkernel and virtual machine concepts in one system. We have designed a ``virtualizable architecture'' that does not attempt to emulate an actual hardware architecture closely, but is instead designed along the lines of a traditional process model and is intended to be implemented in software by a microkernel. The microkernel runs on the ``raw'' hardware platform and exports our software-based virtualizable architecture (see Figure 2), which we will refer to as a virtualizable process or nested process architecture to avoid confusion with traditional hardware-based architectures. The virtual machine monitors designed to run on this software architecture, ____________________________ This research was supported in part by the Defense Advanced Research Projects Agency, monitored by the Department of the Army, under contract number DABT63- 94-C-0058. The opinions and conclusions contained in this document are those of the authors and should not be interpreted as representing official views or policies of the U.S. Government. which we call nesters, can efficiently create additional recursive virtual machines or nested processes in which arbitrary applications or other nesters can run. Although the Fluke architecture does not closely follow a traditional virtual machine architecture, it is designed to preserve certain highly useful properties of recursive virtual machines. These properties are required to different degrees by different nesters that take advantage of the model. For example, demand paging and checkpointing nesters require access to and control over the program state contained in their children, grandchildren, and so on, whereas process management and security monitoring nesters primarily rely on being able to monitor and control IPC-based communication across the boundary surrounding the nested environment. Our microkernel's API provides these properties efficiently in several ways. Address spaces are composed from other address spaces using hierarchical memory remapping primitives. For CPU resources, the kernel provides primitives that support hierarchical scheduling. To allow IPC-based communication to short-circuit the hierarchy safely, the kernel provides a global capability model that supports selective interposition on communication channels. On top of the microkernel API, well-defined IPC interfaces provide I/O and resource management functionality at a higher level than in traditional virtual machines. These higher-level interfaces are more suited to the needs of modern applications: e.g., they provide file handles instead of device I/O registers. This nested process architecture can be used to apply existing algorithms and techniques in more flexible ways. Some examples we demonstrate in this paper include the following: Decomposing the kernel: Some features of traditional operating systems are usually so tightly integrated into the kernel that it is difficult to eliminate them in situations in which they are not needed. A striking example is demand paging. Although it is often possible to disable it in particular situations on particular regions (e.g., using posix's mlock()), all of the paging support is still in the kernel, occupying memory and increasing system overhead. Even systems that support ``external pagers,'' such as Mach, contain considerable paging-related code in the kernel and most do not allow Figure 1: Traditional virtual machines based on hardware architectures. Each shaded area is a separate virtual machine, and each virtual machine exports the same architecture as the base machine's architecture. control over physical memory management, just backing store. Similarly, multiuser security mechanisms are not always needed, since most personal computers are dedicated to the use of a single person, and even process management and job control features may not be needed in single-application systems such as the proverbial ``Internet appliance.'' Our system demonstrates decomposed paging and POSIX process management by implementing these traditional kernel functions as optional nesters which can be used only when needed, and only for the parts of a system for which they are desired. Increasing the scope of existing mechanisms: There are algorithms and software packages available for common operating systems to provide features such as distributed shared memory (DSM) [10 , 32 ], checkpointing [11 ], and security against untrusted applications [52 ]. However, these systems only cleanly support applications running Figure 2: Virtual machines based on an extended architecture implemented by a microkernel. The interface between the microkernel and the bare machine is a traditional hardware-based machine architecture, but the common interface between all the other layers in the system is a software-based nested process architecture. Each shaded area is a separate process. in a single logical protection domain. In a nested process model, any process can create further nested subprocesses which are completely encapsulated within the parent. This design allows DSM, checkpointing, security, and other mechanisms to be applied just as easily to multi-process applications or even complete operating environments. Our system demonstrates this flexibility by providing a checkpointer, implemented as a nester, which can be transparently applied to arbitrary domains such as a single application, a multi-process user environment containing a process manager and multiple applications, or even the entire system. Composing OS features: The mechanisms mentioned above are generally difficult or impossible to combine flexibly. One might be able to run an application and checkpoint it, or to run an untrusted application in a secure environment, but existing software mechanisms are insufficient to run a checkpointed, untrusted application without implementing a new, specialized program designed to provide both functions. A nested process architecture allows one to combine such features by layering the mechanisms, since the interface between each layer is the same. In Fluke, for example, a Unix-like environment can be built by running a process manager within a virtual memory manager, so that the process manager and all of the processes it controls are paged. Alternatively, the virtual memory manager can be run within the process manager to provide virtual memory to an individual process. We used micro benchmarks to measure the system's performance in a variety of configurations. These measurements indicate a slowdown of about 0--35% per virtual machine layer, in contrast to conventional recursive virtual machines whose slowdown is 20%--100% [7]. Some nesters, such as the process manager, do not need to interpose on performance-critical interfaces such as memory allocation or file I/O, and hence take better advantage of the short-circuit communication facilities provided by the microkernel architecture. These nesters cause almost no slowdown at all. Other nesters, such as the memory manager and the checkpointer, must interfere more to perform their function, and therefore cause some slowdown. However, even this slowdown is fairly reasonable. Our results indicate that, at least for the applications we have tested, this combined virtual machine/microkernel model indeed provides a practical method of increasing operating system modularity, flexibility, and power. The rest of this paper is organized as follows: In Section 2 we compare our architecture to related work. We describe the key principles upon which our work is based in Section 3, and our software-based virtualizable architecture derived from these principles in Section 4. Section 5 describes the implementation of the example applications and nesters we designed to take advantage of the nested process model. Section 6 describes the experiments and results using the example process nesters. Finally, we conclude with a short reflective summary. 2 Related Work In this section, we first summarize virtual machine concepts and how our system incorporates them; second, we show how our design relates to conventional process models, and finally, we contrast our system with other microkernel-based systems. We describe related work concerning the details of our design later, in the appropriate sections. 2.1 Traditional Virtual Machines A virtual machine simulator, such as the Java interpreter [25 ], is a program that runs on one hardware architecture and implements in software a virtual machine conforming to a completely different architecture. In contrast, a virtual machine monitor or hypervisor, such as VM/370 [28 , 29 ], is a program that creates one or more virtual machines exporting the same hardware architecture as the machine it runs on. Hypervisors are typically much more efficient than simulators because most of the instructions in the virtual machine environment can be executed at full speed on the bare hardware, and only ``special'' instructions such as privileged instructions and accesses to I/O registers need to be emulated in software. Since the ``upper'' and ``lower'' interfaces of a hypervisor are the same, a sufficiently complete hypervisor can even run additional copies of itself, recursively. Virtual machines have been used for a variety of purposes including security, fault tolerance, and operating system software development [23 ]. In their heyday, virtual machine systems were not driven by modularity issues at all, but instead were created to make better use of scarce, expensive hardware resources. For example, organizations often needed to run several applications requiring different operating systems concurrently, and possibly test new operating systems, all on only a single mainframe. Therefore, traditional virtual machine systems used (and needed) only shallow hierarchies, implementing all required functionality in a single hypervisor. As hardware became cheap and ubiquitous, virtual machines became less common, although they remain in use in specialized contexts such as fault tolerance [7], and safe environments for untrusted applications [25 ]. In this paper we revive the idea of using virtual machine Figure 3: Exponential slowdown in traditional recursive virtual machines caused by emulation and reflection of privileged instructions, accesses to device registers, etc. For n hypervisor layers, at least 2n-1 hardware trap/return sequences are required to emulate any privileged instruction executed in the top layer. concepts pervasively throughout a system, but for the purpose of enhancing OS modularity, flexibility, and extensibility, rather than merely for hardware multiplexing. For these purposes, virtual machines based on hardware architectures have several drawbacks. First, most processor architectures allow ``sensitive'' information such as the current privilege level to leak into user-accessible registers, making it impossible for a hypervisor to recreate the underlying architecture faithfully without extensive emulation of even unprivileged instructions. Second, a hypervisor's performance worsens exponentially with stacking depth because each layer must trap and emulate all privileged instructions executed in the next higher layer (see Figure 3). Third, since hardware architectures are oblivious to the notion of stacking, all communication must be strictly parent-child; there is no way to support ``short-circuit'' communication between siblings, grandparents and grandchildren, etc. 2.2 The Nested Process Model For the above reasons, Fluke does not conform exactly to an existing hardware architecture. Instead, we use a software-based architecture which allows us to solve these problems by defining the architecture to avoid them. Although some existing hypervisors export a virtual architecture that differs slightly from the architecture on which they are based, they generally try to minimize such divergence, changing the architecture only enough to address ``show-stopping'' problems such as double paging. In contrast, Fluke does not attempt to minimize divergence, since running existing operating systems unmodified in our virtual machines is not a goal. Instead, our design goal is to maintain maximum performance in the presence of deep virtual machine layering. The resulting architecture is similar to conventional process models such as Unix's, though with some important differences to make processes recursively virtualizable or ``nestable''; hence the term ``nested process architecture.'' The Cambridge CAP computer [54 ] implemented a similar high-level architecture in microcode supporting an arbitrarily deep process hierarchy in which parent processes virtualize memory, CPU, and trap handlers for child processes. However, the CAP computer strictly enforced the process hierarchy at all levels, and did not allow communication paths to ``short-circuit'' the layers as Fluke does. As noted in retrospect by the designers of the system, this weakness made it impractical for performance reasons to use more than two levels of process hierarchy (corresponding to the ``supervisor'' and ``user'' modes of other architectures), so the uses of nested processes were never actually explored or tested in this system. System call emulation, interposition, and stacking have been used in the past to virtualize the activity of a process by interposing special software modules between an application and the actual OS on which it is running. This form of interposition can be used, for example, to trace system calls or change the process's view of the file system [31 , 33 ], or to provide security against an untrusted application [52 ]. However, these mechanisms can only be applied easily to a single application process and generally cannot be used in combination (only one interposition module can be used on a given process). Furthermore, although file system access and other system call-based activity can be monitored and virtualized this way, it is difficult to virtualize other resources such as CPU and memory. 2.3 Other Microkernel Architectures Our nested process model shares many of the same goals as those driving microkernel-based systems: flexibility, modularity, extensibility, and the decomposition of traditional OS features into separate, unprivileged modules. In fact, our prototype is essentially a microkernel-based system revolving around a small privileged kernel that exports a minimal set of interfaces to higher-level software. The primary difference of interest in this paper is that our microkernel is designed to support a nested process model efficiently, and the higher-level services on top of the microkernel take advantage of this model to provide a strongly structured system instead of the traditional ``flat'' collection of client and server processes. Our work shares many of the extensibility goals of other current kernel-related research [5, 18 , 46 , 50 ], but takes a different, often complementary, approach. For example, the Exokernel pushes the ``red line'' between the privileged kernel and unprivileged code as low as possible so that application code can gain more control over the hardware. In contrast, our work is not primarily concerned with the location of this boundary, but instead with certain fundamental properties the kernel interface must have in order for processes to virtualize each other effectively. 3 Properties of Virtual Machines Before describing the Fluke architecture in detail, we first present the underlying principles on which it is based. All of these principles derive from a single goal: to preserve the useful properties of a ``pure'' recursive virtual machine system without incurring the same exponential performance cost. For our purposes, a ``pure'' virtual machine system is one in which each layer completely simulates the environment of the next higher layer, including all instructions, memory accesses, and device I/O [49 ]. Our work hinges on two primary properties of such systems: state encapsulation Figure 4: Process state in traditional versus nested processes. In traditional systems, the program state (code, data, heap, stack, threads, etc.) of a child process exists independently of the parent process. In a nested process architecture, the child's state is logically a part of the parent's even though the two run in separate address spaces. and border control. These properties are described briefly in the following sections, and a concrete analysis of how our architecture satisfies these properties is presented later in Section 4. 3.1 State Encapsulation The property of state encapsulation is the ability to encapsulate or ``nest'' one process inside another so that the entire state of a child process and all its descendants is logically a subset of the state of the parent process. This property allows the parent process to treat an entire child process hierarchy as merely a chunk of data that can be moved, copied, deleted, etc., while remaining oblivious to the implementation details of the child. Of primary interest are the following three specific aspects of this property: hierarchical resource management, state visibility, and reference relativity. 3.1.1 Hierarchical Resource Management In popular systems such as Unix, Windows NT [45 ], and Mach [1], each process in the system exists independently of all other processes. A child process can outlive the parent process that created it, and retains all of its resources after its parent is destroyed (see Figure 4). By contrast, in a virtual machine system, a virtual machine cannot exist independently of the simulator or hypervisor that created it because the virtual machine's entire state (including the state of any sub-virtual machines it may contain) is merely part of the simulator's program variables. Some operating systems, such as L3 [39 ], directly support a hierarchical virtual machine-like process model in which a parent process can destroy a child process and be assured that all of the child's descendants will also go away and their resources will be freed. Other systems support a hierarchical resource pool abstraction from which the resources used to create processes are allocated; destroying a pool destroys any sub-pools created from it as well as all processes created from these pools. For example, KeyKOS space banks [6, 27 ] serve this purpose, as do ledgers in OSF's MK++ [43 ]. This hierarchical resource management property is a critical prerequisite for many of the applications described in this paper; without it, a parent process cannot even identify all of the state representing a child process subtree, let alone manage it coherently. Note that a traditional ``flat'' process model can easily be emulated in a system supporting the hierarchical model, as described in Section 5.4, but it is difficult or impossible to emulate a hierarchical model given only a flat collection of processes. 3.1.2 State Visibility The second aspect of the state encapsulation property is state visibility: the ability of a parent process to ``get at'' the state of a child process subtree rather than merely having control over its lifetime. State visibility is needed by any application that manages child process state, such as demand paging, debugging, checkpointing, process migration, replication, and DSM. Pure virtual machine simulators obviously satisfy this property since all of the state of a virtual machine is merely part of the simulator's variables. Same-architecture hypervisors that ``delegate'' to the base processor the job of executing unprivileged instructions depend on the processor to reveal the child's full register state through trap frames or shadow registers. With notable exceptions such as the Cache Kernel [14 ] and Amoeba [47 ], most operating systems are not as good about making a child process's state visible to its parent. For example, although most kernels at least allow a process to manipulate a child's user-mode register state Figure 5: References into, out of, and within nested processes. (e.g., Unix's ptrace() facility), other important state is often unavailable, such as the implicit kernel state representing outstanding long-running system calls the child may be engaged in. While these facilities are sufficient for many applications such as debugging, other applications such as process migration and checkpointing require access to all of the child's state. 3.1.3 Relativity of References While full state visibility is a necessary condition to move or copy a child process, it is also necessary that its references are relative to its own scope and are not absolute references into larger scopes such as the machine or the world. A ``reference'' for this purpose is any piece of state in the child process that refers to some other piece of state or ``object.'' For example, virtual addresses, unique identifiers, and file descriptors are different forms of references. An internal reference is a reference in the child to some other object in the same process; an external reference is a reference to an object in an outer scope (e.g., in an ancestor or sibling process). For example, in Figure 5, references 1 and 2 are internal references with respect to process A, while reference 3 and 4 are external references. In a traditional virtual machine system, external references in a virtual machine are effectively just the addresses of I/O ports or memory-mapped devices simulated by the hypervisor, and internal references are simply pointers within that machine's address space. In both cases, the representation of these references is relative to the virtual machine itself; the contents of the virtual machine can be moved at any time to a different context and all of the references it contains will remain valid. This relativity property is the exception rather than the rule in operating systems. Absolute pathnames in Unix serve as a simple example: if a process containing absolute pathnames is migrated from one machine to another, then the new machine must have exactly the same file system organization (at least in the parts accessed by the migrated program) or the external references from the program to the file system will become invalid. On a system such as L3 in which process-internal objects such as threads are addressed using global unique identifiers, even internal references in a migrated process will become invalid unless exactly the same global identifiers can be obtained on the target machine. Furthermore, there can be no hope of cloning or replicating a process within a single global identifier namespace, since the global identifiers representing all the internal objects can only refer to one object within that scope. Implementing fork() in a single-address-space system [12 ] is a well-known instance of this problem. 3.2 Border Control Another primary property of virtual machines is the ability of a hypervisor to monitor and control all communication across the border surrounding a virtual machine without impeding communication within the virtual machine. Barring shared memory channels, the only way for virtual machines to communicate with each other is to trap into the hypervisor, which can handle the request however it chooses. The Clans & Chiefs mechanism in L3 [38 ] provides border control directly in a microkernel environment. In most microkernels that use a capability model, such as Mach [1] and KeyKOS [6, 27 ], border control can be achieved through interposition. Whereas the state encapsulation property allows the parent to control state inside the boundary, border control allows the parent to control the child's view of the world outside the boundary. As with state encapsulation, the extent to which border control is needed depends on the application in question. For example, Unix's ability to redirect the console output of an entire `make' run to a file, including the output of all the subprocesses it forks off, reflects the limited border control provided by the capability-like file descriptor model. However, other applications, such as security monitors which allow untrusted applications to be run safely, require complete border control. 4 Nested Process Architecture This section describes the nested process architecture used in our system, known as Fluke (``Flux ~-kernel Environment''). The Fluke architecture is the common interface between each of the stackable layers in the system and consists of three components: the basic computational instruction set, the low-level system call API, and the IPC-based Common Protocols. As Figure 6 illustrates, the key distinction between these components is that the underlying processor provides the one and only implementation of the basic instruction set used throughout the system, and the microkernel provides the only implementation of the system call API, but the Common Protocols API is independently implemented or interposed on, at least in part, at each layer. The following sections describe the three components of the nested process architecture in detail and explain how this design efficiently supports stacking while preserving all of the critical properties described in the previous section. 4.1 Basic Computational Instruction Set The lowest level of our nested process architecture is a well-defined subset of the x86 architecture, allowing all processes to run with no instruction emulation on x86 machines. However, other processor architectures could be used just as easily, including purely emulation-based instruction sets such as Omniware [2] or Java bytecode [25 ]. The only restriction is that application processes must only use the subset of the instruction set that satisfies the properties described in Section 3; otherwise they may not function properly in virtualized environments. The instructions compilers produce generally satisfy these requirements automatically, because they refer only to per-thread register state or per-process memory. However, most architectures have a Figure 6: Illustration of the three components of the nested process architecture presented to each layer. The basic instruction set (left) used by all layers is implemented by the underlying processor and never needs to be emulated. The low-level system call API (right) is similarly implemented by the microkernel for all layers, whereas the Common Protocols API (bottom) is initially implemented by the microkernel but may subsequently be interposed on or reimplemented by higher-level layers. few ``sensitive'' yet unprivileged instructions that reveal global information and therefore cannot be virtualized. For example, the x86's CPUID instruction identifies the processor on which it is executed, making it impossible to safely migrate applications that rely on this instruction. 4.2 Low-level API The second part of our nested process architecture is a set of low-level microkernel objects and system calls. These primitives are always implemented by the microkernel directly; therefore it is critical that they be designed to support all of the virtualization fluke_type_create(objp) Create a new object of type type at virtual address objp. fluke_type_destroy(objp) Destroy the object at virtual address objp. fluke_type_move(objp, new_objp) Move the object from virtual address objp to new_objp. fluke_type_reference(objp, ref_objp) Associate the reference object ref_objp with the object objp. fluke_type_get_state(objp, statep, refp, : :): Return the state of objp. State includes simple data passed back in the type-specific structure statep and zero or more capabilities returned in reference objects. fluke_type_set_state(objp, statep, refp, : :): Loads new state into objp. State arguments are identical to those of the get_state call. Figure 7: Example of the Fluke low-level (``system call'') API. These are the operations that are common to most object types. Each object type has additional type-specific operations; e.g., IPC operations for ports and lock/unlock operations for mutexes. properties in Section 3. The reward for carefully designing the low-level API in this way is that it is never necessary for a parent process to interpose on the kernel calls made by child processes; instead, it is always sufficient for the parent process merely to retain control over the resources to which the system calls refer. Figure 7 shows a sample of the Fluke microkernel API [20 ]. 4.2.1 Address Spaces The Fluke kernel supports an arbitrary number of address spaces, much like stripped-down Unix processes or Mach tasks; multiple threads can run in each address space. Unlike most operating systems, address spaces in Fluke are defined relative to other spaces. The low-level API does not include system calls to allocate or free memory; it only provides primitives to remap a given virtual address range from one space into another space, possibly at a different virtual address range or with reduced permissions. For example, a parent process can use this mechanism to ``donate'' access to some of its memory pages to a nested child process, for that child to use as its own private storage. This child can in turn donate some of this memory to its children, and so on, forming a hierarchy of memory remappings. The parent can change or revoke at any time the permissions the child has to this memory. If a thread in a descendant process attempts to access addresses in its space whose permissions have been revoked, the thread will take a page fault, which the kernel will deliver as an IPC message to a thread in the parent. This memory remapping mechanism is similar to that of L4 [40 ], Grasshopper [41 ], and the ``f-maps'' in virtual machine systems [22 , 23 ]. Our architecture uses this mechanism to provide the state containment and state visibility properties described in Section 3. Since a child process can only access memory given to it by its parent, the parent can locate and identify all the memory comprising the state of a child (including its descendants) simply by keeping track of which pages of memory it grants to the child. A parent can control the child's use of this memory by manipulating the mappings from its space into the child's. 4.2.2 Kernel Objects All low-level API calls are operations on a few types of primitive kernel objects, such as address spaces and threads. All active kernel objects are logically associated with, or ``attached to,'' a small chunk of physical memory; this is reminiscent of tagged processor architectures such as System 38 [37 ] and the Intel i960XA [30 ]. A process can invoke kernel operations on kernel objects residing on pages mapped into its address space by specifying the virtual address of the object as a parameter to the appropriate system call. Since kernel objects are identified by local virtual addresses, this design satisfies the relativity property for user-mode references to kernel objects. In addition, there are system calls that a process can use to determine the location of all kernel objects within a given range of its own virtual address space, as well as system calls to examine and modify the state of these objects [51 ]. This kernel object design, coupled with the address space remapping mechanism described above, provides the state containment and state visibility properties for kernel object state as well as ``plain data,'' allowing parent processes to identify and gain access to all of the vital state comprising a child process. This mechanism is similar to the kernel object caching mechanism in the Cache Kernel [14 ], except that our mechanism does not impose ordering restrictions on manipulating kernel objects. For example, to take a checkpoint, a checkpointer temporarily revokes its child's access to the memory to be checkpointed, makes a system call to locate the kernel objects within this memory, makes additional kernel calls to extract the vital state from these objects (which includes the state of the child's threads, subprocesses, etc.), saves all the state of both the plain memory and the kernel objects, and finally resumes the child process. This is a very simplified description; the details of the procedure are presented later in Section 5.6. However, this is the fundamental mechanism by which not only checkpointing but other similar applications such as process migration and replication can be implemented transparently in our system. 4.2.3 Thread Objects Multithreaded processes are directly supported in the low-level API through thread objects, which can be created and manipulated just like any other kernel object. Once a new thread object has been created and set up properly, it becomes an active, independent flow of control supervised directly by the kernel, executing instructions in its associated address space. Unlike in most systems, Fluke threads provide full state visibility: a parent process can stop and examine the state of the threads in a child process at any time, and be assured of promptly receiving all relevant child state; i.e., everything necessary to transplant the child nondestructively. One implication of this property, which affects the low-level API pervasively, is that all states in which a thread may have to wait for events caused by other processes must be explicitly representable in the saved-state structure the kernel provides to a process examining the thread. For example, our low-level API provides a ``send request and await reply'' IPC primitive as a single system call. Since the server may take arbitrarily long to service the request, and a parent process may try to snapshot the thread's state while the thread is waiting, the kernel must correctly reveal this intermediate state to the parent. In our system the kernel reveals this state by modifying the client thread's registers so that it appears to the parent that the client was calling the low-level API entrypoint that only waits for a reply without sending a message. This technique of adjusting user-level registers to reflect kernel state transitions is used throughout Fluke to provide full kernel state exportability without reducing efficiency by breaking up all system calls into minimal atomic units. The Mach IPC system [16 ] uses a similar technique, except that user code must explicitly check for interruption after every IPC operation and restart it manually if necessary. Also, various other Mach system calls are not cleanly interruptible at all and therefore do not support full state visibility. 4.2.4 Capabilities All references between low-level objects in Fluke are represented as kernel-mediated capabilities [37 ]. Each primitive object type contains a number of ``capability slots.'' For example, each thread object contains an address space slot, which refers to the address space in which the thread is to run. Processes can store and manipulate individual capabilities using reference objects, which are kernel objects that hold a single capability of any type. System calls are provided to produce capabilities pointing to existing kernel objects, copy capabilities between reference objects and the special-purpose slots in other kernel objects, pass capabilities through IPC, and compare them against each other. A process only refers to a capability by the virtual address of the kernel object in which it is contained; it never has direct access to the capability itself. Capabilities in our API provide the relativity property (Section 3.1.3) for cross-address-space references, such as references within a child process to objects implemented by its parent. Since only the kernel can access the actual contents of a capability, capabilities can be passed freely between arbitrary processes in our system, allowing communication to short-circuit the process hierarchy when appropriate. This contrasts with CAP [54 ], where capabilities are used only for communication within a process and all interprocess communication is strictly parent-child and based on traps. Kernel-mediated capabilities satisfy the relativity property because even though the contents of a capability are absolute in the view of the kernel (i.e., it is typically just a pointer in the kernel's address space), from the view of any user-level process, a capability appears relative since the process has no access to the actual pointer stored in the capability. A parent process can locate all of the capabilities contained in a child process, discover to which objects they point, and transparently make substitutions of these capabilities, analogous to ``pointer swizzling'' done in the persistence and language domains. Capabilities also provide border control (Section 3.2). Since a parent process determines what capabilities it passes to its child, it can interpose on any of these capabilities, as well as on capabilities subsequently passed into or out of the child through communication channels on which the parent has interposed. This way, the parent can completely monitor and control all communication into or out of a process by intercepting IPC messages. However, the capability model also allows the parent to interpose selectively on only a subset of the capabilities provided to a child. For example, a nester that interposes on file system-related operations does not also need to intercept messages to other external services. This contrasts with L3's Clans & Chiefs model [38 ], where interposition is ``all or nothing'': if a parent wants to intercept any communication transparently, it must intercept all communication. 4.2.5 Scheduling The final type of resource with which the Fluke kernel deals directly is CPU time. As with memory and communication, the kernel provides only minimal, completely relative scheduling facilities. Threads can act as schedulers for other threads, donating their CPU time to those threads according to some high-level scheduling policy; those threads can then further subdivide CPU time among still other threads, etc., forming a scheduling hierarchy. The scheduling hierarchy usually corresponds to the nested process hierarchy, but is not required to do so. Our scheduling model, which has been prototyped and tested in user space but not yet as part of the Fluke kernel, is detailed in a companion paper [21 ]. However, only its relative, hierarchical nature is important to the nested process architecture. Other hierarchical schedulers should also work, such as the meter system in KeyKOS [27 ], lottery/stride scheduling [53 ], and SFQ scheduling [26 ]. 4.3 High-level Protocols While our low-level IPC mechanism provides primitive capability-based communication channels, a higher level protocol defines the conventions for communication over these channels. The key distinction between the high-level protocols and the low-level system call API is that any layer in the system can provide implementations of the objects defined by these interfaces, whereas only the microkernel can implement the low-level API. High-level objects automatically support the important properties in Section 3 because they are based on a capability model which allows interposition; the disadvantage is that interposition has a significant cost. We have designed our system around this trade-off by placing simple, performance-critical activities in the low-level API while leaving services that are invoked less often to be implemented by high-level objects. Our high-level interfaces, defined in CORBA IDL, are known as the Common Protocols because they are common to each layer in the nesting hierarchy. The Fluke Common Protocols, modeled on existing multiserver systems [8, 33 ], are designed to support a posix-compatible environment, though many of the interfaces are generic enough to support other APIs such as Win32. Table 1 shows the currently defined interfaces and example methods for each. The most basic interface, the Parent interface, is used for direct parent/child communication, and effectively acts as a first-level ``name service'' interface through Parent:: Basic parent-child interface get_process Get the Process Management interface get_mem_pool Get the memory pool for this process Process:: posix process management interface create_child Create a new posix child process exec Execute a program MemPool:: Memory Management interface create_var_segment Create a growable memory segment create_sub Create a sub-pool of a pool FileSystem:: File system interface open Open a file mkdir Create an empty directory FileDescription:: Open file and segment interface read Read data from a file map Map a file or memory segment Table 1: Fluke Common Protocols interfaces and example methods. The top-level Parent interface acts as a name service for obtaining references to process management, memory management and other nesters. which the child requests access to other services. This is the only interface that all nesters interpose on; nesters selectively interpose on other interfaces only as necessary to perform their function. The cost of interposition on the Parent interface is minimal because the child usually makes only a few requests on this interface, during its initialization phase, to find other interfaces of interest. The Parent interface currently provides methods to obtain initial file descriptors (e.g., stdin, stdout, stderr), to find file system, memory, and process managers, and to exit. The relevant details of these interfaces will be presented in the next section as we describe specific applications that use and interpose on them. 5 System Implementation In the following sections we overview the kernel, support libraries, and example nesters we have implemented to demonstrate the nested process model. These user-level applications take advantage of the model to provide traditional OS features, namely posix process management, demand paging, checkpointing, debugging, and tracing, in _________________________________________________________________ |_Nester_______|_Parent__|__MemPool_|_Process__|__FileSys_|_File_| | Debug/Trace | p | | p | | | | Process | p | p | | | p | | Memory | p | p | p | p | p | |_Checkpoint___|_________|__________|__________|__________|______| Table 2: Fluke nesters and the interfaces they interpose on. FileSys is the file system interface; File is the file and memory segment interface. a more flexible and decomposed way. Table 2 lists the example nesters and the Common Protocols interfaces they interpose on. 5.1 The Microkernel To provide the initial, base-level implementation of our nested process architecture, we developed a new microkernel that runs directly on ``raw'' x86 PC hardware. The initial implementation was designed primarily with portability, readability, and flexibility in mind rather than maximum performance; it is written mostly in machine-independent C and relies heavily on generic facilities provided by the Flux OS Toolkit [19 ]. The prototype supports kernel preemptibility, full multiprocessor locking, and can be configured to run under either an interrupt or process model. The kernel currently contains built-in console, serial, clock, disk, and network drivers, although it is designed to support out-of-kernel drivers in the future. Besides implementing the low-level API used by all processes, the microkernel also implements a first-level Common Protocols interface defining the environment presented to the first process loaded (the ``root'' process). This initial interface is sufficient to allow various simple applications as well as nesters to be run directly on top of the microkernel: for example, it includes a basic (physical) memory allocation interface, and a minimal root file system interface which acts like a RAM disk. Although it should be possible for a traditional monolithic kernel to implement a nested process architecture, we took a microkernel approach for the proof-of-concept, as it would be much more difficult to adapt an existing monolithic kernel because of the large source base and numerous changes that would be required. In addition, a monolithic kernel would benefit less from a nested process model because so much functionality is already hard-wired into the kernel. For example, while our checkpointer example would probably still apply, the decomposed process manager and virtual memory manager wouldn't. Because we chose the microkernel approach, our system takes the well-known ``microkernel performance hit'' [13 ] due to the additional decomposition and context switching overhead. This performance impact is made worse by the fact that our microkernel is new and entirely unoptimized. 5.2 The Libraries In our system, traditional Unix system calls are implemented by the C library residing in the same address space as the application. These C library functions communicate with ancestor nesters and external servers as necessary to provide the required functionality. For example, each process's file descriptor table is managed by its local C library and stored in the process itself, as IPC capabilities to objects managed by file servers.1 Our system currently provides two different C libraries: a ``minimal'' library for nesters and simple applications, and a full-blown BSD-based C library to support ``real'' Unix applications. Similar libraries could be designed to provide compatibility with other APIs and ABIs, such as Linux or Win32. The Nesting library, linked only into nesters and not ordinary applications, provides the ``parent-side'' complement to the C library. For example, it contains standard functions to spawn nested subprocesses and to handle or forward a child's Common Protocols requests. Use of this library is optional. Applications can always create nested processes manually in whatever way they desire; the library only provides a ``standard'' mechanism for creating children and providing Common Protocols-compatible interfaces to them. ____________________________ 1 The actual files and "open file descriptions," containing seek pointers and other per-open state, are maintained by the file servers; this greatly simplifies some of the traditionally hairy Unix multiserver issues. Figure 8: Process manager managing a posix-style process hierarchy. Each posix process is an immediate child of the process manager in terms of the nested process hierarchy. Process IDs and the posix-level parent/child relationships are implemented within the process manager. 5.3 The Debug and Trace Nesters We have implemented, as a simple nester, a debugger that can be used to debug either ordinary applications or other nesters. The debugger works by handling exceptions that occur in the child process. When a thread in the child faults, the kernel sends an exception RPC to the debugger, which handles the fault by communicating with a remote host running GDB. The debugger restarts the thread by replying to the kernel's exception RPC. Although Mach provides a similar ability to interpose on an exception port, Mach allows a task to change its own exception port reference, potentially allowing a buggy or uncooperative task to escape the debugger's control. We have also implemented a simple tracer, as a minor modification to the debugger, which can be used to trace the message activity of an arbitrary subprocess. The tracer starts by interposing on the application's Parent interface (as all nesters do). Then, for any request from the child that returns a new IPC capability, the tracer interposes on it. This interposition is completely generic in that the tracer needs to know nothing about the semantics of what it is interposing on; it will work for any IPC-based protocol. Although the tracer does nothing more than record messages that cross the subprocess's border, a security monitor for untrusted applications would interpose in basically the same way. 5.4 The Process Management Nester Support for posix-style processes in our system is provided with the help of a process manager, which is the only nester in our prototype system that manages multiple immediate subprocesses at once. Each posix process, regardless of its location in the Unix-style process hierarchy, is a direct child of the process manager in the global nested process hierarchy, as illustrated in Figure 8. This allows a posix child processes to outlive its posix parent, as is the proper behavior for posix-style processes. The Process Manager assigns a process ID to each of its child processes, and allows these processes to create other ``peer'' posix processes using fork() and exec(), which the C library converts into Common Protocols requests to the process manager. The process manager also handles interprocess signals and other posix process-related features. As with the other nesters, the Process Manager is an optional component: applications that don't fork(), send signals, etc., can be run without it. Furthermore, multiple process managers can be run side-by-side or even arbitrarily ``stacked'' on top of each other to provide multiple independent posix environments on a single machine; each of these environments has its own process ID namespace and posix process hierarchy. The process manager uses the Common Protocols' MemPool (``memory pool'') interface to manage the memory resources consumed by its children and ensure that all state is properly cleaned up when a posix process terminates. A MemPool is an object from which anonymous memory segments can be allocated and mapped by any process holding a capability for the pool. The MemPool interface also provides methods to create and destroy sub-pools; destroying a pool destroys all memory segments allocated from it and, recursively, all sub-pools derived from it. Thus, MemPools provide the hierarchical resource management property (Section 3.1.1) for memory. The process manager uses the MemPool capability from its parent to create a sub-pool for each posix process it supervises. When a process terminates or is killed, the process manager simply destroys the appropriate sub-pool; this automatically frees all memory allocated by that process and any nested subprocesses that may have been created. Furthermore, since the subprocess's threads and other kernel objects are merely part of its memory, they too get destroyed automatically when the MemPool is destroyed. 5.5 The Virtual Memory Management Nester We have implemented a user-level demand paged virtual memory manager which creates a nested environment whose anonymous memory is paged to a swap file. Arbitrary programs can be run in this paged environment, such as a single application, or a process manager supporting an entire paged posix environment similar to a traditional Unix system. The memory manager provides paged memory by interposing on the MemPool interface it passes to its child, re-implementing the pool and segment operations itself instead of merely creating a sub-pool and passing the sub-pool to the child as the process manager does. All memory segments allocated from the memory manager are backed by a swap file and cached in its own address space. The ``physical memory'' cache used by the memory manager is a memory segment allocated from the memory pool passed in by the memory manager's parent; the swap file is implemented by a Common Protocols file system server. When a client invokes the map operation on a segment implemented by the memory manager, the memory manager uses Fluke kernel address space manipulation primitives to remap the appropriate portions of its own memory into the address space of the client. These mappings correspond to ``resident pages'' in a conventional, kernel-based virtual memory system. The kernel notifies the memory manager of access violation and non-resident exceptions in the segments it supervises, allowing it to do demand allocation of memory and lazy copy optimization. In the prototype memory manager, the physical memory cache is divided into fixed-size (4k) pages. All allocations and I/O operations are performed on single pages; no prepaging or other clustering techniques are used. It implements a simple global FIFO page replacement algorithm and uses a single, fixed-size file as its backing store. The current implementation does not maintain ``dirty'' bits, so pages are always written to backing store. 5.6 The Checkpoint Nester We have implemented a user-level checkpointer that can operate over a single application or an arbitrary environment, transparently to the target. By loading a checkpointer as the ``root'' nester immediately on top of the microkernel, a whole-machine checkpointed system can be created. To our knowledge this is the first checkpointer that can operate over arbitrary domains in this way. Like the memory manager, the checkpointer interposes on the MemPool interface in order to maintain complete control over the state of its child processes. Since the kernel provides primitives to locate and manipulate the low-level objects within a memory region, the checkpointer effectively has direct access to all kernel object state in the child as well as to its raw data. The checkpointer currently uses a simplistic sequential checkpointing algorithm to take a checkpoint: it temporarily revokes all permissions on the memory segments it provides (which also effectively stops all threads in the child since they have no address space to execute in), saves the contents of the child's memory and kernel object state, and then re-enables the memory permissions to allow the child to continue execution. This algorithm, of course, will not scale well to large applications or distributed environments. However, more efficient checkpointers based on well-known single-process algorithms [15 , 17 ] could be implemented in our environment in the same way, and should also work automatically over multi-process domains. In order to checkpoint kernel objects that contain capabilities, the checkpointer discovers what object each capability points to and replaces it with a simple identifier that is unique within the saved checkpoint image. There are two classes of capabilities that the checkpointer must deal with, corresponding to the two types of references described in Section 3.1.3. To handle capabilities representing internal references (references to other objects within the checkpointed environment), the checkpointer builds a catalog of the objects in the checkpointed environment and uses kernel primitives to look up each capability in this catalog. Capabilities representing external references (references to objects outside the checkpointed environment) will not appear in the catalog, but since any external reference owned by the child environment must have been granted to it by the checkpointer at some point, the checkpointer can recognize it and take a reasonable course of action. For example, the capabilities representing the stdin, stdout, and stderr file handles are recognized by the checkpointer and, on restart, are reinitialized with the corresponding file handles in the new environment. Thus, all standard I/O file descriptors (including descriptors in nested subprocesses of the application) are transparently routed to the new environment. Similarly, when the child makes a Parent interface request for an external service such as process management, the checkpointer keeps track of the IPC capability returned, so that it can route it to the new environment. IPC capabilities that the checkpointer chose not to interpose on, such as open files other than the standard I/O handles, are replaced with null references. This has similar consequences to an NFS server going down and leaving stale file handles behind. Although our current implementation doesn't interpose on any file system accesses, it could easily recognize open calls and save file names, or even whole files, in order to provide a more consistent restart. The checkpointer is our most comprehensive nester, taking advantage of all of the virtual machine properties described in Section 3 to provide the most complete encapsulation possible. The state encapsulation, visibility, and relativity properties allow the checkpointer to save and restore the state of the child's memory and kernel objects. Additionally, the border control provided by the capability model allows the checkpointer to interpose on whichever interfaces it needs to; our implementation interposes on only those things necessary for a minimal complete checkpoint, comparable to the functionality offered by other user-level checkpointers [42 , 48 ]. There is a large body of work on checkpointers both in and out of the kernel. A few existing operating systems, such as KeyKOS [34 ] and L3 [39 ], have implemented checkpointing on a whole-machine basis in the kernel. Similarly, a hypervisor was recently used to provide fault tolerance (replication) on a whole-machine basis on PA-RISC machines [7]. While these features appear practical and useful in some situations, they are inflexibly tied to the machine boundary and cannot easily be used on smaller scopes, such as a process or a group of processes, or on larger scopes, such as networked clusters of machines. The nested process model allows checkpointing and other algorithms to be implemented over more flexible domains. 6 Experimental Results In order to evaluate the performance effects of nesting processes in our Fluke implementation, we used a set of micro benchmarks designed to reveal operating system performance properties that directly affect real-world applications. Our primary interest in these tests is to show the performance effect of different nesters on various types of applications; thus, we are chiefly concerned with relative slowdown due to nesting rather than the absolute performance of the system. All tests were performed on 200MHz Pentium Pro PCs with 128MB of RAM. The micro benchmarks used are: memtest: This is a simple memory tester which allocates as much memory as it can with sbrk and then makes four passes over it. Each pass writes and then reads back a distinct pattern looking for bit errors. We configured the heap to 4MB for this test. Memtest stresses memory management and is representative of programs that first allocate memory and then repeatedly operate on it. appel2: This benchmark tests a combination of basic virtual memory primitives as identified by Appel and Li [3]. This test, known as ``trap+protN+unprot'' in the original paper, measures the time to protect 100 pages, randomly access each page, and, in the fault handler, unprotect the faulting page. Where memtest emphasizes higher-level Common Protocols memory management, appel2 stresses the microkernel's low-level memory remapping mechanism. forktest: This program exercises the posix fork() and wait() operations of the Process Manager. Forktest creates a parent-child chain four levels deep; each level performs a simple computation and then returns the result in its exit status. readtest: This test is similar to the lmbench [44 ] bw_file_read test. It reads an 8MB file in 8KB chunks accessing every word of the buffer after each read. It is intended to discover the best-case file caching capability of the operating system. _________________________________________ |_________________________|___Time_(~s)_| | Null system call | 2.0 | | Mutex lock | 3.6 | | Mutex unlock | 4.0 | | Context switch | 7.5 | |_Null_cross-domain_RPC___|________14.9_| Table 3: Absolute performance of microkernel primitives. matconn: The final benchmark is a computationally intensive application. It uses Warshall's algorithm to compute connectivity in adjacency matrices, and generates successively larger matrices starting at 2x2 and ending with 128x128. 6.1 Absolute Performance To provide a baseline for further evaluation, Table 3 presents absolute times for various primitive Fluke microkernel operations. Table 4 shows absolute times for the aforementioned benchmark programs running directly on top of the microkernel with no intervening processes. For reference, we also show micro benchmark performance results for FreeBSD (version 2.1.5). Fluke performs reasonably in most tests even though it is, for the most part, an unoptimized microkernel and FreeBSD is a mature, well-optimized monolithic kernel. There is no time listed for forktest in the Fluke column since the Fluke kernel's Common Protocols server does not implement the Process Management interface. 6.2 Overhead of Interposition Figure 9 illustrates the overhead associated with interposition. Each benchmark was first run under the ``bare'' environment created by the microkernel to serve as a baseline. Then the benchmarks were run under one to four levels of the previously described trace nester which interposes on all IPC channels but simply passes data along. In this experiment, readtest is the only benchmark exhibiting measurable slowdown. This is because this test's running time is dominated by data copying since each interposition on the FileDescription read method results in an extra copy as data passes ______________________________________ |_Test______|____Fluke_|_____FreeBSD_| | memtest |929091 ~s | 914918 ~s | | appel2 | 5372 ~s | 3620 ~s | | forktest | N/A | 2534 ~s | | readtest |125844 ~s | 153010 ~s | |_matconn___|102917_~s_|____71568_~s_| Table 4: Absolute micro benchmark results for Fluke and FreeBSD. Fluke times reflect benchmarks running directly on top of the microkernel's minimal Common Protocols server. FreeBSD times were collected in single-user mode. All times were measured using the Pentium timestamp counter. through the tracing nester's address space. These large IPC data transfers are unoptimized in the prototype kernel. 6.3 Performance of Various Nester Hierarchies Figure 10 illustrates the effect of increasing levels of nesting on the micro benchmarks. The nester stacks were chosen to be representative of ``real world'' situations. The no-nesters case reflects a real-time or embedded system where only minimal OS services are required. Addition of the process manager (P) and memory manager (M) shows increasing levels of functionality, providing multitasking and traditional virtual memory services. A checkpointer (C) can be included to checkpoint or migrate the entire environment on demand. No checkpoints are actually taken during these tests, but the checkpointer performs all of the interposition and state management activities required to be able to take a checkpoint at any time. The final stack shows the insertion of the tracing nester (T) where it might be used to trace Common Protocols activity. The matconn and appel2 benchmarks are largely unaffected by nesting. The matconn benchmark is not impacted because it is largely computational. The appel2 result demonstrates the importance of the low-level kernel API directly supporting the virtualization properties described in Section 3. Since nesters do not interpose on the low-level interface, increasing the level of nesting has minimal impact on appel2, even though the nesters may be virtualizing memory. There is a modest (1--4% per level) kernel overhead associated with these nesters if they remap memory through themselves into their children. By mapping a child's memory into its own address space, a nester creates an additional level of mapping hierarchy in the kernel which may need to be traversed when performing virtual-to-physical address translations. The readtest benchmark is not significantly affected by most nesters since they do not interpose on the file system interface. As in the complete interposition case, there is a significant impact (34%) when the tracing nester is added. As expected, memtest is only impacted by memory interposition. An interesting observation is that memtest is affected to a much greater degree by the memory manager than by the checkpointer even though both interpose and remap the memory used. The difference is the way in which they remap the memory allocated for a child's segment. The current memory manager provides memory to a segment using single-page sized mappings to make page-level pagein and pageout easy, while the checkpointer uses one large multi-page mapping. The prototype kernel is not yet optimized to deal with the large number of kernel data structures that result from the memory manager's behavior. The forktest benchmark shows a pattern similar to that of memtest, except that the performance degrades more severely. The greater impact is due to the large ratio of Common Protocols calls to other activity. 6.4 Status Besides the simple tests and nesters above, our system is just starting to run some larger applications including GNU make, gawk, sed, bash, and gcc. We expect to make a full, self-hosting public release within the year. 7 Conclusion In this paper we have presented a novel approach to providing modular and extensible operating system functionality based on a synthesis of microkernel and virtual machine concepts. We have demonstrated the design and implementation of a microkernel architecture that efficiently supports decomposition of traditional OS Figure 9: Worst-case overhead due to IPC interposition. Performance is measured relative to the appropriate base (no interposition) case for all tests. services such as process management, demand paging, fault tolerance, and debugging, into cleanly modularized, stackable layers. Our prototype implementation of this model indicates that it is practical to modularize operating systems this way. Initial micro benchmark results are encouraging, showing slowdowns of 0--30% per layer. Acknowledgements For their many thoughtful and detailed comments on earlier drafts we thank our patient shepherd Marc Shapiro, the many anonymous reviewers, and Fred Schneider. Jochen Liedtke provided critique and discussion which influenced our kernel design. We thank Steve Smalley and Jeff Turner for their courage in being early guinea pigs and their critical intellectual involvement. Discussions with Jonathan Shapiro also influenced the kernel, while Michael Bushnell and Miles Bader provided key insights on the design of higher-level services. Gary Koob gave us the freedom and trust to try something this new. Finally, we are grateful to all members of the Flux project team for substantial and varied help along the way, including Chris Alfeld, John Carter, Eric Eide, Shantanu Goel, Linus Kamb, Gary Lindstrom, John McCorquodale, Roland McGrath, Bart Robinson, Sai Susarla, Kevin Van Maren, and Teresa Figure 10: Slowdown due to increasing levels of process nesting. Nester stacks on the horizontal axis were chosen to be representative of actual usage. In each stack, `K' is the kernel's Common Protocols server, `P' is the process manager, `M' is the memory manager, `C' is the checkpointer, and `T' is the tracer. Whiting. References [1] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young. Mach: A New Kernel Foundation for UNIX Development. In Proc. of the Summer 1986 USENIX Conf., pages 93--112, June 1986. [2] A. Adl-Tabatabai, G. Langdale, S. Lucco, and R. Wahbe. Efficient and Language-Independent Mobile Programs. In Proc. ACM SIGPLAN Symp. on Programming Language Design and Implementation, pages 127--136, May 1996. [3] A. Appel and K. Li. Virtual Memory Primitives for User Programs. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 96--107, June 1991. [4] N. Batlivala, B. Gleeson, J. Hamrick, S. Lurndal, D. Price, and J. Soddy. Experience With SVR4 Over Chorus. In Proc. of the USENIX Workshop on Micro-kernels and Other Kernel Architectures, pages 223--241, Seattle, WA, Apr. 1992. [5] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Chambers, and S. Eggers. Extensibility, Safety, and Performance in the SPIN Operating System. In Proc. of the 15th ACM Symp. on Operating Systems Principles, pages 267--284, Copper Mountain, CO, Dec. 1995. [6] A. C. Bomberger and N. Hardy. The KeyKOS Nanokernel Architecture. In Proc. of the USENIX Workshop on Micro-kernels and Other Kernel Architectures, pages 95--112, Seattle, WA, Apr. 1992. [7] T. C. Bressoud and F. B. Schneider. Hyporvisor-based Fault-tolerance. In Proc. of the 15th ACM Symp. on Operating Systems Principles, pages 1--11, Dec. 1995. [8] M. I. Bushnell. Towards a New Strategy of OS Design. In GNU's Bulletin, Cambridge, MA, Jan. 1994. Also http://www.cs.pdx.edu/"trent/gnu/hurd-paper.html. [9] P. Cao, E. W. Felten, and K. Li. Implementation and Performance of Application-Controlled File Caching. In Proc. of the First Symp. on Operating Systems Design and Implementation, pages 165--177, Monterey, CA, Nov. 1994. USENIX Assoc. [10] J. Carter, D. Khandekar, and L. Kamb. Distributed Shared Memory: Where We Are and Where We Should Be Headed. In Proceedings of the Fifth Workshop on Hot Topics in Operating Systems, pages 119--122, May 1995. [11] K. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63--75, Feb. 1985. [12] J. S. Chase, H. M. Levy, M. J. Feeley, and E. D. Lazowska. Sharing and Protection in a Single Address Space Operating System. Technical Report UW-CSE-93-04-02, University of Washington Computer Science Department, Apr. 1993. [13] J. B. Chen and B. N. Bershad. The Impact of Operating System Structure on Memory System Performance. In Proc. of the 14th ACM Symp. on Operating Systems Principles, pages 120--133, 1993. [14] D. R. Cheriton and K. J. Duda. A Caching Model of Operating System Kernel Functionality. In Proc. of the First Symp. on Operating Systems Design and Implementation, pages 179--193. USENIX Assoc., Nov. 1994. [15] G. Deconinck, J. Vounckx, R. Cuyvers, and R. Lauwereins. Survey of Checkpointing and Rollback Techniques. Technical Report O3.1.8 and O3.1.12, ESAT-ACCA Laboratory Katholieke Universiteit Leuven, Belgium, June 1993. [16] R. P. Draves. A Revised IPC Interface. In Proc. of the USENIX Mach Workshop, pages 101--121, Oct. 1990. [17] E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The Performance of Consistent Checkpointing. In 11th Symposium on Reliable Distributed Systems, pages 39--47, Oct. 1992. [18] D. R. Engler, M. F. Kaashoek, and J. O'Toole Jr. Exokernel: An Operating System Architecture for Application-level Resource Management. In Proc. of the 15th ACM Symp. on Operating Systems Principles, pages 251--266, Copper Mountain, CO, Dec. 1995. [19] B. Ford and Flux Project Members. The Flux Operating System Toolkit. University of Utah. Postscript and HTML available under http://www.cs.utah.edu/projects/flux/oskit/html/, 1996. [20] B. Ford, M. Hibler, and Flux Project Members. Fluke: Flexible ~-kernel Environment (draft documents). University of Utah. Postscript and HTML available under http://www.cs.utah.edu/projects/flux/fluke/html/, 1996. [21] B. Ford and S. Susarla. CPU Inheritance Scheduling. In Proc. of the Second Symp. on Operating Systems Design and Implementation, Seattle, WA, Oct. 1996. USENIX Assoc. [22] R. P. Goldberg. Architecture of Virtual Machines. In AFIPS Conf. Proc., June 1973. [23] R. P. Goldberg. Survey of Virtual Machine Reseach. IEEE Computer Magazine, pages 34--45, June 1974. [24] D. Golub, R. Dean, A. Forin, and R. Rashid. Unix as an Application Program. In Proc. of the Summer 1990 USENIX Conf., pages 87--96, Anaheim, CA, June 1990. [25] J. Gosling and H. McGilton. The Java Language Environment: A White Paper. Technical report, Sun Microsystems Computer Company, 1996. Available as http://java.sun.com/doc/language_environment/. [26] P. Goyal, X. Guo, and H. M. Vin. A Hierarchical CPU Scheduler For Multimedia Operations. In Proc. of the Second Symp. on Operating Systems Design and Implementation, Seattle, WA, Oct. 1996. USENIX Assoc. [27] N. Hardy. The KeyKos Architecture. Operating Systems Review, Sept. 1985. [28] IBM Virtual Machine Facility /370 Planning Guide. Technical Report GC20-1801-0, IBM Corporation, 1972. [29] IBM Virtual Machine Facility /370: Release 2 Planning Guide. Technical Report GC20-1814-0, IBM Corporation, 1973. [30] Intel. i960 Extended Architecture Programmer's Reference Manual, 1994. [31] M. Jones. Interposition Agents: Transparently Interposing User Code at the System Interface. In Proc. of the 14th ACM Symp. on Operating Systems Principles, pages 80--93, Dec. 1993. [32] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proc. of the 1994 Winter USENIX Conf., pages 115--132, Jan. 1994. [33] Y. A. Khalidi and M. N. Nelson. Extensible File Systems in Spring. In Proc. of the 14th ACM Symp. on Operating Systems Principles, pages 1--14, 1993. [34] C. Landau. The Checkpoint Mechanism in KeyKOS. In Proc. Second International Workshop on Object Orientation in Operating Systems, Sept. 1992. [35] H. C. Lauer and D. Wyeth. A Recursive Virtual Machine Architecture. In ACM SIGARCH-SIGOPS Workshop on Virtual Computer Systems, pages 113--116, Mar. 1973. [36] C. H. Lee, M. C. Chen, and R. C. Chang. HiPEC: High Performance External Virtual Memory Caching. In Proc. of the First Symp. on Operating Systems Design and Implementation, pages 153--164, Monterey, CA, Nov. 1994. USENIX Assoc. [37] H. M. Levy. Capability Based Computer Systems. Digital Press, 1984. [38] J. Liedtke. Clans and Chiefs. In Proc. 12. GI/ITG-Fachtagung Architektur von Rechensystemen, 1992. [39] J. Liedtke. A Persistent System in Real Use -- Experiences of the First 13 Years. In Proc. of the Third International Workshop on Object Orientation in Operating Systems, pages 2--11, Dec. 1993. [40] J. Liedtke. On Micro-Kernel Construction. In Proc. of the 15th ACM Symp. on Operating Systems Principles, pages 237--250, Copper Mountain, CO, Dec. 1995. [41] A. Lindstrom, J. Rosenberg, and A. Dearle. The Grand Unified Theory of Address Spaces. In Proc. of the Fifth Workshop on Hot Topics in Operating Systems, May 1995. [42] M. Litzkow and M. Solomon. Supporting Checkpointing and Process Migration Outside the UNIX Kernel. In Proc. of the Winter 1992 USENIX Conf., 1992. [43] K. Loepere et al. MK++ Kernel Executive Summary. Technical report, Open Software Foundation, 1995. [44] L. McVoy and C. Staelin. lmbench: Portable Tools for Performance Analysis. In Proc. of 1996 USENIX Conf., Jan. 1996. [45] Microsoft Corporation. Win32 Programmer's Reference, 1993. 999 pp. [46] A. B. Montz, D. Mosberger, S. W. O'Malley, L. L. Peterson, T. A. Proebsting, and J. H. Hartman. Scout: A Communications-oriented Operating System. Technical Report 94--20, University of Arizona, Dept. of Computer Science, June 1994. [47] S. J. Mullender. Experiences with Distributed Systems, chapter Process Management in Distributed Operating Systems. Lecture Notes in Computer Science no. 309. Springer-Verlag, 1987. [48] J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent Checkpointing under Unix. In Proc. of the Winter 1995 USENIX Technical Conf., Jan. 1995. [49] M. Rosenblum, S. Herrod, E. Witchel, and A. Gupta. Fast and Accurate Multiprocessor Simulation: The SimOS Approach. IEEE Parallel and Distributed Technology, 3(4), 1995. [50] C. Small and M. Seltzer. VINO: An Intergrated Platform for Operating System and Database Research. Technical Report TR-30-94, Harvard University, 1994. [51] P. Tullmann, J. Lepreau, B. Ford, and M. Hibler. User-level Checkpointing Through Exportable Kernel State. In Proc. Fifth International Workshop on Object Orientation in Operating Systems, Seattle, WA, Oct. 1996. IEEE. [52] D. Wagner, I. Goldberg, and R. Thomas. A Secure Environment for Untrusted Helper Applications. In Proc. of the 6th USENIX Unix Security Symp., 1996. [53] C. A. Waldspurger. Lottery and Stride Scheduling: Flexible Proportional-Share Resource Management. PhD thesis, Massachusetts Institute of Technology, Sept. 1995. [54] M. V. Wilkes and R. M. Needham. The Cambridge CAP Computer and its Operating System. North Holland, NY, 1979.