Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

MINIX 3: A Case Study in More Reliable Operating Systems

Victor van der Veen Vrije Universiteit Amsterdam vvdveen@cs.vu.nl May 2009

Abstract

By using a minimal kernel and moving code up to user space, MINIX 3 claims to be such a reliable OS In a search towards reliable computer systems, in this [3]. By studying dierent MINIX 3 reliability feapaper we discuss the reliability features of MINIX 3. tures, this paper focuses on this claim and examines We start by stating that traditional operating sys- these features. tems can no longer be considered to be reliable. Their monolithic structure makes them very hard to con1.1 What Makes Operating Systems trol and makes it impossible to have a system that is Unreliable? free of bugs. Next, we go into the details of the client-server Each OS consists of a kernel that is used as a commodel on which the MINIX 3 microkernel is based. municator between the machines hardware and user We explain the MINIX 3 architecture, followed by programs. In a traditional OS like Windows XP or discussing the reliability features of MINIX 3. We Linux, this kernel is responsible for almost every crund that the used architecture provides in a highly cial service like storing a le on hard disk and sending dependable operating system. bytes over a network, as well as video handling and We conclude by stating the MINIX 3 is a step in the scheduling processes. Kernels like these have grown right direction towards reliable operating systems. to million lines of code (LoC) during the past decade and are said to have a monolithic design. Such monolithic design has two characteristics that 1 Introduction make it unreliable [6]. First, since all supporting OS procedures are inside the kernel, a monolithic kernel Although the use of computers in daily activities contains many LoC. For example, the Linux 2.6.27 has increased signicantly over the past few decades, kernel contains 6,399,191 LoC [8]. Software reliability their reliability is still not satisfying. Reboots have studies have shown that executable code may contain become a common rst try to x your broken com- up to 75 bugs per 1,000 LoC [1, 5]. Using a conserputer, and re-installation of the entire system is a vative estimate of ve bugs per 1,000 LoC, the Linux good second one. kernel probably has over 30,000 bugs. Knowing that Many computer problems are caused by errors in 50 percent of the Linux kernel concerns device driver device drivers and result in the entire Operating Sys- code, it gets even worse when we consider that error tem (OS) breaking down. To get rid of reboots and rates in device drivers are up to seven times higher re-installations, a reliable OS is needed in which er- than in normal code [2]. Second, a monolithic OS has very poor fault isorors no longer aect the entire system, but at most lation. Hundreds of procedures are linked together only a small part of it. 1

and run in the same address space. This allows each 2.1 Monolithic Kernels line of code of the 6.4 million to (accidentally) overwrite key data structures of an unrelated component, A traditional OS like Windows XP or Linux uses a resulting in a system crash in such a way that it is monolithic structure, which is by far the most common organization and has been described as The dicult to detect what went wrong. Big Mess [7]. In such a design, all OS components are linked into a single executable (the kernel) which runs in a single address space. Section 1.1 describes the problems with monolithic 1.2 Paper Outline kernels and what makes them unreliable. The reason We have now seen what makes a traditional OS unre- that monolithic kernels are still widely used is simliable. In this paper, we outline how MINIX 3 uses a ple: performance. When computers were slow, an dierent approach and how it tries to overcome prob- OS had to be fast. By storing all functions in one big giant program, the OS was faster, which was all that lems that monolithic systems suer from. counted [3]. In section 2 we provide a bit more background about monolithic kernels and explain the client-server model which MINIX 3 uses. Section 3 outlines the 2.2 Client Server Model MINIX 3 architecture in more detail and discusses its reliability features. We end with section 4 to back MINIX 3 uses a client-server structure, which implies the use of a microkernel. The idea is that code is up our results. moved up into higher layers or is removed from the OS where possible, so that a minimal kernel (the microkernel) is obtained. Higher layers become separate server processes which run in user space. 2 Related Work In this model, shown in gure 1, the kernel is only responsible for the interprocess communication (IPC) Each OS has its own internal structure based on a between clients and servers [7]. specic design. A traditional OS uses a monolithic Note that the client-server model eliminates the design while MINIX 3 uses a client-server model. Al- characteristics found in section 1.1. First, since the though we already saw in section 1.1 how monolithic microkernel is only responsible for communication kernels function, we provide some more information primitives, it is minimalistic and contains only a about them in section 2.1. In section 2.2, we outline small number of LoC. Second, by moving all nonhow the client-server model functions. communication code into user-space processes, each Note that other designs exists as well. Among them process relating to one facet of the system, the OS are virtual machines, exokernels and layered systems has better fault isolation. [7]. Figure 1 of a microkernel being only responsible

Figure 1: The client-server model. 2

for the transport of messages is not completely realistic. Some OS functions cannot be performed in user space. This problem was solved in previous versions of MINIX by compiling some critical server processes (e.g., I/O device drivers) in the kernel. In MINIX 3 it was decided to let I/O device drivers run in user space as system processes. A special kernel process, named the System Task, allows communication between system processes and the kernel. This way, device drivers can still communicate with their hardware via the kernel, but no longer have direct access [7]

MINIX 3 VS Traditional Designs


Figure 2: The MINIX 3 architecture and some IPC paths.

We have seen in section 1.1 that a monolithic OS suffers from two major drawbacks: poor fault isolation and a potentially large number of bugs. We found in section 2.2 that MINIX 3 tries to overcome these drawbacks by using a microkernel. In this section we discuss how MINIX 3 manages this and which reliability features MINIX 3 has over a traditional OS. We start by describing the MINIX 3 architecture in more detail in section 3.1. Using this background, we can explain in section 3.2 why MINIX 3 is more reliable than a traditional OS.

Directly above the microkernel, we see dierent OS services running as user space processes as outlined briey in section 2.2. In the gure, we have only adopted the Process Manager (PM), File Server (FS), and Reincarnation Server (RS), but other services exists as well. Also in user space, we see I/O drivers running as separate processes (Audio Driver, Disk Driver, Network Driver, . . . ). 3.1 The MINIX 3 Architecture On top of these services, we see normal user processes running like ls or cp. These processes comTo understand why MINIX 3 is more reliable, we need to have a better understanding of its architecture. municate to the OS via the interface that is provided Consider gure 2 which illustrates how components by the dierent services. To handle an OS operation, services can communicate with drivers and/or inside MINIX 3 are related to each other. At the bottom of gure 2 we see the hardware layer. directly to the kernel by using the System Task. On top of that runs the MINIX 3 microkernel which is the only component that has direct access to the 3.2 Reliability of MINIX 3 hardware. As described in section 2.2, the microkernel is responsible for IPC primitives and provides We have seen in section 2.2 that by using a clientan interface, named the System Task, to let system server structure, a minimal kernel can be obtained. processes communicate with the kernel. A similar In section 3.1, we have seen how MINIX 3 accomprocess called the Clock Task, which concerns the plishes this. With this background, we now discuss scheduling of processes, is placed in the microkernel why MINIX 3 is more reliable than a traditional OS. For this, we consider ve essential reliability features as well. 3

of MINIX 3. First, due to the client-server model, the MINIX 3 microkernel consists of at most 5,000 LoC, which is much less than the 6.4 million LoC of the Linux kernel. Although moving code up into higher layers does not necessarily reduce the total number of LoC of the entire OS, it does reduce the estimated number of kernel bugs. Since a bug in the kernel can bring the entire system to a halt, the MINIX 3 microkernel ensures that the OS is more reliable than a traditional OS. Second, by dividing the OS into specic parts, it is possible to reduce the number of bugs per 1,000 LoC. Since each part of the OS is a separate component in MINIX 3, it is easier to understand the code, which can reduce the number of bugs in it. Consider a fruit company employee, responsible for removing rotten fruit before it is shipped. He will probably nd a rotten apple easier when he is inspecting only apples than when he is searching for rotten apples between pears, bananas, oranges, mandarins and apples at the same time. It is thus easier to nd bugs when code is divided into separate parts, increasing the reliability of MINIX 3 over a monolithic OS. Third, as we have already seen in section 2.2, the separation of distinguishable OS parts improves the operating systems fault isolation. As outlined in section 3.1, separate components run as separate user space processes. When, for example, the File Server crashes, it will not result in a complete system crash. Instead, only the server process exits, indicating a problem in the File Server and not in any other component. Fourth, device drivers can no longer bring the entire system down, since these drivers now run in user space as separate processes. This means that bad pointers in a driver cannot corrupt memory outside the drivers address space. Also, buer overrun vulnerabilities are limited. Fifth, the OS has become self-repairing. The Reincarnation Server as we saw briey in section 3.1 is a new component and is designed to restart broken services [4]. The Reincarnation Server acts as a parent for each system process that is started during the boot sequence. Every new system process that is started later on automatically becomes a child of 4

the Reincarnation Server as well. When a system process crashes and becomes a zombie process, the Reincarnation Server picks it up and may restart a fresh copy of the process. It also periodically checks whether system processes are still alive, by sending them status messages and checking if it receives the expected replies. When it receives an incorrect reply, it indicates that the system process must be in some kind of innite loop and may restart the associated process. The Reincarnation Server can restart essential OS processes if they crash or appear to have stopped working.

Conclusions

In search of a operating system that does no longer need to be re-installed every other six months, we have taken a closer look at MINIX 3. We have outlined why a traditional operating system like Windows XP or Linux is considered to be unreliable. The use of a monolithic kernel structure consisting of millions LoC has evolved into a big mess. It is dicult to understand the complete kernel, which makes it hard to write bug-free extensions for such a system. We outlined the MINIX 3 architecture which is based on a client-server model. In MINIX 3, many operating system services are moved into user space and a microkernel is responsible for interprocess communication between these services and hardware. The MINIX 3 design improves reliability in three important ways: 1. The number of LoC in the kernel is reduced to a minimum of at most 5,000. Less kernel code means less fatal kernel bugs. 2. The impact of bugs is reduced as well. Operating system services that run as separate processes in user space can no longer interfere with each other and so a bug in service A can no longer lead towards a crash in service B. 3. The Reincarnation Server is capable of restarting system processes when they crashed. Besides doing more research on MINIX 3 and porting drivers to the MINIX 3 platform, it may be an

idea to research the possibility of using the MINIX 3 Reliable and Secure? volume 39, pages 4451. architecture to write more reliable applications. Big IEEE Computer Society, May 2006. applications like oce suites, games, or even Internet browsers, may become more reliable when the [7] Andrew S. Tanenbaum and Albert S. Woodhull. Operating Systems Design and Implementation MINIX 3 architecture is applied to it. It may be pos(3rd Edition) (Prentice Hall Software Series), sible to provide each application a tiny microkernel, chapter 1.5 Operating System Structures, pages with on top of that separate other processes for dif4251. Prentice Hall, January 2006. ferent application components. This way, bugs in, for example, the ashplayer-plugin would no longer [8] Dj Walker-Morgan. Kernel log: More crash your entire browser. than 10 million lines of linux source les. To summarize, MINIX 3 seems to be a good step in http://www.h-online.com/open/Kernel-Logthe right direction towards stable computer systems. More-than-10-million-lines-of-Linux source-files--/news/111759, October 2008.

References
[1] Victor R. Basili and Barry T. Perricone. Software errors and complexity: an empirical investigation. Commun. ACM, 27(1):4252, 1984. [2] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson R. Engler. An empirical study of operating system errors. In Proc. Symp. Operating Systems Principles, pages 7388, 2001. [3] Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum. MINIX 3: A Highly Reliable, Self-Repairing Operating System. volume 40, pages 8089. ACM SIGOPS, July 2006. [4] Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum. Reorganizing UNIX for Reliability. In Colin Egan Chris Jesshope, editor, Proc. 11th Asia-Pacic Computer Systems Architecture Conference (ACSAC06), pages 8194, Shanghai, China, September 2006. Springer Berlin / Heidelberg. [5] Thomas J. Ostrand and Elaine J. Weyuker. The distribution of faults in a large industrial software system. In ISSTA 02: Proceedings of the 2002 ACM SIGSOFT international symposium on Software testing and analysis, pages 5564, New York, NY, USA, 2002. ACM. [6] Andrew S. Tanenbaum, Jorrit N. Herder, and Herbert Bos. Can We Make Operating Systems 5

You might also like