Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 217

The Linux Kernel

Copyright 1996-1999

David A Rusling
david.ruslingdigital. om
REVIEW, Version 0.8-3

January 25, 1999


This book is for Linux enthusiasts who want to know how the Linux kernel works. It is
not an internals manual. Rather it des ribes the prin iples and me hanisms that Linux
uses; how and why the Linux kernel works the way that it does. Linux is a moving target; this book is based upon the urrent, stable, 2.0.33 sour es as those are what most
individuals and ompanies are now using.
This book is freely distributable, you may opy and redistribute it under ertain onditions.
Please refer to the opyright and distribution statement.

For Gill, Esther and Stephen

Legal Noti e
UNIX is a trademark of Univel.
Linux is a trademark of Linus Torvalds, and has no onne tion to UNIXTM or
Univel.

Copyright 1996,1997,1998,1999 David A Rusling


3 Foxglove Close, Wokingham, Berkshire RG41 3NF, UK
david.ruslingarm. om
This book (\The Linux Kernel") may be reprodu ed and distributed in whole or
in part, without fee, subje t to the following onditions:

 The opyright noti e above and this permission noti e must be preserved
omplete on all omplete or partial opies.

 Any translation or derived work must be approved by the author in writing


before distribution.

 If you distribute this work in part, instru tions for obtaining the omplete

version of this manual must be in luded, and a means for obtaining a omplete version provided.

 Small portions may be reprodu ed as illustrations for reviews or quotes in


other works without this permission noti e if proper itation is given.

Ex eptions to these rules may be granted for a ademi purposes: Write to the
author and ask. These restri tions are here to prote t us as authors, not to restri t
you as learners and edu ators.
All sour e ode in this do ument is pla ed under the GNU General Publi Li ense,
available via anonymous FTP from prep.ai.mit.edu:/pub/gnu/COPYING. It is also
reprodu ed in appendix D.

Prefa e
Linux is a phenomenon of the Internet. Born out of the hobby proje t of a student it
has grown to be ome more popular than any other freely available operating system.
To many Linux is an enigma. How an something that is free be worthwhile? In
a world dominated by a handful of large software orporations, how an something
that has been written by a bun h of \ha kers" (si ) hope to ompete? How an
software ontributed to by many di erent people in many di erent ountries around
the world have a hope of being stable and e e tive? Yet stable and e e tive it is
and ompete it does. Many Universities and resear h establishments use it for their
everyday omputing needs. People are running it on their home PCs and I would
wager that most ompanies are using it somewhere even if they do not always realize
that they do. Linux is used to browse the web, host web sites, write theses, send
ele troni mail and, as always with omputers, to play games. Linux is emphati ally
not a toy; it is a fully developed and professionally written operating system used by
enthusiasts all over the world.
The roots of Linux an be tra ed ba k to the origins of UnixTM . In 1969, Ken
Thompson of the Resear h Group at Bell Laboratories began experimenting on a
multi-user, multi-tasking operating system using an otherwise idle PDP-7. He was
soon joined by Dennis Ri hie and the two of them, along with other members of the
Resear h Group produ ed the early versions of UnixTM . Ri hie was strongly in uen ed
by an earlier proje t, MULTICS and the name UnixTM is itself a pun on the name
MULTICS. Early versions were written in assembly ode, but the third version was
rewritten in a new programming language, C. C was designed and written by Ri hie
expressly as a programming language for writing operating systems. This rewrite
allowed UnixTM to move onto the more powerful PDP-11/45 and 11/70 omputers
then being produ ed by DIGITAL. The rest, as they say, is history. UnixTM moved
out of the laboratory and into mainstream omputing and soon most major omputer
manufa turers were produ ing their own versions.
Linux was the solution to a simple need. The only software that Linus Torvalds,
Linux's author and prin iple maintainer was able to a ord was Minix. Minix is a
simple, UnixTM like, operating system widely used as a tea hing aid. Linus was less
than impressed with its features, his solution was to write his own software. He took
UnixTM as his model as that was an operating system that he was familiar with in his
day to day student life. He started with an Intel 386 based PC and started to write.
Progress was rapid and, ex ited by this, Linus o ered his e orts to other students
via the emerging world wide omputer networks, then mainly used by the a ademi
ommunity. Others saw the software and started ontributing. Mu h of this new
software was itself the solution to a problem that one of the ontributors had. Before
long, Linux had be ome an operating system. It is important to note that Linux
iii

ontains no UnixTM ode, it is a rewrite based on published POSIX standards. Linux


is built with and uses a lot of the GNU (GNU's Not UnixTM ) software produ ed by
the Free Software Foundation in Cambridge, Massa husetts.
Most people use Linux as a simple tool, often just installing one of the many good
CD ROM-based distributions. A lot of Linux users use it to write appli ations or
to run appli ations written by others. Many Linux users read the HOWTOs1 avidly
and feel both the thrill of su ess when some part of the system has been orre tly
on gured and the frustration of failure when it has not. A minority are bold enough
to write devi e drivers and o er kernel pat hes to Linus Torvalds, the reator and
maintainer of the Linux kernel. Linus a epts additions and modi ations to the
kernel sour es from anyone, anywhere. This might sound like a re ipe for anar hy
but Linus exer ises stri t quality ontrol and merges all new ode into the kernel
himself. At any one time though, there are only a handful of people ontributing
sour es to the Linux kernel.
The majority of Linux users do not look at how the operating system works, how
it ts together. This is a shame be ause looking at Linux is a very good way to
learn more about how an operating system fun tions. Not only is it well written,
all the sour es are freely available for you to look at. This is be ause although the
authors retain the opyrights to their software, they allow the sour es to be freely
redistributable under the Free Software Foundation's GNU Publi Li ense. At rst
glan e though, the sour es an be onfusing; you will see dire tories alled kernel,
mm and net but what do they ontain and how does that ode work? What is needed
is a broader understanding of the overall stru ture and aims of Linux. This, in
short, is the aim of this book: to promote a lear understanding of how Linux, the
operating system, works. To provide a mind model that allows you to pi ture what
is happening within the system as you opy a le from one pla e to another or read
ele troni mail. I well remember the ex itement that I felt when I rst realized just
how an operating system a tually worked. It is that ex itement that I want to pass
on to the readers of this book.
My involvement with Linux started late in 1994 when I visited Jim Paradis who was
working on a port of Linux to the Alpha AXP pro essor based systems. I had worked
for Digital Equipment Co. Limited sin e 1984, mostly in networks and ommuni ations and in 1992 I started working for the newly formed Digital Semi ondu tor
division. This division's goal was to enter fully into the mer hant hip vendor market
and sell hips, and in parti ular the Alpha AXP range of mi ropro essors but also
Alpha AXP system boards outside of Digital. When I rst heard about Linux I
immediately saw an opportunity to have fun. Jim's enthusiasm was at hing and I
started to help on the port. As I worked on this, I began more and more to appre iate
not only the operating system but also the ommunity of engineers that produ es it.
However, Alpha AXP is only one of the many hardware platforms that Linux runs
on. Most Linux kernels are running on Intel pro essor based systems but a growing
number of non-Intel Linux systems are be oming more ommonly available. Amongst
these are Alpha AXP, ARM, MIPS, Spar and PowerPC. I ould have written this
book using any one of those platforms but my ba kground and te hni al experien es
with Linux are with Linux on the Alpha AXP and, to a lesser extent on the ARM.
This is why this book sometimes uses non-Intel hardware as an example to illustrate
1A

HOWTO is just what it sounds like, a do ument des ribing how to do something. Many have
been written for Linux and all are very useful.

some key point. It must be noted that around 95% of the Linux kernel sour es are
ommon to all of the hardware platforms that it runs on. Likewise, around 95% of
this book is about the ma hine independent parts of the Linux kernel.

Reader Pro le
This book does not make any assumptions about the knowledge or experien e of
the reader. I believe that interest in the subje t matter will en ourage a pro ess of
self edu ation where ne essary. That said, a degree of familiarity with omputers,
preferably the PC will help the reader derive real bene t from the material, as will
some knowledge of the C programming language.

Organisation of this Book


This book is not intended to be used as an internals manual for Linux. Instead
it is an introdu tion to operating systems in general and to Linux in parti ular.
The hapters ea h follow my rule of \working from the general to the parti ular".
They rst give an overview of the kernel subsystem that they are des ribing before
laun hing into its gory details.
I have deliberately not des ribed the kernel's algorithms, its methods of doing things,
in terms of routine X() alls routine Y() whi h in rements the foo eld of the bar
data stru ture. You an read the ode to nd these things out. Whenever I need to
understand a pie e of ode or des ribe it to someone else I often start with drawing
its data stru tures on the white-board. So, I have des ribed many of the relevant
kernel data stru tures and their interrelationships in a fair amount of detail.
Ea h hapter is fairly independent, like the Linux kernel subsystem that they ea h
des ribe. Sometimes, though, there are linkages; for example you annot des ribe a
pro ess without understanding how virtual memory works.
The Hardware Basi s hapter (Chapter 1) gives a brief introdu tion to the modern
PC. An operating system has to work losely with the hardware system that a ts
as its foundations. The operating system needs ertain servi es that an only be
provided by the hardware. In order to fully understand the Linux operating system,
you need to understand the basi s of the underlying hardware.
The Software Basi s hapter (Chapter 2) introdu es basi software prin iples and
looks at assembly and C programing languages. It looks at the tools that are used
to build an operating system like Linux and it gives an overview of the aims and
fun tions of an operating system.
The Memory Management hapter (Chapter 3) des ribes the way that Linux handles
the physi al and virtual memory in the system.
The Pro esses hapter (Chapter 4) des ribes what a pro ess is and how the Linux
kernel reates, manages and deletes the pro esses in the system.
Pro esses ommuni ate with ea h other and with the kernel to oordinate their a tivities. Linux supports a number of Inter-Pro ess Communi ation (IPC) me hanisms.
Signals and pipes are two of them but Linux also supports the System V IPC me hanisms named after the UnixTM release in whi h they rst appeared. These interpro ess
ommuni ations me hanisms are des ribed in Chapter 5.

The Peripheral Component Inter onne t (PCI) standard is now rmly established
as the low ost, high performan e data bus for PCs. The PCI hapter (Chapter 6)
des ribes how the Linux kernel initializes and uses PCI buses and devi es in the
system.
The Interrupts and Interrupt Handling hapter (Chapter 7) looks at how the Linux
kernel handles interrupts. Whilst the kernel has generi me hanisms and interfa es
for handling interrupts, some of the interrupt handling details are hardware and
ar hite ture spe i .
One of Linux's strengths is its support for the many available hardware devi es for
the modern PC. The Devi e Drivers hapter (Chapter 8) des ribes how the Linux
kernel ontrols the physi al devi es in the system.
The File system hapter (Chapter 9) des ribes how the Linux kernel maintains the
les in the le systems that it supports. It des ribes the Virtual File System (VFS)
and how the Linux kernel's real le systems are supported.
Networking and Linux are terms that are almost synonymous. In a very real sense
Linux is a produ t of the Internet or World Wide Web (WWW). Its developers and
users use the web to ex hange information ideas, ode and Linux itself is often used
to support the networking needs of organizations. Chapter 10 des ribes how Linux
supports the network proto ols known olle tively as TCP/IP.
The Kernel Me hanisms hapter (Chapter 11) looks at some of the general tasks and
me hanisms that the Linux kernel needs to supply so that other parts of the kernel
work e e tively together.
The Modules hapter (Chapter 12) des ribes how the Linux kernel an dynami ally
load fun tions, for example le systems, only when they are needed.
The Pro essors hapter (Chapter 13) gives a brief des ription of some of the pro essors that Linux has been ported to.
The Sour es hapter (Chapter 14) des ribes where in the Linux kernel sour es you
should start looking for parti ular kernel fun tions.

Conventions used in this Book


The following is a list of the typographi al onventions used in this book.
serif font
type font

See foo() in

foo/bar.

identi es ommands or other text that is to be typed


literally by the user.
refers to data stru tures or elds
within data stru tures.

Throughout the text there referen es to pie es of ode within the Linux kernel sour e
tree (for example the boxed margin note adja ent to this text ). These are given
in ase you wish to look at the sour e ode itself and all of the le referen es are
relative to /usr/sr /linux. Taking foo/bar. as an example, the full lename
would be /usr/sr /linux/foo/bar. If you are running Linux (and you should),
then looking at the ode is a worthwhile experien e and you an use this book as an
aid to understanding the ode and as a guide to its many data stru tures.

Trademarks
ARM is a trademark of ARM Holdings PLC.
Caldera, OpenLinux and the \C" logo are trademarks of Caldera, In .
Caldera OpenDOS 1997 Caldera, In .
DEC is a trademark of Digital Equipment Corporation.
DIGITAL is a trademark of Digital Equipment Corporation.
Linux is a trademark of Linus Torvalds.
Motif is a trademark of The Open System Foundation, In .
MSDOS is a trademark of Mi rosoft Corporation.
Red Hat, glint and the Red Hat logo are trademarks of Red Hat Software, In .
UNIX is a registered trademark of X/Open.
XFree86 is a trademark of XFree86 Proje t, In .
X Window System is a trademark of the X Consortium and the Massa husetts Institute of Te hnology.

The Author
I was born in 1957, a few weeks before Sputnik was laun hed, in the north of England.
I rst met Unix at University, where a le turer used it as an example when tea hing
the notions of kernels, s heduling and other operating systems goodies. I loved using
the newly delivered PDP-11 for my nal year proje t. After graduating (in 1982 with
a First Class Honours degree in Computer S ien e) I worked for Prime Computers
(Primos) and then after a ouple of years for Digital (VMS, Ultrix). At Digital I
worked on many things but for the last 5 years there, I worked for the semi ondu tor
group on Alpha and StrongARM evaluation boards. In 1998 I moved to ARM where
I have a small group of engineers writing low level rmware and porting operating
systems. My hildren (Esther and Stephen) des ribe me as a geek.
People often ask me about Linux at work and at home and I am only too happy
to oblige. The more that I use Linux in both my professional and personal life the
more that I be ome a Linux zealot. You may note that I use the term `zealot' and
not `bigot'; I de ne a Linux zealot to be an enthusiast that re ognizes that there
are other operating systems but prefers not to use them. As my wife, Gill, who
uses Windows 95 on e remarked \I never realized that we would have his and her
operating systems". For me, as an engineer, Linux suits my needs perfe tly. It is
a superb, exible and adaptable engineering tool that I use at work and at home.
Most freely available software easily builds on Linux and I an often simply download
pre-built exe utable les or install them from a CD ROM. What else ould I use to
learn to program in C++, Perl or learn about Java for free?

A knowledgements
I must thank the many people who have been kind enough to take the time to email me with omments about this book. I have attempted to in orporated those

omments in ea h new version that I have produ ed and I am more than happy to
re eive omments, however please note my new e-mail address.
A number of le turers have written to me asking if they an use some or parts of
this book in order to tea h omputing. My answer is an emphati yes; this is one
use of the book that I parti ularly wanted. Who knows, there may be another Linus
Torvalds sat in the lass.
Spe ial thanks must go to John Rigby and Mi hael Bauer who gave me full, detailed
review notes of the whole book. Not an easy task. Alan Cox and Stephen Tweedie
have patiently answered my questions - thanks. I used Larry Ewing's penguins to
brighten up the hapters a bit. Finally, thank you to Greg Hankins for a epting
this book into the Linux Do umentation Proje t and onto their web site.

Contents
Prefa e

iii

1 Hardware Basi s

1.1 The CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Controllers and Peripherals . . . . . . . . . . . . . . . . . . . . . . . .

4
4
5

1.5 Address Spa es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


1.6 Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5
6

2 Software Basi s

2.1 Computer Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . .


2.1.1 Assembly Languages . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 The C Programming Language and Compiler . . . . . . . . . .

7
7
8

2.1.3 Linkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 What is an Operating System? . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Memory management . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Pro esses . . . .
2.2.3 Devi e drivers . .
2.2.4 The Filesystems
2.3 Kernel Data Stru tures

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

10
11
11
11

2.3.1 Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


2.3.2 Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Abstra t Interfa es . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Memory Management

15

3.1 An Abstra t Model of Virtual Memory . . . . . . . . . . . . . . . . . . 16


3.1.1 Demand Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Shared Virtual Memory . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Physi al and Virtual Addressing Modes . . . . . . . . . . . . . 19
3.1.5 A ess Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
ix

3.2 Ca hes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Linux Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Page Allo ation and Deallo ation . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Page Allo ation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.2 Page Deallo ation . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5
3.6
3.7
3.8

Memory Mapping . . . . . . . . . . . . . . . . . . . . . .
Demand Paging . . . . . . . . . . . . . . . . . . . . . . .
The Linux Page Ca he . . . . . . . . . . . . . . . . . . .
Swapping Out and Dis arding Pages . . . . . . . . . . .
3.8.1 Redu ing the Size of the Page and Bu er Ca hes
3.8.2 Swapping Out System V Shared Memory Pages .
3.8.3 Swapping Out and Dis arding Pages . . . . . . .
3.9 The Swap Ca he . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

25
26
27
28
29
30
30
31

3.10 Swapping Pages In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Pro esses

35

4.1 Linux Pro esses . . . . . . . . . . . . . . . . .


4.2 Identi ers . . . . . . . . . . . . . . . . . . . .
4.3 S heduling . . . . . . . . . . . . . . . . . . . .
4.3.1 S heduling in Multipro essor Systems
4.4 Files . . . . . . . . . . . . . . . . . . . . . . .
4.5 Virtual Memory . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

36
38
39
41
42
43

4.6 Creating a Pro ess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


4.7 Times and Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.8 Exe uting Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.8.1 ELF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.8.2 S ript Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Interpro ess Communi ation Me hanisms

51

5.1 Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 So kets . . . . . . . . . . . . . . .
5.3.1 System V IPC Me hanisms
5.3.2 Message Queues . . . . . .
5.3.3 Semaphores . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

55
55
55
56

5.3.4 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 PCI

61

6.1 PCI Address Spa es . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


6.2 PCI Con guration Headers . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 PCI I/O and PCI Memory Addresses . . . . . . . . . . . . . . . . . . . 64

6.4 PCI-ISA Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


6.5 PCI-PCI Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5.1 PCI-PCI Bridges: PCI I/O and PCI Memory Windows . . . . 65
6.5.2 PCI-PCI Bridges: PCI Con guration Cy les and PCI Bus
Numbering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Linux PCI Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.6.1 The Linux Kernel PCI Data Stru tures . . . . . . . . . . . . . 67
6.6.2 The PCI Devi e Driver . . . . . . . . . . . . . . . . . . . . . . 68
6.6.3 PCI BIOS Fun tions . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6.4 PCI Fixup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7 Interrupts and Interrupt Handling

75

7.1 Programmable Interrupt Controllers . . . . . . . . . . . . . . . . . . . 77


7.2 Initializing the Interrupt Handling Data Stru tures . . . . . . . . . . . 77
7.3 Interrupt Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8 Devi e Drivers

81

8.1 Polling and Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . 82


8.2 Dire t Memory A ess (DMA) . . . . . . . . . . . . . . . . . . . . . . 84
8.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.4 Interfa ing Devi e Drivers with the Kernel . . . . . . . . . . . . . . . . 85
8.4.1 Chara ter Devi es . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.4.2 Blo k Devi es . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.5 Hard Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.5.1 IDE Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.5.2 Initializing the IDE Subsystem . . . . . . . . . . . . . . . . . . 91
8.5.3 SCSI Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.6 Network Devi es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.6.1 Initializing Network Devi es . . . . . . . . . . . . . . . . . . . . 96

9 The File system

99

9.1 The Se ond Extended File system (EXT2) . . . . . . . . . . . . . . . . 101


9.1.1 The EXT2 Inode . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.1.2 The EXT2 Superblo k . . . . . . . . . . . . . . . . . . . . . . . 103
9.1.3 The EXT2 Group Des riptor . . . . . . . . . . . . . . . . . . . 104
9.1.4 EXT2 Dire tories . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.1.5 Finding a File in an EXT2 File System . . . . . . . . . . . . . 105
9.1.6 Changing the Size of a File in an EXT2 File System . . . . . . 106
9.2 The Virtual File System (VFS) . . . . . . . . . . . . . . . . . . . . . . 107
9.2.1 The VFS Superblo k . . . . . . . . . . . . . . . . . . . . . . . . 109
9.2.2 The VFS Inode . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

9.2.3 Registering the File Systems . . . . . . . . . . . . . . . . . . . 110


9.2.4 Mounting a File System . . . . . . . . . . . . . . . . . . . . . . 110
9.2.5 Finding a File in the Virtual File System . . . . . . . . . . . . 112
9.2.6 Creating a File in the Virtual File System . . . . . . . . . . . . 112
9.2.7 Unmounting a File System . . . . . . . . . . . . . . . . . . . . 112
9.2.8 The VFS Inode Ca he . . . . . . . . . . . . . . . . . . . . . . . 113
9.2.9 The Dire tory Ca he . . . . . . . . . . . . . . . . . . . . . . . . 113
9.3 The Bu er Ca he . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.3.1 The bdflush Kernel Daemon . . . . . . . . . . . . . . . . . . . 116
9.3.2 The update Pro ess . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.4 The /pro File System . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.5 Devi e Spe ial Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

10 Networks

119

10.1 An Overview of TCP/IP Networking . . . . . . . . . . . . . . . . . . . 119


10.2 The Linux TCP/IP Networking Layers . . . . . . . . . . . . . . . . . . 122
10.3 The BSD So ket Interfa e . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.4 The INET So ket Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.4.1 Creating a BSD So ket . . . . . . . . . . . . . . . . . . . . . . 127
10.4.2 Binding an Address to an INET BSD So ket . . . . . . . . . . 127
10.4.3 Making a Conne tion on an INET BSD So ket . . . . . . . . . 128
10.4.4 Listening on an INET BSD So ket . . . . . . . . . . . . . . . . 129
10.4.5 A epting Conne tion Requests . . . . . . . . . . . . . . . . . . 130
10.5 The IP Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.5.1 So ket Bu ers . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.5.2 Re eiving IP Pa kets . . . . . . . . . . . . . . . . . . . . . . . . 131
10.5.3 Sending IP Pa kets . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.5.4 Data Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . 133
10.6 The Address Resolution Proto ol (ARP) . . . . . . . . . . . . . . . . . 133
10.7 IP Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.7.1 The Route Ca he . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.7.2 The Forwarding Information Database . . . . . . . . . . . . . . 136

11 Kernel Me hanisms

139

11.1 Bottom Half Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 139


11.2 Task Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.3 Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.4 Wait Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.5 Buzz Lo ks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.6 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

12 Modules

145

12.1 Loading a Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146


12.2 Unloading a Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

13 Pro essors

151

13.1 X86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


13.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
13.3 Alpha AXP Pro essor . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

14 The Linux Kernel Sour es

153

A Linux Data Stru tures

159

B Useful Web and FTP Sites

177

C Linux Do umentation Proje t Manifesto

179

C.1
C.2
C.3
C.4
C.5
C.6
C.7

Overview . . . . . . . . . .
Getting Involved . . . . . .
Current Proje ts . . . . . .
FTP sites for LDP works .
Do umentation Conventions
Copyright and Li ense . . .
Publishing LDP Manuals .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

D The GNU General Publi Li ense

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

179
180
180
180
180
181
181

183

D.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


D.2 Terms and Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
D.3 How to Apply These Terms . . . . . . . . . . . . . . . . . . . . . . . . 188

Glossary

191

Bibliography

194

List of Figures
1.1 A typi al PC motherboard. . . . . . . . . . . . . . . . . . . . . . . . .

3.1
3.2
3.3
3.4
3.5
3.6

Abstra t model of Virtual to Physi al address mapping


Alpha AXP Page Table Entry . . . . . . . . . . . . . . .
Three Level Page Tables . . . . . . . . . . . . . . . . . .
The free area data stru ture . . . . . . . . . . . . . . .
Areas of Virtual Memory . . . . . . . . . . . . . . . . .
The Linux Page Ca he . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

16
20
23
25
26
27

4.1
4.2
4.3
4.4

A Pro ess's Files . . . . . . .


A Pro ess's Virtual Memory .
Registered Binary Formats .
ELF Exe utable File Format

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

42
44
47
48

5.1
5.2
5.3
5.4

Pipes . . . . . . . . . . . . . . .
System V IPC Message Queues
System V IPC Semaphores . .
System V IPC Shared Memory

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

54
56
57
59

6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10

Example PCI Based System . . . . . . . . . . . . .


The PCI Con guration Header . . . . . . . . . . .
Type 0 PCI Con guration Cy le . . . . . . . . . .
Type 1 PCI Con guration Cy le . . . . . . . . . .
Linux Kernel PCI Data Stru tures . . . . . . . . .
Con guring a PCI System: Part 1 . . . . . . . . .
Con guring a PCI System: Part 2 . . . . . . . . .
Con guring a PCI System: Part 3 . . . . . . . . .
Con guring a PCI System: Part 4 . . . . . . . . .
PCI Con guration Header: Base Address Registers

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

62
63
65
65
67
69
70
71
71
72

7.1 A Logi al Diagram of Interrupt Routing . . . . . . . . . . . . . . . . . 76


7.2 Linux Interrupt Handling Data Stru tures . . . . . . . . . . . . . . . . 79
8.1 Chara ter Devi es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
xv

8.2 Bu er Ca he Blo k Devi e Requests . . . . . . . . . . . . . . . . . . . 87


8.3 Linked list of disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.4 SCSI Data Stru tures . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.1
9.2
9.3
9.4
9.5
9.6
9.7

Physi al Layout of the EXT2 File system . .


EXT2 Inode . . . . . . . . . . . . . . . . . . .
EXT2 Dire tory . . . . . . . . . . . . . . . .
A Logi al Diagram of the Virtual File System
Registered File Systems . . . . . . . . . . . .
A Mounted File System . . . . . . . . . . . .
The Bu er Ca he . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

101
102
105
107
110
112
114

10.1
10.2
10.3
10.4
10.5

TCP/IP Proto ol Layers . . . . . . . .


Linux Networking Layers . . . . . . .
Linux BSD So ket Data Stru tures . .
The So ket Bu er (sk bu ) . . . . . .
The Forwarding Information Database

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

121
123
126
131
136

11.1
11.2
11.3
11.4

Bottom Half Handling Data Stru tures .


A Task Queue . . . . . . . . . . . . . . .
System Timers . . . . . . . . . . . . . .
Wait Queue . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

139
140
142
143

12.1 The List of Kernel Modules . . . . . . . . . . . . . . . . . . . . . . . . 147

Chapter 1

Hardware Basi s

An operating system has to work losely with the hardware system that
a ts as its foundations. The operating system needs ertain servi es that
an only be provided by the hardware. In order to fully understand
the Linux operating system, you need to understand the basi s of the
underlying hardware. This hapter gives a brief introdu tion to that
hardware: the modern PC.
When the \Popular Ele troni s" magazine for January 1975 was printed with an
illustration of the Altair 8080 on its front over, a revolution started. The Altair
8080, named after the destination of an early Star Trek episode, ould be assembled
by home ele troni s enthusiasts for a mere $397. With its Intel 8080 pro essor and
256 bytes of memory but no s reen or keyboard it was puny by today's standards.
Its inventor, Ed Roberts, oined the term \personal omputer" to des ribe his new
invention, but the term PC is now used to refer to almost any omputer that you
an pi k up without needing help. By this de nition, even some of the very powerful
Alpha AXP systems are PCs.
Enthusiasti ha kers saw the Altair's potential and started to write software and
build hardware for it. To these early pioneers it represented freedom; the freedom
from huge bat h pro essing mainframe systems run and guarded by an elite priesthood. Overnight fortunes were made by ollege dropouts fas inated by this new
phenomenon, a omputer that you ould have at home on your kit hen table. A lot
of hardware appeared, all di erent to some degree and software ha kers were happy
to write software for these new ma hines. Paradoxi ally it was IBM who rmly ast
the mould of the modern PC by announ ing the IBM PC in 1981 and shipping it to
ustomers early in 1982. With its Intel 8088 pro essor, 64K of memory (expandable
to 256K), two oppy disks and an 80 hara ter by 25 lines Colour Graphi s Adapter
(CGA) it was not very powerful by today's standards but it sold well. It was followed, in 1983, by the IBM PC-XT whi h had the luxury of a 10Mbyte hard drive.
It was not long before IBM PC lones were being produ ed by a host of ompanies
su h as Compaq and the ar hite ture of the PC be ame a de-fa to standard. This
1

power

power

parallel port
COM1

COM2

CPU

Memory SIMM Slots

PCI Slots

ISA Slots

Figure 1.1: A typi al PC motherboard.


de-fa to standard helped a multitude of hardware ompanies to ompete together in
a growing market whi h, happily for onsumers, kept pri es low. Many of the system
ar hite tural features of these early PCs have arried over into the modern PC. For
example, even the most powerful Intel Pentium Pro based system starts running in
the Intel 8086's addressing mode. When Linus Torvalds started writing what was
to be ome Linux, he pi ked the most plentiful and reasonably pri ed hardware, an
Intel 80386 PC.
Looking at a PC from the outside, the most obvious omponents are a system box,
a keyboard, a mouse and a video monitor. On the front of the system box are some
buttons, a little display showing some numbers and a oppy drive. Most systems
these days have a CD ROM and if you feel that you have to prote t your data, then
there will also be a tape drive for ba kups. These devi es are olle tively known as
the peripherals.
Although the CPU is in overall ontrol of the system, it is not the only intelligent
devi e. All of the peripheral ontrollers, for example the IDE ontroller, have some
level of intelligen e. Inside the PC (Figure 1.1) you will see a motherboard ontaining
the CPU or mi ropro essor, the memory and a number of slots for the ISA or PCI
peripheral ontrollers. Some of the ontrollers, for example the IDE disk ontroller
may be built dire tly onto the system board.

1.1 The CPU


The CPU, or rather mi ropro essor, is the heart of any omputer system. The mi ropro essor al ulates, performs logi al operations and manages data ows by reading
instru tions from memory and then exe uting them. In the early days of omput-

ing the fun tional omponents of the mi ropro essor were separate (and physi ally
large) units. This is when the term Central Pro essing Unit was oined. The modern
mi ropro essor ombines these omponents onto an integrated ir uit et hed onto
a very small pie e of sili on. The terms CPU, mi ropro essor and pro essor are all
used inter hangeably in this book.
Mi ropro essors operate on binary data; that is data omposed of ones and zeros.
These ones and zeros orrespond to ele tri al swit hes being either on or o . Just
as 42 is a de imal number meaning \4 10s and 2 units", a binary number is a series
of binary digits ea h one representing a power of 2. In this ontext, a power means
the number of times that a number is multiplied by itself. 10 to the power 1 ( 101 )
is 10, 10 to the power 2 ( 102 ) is 10x10, 103 is 10x10x10 and so on. Binary 0001 is
de imal 1, binary 0010 is de imal 2, binary 0011 is 3, binary 0100 is 4 and so on. So,
42 de imal is 101010 binary or (2 + 8 + 32 or 21 + 23 + 25 ). Rather than using binary
to represent numbers in omputer programs, another base, hexade imal is usually
used. In this base, ea h digital represents a power of 16. As de imal numbers only
go from 0 to 9 the numbers 10 to 15 are represented as a single digit by the letters
A, B, C, D, E and F. For example, hexade imal E is de imal 14 and hexade imal 2A
is de imal 42 (two 16s) + 10). Using the C programming language notation (as I do
throughout this book) hexade imal numbers are prefa ed by \0x"; hexade imal 2A
is written as 0x2A .
Mi ropro essors an perform arithmeti operations su h as add, multiply and divide
and logi al operations su h as \is X greater than Y?".
The pro essor's exe ution is governed by an external lo k. This lo k, the system
lo k, generates regular lo k pulses to the pro essor and, at ea h lo k pulse, the
pro essor does some work. For example, a pro essor ould exe ute an instru tion
every lo k pulse. A pro essor's speed is des ribed in terms of the rate of the system
lo k ti ks. A 100Mhz pro essor will re eive 100,000,000 lo k ti ks every se ond. It
is misleading to des ribe the power of a CPU by its lo k rate as di erent pro essors
perform di erent amounts of work per lo k ti k. However, all things being equal, a
faster lo k speed means a more powerful pro essor. The instru tions exe uted by the
pro essor are very simple; for example \read the ontents of memory at lo ation X
into register Y". Registers are the mi ropro essor's internal storage, used for storing
data and performing operations on it. The operations performed may ause the
pro essor to stop what it is doing and jump to another instru tion somewhere else in
memory. These tiny building blo ks give the modern mi ropro essor almost limitless
power as it an exe ute millions or even billions of instru tions a se ond.
The instru tions have to be fet hed from memory as they are exe uted. Instru tions
may themselves referen e data within memory and that data must be fet hed from
memory and saved there when appropriate.
The size, number and type of register within a mi ropro essor is entirely dependent
on its type. An Intel 4086 pro essor has a di erent register set to an Alpha AXP
pro essor; for a start, the Intel's are 32 bits wide and the Alpha AXP's are 64 bits
wide. In general, though, any given pro essor will have a number of general purpose
registers and a smaller number of dedi ated registers. Most pro essors have the
following spe ial purpose, dedi ated, registers:

Program Counter (PC) This register ontains the address of the next instru tion

to be exe uted. The ontents of the PC are automati ally in remented ea h


time an instru tion is fet hed,

Sta k Pointer (SP) Pro essors have to have a ess to large amounts of external

read/write random a ess memory (RAM) whi h fa ilitates temporary storage


of data. The sta k is a way of easily saving and restoring temporary values in
external memory. Usually, pro essors have spe ial instru tions whi h allow you
to push values onto the sta k and to pop them o again later. The sta k works
on a last in rst out (LIFO) basis. In other words, if you push two values, x
and y, onto a sta k and then pop a value o of the sta k then you will get ba k
the value of y.
Some pro essor's sta ks grow upwards towards the top of memory whilst others
grow downwards towards the bottom, or base, of memory. Some pro essor's
support both types, for example ARM.

Pro essor Status (PS) Instru tions may yield results; for example \is the ontent

of register X greater than the ontent of register Y?" will yield true or false as
a result. The PS register holds this and other information about the urrent
state of the pro essor. For example, most pro essors have at least two modes
of operation, kernel (or supervisor) and user. The PS register would hold
information identifying the urrent mode.

1.2 Memory
All systems have a memory hierar hy with memory at di erent speeds and sizes at
di erent points in the hierar hy. The fastest memory is known as a he memory and
is what it sounds like - memory that is used to temporarily hold, or a he, ontents
of the main memory. This sort of memory is very fast but expensive, therefore most
pro essors have a small amount of on- hip a he memory and more system based (onboard) a he memory. Some pro essors have one a he to ontain both instru tions
and data, but others have two, one for instru tions and the other for data. The
Alpha AXP pro essor has two internal memory a hes; one for data (the D-Ca he)
and one for instru tions (the I-Ca he). The external a he (or B-Ca he) mixes the
two together. Finally there is the main memory whi h relative to the external a he
memory is very slow. Relative to the on-CPU a he, main memory is positively
rawling.
The a he and main memories must be kept in step ( oherent). In other words, if
a word of main memory is held in one or more lo ations in a he, then the system
must make sure that the ontents of a he and memory are the same. The job of
a he oheren y is done partially by the hardware and partially by the operating
system. This is also true for a number of major system tasks where the hardware
and software must ooperate losely to a hieve their aims.

1.3 Buses
The individual omponents of the system board are inter onne ted by multiple onne tion systems known as buses. The system bus is divided into three logi al fun tions; the address bus, the data bus and the ontrol bus. The address bus spe i es
the memory lo ations (addresses) for the data transfers. The data bus holds the data
transfered. The data bus is bidire tional; it allows data to be read into the CPU and
written from the CPU. The ontrol bus ontains various lines used to route timing

and ontrol signals throughout the system. Many avours of bus exist, for example
ISA and PCI buses are popular ways of onne ting peripherals to the system.

1.4 Controllers and Peripherals


Peripherals are real devi es, su h as graphi s ards or disks ontrolled by ontroller
hips on the system board or on ards plugged into it. The IDE disks are ontrolled
by the IDE ontroller hip and the SCSI disks by the SCSI disk ontroller hips and
so on. These ontrollers are onne ted to the CPU and to ea h other by a variety
of buses. Most systems built now use PCI and ISA buses to onne t together the
main system omponents. The ontrollers are pro essors like the CPU itself, they
an be viewed as intelligent helpers to the CPU. The CPU is in overall ontrol of the
system.
All ontrollers are di erent, but they usually have registers whi h ontrol them.
Software running on the CPU must be able to read and write those ontrolling
registers. One register might ontain status des ribing an error. Another might be
used for ontrol purposes; hanging the mode of the ontroller. Ea h ontroller on
a bus an be individually addressed by the CPU, this is so that the software devi e
driver an write to its registers and thus ontrol it. The IDE ribbon is a good example,
as it gives you the ability to a ess ea h drive on the bus separately. Another good
example is the PCI bus whi h allows ea h devi e (for example a graphi s ard) to be
a essed independently.

1.5 Address Spa es


The system bus onne ts the CPU with the main memory and is separate from the
buses onne ting the CPU with the system's hardware peripherals. Colle tively the
memory spa e that the hardware peripherals exist in is known as I/O spa e. I/O
spa e may itself be further subdivided, but we will not worry too mu h about that
for the moment. The CPU an a ess both the system spa e memory and the I/O
spa e memory, whereas the ontrollers themselves an only a ess system memory
indire tly and then only with the help of the CPU. From the point of view of the
devi e, say the oppy disk ontroller, it will see only the address spa e that its
ontrol registers are in (ISA), and not the system memory. Typi ally a CPU will
have separate instru tions for a essing the memory and I/O spa e. For example,
there might be an instru tion that means \read a byte from I/O address 0x3f0 into
register X". This is exa tly how the CPU ontrols the system's hardware peripherals,
by reading and writing to their registers in I/O spa e. Where in I/O spa e the
ommon peripherals (IDE ontroller, serial port, oppy disk ontroller and so on)
have their registers has been set by onvention over the years as the PC ar hite ture
has developed. The I/O spa e address 0x3f0 just happens to be the address of one
of the serial port's (COM1) ontrol registers.
There are times when ontrollers need to read or write large amounts of data dire tly
to or from system memory. For example when user data is being written to the
hard disk. In this ase, Dire t Memory A ess (DMA) ontrollers are used to allow
hardware peripherals to dire tly a ess system memory but this a ess is under stri t
ontrol and supervision of the CPU.

1.6 Timers
All operating systems need to know the time and so the modern PC in ludes a spe ial
peripheral alled the Real Time Clo k (RTC). This provides two things: a reliable
time of day and an a urate timing interval. The RTC has its own battery so that
it ontinues to run even when the PC is not powered on, this is how your PC always
\knows" the orre t date and time. The interval timer allows the operating system
to a urately s hedule essential work.

Chapter 2

Software Basi s

A program is a set of omputer instru tions that perform a parti ular task.
That program an be written in assembler, a very low level omputer
language, or in a high level, ma hine independent language su h as the
C programming language. An operating system is a spe ial program
whi h allows the user to run appli ations su h as spreadsheets and word
pro essors. This hapter introdu es basi programming prin iples and
gives an overview of the aims and fun tions of an operating system.

2.1 Computer Languages


2.1.1 Assembly Languages
The instru tions that a CPU fet hes from memory and exe utes are not at all understandable to human beings. They are ma hine odes whi h tell the omputer
pre isely what to do. The hexade imal number 0x89E5 is an Intel 80486 instru tion
whi h opies the ontents of the ESP register to the EBP register. One of the rst
software tools invented for the earliest omputers was an assembler, a program whi h
takes a human readable sour e le and assembles it into ma hine ode. Assembly
languages expli itly handle registers and operations on data and they are spe i to
a parti ular mi ropro essor. The assembly language for an Intel X86 mi ropro essor
is very di erent to the assembly language for an Alpha AXP mi ropro essor. The
following Alpha AXP assembly ode shows the sort of operations that a program an
perform:
ldr
ldr
beq
str
100:

r16, (r15)
r17, 4(r15)
r16,r17,100
r17, (r15)

;
;
;
;
;

Line
Line
Line
Line
Line

1
2
3
4
5

The rst statement (on line 1) loads register 16 from the address held in register
15. The next instru tion loads register 17 from the next lo ation in memory. Line 3
ompares the ontents of register 16 with that of register 17 and, if they are equal,
bran hes to label 100. If the registers do not ontain the same value then the program
ontinues to line 4 where the ontents of r17 are saved into memory. If the registers
do ontain the same value then no data needs to be saved. Assembly level programs
are tedious and tri ky to write and prone to errors. Very little of the Linux kernel is
written in assembly language and those parts that are are written only for e ien y
and they are spe i to parti ular mi ropro essors.

2.1.2 The C Programming Language and Compiler


Writing large programs in assembly language is a di ult and time onsuming task.
It is prone to error and the resulting program is not portable, being tied to one
parti ular pro essor family. It is far better to use a ma hine independent language
like C[7, The C Programming Language. C allows you to des ribe programs in terms
of their logi al algorithms and the data that they operate on. Spe ial programs alled
ompilers read the C program and translate it into assembly language, generating
ma hine spe i ode from it. A good ompiler an generate assembly instru tions
that are very nearly as e ient as those written by a good assembly programmer.
Most of the Linux kernel is written in the C language. The following C fragment:
if (x != y)
x = y ;

performs exa tly the same operations as the previous example assembly ode. If the
ontents of the variable x are not the same as the ontents of variable y then the
ontents of y will be opied to x. C ode is organized into routines, ea h of whi h
perform a task. Routines may return any value or data type supported by C. Large
programs like the Linux kernel omprise many separate C sour e modules ea h with
its own routines and data stru tures. These C sour e ode modules group together
logi al fun tions su h as lesystem handling ode.
C supports many types of variables, a variable is a lo ation in memory whi h an be
referen ed by a symboli name. In the above C fragment x and y refer to lo ations
in memory. The programmer does not are where in memory the variables are put,
it is the linker (see below) that has to worry about that. Some variables ontain
di erent sorts of data, integer and oating point and others are pointers.
Pointers are variables that ontain the address, the lo ation in memory of other
data. Consider a variable alled x, it might live in memory at address 0x80010000.
You ould have a pointer, alled px, whi h points at x. px might live at address
0x80010030. The value of px would be 0x80010000: the address of the variable x.
C allows you to bundle together related variables into data stru tures. For example,
stru t {
int i ;
har b ;
} my_stru t ;

is a data stru ture alled my stru t whi h ontains two elements, an integer (32 bits
of data storage) alled i and a hara ter (8 bits of data) alled b.

2.1.3 Linkers
Linkers are programs that link together several obje t modules and libraries to form
a single, oherent, program. Obje t modules are the ma hine ode output from an
assembler or ompiler and ontain exe utable ma hine ode and data together with
information that allows the linker to ombine the modules together to form a program. For example one module might ontain all of a program's database fun tions
and another module its ommand line argument handling fun tions. Linkers x up
referen es between these obje t modules, where a routine or data stru ture referen ed in one module a tually exists in another module. The Linux kernel is a single,
large program linked together from its many onstituent obje t modules.

2.2 What is an Operating System?


Without software a omputer is just a pile of ele troni s that gives o heat. If the
hardware is the heart of a omputer then the software is its soul. An operating system
is a olle tion of system programs whi h allow the user to run appli ation software.
The operating system abstra ts the real hardware of the system and presents the
system's users and its appli ations with a virtual ma hine. In a very real sense
the software provides the hara ter of the system. Most PCs an run one or more
operating systems and ea h one an have a very di erent look and feel. Linux is
made up of a number of fun tionally separate pie es that, together, omprise the
operating system. One obvious part of Linux is the kernel itself; but even that would
be useless without libraries or shells.
In order to start understanding what an operating system is, onsider what happens
when you type an apparently simple ommand:
$ ls
Mail
do s
$


t l

images

perl

The $ is a prompt put out by a login shell (in this ase bash). This means that it
is waiting for you, the user, to type some ommand. Typing ls auses the keyboard
driver to re ognize that hara ters have been typed. The keyboard driver passes
them to the shell whi h pro esses that ommand by looking for an exe utable image
of the same name. It nds that image, in /bin/ls. Kernel servi es are alled to pull
the ls exe utable image into virtual memory and start exe uting it. The ls image
makes alls to the le subsystem of the kernel to nd out what les are available.
The lesystem might make use of a hed lesystem information or use the disk
devi e driver to read this information from the disk. It might even ause a network
driver to ex hange information with a remote ma hine to nd out details of remote
les that this system has a ess to ( lesystems an be remotely mounted via the

Networked File System or NFS). Whi hever way the information is lo ated, ls writes
that information out and the video driver displays it on the s reen.
All of the above seems rather ompli ated but it shows that even most simple ommands reveal that an operating system is in fa t a o-operating set of fun tions that
together give you, the user, a oherent view of the system.

2.2.1 Memory management


With in nite resour es, for example memory, many of the things that an operating
system has to do would be redundant. One of the basi tri ks of any operating
system is the ability to make a small amount of physi al memory behave like rather
more memory. This apparently large memory is known as virtual memory. The idea
is that the software running in the system is fooled into believing that it is running
in a lot of memory. The system divides the memory into easily handled pages and
swaps these pages onto a hard disk as the system runs. The software does not noti e
be ause of another tri k, multi-pro essing.

2.2.2 Pro esses


A pro ess ould be thought of as a program in a tion, ea h pro ess is a separate
entity that is running a parti ular program. If you look at the pro esses on your
Linux system, you will see that there are rather a lot. For example, typing ps shows
the following pro esses on my system:
$ ps
PID
158
174
175
178
182
184
185
187
202
203
1796
1797
3056
3270
$

TTY
pRe
pRe
pRe
pRe
pRe
pRe
pRe
pp6
pRe
pp
pRe
v06
pp6
pp6

STAT
1
1
1
1 N
1 N
1 <
1 <
1
1 N
2
1 N
1
3 <
3

TIME
0:00
0:00
0:00
0:00
0:01
0:00
0:00
9:26
0:00
0:00
0:00
0:00
0:02
0:00

COMMAND
-bash
sh /usr/X11R6/bin/startx
xinit /usr/X11R6/lib/X11/xinit/xinitr -bowman
rxvt -geometry 120x35 -fg white -bg bla k
x lo k -bg grey -geometry -1500-1500 -padding 0
xload -bg grey -geometry -0-0 -label xload
/bin/bash
rxvt -geometry 120x35 -fg white -bg bla k
/bin/bash
rxvt -geometry 120x35 -fg white -bg bla k
/bin/bash
ema s intro/introdu tion.tex
ps

If my system had many CPUs then ea h pro ess ould (theoreti ally at least) run
on a di erent CPU. Unfortunately, there is only one so again the operating system
resorts to tri kery by running ea h pro ess in turn for a short period. This period of
time is known as a time-sli e. This tri k is known as multi-pro essing or s heduling
and it fools ea h pro ess into thinking that it is the only pro ess. Pro esses are
prote ted from one another so that if one pro ess rashes or malfun tions then it will
not a e t any others. The operating system a hieves this by giving ea h pro ess a
separate address spa e whi h only they have a ess to.

2.2.3 Devi e drivers


Devi e drivers make up the major part of the Linux kernel. Like other parts of the
operating system, they operate in a highly privileged environment and an ause
disaster if they get things wrong. Devi e drivers ontrol the intera tion between the
operating system and the hardware devi e that they are ontrolling. For example,
the lesystem makes use of a general blo k devi e interfa e when writing blo ks to
an IDE disk. The driver takes are of the details and makes devi e spe i things
happen. Devi e drivers are spe i to the ontroller hip that they are driving whi h
is why, for example, you need the NCR810 SCSI driver if your system has an NCR810
SCSI ontroller.

2.2.4 The Filesystems


In Linux, as it is for UnixTM , the separate lesystems that the system may use
are not a essed by devi e identi ers (su h as a drive number or a drive name) but
instead they are ombined into a single hierar hi al tree stru ture that represents the
lesystem as a single entity. Linux adds ea h new lesystem into this single lesystem
tree as they are mounted onto a mount dire tory, for example /mnt/ drom. One of
the most important features of Linux is its support for many di erent lesystems.
This makes it very exible and well able to oexist with other operating systems. The
most popular lesystem for Linux is the EXT2 lesystem and this is the lesystem
supported by most of the Linux distributions.
A lesystem gives the user a sensible view of les and dire tories held on the hard
disks of the system regardless of the lesystem type or the hara teristi s of the
underlying physi al devi e. Linux transparently supports many di erent lesystems
(for example MS-DOS and EXT2) and presents all of the mounted les and lesystems
as one integrated virtual lesystem. So, in general, users and pro esses do not need
to know what sort of lesystem that any le is part of, they just use them.
The blo k devi e drivers hide the di eren es between the physi al blo k devi e types
(for example, IDE and SCSI) and, so far as ea h lesystem is on erned, the physi al
devi es are just linear olle tions of blo ks of data. The blo k sizes may vary between
devi es, for example 512 bytes is ommon for oppy devi es whereas 1024 bytes is
ommon for IDE devi es and, again, this is hidden from the users of the system. An
EXT2 lesystem looks the same no matter what devi e holds it.

2.3 Kernel Data Stru tures


The operating system must keep a lot of information about the urrent state of the
system. As things happen within the system these data stru tures must be hanged
to re e t the urrent reality. For example, a new pro ess might be reated when
a user logs onto the system. The kernel must reate a data stru ture representing
the new pro ess and link it with the data stru tures representing all of the other
pro esses in the system.
Mostly these data stru tures exist in physi al memory and are a essible only by
the kernel and its subsystems. Data stru tures ontain data and pointers; addresses
of other data stru tures or the addresses of routines. Taken all together, the data
stru tures used by the Linux kernel an look very onfusing. Every data stru ture

has a purpose and although some are used by several kernel subsystems, they are
more simple than they appear at rst sight.
Understanding the Linux kernel hinges on understanding its data stru tures and the
use that the various fun tions within the Linux kernel makes of them. This book
bases its des ription of the Linux kernel on its data stru tures. It talks about ea h
kernel subsystem in terms of its algorithms, its methods of getting things done, and
their usage of the kernel's data stru tures.

2.3.1 Linked Lists


Linux uses a number of software engineering te hniques to link together its data
stru tures. On a lot of o asions it uses linked or hained data stru tures. If ea h
data stru ture des ribes a single instan e or o uran e of something, for example a
pro ess or a network devi e, the kernel must be able to nd all of the instan es. In a
linked list a root pointer ontains the address of the rst data stru ture, or element,
in the list and ea h data stru ture ontains a pointer to the next element in the list.
The last element's next pointer would be 0 or NULL to show that it is the end of the
list. In a doubly linked list ea h element ontains both a pointer to the next element
in the list but also a pointer to the previous element in the list. Using doubly linked
lists makes it easier to add or remove elements from the middle of list although you
do need more memory a esses. This is a typi al operating system trade o : memory
a esses versus CPU y les.

2.3.2 Hash Tables


Linked lists are handy ways of tying data stru tures together but navigating linked
lists an be ine ient. If you were sear hing for a parti ular element, you might
easily have to look at the whole list before you nd the one that you need. Linux
uses another te hnique, hashing to get around this restri tion. A hash table is an
array or ve tor of pointers. An array, or ve tor, is simply a set of things oming one
after another in memory. A bookshelf ould be said to be an array of books. Arrays
are a essed by an index, the index is an o set into the array. Taking the bookshelf
analogy a little further, you ould des ribe ea h book by its position on the shelf;
you might ask for the 5th book.
A hash table is an array of pointers to data stru tures and its index is derived
from information in those data stru tures. If you had data stru tures des ribing
the population of a village then you ould use a person's age as an index. To nd
a parti ular person's data you ould use their age as an index into the population
hash table and then follow the pointer to the data stru ture ontaining the person's
details. Unfortunately many people in the village are likely to have the same age
and so the hash table pointer be omes a pointer to a hain or list of data stru tures
ea h des ribing people of the same age. However, sear hing these shorter hains is
still faster than sear hing all of the data stru tures.
As a hash table speeds up a ess to ommonly used data stru tures, Linux often
uses hash tables to implement a hes. Ca hes are handy information that needs to
be a essed qui kly and are usually a subset of the full set of information available.
Data stru tures are put into a a he and kept there be ause the kernel often a esses
them. There is a drawba k to a hes in that they are more omplex to use and

maintain than simple linked lists or hash tables. If the data stru ture an be found
in the a he (this is known as a a he hit, then all well and good. If it annot then
all of the relevant data stru tures must be sear hed and, if the data stru ture exists
at all, it must be added into the a he. In adding new data stru tures into the a he
an old a he entry may need dis arding. Linux must de ide whi h one to dis ard,
the danger being that the dis arded data stru ture may be the next one that Linux
needs.

2.3.3 Abstra t Interfa es


The Linux kernel often abstra ts its interfa es. An interfa e is a olle tion of routines
and data stru tures whi h operate in a parti ular way. For example all network
devi e drivers have to provide ertain routines in whi h parti ular data stru tures
are operated on. This way there an be generi layers of ode using the servi es
(interfa es) of lower layers of spe i ode. The network layer is generi and it is
supported by devi e spe i ode that onforms to a standard interfa e.
Often these lower layers register themselves with the upper layer at boot time. This
registration usually involves adding a data stru ture to a linked list. For example
ea h lesystem built into the kernel registers itself with the kernel at boot time
or, if you are using modules, when the lesystem is rst used. You an see whi h
lesystems have registered themselves by looking at the le /pro /filesystems.
The registration data stru ture often in ludes pointers to fun tions. These are the
addresses of software fun tions that perform parti ular tasks. Again, using lesystem
registration as an example, the data stru ture that ea h lesystem passes to the Linux
kernel as it registers in ludes the address of a lesystem spe routine whi h must
be alled whenever that lesystem is mounted.

Chapter 3

Memory Management

The memory management subsystem is one of the most important parts of


the operating system. Sin e the early days of omputing, there has been
a need for more memory than exists physi ally in a system. Strategies
have been developed to over ome this limitation and the most su essful
of these is virtual memory. Virtual memory makes the system appear to
have more memory than it a tually has by sharing it between ompeting
pro esses as they need it.
Virtual memory does more than just make your omputer's memory go further. The
memory management subsystem provides:

Large Address Spa es The operating system makes the system appear as if it has
a larger amount of memory than it a tually has. The virtual memory an be
many times larger than the physi al memory in the system,

Prote tion Ea h pro ess in the system has its own virtual address spa e. These

virtual address spa es are ompletely separate from ea h other and so a pro ess
running one appli ation annot a e t another. Also, the hardware virtual
memory me hanisms allow areas of memory to be prote ted against writing.
This prote ts ode and data from being overwritten by rogue appli ations.

Memory Mapping Memory mapping is used to map image and data les into a

pro esses address spa e. In memory mapping, the ontents of a le are linked
dire tly into the virtual address spa e of a pro ess.

Fair Physi al Memory Allo ation The memory management subsystem allows
ea h running pro ess in the system a fair share of the physi al memory of the
system,

Shared Virtual Memory Although virtual memory allows pro esses to have sep-

arate (virtual) address spa es, there are times when you need pro esses to share
memory. For example there ould be several pro esses in the system running
15

Process X

Process Y

VPFN 7

VPFN 7

VPFN 6

Process X
Page Tables

VPFN 6

Process Y
Page Tables

VPFN 5

VPFN 5

VPFN 4

PFN 4

VPFN 4

VPFN 3

PFN 3

VPFN 3

VPFN 2

PFN 2

VPFN 2

VPFN 1

PFN 1

VPFN 1

VPFN 0

PFN 0

VPFN 0

VIRTUAL MEMORY

PHYSICAL MEMORY

VIRTUAL MEMORY

Figure 3.1: Abstra t model of Virtual to Physi al address mapping


the bash ommand shell. Rather than have several opies of bash, one in ea h
pro esses virtual address spa e, it is better to have only one opy in physi al
memory and all of the pro esses running bash share it. Dynami libraries are
another ommon example of exe uting ode shared between several pro esses.
Shared memory an also be used as an Inter Pro ess Communi ation (IPC)
me hanism, with two or more pro esses ex hanging information via memory
ommon to all of them. Linux supports the UnixTM System V shared memory
IPC.

3.1 An Abstra t Model of Virtual Memory


Before onsidering the methods that Linux uses to support virtual memory it is
useful to onsider an abstra t model that is not luttered by too mu h detail.
As the pro essor exe utes a program it reads an instru tion from memory and de odes
it. In de oding the instru tion it may need to fet h or store the ontents of a lo ation
in memory. The pro essor then exe utes the instru tion and moves onto the next
instru tion in the program. In this way the pro essor is always a essing memory
either to fet h instru tions or to fet h and store data.
In a virtual memory system all of these addresses are virtual addresses and not
physi al addresses. These virtual addresses are onverted into physi al addresses by
the pro essor based on information held in a set of tables maintained by the operating
system.
To make this translation easier, virtual and physi al memory are divided into handy
sized hunks alled pages. These pages are all the same size, they need not be but if
they were not, the system would be very hard to administer. Linux on Alpha AXP
systems uses 8 Kbyte pages and on Intel x86 systems it uses 4 Kbyte pages. Ea h
of these pages is given a unique number; the page frame number (PFN). In this
paged model, a virtual address is omposed of two parts; an o set and a virtual page
frame number. If the page size is 4 Kbytes, bits 11:0 of the virtual address ontain

the o set and bits 12 and above are the virtual page frame number. Ea h time the
pro essor en ounters a virtual address it must extra t the o set and the virtual page
frame number. The pro essor must translate the virtual page frame number into
a physi al one and then a ess the lo ation at the orre t o set into that physi al
page. To do this the pro essor uses page tables.
Figure 3.1 shows the virtual address spa es of two pro esses, pro ess X and pro ess
Y, ea h with their own page tables. These page tables map ea h pro esses virtual
pages into physi al pages in memory. This shows that pro ess X's virtual page frame
number 0 is mapped into memory in physi al page frame number 1 and that pro ess
Y's virtual page frame number 1 is mapped into physi al page frame number 4. Ea h
entry in the theoreti al page table ontains the following information:

 Valid ag. This indi ates if this page table entry is valid,
 The physi al page frame number that this entry is des ribing,
 A ess ontrol information. This des ribes how the page may be used. Can it
be written to? Does it ontain exe utable ode?

The page table is a essed using the virtual page frame number as an o set. Virtual
page frame 5 would be the 6th element of the table (0 is the rst element).
To translate a virtual address into a physi al one, the pro essor must rst work out
the virtual addresses page frame number and the o set within that virtual page. By
making the page size a power of 2 this an be easily done by masking and shifting.
Looking again at Figures 3.1 and assuming a page size of 0x2000 bytes (whi h is
de imal 8192) and an address of 0x2194 in pro ess Y's virtual address spa e then
the pro essor would translate that address into o set 0x194 into virtual page frame
number 1.
The pro essor uses the virtual page frame number as an index into the pro esses
page table to retrieve its page table entry. If the page table entry at that o set is
valid, the pro essor takes the physi al page frame number from this entry. If the
entry is invalid, the pro ess has a essed a non-existent area of its virtual memory.
In this ase, the pro essor annot resolve the address and must pass ontrol to the
operating system so that it an x things up.
Just how the pro essor noti es the operating system that the orre t pro ess has
attempted to a ess a virtual address for whi h there is no valid translation is spe i
to the pro essor. However the pro essor delivers it, this is known as a page fault and
the operating system is noti ed of the faulting virtual address and the reason for the
page fault.
Assuming that this is a valid page table entry, the pro essor takes that physi al page
frame number and multiplies it by the page size to get the address of the base of the
page in physi al memory. Finally, the pro essor adds in the o set to the instru tion
or data that it needs.
Using the above example again, pro ess Y's virtual page frame number 1 is mapped
to physi al page frame number 4 whi h starts at 0x8000 (4 x 0x2000). Adding in the
0x194 byte o set gives us a nal physi al address of 0x8194.
By mapping virtual to physi al addresses this way, the virtual memory an be
mapped into the system's physi al pages in any order. For example, in Figure 3.1
pro ess X's virtual page frame number 0 is mapped to physi al page frame number

1 whereas virtual page frame number 7 is mapped to physi al page frame number
0 even though it is higher in virtual memory than virtual page frame number 0.
This demonstrates an interesting byprodu t of virtual memory; the pages of virtual
memory do not have to be present in physi al memory in any parti ular order.

3.1.1 Demand Paging


As there is mu h less physi al memory than virtual memory the operating system
must be areful that it does not use the physi al memory ine iently. One way to
save physi al memory is to only load virtual pages that are urrently being used by
the exe uting program. For example, a database program may be run to query a
database. In this ase not all of the database needs to be loaded into memory, just
those data re ords that are being examined. If the database query is a sear h query
then it does not make sense to load the ode from the database program that deals
with adding new re ords. This te hnique of only loading virtual pages into memory
as they are a essed is known as demand paging.
When a pro ess attempts to a ess a virtual address that is not urrently in memory
the pro essor annot nd a page table entry for the virtual page referen ed. For
example, in Figure 3.1 there is no entry in pro ess X's page table for virtual page
frame number 2 and so if pro ess X attempts to read from an address within virtual
page frame number 2 the pro essor annot translate the address into a physi al
one. At this point the pro essor noti es the operating system that a page fault has
o urred.
If the faulting virtual address is invalid this means that the pro ess has attempted
to a ess a virtual address that it should not have. Maybe the appli ation has gone
wrong in some way, for example writing to random addresses in memory. In this ase
the operating system will terminate it, prote ting the other pro esses in the system
from this rogue pro ess.
If the faulting virtual address was valid but the page that it refers to is not urrently
in memory, the operating system must bring the appropriate page into memory from
the image on disk. Disk a ess takes a long time, relatively speaking, and so the
pro ess must wait quite a while until the page has been fet hed. If there are other
pro esses that ould run then the operating system will sele t one of them to run.
The fet hed page is written into a free physi al page frame and an entry for the
virtual page frame number is added to the pro esses page table. The pro ess is then
restarted at the ma hine instru tion where the memory fault o urred. This time
the virtual memory a ess is made, the pro essor an make the virtual to physi al
address translation and so the pro ess ontinues to run.
Linux uses demand paging to load exe utable images into a pro esses virtual memory.
Whenever a ommand is exe uted, the le ontaining it is opened and its ontents
are mapped into the pro esses virtual memory. This is done by modifying the data
stru tures des ribing this pro esses memory map and is known as memory mapping.
However, only the rst part of the image is a tually brought into physi al memory.
The rest of the image is left on disk. As the image exe utes, it generates page faults
and Linux uses the pro esses memory map in order to determine whi h parts of the
image to bring into memory for exe ution.

3.1.2 Swapping
If a pro ess needs to bring a virtual page into physi al memory and there are no
free physi al pages available, the operating system must make room for this page by
dis arding another page from physi al memory.
If the page to be dis arded from physi al memory ame from an image or data le
and has not been written to then the page does not need to be saved. Instead it an
be dis arded and if the pro ess needs that page again it an be brought ba k into
memory from the image or data le.
However, if the page has been modi ed, the operating system must preserve the
ontents of that page so that it an be a essed at a later time. This type of page is
known as a dirty page and when it is removed from memory it is saved in a spe ial
sort of le alled the swap le. A esses to the swap le are very long relative to the
speed of the pro essor and physi al memory and the operating system must juggle
the need to write pages to disk with the need to retain them in memory to be used
again.
If the algorithm used to de ide whi h pages to dis ard or swap (the swap algorithm
is not e ient then a ondition known as thrashing o urs. In this ase, pages are
onstantly being written to disk and then being read ba k and the operating system
is too busy to allow mu h real work to be performed. If, for example, physi al
page frame number 1 in Figure 3.1 is being regularly a essed then it is not a good
andidate for swapping to hard disk. The set of pages that a pro ess is urrently
using is alled the working set. An e ient swap s heme would make sure that all
pro esses have their working set in physi al memory.
Linux uses a Least Re ently Used (LRU) page aging te hnique to fairly hoose pages
whi h might be removed from the system. This s heme involves every page in the
system having an age whi h hanges as the page is a essed. The more that a page
is a essed, the younger it is; the less that it is a essed the older and more stale it
be omes. Old pages are good andidates for swapping.

3.1.3 Shared Virtual Memory


Virtual memory makes it easy for several pro esses to share memory. All memory
a ess are made via page tables and ea h pro ess has its own separate page table.
For two pro esses sharing a physi al page of memory, its physi al page frame number
must appear in a page table entry in both of their page tables.
Figure 3.1 shows two pro esses that ea h share physi al page frame number 4. For
pro ess X this is virtual page frame number 4 whereas for pro ess Y this is virtual
page frame number 6. This illustrates an interesting point about sharing pages: the
shared physi al page does not have to exist at the same pla e in virtual memory for
any or all of the pro esses sharing it.

3.1.4 Physi al and Virtual Addressing Modes


It does not make mu h sense for the operating system itself to run in virtual memory.
This would be a nightmare situation where the operating system must maintain page
tables for itself. Most multi-purpose pro essors support the notion of a physi al
address mode as well as a virtual address mode. Physi al addressing mode requires no

31

15 14 13 12 11 10 9
U K
W W
E E

8 7 6 5

U K
R R
E E

G
H

4 3 2 1 0
A F F F V
S O O O
M E W R

__PAGE_DIRTY
__PAGE_ACCESSED
63

32

PFN

Figure 3.2: Alpha AXP Page Table Entry


page tables and the pro essor does not attempt to perform any address translations
in this mode. The Linux kernel is linked to run in physi al address spa e.
The Alpha AXP pro essor does not have a spe ial physi al addressing mode. Instead,
it divides up the memory spa e into several areas and designates two of them as
physi ally mapped addresses. This kernel address spa e is known as KSEG address
spa e and it en ompasses all addresses upwards from 0x f 0000000000. In order to
exe ute from ode linked in KSEG (by de nition, kernel ode) or a ess data there,
the ode must be exe uting in kernel mode. The Linux kernel on Alpha is linked to
exe ute from address 0x f 0000310000.

3.1.5 A ess Control


The page table entries also ontain a ess ontrol information. As the pro essor is
already using the page table entry to map a pro esses virtual address to a physi al
one, it an easily use the a ess ontrol information to he k that the pro ess is not
a essing memory in a way that it should not.
There are many reasons why you would want to restri t a ess to areas of memory.
Some memory, su h as that ontaining exe utable ode, is naturally read only memory; the operating system should not allow a pro ess to write data over its exe utable
ode. By ontrast, pages ontaining data an be written to but attempts to exe ute
that memory as instru tions should fail. Most pro essors have at least two modes
of exe ution: kernel and user. You would not want kernel ode exe uting by a user
or kernel data stru tures to be a essible ex ept when the pro essor is running in
kernel mode.
The a ess ontrol information is held in the PTE and is pro essor spe i ; gure 3.2
shows the PTE for Alpha AXP. The bit elds have the following meanings:

V Valid, if set this PTE is valid,


FOE \Fault on Exe ute", Whenever an attempt to exe ute instru tions in this page

o urs, the pro essor reports a page fault and passes ontrol to the operating
system,

FOW \Fault on Write", as above but page fault on an attempt to write to this
page,

FOR \Fault on Read", as above but page fault on an attempt to read from this
page,

ASM Address Spa e Mat h. This is used when the operating system wishes to lear
only some of the entries from the Translation Bu er,

KRE Code running in kernel mode an read this page,


URE Code running in user mode an read this page,
GH Granularity hint used when mapping an entire blo k with a single Translation
Bu er entry rather than many,

KWE Code running in kernel mode an write to this page,


UWE Code running in user mode an write to this page,
page frame number For PTEs with the V bit set, this eld ontains the physi al
Page Frame Number (page frame number) for this PTE. For invalid PTEs, if
this eld is not zero, it ontains information about where the page is in the
swap le.

The following two bits are de ned and used by Linux:

PAGE DIRTY if set, the page needs to be written out to the swap le,
PAGE ACCESSED Used by Linux to mark a page as having been a essed.

3.2 Ca hes
If you were to implement a system using the above theoreti al model then it would
work, but not parti ularly e iently. Both operating system and pro essor designers
try hard to extra t more performan e from the system. Apart from making the
pro essors, memory and so on faster the best approa h is to maintain a hes of
useful information and data that make some operations faster. Linux uses a number
of memory management related a hes:

Bu er Ca he The bu er a he ontains data bu ers that are used by the blo k

devi e drivers. These bu ers are of xed sizes (for example 512 bytes) and
ontain blo ks of information that have either been read from a blo k devi e
or are being written to it. A blo k devi e is one that an only be a essed by
reading and writing xed sized blo ks of data. All hard disks are blo k devi es.
The bu er a he is indexed via the devi e identi er and the desired blo k
number and is used to qui kly nd a blo k of data. Blo k devi es are only ever
a essed via the bu er a he. If data an be found in the bu er a he then it
does not need to be read from the physi al blo k devi e, for example a hard
disk, and a ess to it is mu h faster.

Page Ca he This is used to speed up a ess to images and data on disk. It is used
to a he the logi al ontents of a le a page at a time and is a essed via the
le and o set within the le. As pages are read into memory from disk, they
are a hed in the page a he.

See fs/buffer.

See

mm/filemap.

See swap.h,
mm/swap state.
mm/swapfile.

Swap Ca he Only modi ed (or dirty ) pages are saved in the swap le. So long
as these pages are not modi ed after they have been written to the swap le
then the next time the page is swapped out there is no need to write it to the
swap le as the page is already in the swap le. Instead the page an simply
be dis arded. In a heavily swapping system this saves many unne essary and
ostly disk operations.

Hardware Ca hes One ommonly implemented hardware a he is in the pro essor;


a a he of Page Table Entries. In this ase, the pro essor does not always read
the page table dire tly but instead a hes translations for pages as it needs
them. These are the Translation Look-aside Bu ers and ontain a hed opies
of the page table entries from one or more pro esses in the system.

When the referen e to the virtual address is made, the pro essor will attempt to
nd a mat hing TLB entry. If it nds one, it an dire tly translate the virtual
address into a physi al one and perform the orre t operation on the data. If
the pro essor annot nd a mat hing TLB entry then it must get the operating
system to help. It does this by signalling the operating system that a TLB miss
has o urred. A system spe i me hanism is used to deliver that ex eption
to the operating system ode that an x things up. The operating system
generates a new TLB entry for the address mapping. When the ex eption has
been leared, the pro essor will make another attempt to translate the virtual
address. This time it will work be ause there is now a valid entry in the TLB
for that address.
The drawba k of using a hes, hardware or otherwise, is that in order to save e ort
Linux must use more time and spa e maintaining these a hes and, if the a hes
be ome orrupted, the system will rash.

3.3 Linux Page Tables


Linux assumes that there are three levels of page tables. Ea h Page Table a essed
ontains the page frame number of the next level of Page Table. Figure 3.3 shows
how a virtual address an be broken into a number of elds; ea h eld providing an
o set into a parti ular Page Table. To translate a virtual address into a physi al
one, the pro essor must take the ontents of ea h level eld, onvert it into an o set
into the physi al page ontaining the Page Table and read the page frame number
of the next level of Page Table. This is repeated three times until the page frame
number of the physi al page ontaining the virtual address is found. Now the nal
eld in the virtual address, the byte o set, is used to nd the data inside the page.
See in lude/asm/pgtable.h

Ea h platform that Linux runs on must provide translation ma ros that allow the
kernel to traverse the page tables for a parti ular pro ess. This way, the kernel does
not need to know the format of the page table entries or how they are arranged. This
is so su essful that Linux uses the same page table manipulation ode for the Alpha
pro essor, whi h has three levels of page tables, and for Intel x86 pro essors, whi h
have two levels of page tables.

VIRTUAL ADDRESS
Level 1

Level 2

Level 3

Level 1

Level 2

Level 3

Page Table

Page Table

Page Table

PFN

PFN

PFN

Byte within page

Physical Page

PGD

Figure 3.3: Three Level Page Tables

3.4 Page Allo ation and Deallo ation


There are many demands on the physi al pages in the system. For example, when
an image is loaded into memory the operating system needs to allo ate pages. These
will be freed when the image has nished exe uting and is unloaded. Another use
for physi al pages is to hold kernel spe i data stru tures su h as the page tables
themselves. The me hanisms and data stru tures used for page allo ation and deallo ation are perhaps the most riti al in maintaining the e ien y of the virtual
memory subsystem.
All of the physi al pages in the system are des ribed by the mem map data stru ture
whi h is a list of mem map t 1 stru tures whi h is initialized at boot time. Ea h
mem map t des ribes a single physi al page in the system. Important elds (so far as
memory management is on erned) are:

ount This is a ount of the number of users of this page. The ount is greater than
one when the page is shared between many pro esses,

age This eld des ribes the age of the page and is used to de ide if the page is a
good andidate for dis arding or swapping,

map nr This is the physi al page frame number that this mem map t des ribes.
The free area ve tor is used by the page allo ation ode to nd and free pages.
The whole bu er management s heme is supported by this me hanism and so far as
the ode is on erned, the size of the page and physi al paging me hanisms used by
the pro essor are irrelevant.
Ea h element of free area ontains information about blo ks of pages. The rst
element in the array des ribes single pages, the next blo ks of 2 pages, the next
blo ks of 4 pages and so on upwards in powers of two. The list element is used as a
queue head and has pointers to the page data stru tures in the mem map array. Free
1 Confusingly

the stru ture is also known as the

page

stru ture.

See in lude/linux/mm.h

blo ks of pages are queued here. map is a pointer to a bitmap whi h keeps tra k of
allo ated groups of pages of this size. Bit N of the bitmap is set if the Nth blo k of
pages is free.
Figure 3.4 shows the free area stru ture. Element 0 has one free page (page frame
number 0) and element 2 has 2 free blo ks of 4 pages, the rst starting at page frame
number 4 and the se ond at page frame number 56.

3.4.1 Page Allo ation


See

get free pages()

in

mm/page allo .

Linux uses the Buddy algorithm 2 to e e tively allo ate and deallo ate blo ks of
pages. The page allo ation ode attempts to allo ate a blo k of one or more physi al
pages. Pages are allo ated in blo ks whi h are powers of 2 in size. That means that
it an allo ate a blo k 1 page, 2 pages, 4 pages and so on. So long as there are enough
free pages in the system to grant this request (nr f ree pages > min f ree pages) the
allo ation ode will sear h the free area for a blo k of pages of the size requested.
Ea h element of the free area has a map of the allo ated and free blo ks of pages
for that sized blo k. For example, element 2 of the array has a memory map that
des ribes free and allo ated blo ks ea h of 4 pages long.
The allo ation algorithm rst sear hes for blo ks of pages of the size requested. It
follows the hain of free pages that is queued on the list element of the free area
data stru ture. If no blo ks of pages of the requested size are free, blo ks of the next
size (whi h is twi e that of the size requested) are looked for. This pro ess ontinues
until all of the free area has been sear hed or until a blo k of pages has been found.
If the blo k of pages found is larger than that requested it must be broken down until
there is a blo k of the right size. Be ause the blo ks are ea h a power of 2 pages big
then this breaking down pro ess is easy as you simply break the blo ks in half. The
free blo ks are queued on the appropriate queue and the allo ated blo k of pages is
returned to the aller.
For example, in Figure 3.4 if a blo k of 2 pages was requested, the rst blo k of 4
pages (starting at page frame number 4) would be broken into two 2 page blo ks.
The rst, starting at page frame number 4 would be returned to the aller as the
allo ated pages and the se ond blo k, starting at page frame number 6 would be
queued as a free blo k of 2 pages onto element 1 of the free area array.

3.4.2 Page Deallo ation


See

free pages() in
mm/page allo .

Allo ating blo ks of pages tends to fragment memory with larger blo ks of free pages
being broken down into smaller ones. The page deallo ation ode re ombines pages
into larger blo ks of free pages whenever it an. In fa t the page blo k size is
important as it allows for easy ombination of blo ks into larger blo ks.
Whenever a blo k of pages is freed, the adja ent or buddy blo k of the same size is
he ked to see if it is free. If it is, then it is ombined with the newly freed blo k
of pages to form a new free blo k of pages for the next size blo k of pages. Ea h
time two blo ks of pages are re ombined into a bigger blo k of free pages the page
deallo ation ode attempts to re ombine that blo k into a yet larger one. In this way
the blo ks of free pages are as large as memory usage will allow.
2 Bibliography

referen e here

PHYSICAL MEMORY

free_area
5

8
4

00 11
11
00
11 00
00
11
11
00
11
00

mem_map_t

56

mem_map_t

4
mem_map_t

3
2

map

map

map

4
map

3
2
1

Free PFN

1
0
0
1
0
1

0 PFN

Figure 3.4: The free area data stru ture


For example, in Figure 3.4, if page frame number 1 were to be freed, then that would
be ombined with the already free page frame number 0 and queued onto element 1
of the free area as a free blo k of size 2 pages.

3.5 Memory Mapping


When an image is exe uted, the ontents of the exe utable image must be brought
into the pro esses virtual address spa e. The same is also true of any shared libraries
that the exe utable image has been linked to use. The exe utable le is not a tually
brought into physi al memory, instead it is merely linked into the pro esses virtual
memory. Then, as the parts of the program are referen ed by the running appli ation,
the image is brought into memory from the exe utable image. This linking of an
image into a pro esses virtual address spa e is known as memory mapping.
Every pro esses virtual memory is represented by an mm stru t data stru ture. This
ontains information about the image that it is urrently exe uting (for example
bash) and also has pointers to a number of vm area stru t data stru tures. Ea h
vm area stru t data stru ture des ribes the start and end of the area of virtual
memory, the pro esses a ess rights to that memory and a set of operations for
that memory. These operations are a set of routines that Linux must use when
manipulating this area of virtual memory. For example, one of the virtual memory
operations performs the orre t a tions when the pro ess has attempted to a ess
this virtual memory but nds (via a page fault) that the memory is not a tually in
physi al memory. This operation is the nopage operation. The nopage operation is
used when Linux demand pages the pages of an exe utable image into memory.
When an exe utable image is mapped into a pro esses virtual address a set of
vm area stru t data stru tures is generated. Ea h vm area stru t data stru ture represents a part of the exe utable image; the exe utable ode, initialized data

Processes Virtual Memory

Virtual Area
vm_area_struct
vm_end
vm_start
vm_flags
vm_inode
vm_ops

Virtual Memory
Operations
open()
close()
unmap()
protect()
sync()
advise()
nopage()
wppage()
swapout()
swapin()

vm_next

Figure 3.5: Areas of Virtual Memory


(variables), unitialized data and so on. Linux supports a number of standard virtual memory operations and as the vm area stru t data stru tures are reated, the
orre t set of virtual memory operations are asso iated with them.

3.6 Demand Paging

See

handle mm fault()
in mm/memory.

On e an exe utable image has been memory mapped into a pro esses virtual memory
it an start to exe ute. As only the very start of the image is physi ally pulled into
memory it will soon a ess an area of virtual memory that is not yet in physi al
memory. When a pro ess a esses a virtual address that does not have a valid page
table entry, the pro essor will report a page fault to Linux. The page fault des ribes
the virtual address where the page fault o urred and the type of memory a ess
that aused.
Linux must nd the vm area stru t that represents the area of memory that the
page fault o urred in. As sear hing through the vm area stru t data stru tures is
riti al to the e ient handling of page faults, these are linked together in an AVL
(Adelson-Velskii and Landis) tree stru ture. If there is no vm area stru t data
stru ture for this faulting virtual address, this pro ess has a essed an illegal virtual
address. Linux will signal the pro ess, sending a SIGSEGV signal, and if the pro ess
does not have a handler for that signal it will be terminated.
Linux next he ks the type of page fault that o urred against the types of a esses
allowed for this area of virtual memory. If the pro ess is a essing the memory in
an illegal way, say writing to an area that it is only allowed to read from, it is also

page_hash_table
mem_map_t
inode
offset

:
:
:

mem_map_t
12
0x8000

next_hash
prev_hash

inode
offset

12
0x2000

next_hash
prev_hash

Figure 3.6: The Linux Page Ca he


signalled with a memory error.
Now that Linux has determined that the page fault is legal, it must deal with it. See
Linux must di erentiate between pages that are in the swap le and those that are do no page() in
part of an exe utable image on a disk somewhere. It does this by using the page mm/memory.
table entry for this faulting virtual address.
If the page's page table entry is invalid but not empty, the page fault is for a page
urrently being held in the swap le. For Alpha AXP page table entries, these are
entries whi h do not have their valid bit set but whi h have a non-zero value in their
PFN eld. In this ase the PFN eld holds information about where in the swap
(and whi h swap le) the page is being held. How pages in the swap le are handled
is des ribed later in this hapter.
Not all vm area stru t data stru tures have a set of virtual memory operations and
even those that do may not have a nopage operation. This is be ause by default
Linux will x up the a ess by allo ating a new physi al page and reating a valid
page table entry for it. If there is a nopage operation for this area of virtual memory,
Linux will use it.
See
The generi Linux nopage operation is used for memory mapped exe utable images filemap nopage()
and it uses the page a he to bring the required image page into physi al memory. in mm/filemap.
However the required page is brought into physi al memory, the pro esses page tables
are updated. It may be ne essary for hardware spe i a tions to update those
entries, parti ularly if the pro essor uses translation look aside bu ers. Now that
the page fault has been handled it an be dismissed and the pro ess is restarted at
the instru tion that made the faulting virtual memory a ess.

3.7 The Linux Page Ca he


The role of the Linux page a he is to speed up a ess to les on disk. Memory
mapped les are read a page at a time and these pages are stored in the page a he.
Figure 3.6 shows that the page a he onsists of the page hash table, a ve tor of
pointers to mem map t data stru tures. Ea h le in Linux is identi ed by a VFS

See in lude/linux/pagemap.h

inode data stru ture (des ribed in Chapter 9) and ea h VFS inode is unique and

fully des ribes one and only one le. The index into the page table is derived from
the le's VFS inode and the o set into the le.

Whenever a page is read from a memory mapped le, for example when it needs
to be brought ba k into memory during demand paging, the page is read through
the page a he. If the page is present in the a he, a pointer to the mem map t data
stru ture representing it is returned to the page fault handling ode. Otherwise the
page must be brought into memory from the le system that holds the image. Linux
allo ates a physi al page and reads the page from the le on disk.
If it is possible, Linux will initiate a read of the next page in the le. This single
page read ahead means that if the pro ess is a essing the pages in the le serially,
the next page will be waiting in memory for the pro ess.
Over time the page a he grows as images are read and exe uted. Pages will be
removed from the a he as they are no longer needed, say as an image is no longer
being used by any pro ess. As Linux uses memory it an start to run low on physi al
pages. In this ase Linux will redu e the size of the page a he.

3.8 Swapping Out and Dis arding Pages


When physi al memory be omes s ar e the Linux memory management subsystem
must attempt to free physi al pages. This task falls to the kernel swap daemon
(kswapd ). The kernel swap daemon is a spe ial type of pro ess, a kernel thread.
Kernel threads are pro esses have no virtual memory, instead they run in kernel
mode in the physi al address spa e. The kernel swap daemon is slightly misnamed
in that it does more than merely swap pages out to the system's swap les. Its role
is make sure that there are enough free pages in the system to keep the memory
management system operating e iently.
See kswapd() in

mm/vms an.

The Kernel swap daemon (kswapd ) is started by the kernel init pro ess at startup
time and sits waiting for the kernel swap timer to periodi ally expire. Every time
the timer expires, the swap daemon looks to see if the number of free pages in the
system is getting too low. It uses two variables, free pages high and free pages low to
de ide if it should free some pages. So long as the number of free pages in the system
remains above free pages high, the kernel swap daemon does nothing; it sleeps again
until its timer next expires. For the purposes of this he k the kernel swap daemon
takes into a ount the number of pages urrently being written out to the swap le.
It keeps a ount of these in nr asyn pages ; this is in remented ea h time a page is
queued waiting to be written out to the swap le and de remented when the write to
the swap devi e has ompleted. free pages low and free pages high are set at system
startup time and are related to the number of physi al pages in the system. If the
number of free pages in the system has fallen below free pages high or worse still
free pages low, the kernel swap daemon will try three ways to redu e the number of
physi al pages being used by the system:
Redu ing the size of the bu er and page a hes,
Swapping out System V shared memory pages,
Swapping out and dis arding pages.

If the number of free pages in the system has fallen below free pages low, the kernel
swap daemon will try to free 6 pages before it next runs. Otherwise it will try to
free 3 pages. Ea h of the above methods are tried in turn until enough pages have
been freed. The kernel swap daemon remembers whi h method it was using the last
time that it attempted to free physi al pages. Ea h time it runs it will start trying
to free pages using this last su essful method.
After it has free su ient pages, the swap daemon sleeps again until its timer expires.
If the reason that the kernel swap daemon freed pages was that the number of free
pages in the system had fallen below free pages low, it only sleeps for half its usual
time. On e the number of free pages is more than free pages low the kernel swap
daemon goes ba k to sleeping longer between he ks.

3.8.1 Redu ing the Size of the Page and Bu er Ca hes


The pages held in the page and bu er a hes are good andidates for being freed into
the free area ve tor. The Page Ca he, whi h ontains pages of memory mapped
les, may ontain unne essary pages that are lling up the system's memory. Likewise the Bu er Ca he, whi h ontains bu ers read from or being written to physi al
devi es, may also ontain unneeded bu ers. When the physi al pages in the system
start to run out, dis arding pages from these a hes is relatively easy as it requires
no writing to physi al devi es (unlike swapping pages out of memory). Dis arding
these pages does not have too many harmful side e e ts other than making a ess
to physi al devi es and memory mapped les slower. However, if the dis arding of
pages from these a hes is done fairly, all pro esses will su er equally.
Every time the Kernel swap daemon tries to shrink these a hes it examines a blo k
of pages in the mem map page ve tor to see if any an be dis arded from physi al
memory. The size of the blo k of pages examined is higher if the kernel swap daemon
is intensively swapping; that is if the number of free pages in the system has fallen
dangerously low. The blo ks of pages are examined in a y li al manner; a di erent
blo k of pages is examined ea h time an attempt is made to shrink the memory map.
This is known as the lo k algorithm as, rather like the minute hand of a lo k, the
whole mem map page ve tor is examined a few pages at a time.

See

shrink mmap() in
mm/filemap.

Ea h page being examined is he ked to see if it is a hed in either the page a he or


the bu er a he. You should note that shared pages are not onsidered for dis arding
at this time and that a page annot be in both a hes at the same time. If the page
is not in either a he then the next page in the mem map page ve tor is examined.
Pages are a hed in the bu er a he (or rather the bu ers within the pages are
a hed) to make bu er allo ation and deallo ation more e ient. The memory map
shrinking ode tries to free the bu ers that are ontained within the page being
examined. If all the bu ers are freed, then the pages that ontain them are also be
freed. If the examined page is in the Linux page a he, it is removed from the page
a he and freed.
When enough pages have been freed on this attempt then the kernel swap daemon
will wait until the next time it is periodi ally woken. As none of the freed pages
were part of any pro ess's virtual memory (they were a hed pages), then no page
tables need updating. If there were not enough a hed pages dis arded then the swap
daemon will try to swap out some shared pages.

See try to
free buffer() in
fs/buffer.

3.8.2 Swapping Out System V Shared Memory Pages

See shm swap()


in ip /shm.

System V shared memory is an inter-pro ess ommuni ation me hanism whi h allows two or more pro esses to share virtual memory in order to pass information
amongst themselves. How pro esses share memory in this way is des ribed in
more detail in Chapter 5. For now it is enough to say that ea h area of System V shared memory is des ribed by a shmid ds data stru ture. This ontains
a pointer to a list of vm area stru t data stru tures, one for ea h pro ess sharing
this area of virtual memory. The vm area stru t data stru tures des ribe where
in ea h pro esses virtual memory this area of System V shared memory goes. Ea h
vm area stru t data stru ture for this System V shared memory is linked together
using the vm next shared and vm prev shared pointers. Ea h shmid ds data stru ture also ontains a list of page table entries ea h of whi h des ribes the physi al
page that a shared virtual page maps to.
The kernel swap daemon also uses a lo k algorithm when swapping out System V
shared memory pages. . Ea h time it runs it remembers whi h page of whi h shared
virtual memory area it last swapped out. It does this by keeping two indi es, the
rst is an index into the set of shmid ds data stru tures, the se ond into the list of
page table entries for this area of System V shared memory. This makes sure that it
fairly vi timizes the areas of System V shared memory.
As the physi al page frame number for a given virtual page of System V shared
memory is ontained in the page tables of all of the pro esses sharing this area of
virtual memory, the kernel swap daemon must modify all of these page tables to
show that the page is no longer in memory but is now held in the swap le. For
ea h shared page it is swapping out, the kernel swap daemon nds the page table
entry in ea h of the sharing pro esses page tables (by following a pointer from ea h
vm area stru t data stru ture). If this pro esses page table entry for this page of
System V shared memory is valid, it onverts it into an invalid but swapped out
page table entry and redu es this (shared) page's ount of users by one. The format
of a swapped out System V shared page table entry ontains an index into the set
of shmid ds data stru tures and an index into the page table entries for this area of
System V shared memory.
If the page's ount is zero after the page tables of the sharing pro esses have all been
modi ed, the shared page an be written out to the swap le. The page table entry
in the list pointed at by the shmid ds data stru ture for this area of System V shared
memory is repla ed by a swapped out page table entry. A swapped out page table
entry is invalid but ontains an index into the set of open swap les and the o set
in that le where the swapped out page an be found. This information will be used
when the page has to be brought ba k into physi al memory.

3.8.3 Swapping Out and Dis arding Pages


See swap out()
in mm/vms an.

The swap daemon looks at ea h pro ess in the system in turn to see if it is a good
andidate for swapping. Good andidates are pro esses that an be swapped (some
annot) and that have one or more pages whi h an be swapped or dis arded from
memory. Pages are swapped out of physi al memory into the system's swap les only
if the data in them annot be retrieved another way.
A lot of the ontents of an exe utable image ome from the image's le and an easily
be re-read from that le. For example, the exe utable instru tions of an image will

never be modi ed by the image and so will never be written to the swap le. These
pages an simply be dis arded; when they are again referen ed by the pro ess, they To do this it
follows the
will be brought ba k into memory from the exe utable image.
On e the pro ess to swap has been lo ated, the swap daemon looks through all of its
virtual memory regions looking for areas whi h are not shared or lo ked. Linux does
not swap out all of the swappable pages of the pro ess that it has sele ted; instead
it removes only a small number of pages. Pages annot be swapped or dis arded if
they are lo ked in memory.

vm next pointer
along the list of
vm area stru t

stru tures
queued on the
mm stru t for the
pro ess.

The Linux swap algorithm uses page aging. Ea h page has a ounter (held in the See
mem map t data stru ture) that gives the Kernel swap daemon some idea whether or swap out vma()
not a page is worth swapping. Pages age when they are unused and rejuvinate on in mm/vms an.
a ess; the swap daemon only swaps out old pages. The default a tion when a page
is rst allo ated, is to give it an initial age of 3. Ea h time it is tou hed, it's age is
in reased by 3 to a maximum of 20. Every time the Kernel swap daemon runs it ages
pages, de rementing their age by 1. These default a tions an be hanged and for
this reason they (and other swap related information) are stored in the swap ontrol
data stru ture.
If the page is old (age = 0), the swap daemon will pro ess it further. Dirty pages are
pages whi h an be swapped out. Linux uses an ar hite ture spe i bit in the PTE to
des ribe pages this way (see Figure 3.2). However, not all dirty pages are ne essarily
written to the swap le. Every virtual memory region of a pro ess may have its own
swap operation (pointed at by the vm ops pointer in the vm area stru t) and that
method is used. Otherwise, the swap daemon will allo ate a page in the swap le
and write the page out to that devi e.
The page's page table entry is repla ed by one whi h is marked as invalid but whi h
ontains information about where the page is in the swap le. This is an o set into
the swap le where the page is held and an indi ation of whi h swap le is being used.
Whatever the swap method used, the original physi al page is made free by putting
it ba k into the free area. Clean (or rather not dirty ) pages an be dis arded and
put ba k into the free area for re-use.
If enough of the swappable pro esses pages have been swapped out or dis arded,
the swap daemon will again sleep. The next time it wakes it will onsider the next
pro ess in the system. In this way, the swap daemon nibbles away at ea h pro esses
physi al pages until the system is again in balan e. This is mu h fairer than swapping
out whole pro esses.

3.9 The Swap Ca he


When swapping pages out to the swap les, Linux avoids writing pages if it does not
have to. There are times when a page is both in a swap le and in physi al memory.
This happens when a page that was swapped out of memory was then brought ba k
into memory when it was again a essed by a pro ess. So long as the page in memory
is not written to, the opy in the swap le remains valid.
Linux uses the swap a he to tra k these pages. The swap a he is a list of page
table entries, one per physi al page in the system. This is a page table entry for a
swapped out page and des ribes whi h swap le the page is being held in together
with its lo ation in the swap le. If a swap a he entry is non-zero, it represents

a page whi h is being held in a swap le that has not been modi ed. If the page
is subsequently modi ed (by being written to), its entry is removed from the swap
a he.
When Linux needs to swap a physi al page out to a swap le it onsults the swap
a he and, if there is a valid entry for this page, it does not need to write the page
out to the swap le. This is be ause the page in memory has not been modi ed sin e
it was last read from the swap le.
The entries in the swap a he are page table entries for swapped out pages. They
are marked as invalid but ontain information whi h allow Linux to nd the right
swap le and the right page within that swap le.

3.10 Swapping Pages In

See

do page fault()
in ar h/i386/mm/fault.

See

do no page() in
mm/memory.

See

do swap page()
in mm/memory.

See

shm swap in() in


ip /shm.

The dirty pages saved in the swap les may be needed again, for example when
an appli ation writes to an area of virtual memory whose ontents are held in a
swapped out physi al page. A essing a page of virtual memory that is not held
in physi al memory auses a page fault to o ur. The page fault is the pro essor
signalling the operating system that it annot translate a virtual address into a
physi al one. In this ase this is be ause the page table entry des ribing this page
of virtual memory was marked as invalid when the page was swapped out. The
pro essor annot handle the virtual to physi al address translation and so hands
ontrol ba k to the operating system des ribing as it does so the virtual address that
faulted and the reason for the fault. The format of this information and how the
pro essor passes ontrol to the operating system is pro essor spe i . The pro essor
spe i page fault handling ode must lo ate the vm area stru t data stru ture that
des ribes the area of virtual memory that ontains the faulting virtual address. It
does this by sear hing the vm area stru t data stru tures for this pro ess until it
nds the one ontaining the faulting virtual address. This is very time riti al ode
and a pro esses vm area stru t data stru tures are so arranged as to make this
sear h take as little time as possible.
Having arried out the appropriate pro essor spe i a tions and found that the
faulting virtual address is for a valid area of virtual memory, the page fault pro essing
be omes generi and appli able to all pro essors that Linux runs on. The generi
page fault handling ode looks for the page table entry for the faulting virtual address.
If the page table entry it nds is for a swapped out page, Linux must swap the page
ba k into physi al memory. The format of the page table entry for a swapped out
page is pro essor spe i but all pro essors mark these pages as invalid and put the
information ne essary to lo ate the page within the swap le into the page table
entry. Linux needs this information in order to bring the page ba k into physi al
memory.
At this point, Linux knows the faulting virtual address and has a page table
entry ontaining information about where this page has been swapped to. The
vm area stru t data stru ture may ontain a pointer to a routine whi h will swap
any page of the area of virtual memory that it des ribes ba k into physi al memory.
This is its swapin operation. If there is a swapin operation for this area of virtual
memory then Linux will use it. This is, in fa t, how swapped out System V shared
memory pages are handled as it requires spe ial handling be ause the format of a
swapped out System V shared page is a little di erent from that of an ordinairy

swapped out page. There may not be a swapin operation, in whi h ase Linux will
assume that this is an ordinairy page that does not need to be spe ially handled. It
allo ates a free physi al page and reads the swapped out page ba k from the swap
le. Information telling it where in the swap le (and whi h swap le) is taken from
the the invalid page table entry.
If the a ess that aused the page fault was not a write a ess then the page is left
in the swap a he and its page table entry is not marked as writable. If the page is
subsequently written to, another page fault will o ur and, at that point, the page
is marked as dirty and its entry is removed from the swap a he. If the page is not
written to and it needs to be swapped out again, Linux an avoid the write of the
page to its swap le be ause the page is already in the swap le.
If the a ess that aused the page to be brought in from the swap le was a write
operation, this page is removed from the swap a he and its page table entry is
marked as both dirty and writable.

See swap in() in


mm/page allo .

Chapter 4

Pro esses

This hapter des ribes what a pro ess is and how the Linux kernel reates,
manages and deletes the pro esses in the system.
Pro esses arry out tasks within the operating system. A program is a set of ma hine
ode instru tions and data stored in an exe utable image on disk and is, as su h, a
passive entity; a pro ess an be thought of as a omputer program in a tion. It is a
dynami entity, onstantly hanging as the ma hine ode instru tions are exe uted
by the pro essor. As well as the program's instru tions and data, the pro ess also
in ludes the program ounter and all of the CPU's registers as well as the pro ess
sta ks ontaining temporary data su h as routine parameters, return addresses and
saved variables. The urrent exe uting program, or pro ess, in ludes all of the urrent a tivity in the mi ropro essor. Linux is a multipro essing operating system.
Pro esses are separate tasks ea h with their own rights and responsibilities. If one
pro ess rashes it will not ause another pro ess in the system to rash. Ea h individual pro ess runs in its own virtual address spa e and is not apable of intera ting
with another pro ess ex ept through se ure, kernel managed me hanisms.
During the lifetime of a pro ess it will use many system resour es. It will use the
CPUs in the system to run its instru tions and the system's physi al memory to hold
it and its data. It will open and use les within the lesystems and may dire tly
or indire tly use the physi al devi es in the system. Linux must keep tra k of the
pro ess itself and of the system resour es that it has so that it an manage it and
the other pro esses in the system fairly. It would not be fair to the other pro esses
in the system if one pro ess monopolized most of the system's physi al memory or
its CPUs.
The most pre ious resour e in the system is the CPU, usually there is only one. Linux
is a multipro essing operating system, its obje tive is to have a pro ess running on
ea h CPU in the system at all times, to maximize CPU utilization. If there are more
pro esses than CPUs (and there usually are), the rest of the pro esses must wait
before a CPU be omes free until they an be run. Multipro essing is a simple idea; a
35

pro ess is exe uted until it must wait, usually for some system resour e; when it has
this resour e, it may run again. In a unipro essing system, for example DOS, the CPU
would simply sit idle and the waiting time would be wasted. In a multipro essing
system many pro esses are kept in memory at the same time. Whenever a pro ess
has to wait the operating system takes the CPU away from that pro ess and gives
it to another, more deserving pro ess. It is the s heduler whi h hooses whi h is
the most appropriate pro ess to run next and Linux uses a number of s heduling
strategies to ensure fairness.
Linux supports a number of di erent exe utable le formats, ELF is one, Java is
another and these must be managed transparently as must the pro esses use of the
system's shared libraries.

4.1 Linux Pro esses

See in lude/linux/s hed.h

So that Linux an manage the pro esses in the system, ea h pro ess is represented
by a task stru t data stru ture (task and pro ess are terms that Linux uses inter hangeably). The task ve tor is an array of pointers to every task stru t data
stru ture in the system. This means that the maximum number of pro esses in the
system is limited by the size of the task ve tor; by default it has 512 entries. As pro esses are reated, a new task stru t is allo ated from system memory and added
into the task ve tor. To make it easy to nd, the urrent, running, pro ess is pointed
to by the urrent pointer.
As well as the normal type of pro ess, Linux supports real time pro esses. These
pro esses have to rea t very qui kly to external events (hen e the term \real time")
and they are treated di erently from normal user pro esses by the s heduler. Although the task stru t data stru ture is quite large and omplex, but its elds an
be divided into a number of fun tional areas:

State As a pro ess exe utes it hanges state a ording to its ir umstan es. Linux
pro esses have the following states:

Running The pro ess is either running (it is the urrent pro ess in the system)
or it is ready to run (it is waiting to be assigned to one of the system's
CPUs).

Waiting The pro ess is waiting for an event or for a resour e. Linux di eren-

tiates between two types of waiting pro ess; interruptible and uninterruptible. Interruptible waiting pro esses an be interrupted by signals whereas
uninterruptible waiting pro esses are waiting dire tly on hardware onditions and annot be interrupted under any ir umstan es.

Stopped The pro ess has been stopped, usually by re eiving a signal. A
pro ess that is being debugged an be in a stopped state.

Zombie This is a halted pro ess whi h, for some reason, still has a task stru t
data stru ture in the task ve tor. It is what it sounds like, a dead pro ess.

S heduling Information The s heduler needs this information in order to fairly


de ide whi h pro ess in the system most deserves to run,

1 REVIEW

NOTE: I left out SWAPPING be ause it does not appear to be used.

Identi ers Every pro ess in the system has a pro ess identi er. The pro ess iden-

ti er is not an index into the task ve tor, it is simply a number. Ea h pro ess
also has User and group identi ers, these are used to ontrol this pro esses
a ess to the les and devi es in the system,

Inter-Pro ess Communi ation Linux supports the lassi UnixTM IPC me hanisms of signals, pipes and semaphores and also the System V IPC me hanisms
of shared memory, semaphores and message queues. The IPC me hanisms supported by Linux are des ribed in Chapter 5.

Links In a Linux system no pro ess is independent of any other pro ess. Every
pro ess in the system, ex ept the initial pro ess has a parent pro ess. New
pro esses are not reated, they are opied, or rather loned from previous pro esses. Every task stru t representing a pro ess keeps pointers to its parent
pro ess and to its siblings (those pro esses with the same parent pro ess) as
well as to its own hild pro esses. You an see the family relationship between
the running pro esses in a Linux system using the pstree ommand:

init(1)-+- rond(98)
|-ema s(387)
|-gpm(146)
|-inetd(110)
|-kerneld(18)
|-kflushd(2)
|-klogd(87)
|-kswapd(3)
|-login(160)---bash(192)---ema s(225)
|-lpd(121)
|-mingetty(161)
|-mingetty(162)
|-mingetty(163)
|-mingetty(164)
|-login(403)---bash(404)---pstree(594)
|-sendmail(134)
|-syslogd(78)
`-update(166)

Additionally all of the pro esses in the system are held in a doubly linked list
whose root is the init pro esses task stru t data stru ture. This list allows
the Linux kernel to look at every pro ess in the system. It needs to do this to
provide support for ommands su h as ps or kill.

Times and Timers The kernel keeps tra k of a pro esses reation time as well as

the CPU time that it onsumes during its lifetime. Ea h lo k ti k, the kernel
updates the amount of time in jiffies that the urrent pro ess has spent in
system and in user mode. Linux also supports pro ess spe i interval timers,
pro esses an use system alls to set up timers to send signals to themselves
when the timers expire. These timers an be single-shot or periodi timers.

File system Pro esses an open and lose les as they wish and the pro esses

task stru t ontains pointers to des riptors for ea h open le as well as pointers to two VFS inodes. Ea h VFS inode uniquely des ribes a le or dire tory
within a le system and also provides a uniform interfa e to the underlying
le systems. How le systems are supported under Linux is des ribed in Chapter 9. The rst is to the root of the pro ess (its home dire tory) and the se ond
is to its urrent or pwd dire tory. pwd is derived from the UnixTM ommand
pwd, print working dire tory. These two VFS inodes have their ount elds
in remented to show that one or more pro esses are referen ing them. This is
why you annot delete the dire tory that a pro ess has as its pwd dire tory set
to, or for that matter one of its sub-dire tories.

Virtual memory Most pro esses have some virtual memory (kernel threads and
daemons do not) and the Linux kernel must tra k how that virtual memory is
mapped onto the system's physi al memory.

Pro essor Spe i Context A pro ess ould be thought of as the sum total of the

system's urrent state. Whenever a pro ess is running it is using the pro essor's
registers, sta ks and so on. This is the pro esses ontext and, when a pro ess is
suspended, all of that CPU spe i ontext must be saved in the task stru t
for the pro ess. When a pro ess is restarted by the s heduler its ontext is
restored from here.

4.2 Identi ers


Linux, like all UnixTM uses user and group identi ers to he k for a ess rights to
les and images in the system. All of the les in a Linux system have ownerships
and permissions, these permissions des ribe what a ess the system's users have to
that le or dire tory. Basi permissions are read, write and exe ute and are assigned
to three lasses of user; the owner of the le, pro esses belonging to a parti ular
group and all of the pro esses in the system. Ea h lass of user an have di erent
permissions, for example a le ould have permissions whi h allow its owner to read
and write it, the le's group to read it and for all other pro esses in the system to
have no a ess at all.
REVIEW NOTE: Expand and give the bit assignments (777).

Groups are Linux's way of assigning privileges to les and dire tories for a group
of users rather than to a single user or to all pro esses in the system. You might,
for example, reate a group for all of the users in a software proje t and arrange it
so that only they ould read and write the sour e ode for the proje t. A pro ess
an belong to several groups (a maximum of 32 is the default) and these are held in
the groups ve tor in the task stru t for ea h pro ess. So long as a le has a ess
rights for one of the groups that a pro ess belongs to then that pro ess will have
appropriate group a ess rights to that le.
There are four pairs of pro ess and group identi ers held in a pro esses task stru t:

uid, gid The user identi er and group identi er of the user that the pro ess is
running on behalf of,

e e tive uid and gid There are some programs whi h hange the uid and gid from
that of the exe uting pro ess into their own (held as attributes in the VFS

inode des ribing the exe utable image). These programs are known as setuid
programs and they are useful be ause it is a way of restri ting a esses to
servi es, parti ularly those that run on behalf of someone else, for example a
network daemon. The e e tive uid and gid are those from the setuid program
and the uid and gid remain as they were. The kernel he ks the e e tive uid
and gid whenever it he ks for privilege rights.

le system uid and gid These are normally the same as the e e tive uid and gid

and are used when he king le system a ess rights. They are needed for NFS
mounted lesystems where the user mode NFS server needs to a ess les as if
it were a parti ular pro ess. In this ase only the le system uid and gid are
hanged (not the e e tive uid and gid). This avoids a situation where mali ious
users ould send a kill signal to the NFS server. Kill signals are delivered to
pro esses with a parti ular e e tive uid and gid.

saved uid and gid These are mandated by the POSIX standard and are used by

programs whi h hange the pro esses uid and gid via system alls. They are
used to save the real uid and gid during the time that the original uid and gid
have been hanged.

4.3 S heduling
All pro esses run partially in user mode and partially in system mode. How these
modes are supported by the underlying hardware di ers but generally there is a
se ure me hanism for getting from user mode into system mode and ba k again.
User mode has far less privileges than system mode. Ea h time a pro ess makes a
system all it swaps from user mode to system mode and ontinues exe uting. At
this point the kernel is exe uting on behalf of the pro ess. In Linux, pro esses do
not preempt the urrent, running pro ess, they annot stop it from running so that
they an run. Ea h pro ess de ides to relinquish the CPU that it is running on when
it has to wait for some system event. For example, a pro ess may have to wait for
a hara ter to be read from a le. This waiting happens within the system all, in
system mode; the pro ess used a library fun tion to open and read the le and it,
in turn made system alls to read bytes from the open le. In this ase the waiting
pro ess will be suspended and another, more deserving pro ess will be hosen to run.
Pro esses are always making system alls and so may often need to wait. Even so, if
a pro ess exe utes until it waits then it still might use a disproportionate amount of
CPU time and so Linux uses pre-emptive s heduling. In this s heme, ea h pro ess is
allowed to run for a small amount of time, 200ms, and, when this time has expired
another pro ess is sele ted to run and the original pro ess is made to wait for a little
while until it an run again. This small amount of time is known as a time-sli e.
It is the s heduler that must sele t the most deserving pro ess to run out of all of See s hedule()
the runnable pro esses in the system. A runnable pro ess is one whi h is waiting in
only for a CPU to run on. Linux uses a reasonably simple priority based s heduling kernel/s hed.
algorithm to hoose between the urrent pro esses in the system. When it has hosen
a new pro ess to run it saves the state of the urrent pro ess, the pro essor spe i
registers and other ontext being saved in the pro esses task stru t data stru ture.
It then restores the state of the new pro ess (again this is pro essor spe i ) to run
and gives ontrol of the system to that pro ess. For the s heduler to fairly allo ate

CPU time between the runnable pro esses in the system it keeps information in the
task stru t for ea h pro ess:

poli y This is the s heduling poli y that will be applied to this pro ess. There are
two types of Linux pro ess, normal and real time. Real time pro esses have a
higher priority than all of the other pro esses. If there is a real time pro ess
ready to run, it will always run rst. Real time pro esses may have two types
of poli y, round robin and rst in rst out. In round robin s heduling, ea h
runnable real time pro ess is run in turn and in rst in, rst out s heduling
ea h runnable pro ess is run in the order that it is in on the run queue and
that order is never hanged.

priority This is the priority that the s heduler will give to this pro ess. It is also the

amount of time (in jiffies) that this pro ess will run for when it is allowed
to run. You an alter the priority of a pro ess by means of system alls and
the reni e ommand.

rt priority Linux supports real time pro esses and these are s heduled to have a
higher priority than all of the other non-real time pro esses in system. This
eld allows the s heduler to give ea h real time pro ess a relative priority. The
priority of a real time pro esses an be altered using system alls.

ounter This is the amount of time (in jiffies) that this pro ess is allowed to run

for. It is set to priority when the pro ess is rst run and is de remented ea h
lo k ti k.

See s hedule()
in
kernel/s hed.

The s heduler is run from several pla es within the kernel. It is run after putting the
urrent pro ess onto a wait queue and it may also be run at the end of a system all,
just before a pro ess is returned to pro ess mode from system mode. One reason that
it might need to run is be ause the system timer has just set the urrent pro esses
ounter to zero. Ea h time the s heduler is run it does the following:

kernel work The s heduler runs the bottom half handlers and pro esses the s heduler task queue. These lightweight kernel threads are des ribed in detail in
hapter 11.

Current pro ess The urrent pro ess must be pro essed before another pro ess
an be sele ted to run.

If the s heduling poli y of the urrent pro esses is round robin then it is put
onto the ba k of the run queue.
If the task is INTERRUPTIBLE and it has re eived a signal sin e the last time it
was s heduled then its state be omes RUNNING.
If the urrent pro ess has timed out, then its state be omes RUNNING.
If the urrent pro ess is RUNNING then it will remain in that state.
Pro esses that were neither RUNNING nor INTERRUPTIBLE are removed from the
run queue. This means that they will not be onsidered for running when the
s heduler looks for the most deserving pro ess to run.

Pro ess sele tion The s heduler looks through the pro esses on the run queue

looking for the most deserving pro ess to run. If there are any real time pro esses (those with a real time s heduling poli y) then those will get a higher

weighting than ordinary pro esses. The weight for a normal pro ess is its
ounter but for a real time pro ess it is ounter plus 1000. This means that if
there are any runnable real time pro esses in the system then these will always
be run before any normal runnable pro esses. The urrent pro ess, whi h has
onsumed some of its time-sli e (its ounter has been de remented) is at a disadvantage if there are other pro esses with equal priority in the system; that
is as it should be. If several pro esses have the same priority, the one nearest
the front of the run queue is hosen. The urrent pro ess will get put onto the
ba k of the run queue. In a balan ed system with many pro esses of the same
priority, ea h one will run in turn. This is known as Round Robin s heduling.
However, as pro esses wait for resour es, their run order tends to get moved
around.

Swap pro esses If the most deserving pro ess to run is not the urrent pro ess,

then the urrent pro ess must be suspended and the new one made to run.
When a pro ess is running it is using the registers and physi al memory of the
CPU and of the system. Ea h time it alls a routine it passes its arguments
in registers and may sta k saved values su h as the address to return to in the
alling routine. So, when the s heduler is running it is running in the ontext
of the urrent pro ess. It will be in a privileged mode, kernel mode, but it
is still the urrent pro ess that is running. When that pro ess omes to be
suspended, all of its ma hine state, in luding the program ounter (PC) and
all of the pro essor's registers, must be saved in the pro esses task stru t
data stru ture. Then, all of the ma hine state for the new pro ess must be
loaded. This is a system dependent operation, no CPUs do this in quite the
same way but there is usually some hardware assistan e for this a t.

This swapping of pro ess ontext takes pla e at the end of the s heduler. The
saved ontext for the previous pro ess is, therefore, a snapshot of the hardware
ontext of the system as it was for this pro ess at the end of the s heduler.
Equally, when the ontext of the new pro ess is loaded, it too will be a snapshot
of the way things were at the end of the s heduler, in luding this pro esses
program ounter and register ontents.
If the previous pro ess or the new urrent pro ess uses virtual memory then
the system's page table entries may need to be updated. Again, this a tion
is ar hite ture spe i . Pro essors like the Alpha AXP, whi h use Translation
Look-aside Tables or a hed Page Table Entries, must ush those a hed table
entries that belonged to the previous pro ess.

4.3.1 S heduling in Multipro essor Systems


Systems with multiple CPUs are reasonably rare in the Linux world but a lot of
work has already gone into making Linux an SMP (Symmetri Multi-Pro essing)
operating system. That is, one that is apable of evenly balan ing work between the
CPUs in the system. Nowhere is this balan ing of work more apparent than in the
s heduler.
In a multipro essor system, hopefully, all of the pro essors are busily running pro esses. Ea h will run the s heduler separately as its urrent pro ess exhausts its
time-sli e or has to wait for a system resour e. The rst thing to noti e about an
SMP system is that there is not just one idle pro ess in the system. In a single

inode

fs_struct
count
task_struct

0x022

umask
*root

fs

*pwd

files

inode

files_struct
count
close_on_exec
open_fs
fd[0]

file

fd[1]

f_mode

inode

f_pos
f_flags
fd[255]

f_count
f_owner
f_inode
f_op
f_version

file operation
routines

Figure 4.1: A Pro ess's Files


pro essor system the idle pro ess is the rst task in the task ve tor, in an SMP
system there is one idle pro ess per CPU, and you ould have more than one idle
CPU. Additionally there is one urrent pro ess per CPU, so SMP systems must keep
tra k of the urrent and idle pro esses for ea h pro essor.
In an SMP system ea h pro ess's task stru t ontains the number of the pro essor
that it is urrently running on (pro essor) and its pro essor number of the last
pro essor that it ran on (last pro essor). There is no reason why a pro ess should
not run on a di erent CPU ea h time it is sele ted to run but Linux an restri t a
pro ess to one or more pro essors in the system using the pro essor mask. If bit N
is set, then this pro ess an run on pro essor N. When the s heduler is hoosing a
new pro ess to run it will not onsider one that does not have the appropriate bit set
for the urrent pro essor's number in its pro essor mask. The s heduler also gives
a slight advantage to a pro ess that last ran on the urrent pro essor be ause there
is often a performan e overhead when moving a pro ess to a di erent pro essor.

4.4 Files
See in lude/-

linux/s hed.h

Figure 4.1 shows that there are two data stru tures that des ribe le system spe i
information for ea h pro ess in the system. The rst, the fs stru t ontains pointers
to this pro ess's VFS inodes and its umask. The umask is the default mode that new
les will be reated in, and it an be hanged via system alls.
The se ond data stru ture, the files stru t, ontains information about all of the
les that this pro ess is urrently using. Programs read from standard input and write

to standard output. Any error messages should go to standard error. These may be
les, terminal input/output or a real devi e but so far as the program is on erned
they are all treated as les. Every le has its own des riptor and the files stru t
ontains pointers to up to 256 file data stru tures, ea h one des ribing a le being
used by this pro ess. The f mode eld des ribes what mode the le has been reated
in; read only, read and write or write only. f pos holds the position in the le
where the next read or write operation will o ur. f inode points at the VFS inode
des ribing the le and f ops is a pointer to a ve tor of routine addresses; one for
ea h fun tion that you might wish to perform on a le. There is, for example, a write
data fun tion. This abstra tion of the interfa e is very powerful and allows Linux
to support a wide variety of le types. In Linux, pipes are implemented using this
me hanism as we shall see later.
Every time a le is opened, one of the free file pointers in the files stru t is used
to point to the new file stru ture. Linux pro esses expe t three le des riptors to
be open when they start. These are known as standard input, standard output and
standard error and they are usually inherited from the reating parent pro ess. All
a esses to les are via standard system alls whi h pass or return le des riptors.
These des riptors are indi es into the pro ess's fd ve tor, so standard input, standard
output and standard error have le des riptors 0, 1 and 2. Ea h a ess to the le
uses the file data stru ture's le operation routines to together with the VFS inode
to a hieve its needs.

4.5 Virtual Memory


A pro ess's virtual memory ontains exe utable ode and data from many sour es.
First there is the program image that is loaded; for example a ommand like ls.
This ommand, like all exe utable images, is omposed of both exe utable ode and
data. The image le ontains all of the information ne essary to load the exe utable
ode and asso iated program data into the virtual memory of the pro ess. Se ondly,
pro essses an allo ate (virtual) memory to use during their pro essing, say to hold
the ontents of les that it is reading. This newly allo ated, virtual, memory needs to
be linked into the pro ess's existing virtual memory so that it an be used. Thirdly,
Linux pro esses use libraries of ommonly useful ode, for example le handling
routines. It does not make sense that ea h pro ess has its own opy of the library,
Linux uses shared libraries that an be used by several running pro esses at the same
time. The ode and the data from these shared libraries must be linked into this
pro ess's virtual address spa e and also into the virtual address spa e of the other
pro esses sharing the library.
In any given time period a pro ess will not have used all of the ode and data
ontained within its virtual memory. It ould ontain ode that is only used during
ertain situations, su h as during initialization or to pro ess a parti ular event. It
may only have used some of the routines from its shared libraries. It would be
wasteful to load all of this ode and data into physi al memory where it would lie
unused. Multiply this wastage by the number of pro esses in the system and the
system would run very ine iently. Instead, Linux uses a te hnique alled demand
paging where the virtual memory of a pro ess is brought into physi al memory only
when a pro ess attempts to use it. So, instead of loading the ode and data into
physi al memory straight away, the Linux kernel alters the pro ess's page table,

Processes Virtual Memory

task_struct
mm_struct
mm

count
pgd

vm_area_struct
vm_end
vm_start
vm_flags

Data

vm_inode
vm_ops
0x8059BB8
mmap
mmap_avl

vm_next

mmap_sem

vm_area_struct

Code

vm_end
vm_start

0x8048000

vm_flags
vm_inode
vm_ops
vm_next
0x0000000

Figure 4.2: A Pro ess's Virtual Memory


marking the virtual areas as existing but not in memory. When the pro ess attempts
to a ess the ode or data the system hardware will generate a page fault and
hand ontrol to the Linux kernel to x things up. Therefore, for every area of
virtual memory in the pro ess's address spa e Linux needs to know where that virtual
memory omes from and how to get it into memory so that it an x up these page
faults.
The Linux kernel needs to manage all of these areas of virtual memory and the ontents of ea h pro ess's virtual memory is des ribed by a mm stru t data stru ture
pointed at from its task stru t. The pro ess's mm stru t data stru ture also ontains information about the loaded exe utable image and a pointer to the pro ess's
page tables. It ontains pointers to a list of vm area stru t data stru tures, ea h
representing an area of virtual memory within this pro ess.
This linked list is in as ending virtual memory order, gure 4.2 shows the layout in
virtual memory of a simple pro ess together with the kernel data stru tures managing
it. As those areas of virtual memory are from several sour es, Linux abstra ts the
interfa e by having the vm area stru t point to a set of virtual memory handling
routines (via vm ops). This way all of the pro ess's virtual memory an be handled
in a onsistent way no matter how the underlying servi es managing that memory
di er. For example there is a routine that will be alled when the pro ess attempts
to a ess the memory and it does not exist, this is how page faults are handled.
The pro ess's set of vm area stru t data stru tures is a essed repeatedly by the
Linux kernel as it reates new areas of virtual memory for the pro ess and as it xes
up referen es to virtual memory not in the system's physi al memory. This makes
the time that it takes to nd the orre t vm area stru t riti al to the performan e

of the system. To speed up this a ess, Linux also arranges the vm area stru t data
stru tures into an AVL (Adelson-Velskii and Landis) tree. This tree is arranged so
that ea h vm area stru t (or node) has a left and a right pointer to its neighbouring
vm area stru t stru ture. The left pointer points to node with a lower starting
virtual address and the right pointer points to a node with a higher starting virtual
address. To nd the orre t node, Linux goes to the root of the tree and follows
ea h node's left and right pointers until it nds the right vm area stru t. Of ourse,
nothing is for free and inserting a new vm area stru t into this tree takes additional
pro essing time.
When a pro ess allo ates virtual memory, Linux does not a tually reserve physi al
memory for the pro ess. Instead, it des ribes the virtual memory by reating a
new vm area stru t data stru ture. This is linked into the pro ess's list of virtual
memory. When the pro ess attempts to write to a virtual address within that new
virtual memory region then the system will page fault. The pro essor will attempt
to de ode the virtual address, but as there are no Page Table Entries for any of this
memory, it will give up and raise a page fault ex eption, leaving the Linux kernel to
x things up. Linux looks to see if the virtual address referen ed is in the urrent
pro ess's virtual address spa e. If it is, Linux reates the appropriate PTEs and
allo ates a physi al page of memory for this pro ess. The ode or data may need to
be brought into that physi al page from the lesystem or from the swap disk. The
pro ess an then be restarted at the instru tion that aused the page fault and, this
time as the memory physi ally exists, it may ontinue.

4.6 Creating a Pro ess


When the system starts up it is running in kernel mode and there is, in a sense, only
one pro ess, the initial pro ess. Like all pro esses, the initial pro ess has a ma hine
state represented by sta ks, registers and so on. These will be saved in the initial
pro ess's task stru t data stru ture when other pro esses in the system are reated
and run. At the end of system initialization, the initial pro ess starts up a kernel
thread ( alled init) and then sits in an idle loop doing nothing. Whenever there
is nothing else to do the s heduler will run this, idle, pro ess. The idle pro ess's
task stru t is the only one that is not dynami ally allo ated, it is stati ally de ned
at kernel build time and is, rather onfusingly, alled init task.
The init kernel thread or pro ess has a pro ess identi er of 1 as it is the system's
rst real pro ess. It does some initial setting up of the system (su h as opening the
system onsole and mounting the root le system) and then exe utes the system initialization program. This is one of /et /init, /bin/init or /sbin/init depending
on your system. The init program uses /et /inittab as a s ript le to reate new
pro esses within the system. These new pro esses may themselves go on to reate
new pro esses. For example the getty pro ess may reate a login pro ess when a
user attempts to login. All of the pro esses in the system are des ended from the
init kernel thread.
New pro esses are reated by loning old pro esses, or rather by loning the urrent
pro ess. A new task is reated by a system all (fork or lone) and the loning
happens within the kernel in kernel mode. At the end of the system all there is a
new pro ess waiting to run on e the s heduler hooses it. A new task stru t data
stru ture is allo ated from the system's physi al memory with one or more physi al

See do fork() in
kernel/fork.

pages for the loned pro ess's sta ks (user and kernel). A new pro ess identi er may
be reated, one that is unique within the set of pro ess identi ers in the system.
However, it is perfe tly reasonable for the loned pro ess to keep its parents pro ess
identi er. The new task stru t is entered into the task ve tor and the ontents of
the old ( urrent) pro ess's task stru t are opied into the loned task stru t.
When loning pro esses Linux allows the two pro esses to share resour es rather
than have two separate opies. This applies to the pro ess's les, signal handlers and
virtual memory. When the resour es are to be shared their respe tive ount elds
are in remented so that Linux will not deallo ate these resour es until both pro esses
have nished using them. So, for example, if the loned pro ess is to share virtual
memory, its task stru t will ontain a pointer to the mm stru t of the original
pro ess and that mm stru t has its ount eld in remented to show the number of
urrent pro esses sharing it.
Cloning a pro ess's virtual memory is rather tri ky. A new set of vm area stru t
data stru tures must be generated together with their owning mm stru t data stru ture and the loned pro ess's page tables. None of the pro ess's virtual memory is
opied at this point. That would be a rather di ult and lengthy task for some
of that virtual memory would be in physi al memory, some in the exe utable image that the pro ess is urrently exe uting and possibly some would be in the swap
le. Instead Linux uses a te hnique alled \ opy on write" whi h means that virtual
memory will only be opied when one of the two pro esses tries to write to it. Any
virtual memory that is not written to, even if it an be, will be shared between the
two pro esses without any harm o uring. The read only memory, for example the
exe utable ode, will always be shared. For \ opy on write" to work, the writeable
areas have their page table entries marked as read only and the vm area stru t
data stru tures des ribing them are marked as \ opy on write". When one of the
pro esses attempts to write to this virtual memory a page fault will o ur. It is at
this point that Linux will make a opy of the memory and x up the two pro esses'
page tables and virtual memory data stru tures.

4.7 Times and Timers

See

kernel/itimer.

The kernel keeps tra k of a pro ess's reation time as well as the CPU time that it
onsumes during its lifetime. Ea h lo k ti k, the kernel updates the amount of time
in jiffies that the urrent pro ess has spent in system and in user mode.
In addition to these a ounting timers, Linux supports pro ess spe i interval
timers. A pro ess an use these timers to send itself various signals ea h time that
they expire. Three sorts of interval timers are supported:

Real the timer ti ks in real time, and when the timer has expired, the pro ess is
sent a SIGALRM signal.

Virtual This timer only ti ks when the pro ess is running and when it expires it
sends a SIGVTALRM signal.

Pro le This timer ti ks both when the pro ess is running and when the system is
exe uting on behalf of the pro ess itself. SIGPROF is signalled when it expires.

One or all of the interval timers may be running and Linux keeps all of the ne essary
information in the pro ess's task stru t data stru ture. System alls an be made

formats

linux_binfmt

linux_binfmt

linux_binfmt

next

next

next

use_count

use_count

use_count

*load_binary()

*load_binary()

*load_binary()

*load_shlib()

*load_shlib()

*load_shlib()

*core_dump()

*core_dump()

*core_dump()

Figure 4.3: Registered Binary Formats


to set up these interval timers and to start them, stop them and read their urrent See
values. The virtual and pro le timers are handled the same way. Every lo k ti k do it virtual()
in
the urrent pro ess's interval timers are de remented and, if they have expired, the kernel/s hed.
appropriate signal is sent.
Real time interval timers are a little di erent and for these Linux uses the timer
me hanism des ribed in Chapter 11. Ea h pro ess has its own timer list data
stru ture and, when the real interval timer is running, this is queued on the system
timer list. When the timer expires the timer bottom half handler removes it from
the queue and alls the interval timer handler. This generates the SIGALRM signal
and restarts the interval timer, adding it ba k into the system timer queue.

4.8 Exe uting Programs


In Linux, as in UnixTM , programs and ommands are normally exe uted by a ommand interpreter. A ommand interpreter is a user pro ess like any other pro ess
and is alled a shell 2 . There are many shells in Linux, some of the most popular
are sh, bash and t sh. With the ex eption of a few built in ommands, su h as d
and pwd, a ommand is an exe utable binary le. For ea h ommand entered, the
shell sear hes the dire tories in the pro ess's sear h path, held in the PATH environment variable, for an exe utable image with a mat hing name. If the le is found it
is loaded and exe uted. The shell lones itself using the fork me hanism des ribed
above and then the new hild pro ess repla es the binary image that it was exe uting, the shell, with the ontents of the exe utable image le just found. Normally
the shell waits for the ommand to omplete, or rather for the hild pro ess to exit.
You an ause the shell to run again by pushing the hild pro ess to the ba kground
by typing ontrol-Z, whi h auses a SIGSTOP signal to be sent to the hild pro ess,
stopping it. You then use the shell ommand bg to push it into a ba kground, the
shell sends it a SIGCONT signal to restart it, where it will stay until either it ends or
it needs to do terminal input or output.
An exe utable le an have many formats or even be a s ript le. S ript les have
to be re ognized and the appropriate interpreter run to handle them; for example
/bin/sh interprets shell s ripts. Exe utable obje t les ontain exe utable ode and
data together with enough information to allow the operating system to load them
into memory and exe ute them. The most ommonly used obje t le format used
by Linux is ELF but, in theory, Linux is exible enough to handle almost any obje t
le format.
2 Think

of a nut the kernel is the edible bit in the middle and the shell goes around it, providing
an interfa e.

See

do it prof() in
kernel/s hed.

See

it real fn() in
kernel/itimer.

ELF Executable Image

Physical Header

Physical Header

e_ident
e_entry
e_phoff
e_phentsize
e_phnum

E L F
0x8048090
52
32
2

p_type
p_offset
p_vaddr
p_filesz
p_memsz
p_flags

PT_LOAD
0
0x8048000
68532
68532
PF_R, PF_X

p_type
p_offset
p_vaddr
p_filesz
p_memsz
p_flags

PT_LOAD
68536
0x8059BB8
2200
4248
PF_R, PF_W

Code

Data

Figure 4.4: ELF Exe utable File Format

See do exe ve()


in fs/exe .

As with le systems, the binary formats supported by Linux are either built into the
kernel at kernel build time or available to be loaded as modules. The kernel keeps
a list of supported binary formats (see gure 4.3) and when an attempt is made
to exe ute a le, ea h binary format is tried in turn until one works. Commonly
supported Linux binary formats are a.out and ELF. Exe utable les do not have to
be read ompletely into memory, a te hnique known as demand loading is used. As
ea h part of the exe utable image is used by a pro ess it is brought into memory.
Unused parts of the image may be dis arded from memory.

4.8.1 ELF

See in lude/linux/elf.h

The ELF (Exe utable and Linkable Format) obje t le format, designed by the Unix
System Laboratories, is now rmly established as the most ommonly used format
in Linux. Whilst there is a slight performan e overhead when ompared with other
obje t le formats su h as ECOFF and a.out, ELF is felt to be more exible. ELF
exe utable les ontain exe utable ode, sometimes refered to as text, and data.
Tables within the exe utable image des ribe how the program should be pla ed into
the pro ess's virtual memory. Stati ally linked images are built by the linker (ld),
or link editor, into one single image ontaining all of the ode and data needed to
run this image. The image also spe i es the layout in memory of this image and the
address in the image of the rst ode to exe ute.
Figure 4.4 shows the layout of a stati ally linked ELF exe utable image. It is a simple
C program that prints \hello world" and then exits. The header des ribes it as an
ELF image with two physi al headers (e phnum is 2) starting 52 bytes (e phoff) from

the start of the image le. The rst physi al header des ribes the exe utable ode in
the image. It goes at virtual address 0x8048000 and there is 65532 bytes of it. This
is be ause it is a stati ally linked image whi h ontains all of the library ode for
the printf() all to output \hello world". The entry point for the image, the rst
instru tion for the program, is not at the start of the image but at virtual address
0x8048090 (e entry). The ode starts immediately after the se ond physi al header.
This physi al header des ribes the data for the program and is to be loaded into
virtual memory at address 0x8059BB8. This data is both readable and writeable.
You will noti e that the size of the data in the le is 2200 bytes (p filesz) whereas
its size in memory is 4248 bytes. This be ause the rst 2200 bytes ontain preinitialized data and the next 2048 bytes ontain data that will be initialized by the
exe uting ode.
When Linux loads an ELF exe utable image into the pro ess's virtual address spa e,
it does not a tually load the image. It sets up the virtual memory data stru tures,
the pro ess's vm area stru t tree and its page tables. When the program is exe uted page faults will ause the program's ode and data to be fet hed into physi al
memory. Unused portions of the program will never be loaded into memory. On e
the ELF binary format loader is satis ed that the image is a valid ELF exe utable
image it ushes the pro ess's urrent exe utable image from its virtual memory. As
this pro ess is a loned image (all pro esses are) this, old, image is the program that
the parent pro ess was exe uting, for example the ommand interpreter shell su h
as bash. This ushing of the old exe utable image dis ards the old virtual memory
data stru tures and resets the pro ess's page tables. It also lears away any signal
handlers that were set up and loses any les that are open. At the end of the ush
the pro ess is ready for the new exe utable image. No matter what format the exe utable image is, the same information gets set up in the pro ess's mm stru t. There
are pointers to the start and end of the image's ode and data. These values are
found as the ELF exe utable images physi al headers are read and the se tions of
the program that they des ribe are mapped into the pro ess's virtual address spa e.
That is also when the vm area stru t data stru tures are set up and the pro ess's
page tables are modi ed. The mm stru t data stru ture also ontains pointers to the
parameters to be passed to the program and to this pro ess's environment variables.

ELF Shared Libraries


A dynami ally linked image, on the other hand, does not ontain all of the ode and
data required to run. Some of it is held in shared libraries that are linked into the
image at run time. The ELF shared library's tables are also used by the dynami
linker when the shared library is linked into the image at run time. Linux uses several
dynami linkers, ld.so.1, lib .so.1 and ld-linux.so.1, all to be found in /lib.
The libraries ontain ommonly used ode su h as language subroutines. Without
dynami linking, all programs would need their own opy of the these libraries and
would need far more disk spa e and virtual memory. In dynami linking, information
is in luded in the ELF image's tables for every library routine referen ed. The
information indi ates to the dynami linker how to lo ate the library routine and
link it into the program's address spa e.
REVIEW NOTE: Do I need more detail here, worked example?

See do load
elf binary() in
fs/binfmt elf.

4.8.2 S ript Files


S ript les are exe utables that need an interpreter to run them. There are a wide
variety of interpreters available for Linux; for example wish, perl and ommand shells
su h as t sh. Linux uses the standard UnuxTM onvention of having the rst line of
a s ript le ontain the name of the interpreter. So, a typi al s ript le would start:
See

do load s ript()

in fs/-

binfmt s ript.

#!/usr/bin/wish

The s ript binary loader tries to nd the intepreter for the s ript. It does this by
attempting to open the exe utable le that is named in the rst line of the s ript.
If it an open it, it has a pointer to its VFS inode and it an go ahead and have
it interpret the s ript le. The name of the s ript le be omes argument zero (the
rst argument) and all of the other arguments move up one pla e (the original rst
argument be omes the new se ond argument and so on). Loading the interpreter
is done in the same way as Linux loads all of its exe utable les. Linux tries ea h
binary format in turn until one works. This means that you ould in theory sta k
several interpreters and binary formats making the Linux binary format handler a
very exible pie e of software.

Chapter 5

Interpro ess Communi ation


Me hanisms

Pro esses ommuni ate with ea h other and with the kernel to oordinate
their a tivities. Linux supports a number of Inter-Pro ess Communi ation (IPC) me hanisms. Signals and pipes are two of them but Linux also
supports the System V IPC me hanisms named after the UnixTM release
in whi h they rst appeared.

5.1 Signals
Signals are one of the oldest inter-pro ess ommuni ation methods used by UnixTM
systems. They are used to signal asyn hronous events to one or more pro esses. A
signal ould be generated by a keyboard interrupt or an error ondition su h as the
pro ess attempting to a ess a non-existent lo ation in its virtual memory. Signals
are also used by the shells to signal job ontrol ommands to their hild pro esses.
There are a set of de ned signals that the kernel an generate or that an be generated
by other pro esses in the system, provided that they have the orre t privileges. You
an list a system's set of signals using the kill ommand (kill -l), on my Intel Linux
box this gives:
1)
5)
9)
13)
18)
22)
26)
30)

SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL


SIGTRAP 6) SIGIOT 7) SIGBUS 8) SIGFPE
SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
SIGPIPE 14) SIGALRM 15) SIGTERM 17) SIGCHLD
SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN
SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO
SIGPWR

51

The numbers are di erent for an Alpha AXP Linux box. Pro esses an hoose to
ignore most of the signals that are generated, with two notable ex eptions: neither
the SIGSTOP signal whi h auses a pro ess to halt its exe ution nor the SIGKILL
signal whi h auses a pro ess to exit an be ignored. Otherwise though, a pro ess
an hoose just how it wants to handle the various signals. Pro esses an blo k
the signals and, if they do not blo k them, they an either hoose to handle them
themselves or allow the kernel to handle them. If the kernel handles the signals, it will
do the default a tions required for this signal. For example, the default a tion when
a pro ess re eives the SIGFPE ( oating point ex eption) signal is to ore dump and
then exit. Signals have no inherent relative priorities. If two signals are generated
for a pro ess at the same time then they may be presented to the pro ess or handled
in any order. Also there is no me hanism for handling multiple signals of the same
kind. There is no way that a pro ess an tell if it re eived 1 or 42 SIGCONT signals.
Linux implements signals using information stored in the task stru t for the pro ess. The number of supported signals is limited to the word size of the pro essor.
Pro esses with a word size of 32 bits an have 32 signals whereas 64 bit pro essors
like the Alpha AXP may have up to 64 signals. The urrently pending signals are
kept in the signal eld with a mask of blo ked signals held in blo ked. With the
ex eption of SIGSTOP and SIGKILL, all signals an be blo ked. If a blo ked signal
is generated, it remains pending until it is unblo ked. Linux also holds information
about how ea h pro ess handles every possible signal and this is held in an array of
siga tion data stru tures pointed at by the task stru t for ea h pro ess. Amongst
other things it ontains either the address of a routine that will handle the signal or
a ag whi h tells Linux that the pro ess either wishes to ignore this signal or let the
kernel handle the signal for it. The pro ess modi es the default signal handling by
making system alls and these alls alter the siga tion for the appropriate signal
as well as the blo ked mask.
Not every pro ess in the system an send signals to every other pro ess, the kernel
an and super users an. Normal pro esses an only send signals to pro esses with the
same uid and gid or to pro esses in the same pro ess group1 . Signals are generated
by setting the appropriate bit in the task stru t's signal eld. If the pro ess has
not blo ked the signal and is waiting but interruptible (in state Interruptible) then
it is woken up by hanging its state to Running and making sure that it is in the
run queue. That way the s heduler will onsider it a andidate for running when the
system next s hedules. If the default handling is needed, then Linux an optimize the
handling of the signal. For example if the signal SIGWINCH (the X window hanged
fo us) and the default handler is being used then there is nothing to be done.
Signals are not presented to the pro ess immediately they are generated., they must
wait until the pro ess is running again. Every time a pro ess exits from a system all
its signal and blo ked elds are he ked and, if there are any unblo ked signals,
they an now be delivered. This might seem a very unreliable method but every
pro ess in the system is making system alls, for example to write a hara ter to the
terminal, all of the time. Pro esses an ele t to wait for signals if they wish, they
are suspended in state Interruptible until a signal is presented. The Linux signal
pro essing ode looks at the siga tion stru ture for ea h of the urrent unblo ked
signals.
If a signal's handler is set to the default a tion then the kernel will handle it. The
1 REVIEW

NOTE: Explain pro ess groups.

SIGSTOP signal's default handler will hange the urrent pro ess's state to Stopped

and then run the s heduler to sele t a new pro ess to run. The default a tion for the
SIGFPE signal will ore dump the pro ess and then ause it to exit. Alternatively,
the pro ess may have spe ed its own signal handler. This is a routine whi h will
be alled whenever the signal is generated and the siga tion stru ture holds the
address of this routine. The kernel must all the pro ess's signal handling routine
and how this happens is pro essor spe i but all CPUs must ope with the fa t
that the urrent pro ess is running in kernel mode and is just about to return to
the pro ess that alled the kernel or system routine in user mode. The problem is
solved by manipulating the sta k and registers of the pro ess. The pro ess's program
ounter is set to the address of its signal handling routine and the parameters to the
routine are added to the all frame or passed in registers. When the pro ess resumes
operation it appears as if the signal handling routine were alled normally.
Linux is POSIX ompatible and so the pro ess an spe ify whi h signals are blo ked
when a parti ular signal handling routine is alled. This means hanging the blo ked
mask during the all to the pro esses signal handler. The blo ked mask must be
returned to its original value when the signal handling routine has nished. Therefore
Linux adds a all to a tidy up routine whi h will restore the original blo ked mask
onto the all sta k of the signalled pro ess. Linux also optimizes the ase where
several signal handling routines need to be alled by sta king them so that ea h time
one handling routine exits, the next one is alled until the tidy up routine is alled.

5.2 Pipes
The ommon Linux shells all allow redire tion. For example
$ ls | pr | lpr

pipes the output from the ls ommand listing the dire tory's les into the standard
input of the pr ommand whi h paginates them. Finally the standard output from
the pr ommand is piped into the standard input of the lpr ommand whi h prints
the results on the default printer. Pipes then are unidire tional byte streams whi h
onne t the standard output from one pro ess into the standard input of another
pro ess. Neither pro ess is aware of this redire tion and behaves just as it would
normally. It is the shell whi h sets up these temporary pipes between the pro esses.
In Linux, a pipe is implemented using two file data stru tures whi h both point at
the same temporary VFS inode whi h itself points at a physi al page within memory.
Figure 5.1 shows that ea h file data stru ture ontains pointers to di erent le
operation routine ve tors; one for writing to the pipe, the other for reading from
the pipe. This hides the underlying di eren es from the generi system alls whi h
read and write to ordinary les. As the writing pro ess writes to the pipe, bytes are
opied into the shared data page and when the reading pro ess reads from the pipe,
bytes are opied from the shared data page. Linux must syn hronize a ess to the
pipe. It must make sure that the reader and the writer of the pipe are in step and
to do this it uses lo ks, wait queues and signals.
When the writer wants to write to the pipe it uses the standard write library fun tions. These all pass le des riptors that are indi es into the pro ess's set of file

See

in lude/linux/inode fs i.h

Process 1

Process 2

file

file

f_mode

f_mode

f_pos

f_pos

f_flags

f_flags

f_count

f_count

f_owner

f_owner

f_inode

f_inode

f_op

inode

f_version

f_op
f_version

Data Page

Pipe
Write
Operations

Pipe
Read
Operations

Figure 5.1: Pipes

See

pipe write() in
fs/pipe.

See pipe read()


in fs/pipe.

data stru tures, ea h one representing an open le or, as in this ase, an open pipe.
The Linux system all uses the write routine pointed at by the file data stru ture
des ribing this pipe. That write routine uses information held in the VFS inode
representing the pipe to manage the write request. If there is enough room to write
all of the bytes into the pipe and, so long as the pipe is not lo ked by its reader,
Linux lo ks it for the writer and opies the bytes to be written from the pro ess's
address spa e into the shared data page. If the pipe is lo ked by the reader or if
there is not enough room for the data then the urrent pro ess is made to sleep on
the pipe inode's wait queue and the s heduler is alled so that another pro ess an
run. It is interruptible, so it an re eive signals and it will be woken by the reader
when there is enough room for the write data or when the pipe is unlo ked. When
the data has been written, the pipe's VFS inode is unlo ked and any waiting readers
sleeping on the inode's wait queue will themselves be woken up.
Reading data from the pipe is a very similar pro ess to writing to it. Pro esses are
allowed to do non-blo king reads (it depends on the mode in whi h they opened
the le or pipe) and, in this ase, if there is no data to be read or if the pipe is
lo ked, an error will be returned. This means that the pro ess an ontinue to run.
The alternative is to wait on the pipe inode's wait queue until the write pro ess
has nished. When both pro esses have nished with the pipe, the pipe inode is
dis arded along with the shared data page.
Linux also supports named pipes, also known as FIFOs be ause pipes operate on a
First In, First Out prin iple. The rst data written into the pipe is the rst data
read from the pipe. Unlike pipes, FIFOs are not temporary obje ts, they are entities
in the le system and an be reated using the mk fo ommand. Pro esses are free to
use a FIFO so long as they have appropriate a ess rights to it. The way that FIFOs
are opened is a little di erent from pipes. A pipe (its two file data stru tures, its

VFS inode and the shared data page) is reated in one go whereas a FIFO already
exists and is opened and losed by its users. Linux must handle readers opening
the FIFO before writers open it as well as readers reading before any writers have
written to it. That aside, FIFOs are handled almost exa tly the same way as pipes
and they use the same data stru tures and operations.

5.3 So kets
REVIEW NOTE: Add when networking hapter written.

5.3.1 System V IPC Me hanisms


Linux supports three types of interpro ess ommuni ation me hanisms that rst
appeared in UnixTM System V (1983). These are message queues, semaphores and
shared memory. These System V IPC me hanisms all share ommon authenti ation
methods. Pro esses may a ess these resour es only by passing a unique referen e
identi er to the kernel via system alls. A ess to these System V IPC obje ts is
he ked using a ess permissions, mu h like a esses to les are he ked. The a ess
rights to the System V IPC obje t is set by the reator of the obje t via system alls.
The obje t's referen e identi er is used by ea h me hanism as an index into a table
of resour es. It is not a straight forward index but requires some manipulation to
generate the index.
All Linux data stru tures representing System V IPC obje ts in the system in lude
an ip perm stru ture whi h ontains the owner and reator pro ess's user and
group identi ers. The a ess mode for this obje t (owner, group and other) and the
IPC obje t's key. The key is used as a way of lo ating the System V IPC obje t's
referen e identi er. Two sets of keys are supported: publi and private. If the key
is publi then any pro ess in the system, subje t to rights he king, an nd the
referen e identi er for the System V IPC obje t. System V IPC obje ts an never
be referen ed with a key, only by their referen e identi er.

See in lude/linux/ip .h

5.3.2 Message Queues


Message queues allow one or more pro esses to write messages, whi h will be read
by one or more reading pro esses. Linux maintains a list of message queues, the
msgque ve tor; ea h element of whi h points to a msqid ds data stru ture that fully
des ribes the message queue. When message queues are reated a new msqid ds
data stru ture is allo ated from system memory and inserted into the ve tor.
Ea h msqid ds data stru ture ontains an ip perm data stru ture and pointers to
the messages entered onto this queue. In addition, Linux keeps queue modi ation
times su h as the last time that this queue was written to and so on. The msqid ds
also ontains two wait queues; one for the writers to the queue and one for the readers
of the message queue.
Ea h time a pro ess attempts to write a message to the write queue its e e tive user
and group identi ers are ompared with the mode in this queue's ip perm data
stru ture. If the pro ess an write to the queue then the message may be opied
from the pro ess's address spa e into a msg data stru ture and put at the end of this
message queue. Ea h message is tagged with an appli ation spe i type, agreed

See in lude/linux/msg.h

msqid_ds
ipc

*msg_last

msg

msg

*msg_first

*msg_next

*msg_next

msg_type
*msg_spot
msg_stime
msg_ts

msg_type
*msg_spot
msg_stime
msg_ts

times
*wwait
*rwait
msg_qnum

msg_ts

message

msg_ts

message

msg_qnum

Figure 5.2: System V IPC Message Queues


between the ooperating pro esses. However, there may be no room for the message
as Linux restri ts the number and length of messages that an be written. In this
ase the pro ess will be added to this message queue's write wait queue and the
s heduler will be alled to sele t a new pro ess to run. It will be woken up when one
or more messages have been read from this message queue.
Reading from the queue is a similar pro ess. Again, the pro esses a ess rights to
the write queue are he ked. A reading pro ess may hoose to either get the rst
message in the queue regardless of its type or sele t messages with parti ular types.
If no messages mat h this riteria the reading pro ess will be added to the message
queue's read wait queue and the s heduler run. When a new message is written to
the queue this pro ess will be woken up and run again.

5.3.3 Semaphores
In its simplest form a semaphore is a lo ation in memory whose value an be tested
and set by more than one pro ess. The test and set operation is, so far as ea h pro ess
is on erned, uninterruptible or atomi ; on e started nothing an stop it. The result
of the test and set operation is the addition of the urrent value of the semaphore
and the set value, whi h an be positive or negative. Depending on the result of the
test and set operation one pro ess may have to sleep until the semphore's value is
hanged by another pro ess. Semaphores an be used to implement riti al regions,
areas of riti al ode that only one pro ess at a time should be exe uting.
Say you had many ooperating pro esses reading re ords from and writing re ords
to a single data le. You would want that le a ess to be stri tly oordinated. You
ould use a semaphore with an initial value of 1 and, around the le operating ode,
put two semaphore operations, the rst to test and de rement the semaphore's value
and the se ond to test and in rement it. The rst pro ess to a ess the le would
try to de rement the semaphore's value and it would su eed, the semaphore's value
now being 0. This pro ess an now go ahead and use the data le but if another
pro ess wishing to use it now tries to de rement the semaphore's value it would fail
as the result would be -1. That pro ess will be suspended until the rst pro ess has

array of
semaphores

semid_ds
ipc
times
sem_base
sem_pending
sem_pending_last
undo
sem_nsems

sem_undo
proc_next
id_next
semid
semadj

sem_queue
next
prev
sleeper
undo
pid
status
sma
sops
nsops

Figure 5.3: System V IPC Semaphores


nished with the data le. When the rst pro ess has nished with the data le it
will in rement the semaphore's value, making it 1 again. Now the waiting pro ess
an be woken and this time its attempt to in rement the semaphore will su eed.
System V IPC semaphore obje ts ea h des ribe a semaphore array and Linux uses
the semid ds data stru ture to represent this. All of the semid ds data stru tures in
the system are pointed at by the semary, a ve tor of pointers. There are sem nsems
in ea h semaphore array, ea h one des ribed by a sem data stru ture pointed at by
sem base. All of the pro esses that are allowed to manipulate the semaphore array of
a System V IPC semaphore obje t may make system alls that perform operations on
them. The system all an spe ify many operations and ea h operation is des ribed
by three inputs; the semaphore index, the operation value and a set of ags. The
semaphore index is an index into the semaphore array and the operation value is a
numeri al value that will be added to the urrent value of the semaphore. First Linux
tests whether or not all of the operations would su eed. An operation will su eed
if the operation value added to the semaphore's urrent value would be greater than
zero or if both the operation value and the semaphore's urrent value are zero. If
any of the semaphore operations would fail Linux may suspend the pro ess but only
if the operation ags have not requested that the system all is non-blo king. If
the pro ess is to be suspended then Linux must save the state of the semaphore
operations to be performed and put the urrent pro ess onto a wait queue. It does
this by building a sem queue data stru ture on the sta k and lling it out. The new
sem queue data stru ture is put at the end of this semaphore obje t's wait queue
(using the sem pending and sem pending last pointers). The urrent pro ess is
put on the wait queue in the sem queue data stru ture (sleeper) and the s heduler
alled to hoose another pro ess to run.
If all of the semaphore operations would have su eeded and the urrent pro ess
does not need to be suspended, Linux goes ahead and applies the operations to the
appropriate members of the semaphore array. Now Linux must he k that any waiting, suspended, pro esses may now apply their semaphore operations. It looks at

See in lude/linux/sem.h

ea h member of the operations pending queue (sem pending) in turn, testing to see
if the semphore operations will su eed this time. If they will then it removes the
sem queue data stru ture from the operations pending list and applies the semaphore
operations to the semaphore array. It wakes up the sleeping pro ess making it available to be restarted the next time the s heduler runs. Linux keeps looking through
the pending list from the start until there is a pass where no semaphore operations
an be applied and so no more pro esses an be woken.
There is a problem with semaphores, deadlo ks. These o ur when one pro ess has
altered the semaphores value as it enters a riti al region but then fails to leave
the riti al region be ause it rashed or was killed. Linux prote ts against this by
maintaining lists of adjustments to the semaphore arrays. The idea is that when
these adjustments are applied, the semaphores will be put ba k to the state that
they were in before the a pro ess's set of semaphore operations were applied. These
adjustments are kept in sem undo data stru tures queued both on the semid ds
data stru ture and on the task stru t data stru ture for the pro esses using these
semaphore arrays.
Ea h individual semaphore operation may request that an adjustment be maintained. Linux will maintain at most one sem undo data stru ture per pro ess for
ea h semaphore array. If the requesting pro ess does not have one, then one is reated when it is needed. The new sem undo data stru ture is queued both onto this
pro ess's task stru t data stru ture and onto the semaphore array's semid ds data
stru ture. As operations are applied to the semphores in the semaphore array the
negation of the operation value is added to this semphore's entry in the adjustment
array of this pro ess's sem undo data stru ture. So, if the operation value is 2, then
-2 is added to the adjustment entry for this semaphore.
When pro esses are deleted, as they exit Linux works through their set of sem undo
data stru tures applying the adjustments to the semaphore arrays. If a semaphore set
is deleted, the sem undo data stru tures are left queued on the pro ess's task stru t
but the semaphore array identi er is made invalid. In this ase the semaphore lean
up ode simply dis ards the sem undo data stru ture.

5.3.4 Shared Memory


Shared memory allows one or more pro esses to ommuni ate via memory that appears in all of their virtual address spa es. The pages of the virtual memory is
referen ed by page table entries in ea h of the sharing pro esses' page tables. It does
not have to be at the same address in all of the pro esses' virtual memory. As with
all System V IPC obje ts, a ess to shared memory areas is ontrolled via keys and
a ess rights he king. On e the memory is being shared, there are no he ks on how
the pro esses are using it. They must rely on other me hanisms, for example System
V semaphores, to syn hronize a ess to the memory.
See in lude/-

linux/sem.h

Ea h newly reated shared memory area is represented by a shmid ds data stru ture.
These are kept in the shm segs ve tor. The shmid ds data stru ture de ribes how
big the area of shared memory is, how many pro esses are using it and information
about how that shared memory is mapped into their address spa es. It is the reator
of the shared memory that ontrols the a ess permissions to that memory and
whether its key is publi or private. If it has enough a ess rights it may also lo k
the shared memory into physi al memory.

shmid_ds
ipc
shm_segsz

times

shm_npages
shm_pages

pte
pte
pte

vm_area_struct

vm_area_struct

vm_next_shared

vm_next_shared

attaches

Figure 5.4: System V IPC Shared Memory


Ea h pro ess that wishes to share the memory must atta h to that virtual memory
via a system all. This reates a new vm area stru t data stru ture des ribing the
shared memory for this pro ess. The pro ess an hoose where in its virtual address
spa e the shared memory goes or it an let Linux hoose a free area large enough.
The new vm area stru t stru ture is put into the list of vm area stru t pointed
at by the shmid ds. The vm next shared and vm prev shared pointers are used to
link them together. The virtual memory is not a tually reated during the atta h;
it happens when the rst pro ess attempts to a ess it.
The rst time that a pro ess a esses one of the pages of the shared virtual memory, a page fault will o ur. When Linux xes up that page fault it nds the
vm area stru t data stru ture des ribing it. This ontains pointers to handler routines for this type of shared virtual memory. The shared memory page fault handling
ode looks in the list of page table entries for this shmid ds to see if one exists for
this page of the shared virtual memory. If it does not exist, it will allo ate a physi al
page and reate a page table entry for it. As well as going into the urrent pro ess's
page tables, this entry is saved in the shmid ds. This means that when the next
pro ess that attempts to a ess this memory gets a page fault, the shared memory
fault handling ode will use this newly reated physi al page for that pro ess too. So,
the rst pro ess that a esses a page of the shared memory auses it to be reated
and thereafter a ess by the other pro esses ause that page to be added into their
virtual address spa es.
When pro esses no longer wish to share the virtual memory, they deta h from it.
So long as other pro esses are still using the memory the deta h only a e ts the
urrent pro ess. Its vm area stru t is removed from the shmid ds data stru ture
and deallo ated. The urrent pro ess's page tables are updated to invalidate the area
of virtual memory that it used to share. When the last pro ess sharing the memory
deta hes from it, the pages of the shared memory urrent in physi al memory are
freed, as is the shmid ds data stru ture for this shared memory.
Further ompli ations arise when shared virtual memory is not lo ked into physi al
memory. In this ase the pages of the shared memory may be swapped out to
the system's swap disk during periods of high memory usage. How shared memory

memory is swapped into and out of physi al memory is des ribed in Chapter 3.

Chapter 6

PCI

Peripheral Component Inter onne t (PCI), as its name implies is a standard that des ribes how to onne t the peripheral omponents of a system together in a stru tured and ontrolled way. The standard[3, PCI
Lo al Bus Spe i ation des ribes the way that the system omponents
are ele tri ally onne ted and the way that they should behave. This
hapter looks at how the Linux kernel initializes the system's PCI buses
and devi es.
Figure 6.1 is a logi al diagram of an example PCI based system. The PCI buses
and PCI-PCI bridges are the glue onne ting the system omponents together; the
CPU is onne ted to PCI bus 0, the primary PCI bus as is the video devi e. A
spe ial PCI devi e, a PCI-PCI bridge onne ts the primary bus to the se ondary
PCI bus, PCI bus 1. In the jargon of the PCI spe i ation, PCI bus 1 is des ribed
as being downstream of the PCI-PCI bridge and PCI bus 0 is up-stream of the
bridge. Conne ted to the se ondary PCI bus are the SCSI and ethernet devi es for
the system. Physi ally the bridge, se ondary PCI bus and two devi es would all be
ontained on the same ombination PCI ard. The PCI-ISA bridge in the system
supports older, lega y ISA devi es and the diagram shows a super I/O ontroller
hip, whi h ontrols the keyboard, mouse and oppy. 1

6.1 PCI Address Spa es


The CPU and the PCI devi es need to a ess memory that is shared between them.
This memory is used by devi e drivers to ontrol the PCI devi es and to pass information between them. Typi ally the shared memory ontains ontrol and status
registers for the devi e. These registers are used to ontrol the devi e and to read
its status. For example, the PCI SCSI devi e driver would read its status register
to nd out if the SCSI devi e was ready to write a blo k of information to the SCSI
1 For

example?

61

CPU
PCI Bus 0

PCI-ISA
Bridge

PCI-PCI
Bridge

Downstream

Video
ISA Bus

Super I/O Controller

Upstream

PCI Bus 1

SCSI

Ethernet

Figure 6.1: Example PCI Based System


disk. Or it might write to the ontrol register to start the devi e running after it has
been turned on.
The CPU's system memory ould be used for this shared memory but if it were,
then every time a PCI devi e a essed memory, the CPU would have to stall, waiting
for the PCI devi e to nish. A ess to memory is generally limited to one system
omponent at a time. This would slow the system down. It is also not a good idea to
allow the system's peripheral devi es to a ess main memory in an un ontrolled way.
This would be very dangerous; a rogue devi e ould make the system very unstable.
Peripheral devi es have their own memory spa es. The CPU an a ess these spa es
but a ess by the devi es into the system's memory is very stri tly ontrolled using
DMA (Dire t Memory A ess) hannels. ISA devi es have a ess to two address
spa es, ISA I/O (Input/Output) and ISA memory. PCI has three; PCI I/O, PCI
Memory and PCI Con guration spa e. All of these address spa es are also a essible
by the CPU with the the PCI I/O and PCI Memory address spa es being used by the
devi e drivers and the PCI Con guration spa e being used by the PCI initialization
ode within the Linux kernel.
The Alpha AXP pro essor does not have natural a ess to addresses spa es other
than the system address spa e. It uses support hipsets to a ess other address
spa es su h as PCI Con guration spa e. It uses a sparse address mapping s heme
whi h steals part of the large virtual address spa e and maps it to the PCI address
spa es.

6.2 PCI Con guration Headers


Every PCI devi e in the system, in luding the PCI-PCI bridges has a on guration
data stru ture that is somewhere in the PCI on guration address spa e. The PCI
Con guration header allows the system to identify and ontrol the devi e. Exa tly
where the header is in the PCI Con guration address spa e depends on where in
the PCI topology that devi e is. For example, a PCI video ard plugged into one
PCI slot on the PC motherboard will have its on guration header at one lo ation
and if it is plugged into another PCI slot then its header will appear in another

31

16 15

Device Id

Vendor Id

00h

Status

Command

04h

Class Code

08h
10h

Base Address Registers

24h

Line

Pin

3Ch

Figure 6.2: The PCI Con guration Header


lo ation in PCI Con guration memory. This does not matter, for wherever the PCI
devi es and bridges are the system will nd and on gure them using the status and
on guration registers in their on guration headers.
Typi ally, systems are designed so that every PCI slot has it's PCI Con guration
Header in an o set that is related to its slot on the board. So, for example, the rst
slot on the board might have its PCI Con guration at o set 0 and the se ond slot at
o set 256 (all headers are the same length, 256 bytes) and so on. A system spe i
hardware me hanism is de ned so that the PCI on guration ode an attempt to
examine all possible PCI Con guration Headers for a given PCI bus and know whi h
devi es are present and whi h devi es are absent simply by trying to read one of the
elds in the header (usually the Vendor Identi ation eld) and getting some sort
of error. The [3, PCI Lo al Bus Spe i ation des ribes one possible error message
as returning 0xFFFFFFFF when attempting to read the Vendor Identi ation and
Devi e Identi ation elds for an empty PCI slot.
Figure 6.2 shows the layout of the 256 byte PCI on guration header. It ontains
the following elds:

Vendor Identi ation A unique number des ribing the originator of the PCI devi e. Digital's PCI Vendor Identi ation is 0x1011 and Intel's is 0x8086.

Devi e Identi ation A unique number des ribing the devi e itself. For example,
Digital's 21141 fast ethernet devi e has a devi e identi ation of 0x0009.

Status This eld gives the status of the devi e with the meaning of the bits of this
eld set by the standard. [3, PCI Lo al Bus Spe i ation.

See in lude/linux/p i.h

Command By writing to this eld the system ontrols the devi e, for example
allowing the devi e to a ess PCI I/O memory,

Class Code This identi es the type of devi e that this is. There are standard
lasses for every sort of devi e; video, SCSI and so on. The lass ode for SCSI
is 0x0100.

Base Address Registers These registers are used to determine and allo ate the

type, amount and lo ation of PCI I/O and PCI memory spa e that the devi e
an use.

Interrupt Pin Four of the physi al pins on the PCI ard arry interrupts from

the ard to the PCI bus. The standard labels these as A, B, C and D. The
Interrupt Pin eld des ribes whi h of these pins this PCI devi e uses. Generally
it is hardwired for a pariti ular devi e. That is, every time the system boots,
the devi e uses the same interrupt pin. This information allows the interrupt
handling subsystem to manage interrupts from this devi e,

Interrupt Line The Interrupt Line eld of the devi e's PCI Con guration header

is used to pass an interrupt handle between the PCI initialisation ode, the
devi e's driver and Linux's interrupt handling subsystem. The number written
there is meaningless to the the devi e driver but it allows the interrupt handler
to orre tly route an interrupt from the PCI devi e to the orre t devi e driver's
interrupt handling ode within the Linux operating system. See Chapter 7 on
page 75 for details on how Linux handles interrupts.

6.3 PCI I/O and PCI Memory Addresses


These two address spa es are used by the devi es to ommuni ate with their devi e
drivers running in the Linux kernel on the CPU. For example, the DEC hip 21141
fast ethernet devi e maps its internal registers into PCI I/O spa e. Its Linux devi e
driver then reads and writes those registers to ontrol the devi e. Video drivers
typi ally use large amounts of PCI memory spa e to ontain video information.
Until the PCI system has been set up and the devi e's a ess to these address spa es
has been turned on using the Command eld in the PCI Con guration header, nothing an a ess them. It should be noted that only the PCI on guration ode reads
and writes PCI on guration addresses; the Linux devi e drivers only read and write
PCI I/O and PCI memory addresses.

6.4 PCI-ISA Bridges


These bridges support lega y ISA devi es by translating PCI I/O and PCI Memory
spa e a esses into ISA I/O and ISA Memory a esses. A lot of systems now sold
ontain several ISA bus slots and several PCI bus slots. Over time the need for this
ba kwards ompatibility will dwindle and PCI only systems will be sold. Where in
the ISA address spa es (I/O and Memory) the ISA devi es of the system have their
registers was xed in the dim mists of time by the early Intel 8080 based PCs. Even a
$5000 Alpha AXP based omputer systems will have its ISA oppy ontroller at the
same pla e in ISA I/O spa e as the rst IBM PC. The PCI spe i ation opes with

11 10

31
Device Select

8 7

Func

2 1 0

Register 0 0

Figure 6.3: Type 0 PCI Con guration Cy le


31

24 23
Reserved

16 15
Bus

11 10

Device

8 7

Func

2 1 0

Register 0 1

Figure 6.4: Type 1 PCI Con guration Cy le


this by reserving the lower regions of the PCI I/O and PCI Memory address spa es
for use by the ISA peripherals in the system and using a single PCI-ISA bridge to
translate any PCI memory a esses to those regions into ISA a esses.

6.5 PCI-PCI Bridges


PCI-PCI bridges are spe ial PCI devi es that glue the PCI buses of the system
together. Simple systems have a single PCI bus but there is an ele tri al limit on the
number of PCI devi es that a single PCI bus an support. Using PCI-PCI bridges to
add more PCI buses allows the system to support many more PCI devi es. This is
parti ularly important for a high performan e server. Of ourse, Linux fully supports
the use of PCI-PCI bridges.

6.5.1 PCI-PCI Bridges: PCI I/O and PCI Memory Windows


PCI-PCI bridges only pass a subset of PCI I/O and PCI memory read and write
requests downstream. For example, in Figure 6.1 on page 62, the PCI-PCI bridge
will only pass read and write addresses from PCI bus 0 to PCI bus 1 if they are
for PCI I/O or PCI memory addresses owned by either the SCSI or ethernet devi e;
all other PCI I/O and memory addresses are ignored. This ltering stops addresses
propogating needlessly throughout the system. To do this, the PCI-PCI bridges must
be programmed with a base and limit for PCI I/O and PCI Memory spa e a ess
that they have to pass from their primary bus onto their se ondary bus. On e the
PCI-PCI Bridges in a system have been on gured then so long as the Linux devi e
drivers only a ess PCI I/O and PCI Memory spa e via these windows, the PCI-PCI
Bridges are invisible. This is an important feature that makes life easier for Linux
PCI devi e driver writers. However, it also makes PCI-PCI bridges somewhat tri ky
for Linux to on gure as we shall see later on.

6.5.2 PCI-PCI Bridges: PCI Con guration Cy les and PCI


Bus Numbering
So that the CPU's PCI initialization ode an address devi es that are not on the
main PCI bus, there has to be a me hanism that allows bridges to de ide whether
or not to pass Con guration y les from their primary interfa e to their se ondary
interfa e. A y le is just an address as it appears on the PCI bus. The PCI spe i ation de nes two formats for the PCI Con guration addresses; Type 0 and Type 1;

these are shown in Figure 6.3 and Figure 6.4 respe tively. Type 0 PCI Con guration
y les do not ontain a bus number and these are interpretted by all devi es as being
for PCI on guration addresses on this PCI bus. Bits 31:11 of the Type 0 on guraration y les are treated as the devi e sele t eld. One way to design a system is to
have ea h bit sele t a di erent devi e. In this ase bit 11 would sele t the PCI devi e
in slot 0, bit 12 would sele t the PCI devi e in slot 1 and so on. Another way is to
write the devi e's slot number dire tly into bits 31:11. Whi h me hanism is used in
a system depends on the system's PCI memory ontroller.
Type 1 PCI Con guration y les ontain a PCI bus number and this type of on guration y le is ignored by all PCI devi es ex ept the PCI-PCI bridges. All of
the PCI-PCI Bridges seeing Type 1 on guration y les may hoose to pass them
to the PCI buses downstream of themselves. Whether the PCI-PCI Bridge ignores
the Type 1 on guration y le or passes it onto the downstream PCI bus depends
on how the PCI-PCI Bridge has been on gured. Every PCI-PCI bridge has a primary bus interfa e number and a se ondary bus interfa e number. The primary bus
interfa e being the one nearest the CPU and the se ondary bus interfa e being the
one furthest away. Ea h PCI-PCI Bridge also has a subordinate bus number and
this is the maximum bus number of all the PCI buses that are bridged beyond the
se ondary bus interfa e. Or to put it another way, the subordinate bus number is the
highest numbered PCI bus downstream of the PCI-PCI bridge. When the PCI-PCI
bridge sees a Type 1 PCI on guration y le it does one of the following things:

 Ignore it if the bus number spe i ed is not in between the bridge's se ondary
bus number and subordinate bus number (in lusive),

 Convert it to a Type 0 on guration ommand if the bus number spe i ed


mat hes the se ondary bus number of the bridge,

 Pass it onto the se ondary bus interfa e un hanged if the bus number spe i ed is greater than the se ondary bus number and less than or equal to the
subordinate bus number.

So, if we want to address Devi e 1 on bus 3 of the topology Figure 6.9 on page 71
we must generate a Type 1 Con guration ommand from the CPU. Bridge1 passes
this un hanged onto Bus 1. Bridge2 ignores it but Bridge3 onverts it into a Type
0 Con guration ommand and sends it out on Bus 3 where Devi e 1 responds to it.
It is up to ea h individual operating system to allo ate bus numbers during PCI
on guration but whatever the numbering s heme used the following statement must
be true for all of the PCI-PCI bridges in the system:

\All PCI buses lo ated behind a PCI-PCI bridge must reside between the seondary
bus number and the subordinate bus number (in lusive)."
If this rule is broken then the PCI-PCI Bridges will not pass and translate Type 1
PCI on guration y les orre tly and the system will fail to nd and initialise the
PCI devi es in the system. To a hieve this numbering s heme, Linux on gures these
spe ial devi es in a parti ular order. Se tion 6.6.2 on page 68 des ribes Linux's PCI
bridge and bus numbering s heme in detail together with a worked example.

6.6 Linux PCI Initialization


The PCI initialisation ode in Linux is broken into three logi al parts:

pci_root

pci_bus
parent
children
next
self
devices
bus = 0

pci_dev

pci_dev

pci_dev

bus
sibling
next

bus
sibling
next

bus
sibling
next

PCI-ISA Bridge

Video

PCI-PCI Bridge

pci_dev

pci_dev

bus
sibling
next

bus
sibling
next

pci_bus
parent
children
next
self
devices
bus = 1

SCSI

Ethernet

Figure 6.5: Linux Kernel PCI Data Stru tures

PCI Devi e Driver This pseudo-devi e driver sear hes the PCI system starting at
Bus 0 and lo ates all PCI devi es and bridges in the system. It builds a linked
list of data stru tures des ribing the topology of the system. Additionally, it
numbers all of the bridges that it nds.

PCI BIOS This software layer provides the servi es des ribed in [4, PCI BIOS

ROM spe i ation. Even though Alpha AXP does not have BIOS servi es,
there is equivalent ode in the Linux kernel providing the same fun tions,

PCI Fixup System spe i xup ode tidies up the system spe i loose ends of
PCI initialization.

6.6.1 The Linux Kernel PCI Data Stru tures


As the Linux kernel initialises the PCI system it builds data stru tures mirroring
the real PCI topology of the system. Figure 6.5 shows the relationships of the data
stru tures that it would build for the example PCI system in Figure 6.1 on page 62.
Ea h PCI devi e (in luding the PCI-PCI Bridges) is des ribed by a p i dev data
stru ture. Ea h PCI bus is des ribed by a p i bus data stru ture. The result is a

See drivers/p i/p i. and


in lude/linux/p i.h
See ar h/*/kernel/bios32.
See ar h/*/kernel/bios32.

tree stru ture of PCI buses ea h of whi h has a number of hild PCI devi es atta hed
to it. As a PCI bus an only be rea hed using a PCI-PCI Bridge (ex ept the primary
PCI bus, bus 0), ea h p i bus ontains a pointer to the PCI devi e (the PCI-PCI
Bridge) that it is a essed through. That PCI devi e is a hild of the the PCI Bus's
parent PCI bus.
Not shown in the Figure 6.5 is a pointer to all of the PCI devi es in the system,
p i devi es. All of the PCI devi es in the system have their p i dev data stru tures
queued onto this queue.. This queue is used by the Linux kernel to qui kly nd all
of the PCI devi es in the system.

6.6.2 The PCI Devi e Driver


See S an bus()
in drivers/p i/p i.

The PCI devi e driver is not really a devi e driver at all but a fun tion of the
operating system alled at system initialisation time. The PCI initialisation ode
must s an all of the PCI buses in the system looking for all PCI devi es in the
system (in luding PCI-PCI bridge devi es). It uses the PCI BIOS ode to nd out if
every possible slot in the urrent PCI bus that it is s anning is o upied. If the PCI
slot is o upied, it builds a p i dev data stru ture des ribing the devi e and links
into the list of known PCI devi es (pointed at by p i devi es).
The PCI initialisation ode starts by s anning PCI Bus 0. It tries to read the
Vendor Identi ation and Devi e Identi ation elds for every possible PCI devi e
in every possible PCI slot. When it nds an o upied slot it builds a p i dev data
stru ture des ribing the devi e. All of the p i dev data stru tures built by the PCI
initialisation ode (in luding all of the PCI-PCI Bridges) are linked into a singly
linked list; p i devi es.
If the PCI devi e that was found was a PCI-PCI bridge then a p i bus data stru ture
is built and linked into the tree of p i bus and p i dev data stru tures pointed at by
p i root. The PCI initialisation ode an tell if the PCI devi e is a PCI-PCI Bridge
be ause it has a lass ode of 0x060400. The Linux kernel then on gures the PCI bus
on the other (downstream) side of the PCI-PCI Bridge that it has just found. If more
PCI-PCI Bridges are found then these are also on gured. This pro ess is known as
a depthwise algorithm; the system's PCI topology is fully mapped depthwise before
sear hing breadthwise. Looking at Figure 6.1 on page 62, Linux would on gure PCI
Bus 1 with its Ethernet and SCSI devi e before it on gured the video devi e on
PCI Bus 0.
As Linux sear hes for downstream PCI buses it must also on gure the intervening
PCI-PCI bridges' se ondary and subordinate bus numbers. This is des ribed in detail
in Se tion 6.6.2 below.

Con guring PCI-PCI Bridges - Assigning PCI Bus Numbers


For PCI-PCI bridges to pass PCI I/O, PCI Memory or PCI Con guration address
spa e reads and writes a ross them, they need to know the following:

Primary Bus Number The bus number immediately upstream of the PCI-PCI
Bridge,

Se ondary Bus Number The bus number immediately downstream of the PCIPCI Bridge,

CPU
DI

D2

Bus 0

DI

Bridge

D2

Primary Bus = 0
Secondary Bus = 1
Subordinate=0xFF
Bus 1

DI

Bridge

Bridge

2
Bus ?

Bus ?

Bridge
4

DI

D2
Bus ?

Figure 6.6: Con guring a PCI System: Part 1

Subordinate Bus Number The highest bus number of all of the buses that an
be rea hed downstream of the bridge.

PCI I/O and PCI Memory Windows The window base and size for PCI I/O
address spa e and PCI Memory address spa e for all addresses downstream of
the PCI-PCI Bridge.

The problem is that at the time when you wish to on gure any given PCI-PCI bridge
you do not know the subordinate bus number for that bridge. You do not know if
there are further PCI-PCI bridges downstream and if you did, you do not know
what numbers will be assigned to them. The answer is to use a depthwise re ursive
algorithm and s an ea h bus for any PCI-PCI bridges assigning them numbers as
they are found. As ea h PCI-PCI bridge is found and its se ondary bus numbered,
assign it a temporary subordinate number of 0xFF and s an and assign numbers to
all PCI-PCI bridges downstream of it. This all seems ompli ated but the worked
example below makes this pro ess learer.

PCI-PCI Bridge Numbering: Step 1 Taking the topology in Figure 6.6, the

rst bridge the s an would nd is Bridge1 . The PCI bus downstream of


Bridge1 would be numbered as 1 and Bridge1 assigned a se ondary bus num-

ber of 1 and a temporary subordinate bus number of 0xFF. This means that
all Type 1 PCI Con guration addresses spe ifying a PCI bus number of 1 or
higher would be passed a ross Bridge1 and onto PCI Bus 1. They would be
translated into Type 0 Con guration y les if they have a bus number of 1 but
left untranslated for all other bus numbers. This is exa tly what the Linux
PCI initialisation ode needs to do in order to go and s an PCI Bus 1.

PCI-PCI Bridge Numbering: Step 2 Linux uses a depthwise algorithm and so

the initialisation ode goes on to s an PCI Bus 1. Here it nds PCI-PCI

CPU
DI

D2

Bus 0

DI

Primary Bus = 0
Secondary Bus = 1
Subordinate=0xFF

Bridge

D2

Bus 1

DI

Bridge

Bridge

Primary Bus = 1
Secondary Bus = 2
Subordinate=2
Bus 2

Bus ?

Bridge
4

DI

D2
Bus ?

Figure 6.7: Con guring a PCI System: Part 2


Bridge2 . There are no further PCI-PCI bridges beyond PCI-PCI Bridge2 , so

it is assigned a subordinate bus number of 2 whi h mat hes the number assigned
to its se ondary interfa e. Figure 6.7 shows how the buses and PCI-PCI bridges
are numbered at this point.

PCI-PCI Bridge Numbering: Step 3 The PCI initialisation ode returns to s anning PCI Bus 1 and nds another PCI-PCI bridge, Bridge3 . It is assigned 1
as its primary bus interfa e number, 3 as its se ondary bus interfa e number
and 0xFF as its subordinate bus number. Figure 6.8 on page 71 shows how the
system is on gured now. Type 1 PCI on guration y les with a bus number
of 1, 2 or 3 wil be orre tly delivered to the appropriate PCI buses.

PCI-PCI Bridge Numbering: Step 4 Linux starts s anning PCI Bus 3, down-

stream of PCI-PCI Bridge3 . PCI Bus 3 has another PCI-PCI bridge (Bridge4 )
on it, it is assigned 3 as its primary bus number and 4 as its se ondary bus number. It is the last bridge on this bran h and so it is assigned a subordinate bus
interfa e number of 4. The initialisation ode returns to PCI-PCI Bridge3 and
assigns it a subordinate bus number of 4. Finally, the PCI initialisation ode
an assign 4 as the subordinate bus number for PCI-PCI Bridge1 . Figure 6.9
on page 71 shows the nal bus numbers.

6.6.3 PCI BIOS Fun tions

See ar h/*/kernel/bios32.

The PCI BIOS fun tions are a series of standard routines whi h are ommon a ross
all platforms. For example, they are the same for both Intel and Alpha AXP based
systems. They allow the CPU ontrolled a ess to all of the PCI address spa es.
Only Linux kernel ode and devi e drivers may use them.

CPU
DI

D2

Bus 0

DI

Bridge

D2

Primary Bus = 0
Secondary Bus = 2
Subordinate=0xFF
Bus 1

Bridge

DI

Primary Bus = 1
Secondary Bus = 3
Subordinate=0xFF

Bridge
2

Primary Bus = 1
Secondary Bus = 2
Subordinate=2
Bus 2

Bus 3

Bridge

DI

D2
Bus ?

Figure 6.8: Con guring a PCI System: Part 3

CPU
DI

D2

Bus 0

DI

Bridge

D2

Primary Bus = 0
Secondary Bus = 1
Subordinate=4
Bus 1

Bridge

DI

Primary Bus = 1
Secondary Bus = 3
Subordinate=4

Bridge
2

Primary Bus = 1
Secondary Bus = 2
Subordinate=2
Bus 2

Bus 3

Bridge
4

Primary Bus = 3
Secondary Bus = 4
Subordinate=4

DI

D2
Bus 4

Figure 6.9: Con guring a PCI System: Part 4

31

43 210
Base Address

prefetchable

Type

Base Address for PCI Memory Space


210

31
Base Address

Reserved
Base Address for PCI I/O Space

Figure 6.10: PCI Con guration Header: Base Address Registers

6.6.4 PCI Fixup


See ar h/*/-

kernel/bios32.

The PCI xup ode for Alpha AXP does rather more than that for Intel (whi h
basi ally does nothing). For Intel based systems the system BIOS, whi h ran at
boot time, has already fully on gured the PCI system. This leaves Linux with
little to do other than map that on guration. For non-Intel based systems further
on guration needs to happen to:

 Allo ate PCI I/O and PCI Memory spa e to ea h devi e,


 Con gure the PCI I/O and PCI Memory address windows for ea h PCI-PCI
bridge in the system,

 Generate Interrupt Line values for the devi es; these ontrol interrupt handling
for the devi e.

The next subse tions des ribe how that ode works.

Finding Out How Mu h PCI I/O and PCI Memory Spa e a Devi e Needs
Ea h PCI devi e found is queried to nd out how mu h PCI I/O and PCI Memory
address spa e it requires. To do this, ea h Base Address Register has all 1's written to
it and then read. The devi e will return 0's in the don't- are address bits, e e tively
spe ifying the address spa e required.
There are two basi types of Base Address Register, the rst indi ates within whi h
address spa e the devi es registers must reside; either PCI I/O or PCI Memory spa e.
This is indi ated by Bit 0 of the register. Figure 6.10 shows the two forms of the
Base Address Register for PCI Memory and for PCI I/O.
To nd out just how mu h of ea h address spa e a given Base Address Register is
requesting, you write all 1s into the register and then read it ba k. The devi e will
spe ify zeros in the don't are address bits, e e tively spe ifying the address spa e
required. This design implies that all address spa es used are a power of two and are
naturally aligned.

For example when you initialize the DECChip 21142 PCI Fast Ethernet devi e, it
tells you that it needs 0x100 bytes of spa e of either PCI I/O or PCI Memory. The
initialization ode allo ates it spa e. The moment that it allo ates spa e, the 21142's
ontrol and status registers an be seen at those addresses.

Allo ating PCI I/O and PCI Memory to PCI-PCI Bridges and Devi es
Like all memory the PCI I/O and PCI memory spa es are nite, and to some extent
s ar e. The PCI Fixup ode for non-Intel systems (and the BIOS ode for Intel
systems) has to allo ate ea h devi e the amount of memory that it is requesting in
an e ient manner. Both PCI I/O and PCI Memory must be allo ated to a devi e
in a naturally aligned way. For example, if a devi e asks for 0xB0 of PCI I/O spa e
then it must be aligned on an address that is a multiple of 0xB0. In addition to this,
the PCI I/O and PCI Memory bases for any given bridge must be aligned on 4K and
on 1Mbyte boundaries respe tively. Given that the address spa es for downstream
devi es must lie within all of the upstream PCI-PCI Bridge's memory ranges for any
given devi e, it is a somewhat di ult problem to allo ate spa e e iently.
The algorithm that Linux uses relies on ea h devi e des ribed by the bus/devi e tree
built by the PCI Devi e Driver being allo ated address spa e in as ending PCI I/O
memory order. Again a re ursive algorithm is used to walk the p i bus and p i dev
data stru tures built by the PCI initialisation ode. Starting at the root PCI bus
(pointed at by p i root) the BIOS xup ode:

 Aligns the urrent global PCI I/O and Memory bases on 4K and 1 Mbyte
boundaries respe tively,

 For every devi e on the urrent bus (in as ending PCI I/O memory needs),
{ allo ates it spa e in PCI I/O and/or PCI Memory,
{ moves on the global PCI I/O and Memory bases by the appropriate
amounts,

{ enables the devi e's use of PCI I/O and PCI Memory,

 Allo ates spa e re ursively to all of the buses downstream of the urrent bus.
Note that this will hange the global PCI I/O and Memory bases,

 Aligns the urrent global PCI I/O and Memory bases on 4K and 1 Mbyte

boundaries respe tively and in doing so gure out the size and base of PCI I/O
and PCI Memory windows required by the urrent PCI-PCI bridge,

 Programs the PCI-PCI bridge that links to this bus with its PCI I/O and PCI
Memory bases and limits,

 Turns on bridging of PCI I/O and PCI Memory a esses in the PCI-PCI Bridge.
This means that if any PCI I/O or PCI Memory addresses seen on the Bridge's
primary PCI bus that are within its PCI I/O and PCI Memory address windows
will be bridged onto its se ondary PCI bus.

Taking the PCI system in Figure 6.1 on page 62 as our example the PCI Fixup ode
would set up the system in the following way:

Align the PCI bases PCI I/O is 0x4000 and PCI Memory is 0x100000. This
allows the PCI-ISA bridges to translate all addresses below these into ISA
address y les,

The Video Devi e This is asking for 0x200000 of PCI Memory and so we allo ate

it that amount starting at the urrent PCI Memory base of 0x200000 as it has
to be naturally aligned to the size requested. The PCI Memory base is moved
to 0x400000 and the PCI I/O base remains at 0x4000.

The PCI-PCI Bridge We now ross the PCI-PCI Bridge and allo ate PCI mem-

ory there, note that we do not need to align the bases as they are already
orre tly aligned:

The Ethernet Devi e This is asking for 0xB0 bytes of both PCI I/O and

PCI Memory spa e. It gets allo ated PCI I/O at 0x4000 and PCI Memory
at 0x400000. The PCI Memory base is moved to 0x4000B0 and the PCI
I/O base to 0x40B0.

The SCSI Devi e This is asking for 0x1000 PCI Memory and so it is allo ated it at 0x401000 after it has been naturally aligned. The PCI I/O base
is still 0x40B0 and the PCI Memory base has been moved to 0x402000.

The PCI-PCI Bridge's PCI I/O and Memory Windows We now return to

the bridge and set its PCI I/O window at between 0x4000 and 0x40B0 and it's
PCI Memory window at between 0x400000 and 0x402000. This means that
the PCI-PCI Bridge will ignore the PCI Memory a esses for the video devi e
and pass them on if they are for the ethernet or SCSI devi es.

Chapter 7

Interrupts and Interrupt


Handling

This hapter looks at how interrupts are handled by the Linux kernel.
Whilst the kernel has generi me hanisms and interfa es for handling interrupts, most of the interrupt handling details are ar hite ture spe i .
Linux uses a lot of di erent pie es of hardware to perform many di erent tasks.
The video devi e drives the monitor, the IDE devi e drives the disks and so on.
You ould drive these devi es syn hronously, that is you ould send a request for
some operation (say writing a blo k of memory out to disk) and then wait for the
operation to omplete. That method, although it would work, is very ine ient and
the operating system would spend a lot of time \busy doing nothing" as it waited
for ea h operation to omplete. A better, more e ient, way is to make the request
and then do other, more useful work and later be interrupted by the devi e when it
has nished the request. With this s heme, there may be many outstanding requests
to the devi es in the system all happening at the same time.
There has to be some hardware support for the devi es to interrupt whatever the
CPU is doing. Most, if not all, general purpose pro essors su h as the Alpha AXP
use a similar method. Some of the physi al pins of the CPU are wired su h that
hanging the voltage (for example hanging it from +5v to -5v) auses the CPU to
stop what it is doing and to start exe uting spe ial ode to handle the interruption;
the interrupt handling ode. One of these pins might be onne ted to an interval
timer and re eive an interrupt every 1000th of a se ond, others may be onne ted to
the other devi es in the system, su h as the SCSI ontroller.
Systems often use an interrupt ontroller to group the devi e interrupts together
before passing on the signal to a single interrupt pin on the CPU. This saves interrupt
pins on the CPU and also gives exibility when designing systems. The interrupt
ontroller has mask and status registers that ontrol the interrupts. Setting the bits
in the mask register enables and disables interrupts and the status register returns
75

Real Time Clock

CPU
P
I
C
1

0
1
2

Keyboard

4
5

Serial
sound

floppy

7
0
P
I
C
2

SCSI

6
7

ide0
ide1

Figure 7.1: A Logi al Diagram of Interrupt Routing


the urrently a tive interrupts in the system.
Some of the interrupts in the system may be hard-wired, for example, the real time
lo k's interval timer may be permanently onne ted to pin 3 on the interrupt ontroller. However, what some of the pins are onne ted to may be determined by
what ontroller ard is plugged into a parti ular ISA or PCI slot. For example, pin
4 on the interrupt ontroller may be onne ted to PCI slot number 0 whi h might
one day have an ethernet ard in it but the next have a SCSI ontroller in it. The
bottom line is that ea h system has its own interrupt routing me hanisms and the
operating system must be exible enough to ope.
Most modern general purpose mi ropro essors handle the interrupts the same way.
When a hardware interrupt o urs the CPU stops exe uting the instru tions that it
was exe uting and jumps to a lo ation in memory that either ontains the interrupt
handling ode or an instru tion bran hing to the interrupt handling ode. This ode
usually operates in a spe ial mode for the CPU, interrupt mode, and, normally, no
other interrupts an happen in this mode. There are ex eptions though; some CPUs
rank the interrupts in priority and higher level interrupts may happen. This means
that the rst level interrupt handling ode must be very arefully written and it
often has its own sta k, whi h it uses to store the CPU's exe ution state (all of the
CPU's normal registers and ontext) before it goes o and handles the interrupt.
Some CPUs have a spe ial set of registers that only exist in interrupt mode, and the
interrupt ode an use these registers to do most of the ontext saving it needs to
do.
When the interrupt has been handled, the CPU's state is restored and the interrupt
is dismissed. The CPU will then ontinue to doing whatever it was doing before
being interrupted. It is important that the interrupt pro essing ode is as e ient
as possible and that the operating system does not blo k interrupts too often or for
too long.

7.1 Programmable Interrupt Controllers


Systems designers are free to use whatever interrupt ar hite ture they wish but IBM
PCs use the Intel 82C59A-2 CMOS Programmable Interrupt Controller [6, Intel
Peripheral Components or its derivatives. This ontroller has been around sin e
the dawn of the PC and it is programmable with its registers being at well known
lo ations in the ISA address spa e. Even very modern support logi hip sets keep
equivalent registers in the same pla e in ISA memory. Non-Intel based systems su h
as Alpha AXP based PCs are free from these ar hite tural onstraints and so often
use di erent interrupt ontrollers.
Figure 7.1 shows that there are two 8 bit ontrollers hained together; ea h having
a mask and an interrupt status register, PIC1 and PIC2. The mask registers are
at addresses 0x21 and 0xA1 and the status registers are at 0x20 and 0xA0 Writing
a one to a parti ular bit of the mask register enables an interrupt, writing a zero
disables it. So, writing one to bit 3 would enable interrupt 3, writing zero would
disable it. Unfortunately (and irritatingly), the interrupt mask registers are write
only, you annot read ba k the value that you wrote. This means that Linux must
keep a lo al opy of what it has set the mask registers to. It modi es these saved
masks in the interrupt enable and disable routines and writes the full masks to the
registers every time.
When an interrupt is signalled, the interrupt handling ode reads the two interrupt
status registers (ISRs). It treats the ISR at 0x20 as the bottom eight bits of a sixteen
bit interrupt register and the ISR at 0xA0 as the top eight bits. So, an interrupt on
bit 1 of the ISR at 0xA0 would be treated as system interrupt 9. Bit 2 of PIC1 is
not available as this is used to hain interrupts from PIC2, any interrupt on PIC2
results in bit 2 of PIC1 being set.

7.2 Initializing the Interrupt Handling Data Stru tures


The kernel's interrupt handling data stru tures are set up by the devi e drivers as
they request ontrol of the system's interrupts. To do this the devi e driver uses a
set of Linux kernel servi es that are used to request an interrupt, enable it and to
disable it. The individual devi e drivers all these routines to register their interrupt
handling routine addresses.
Some interrupts are xed by onvention for the PC ar hite ture and so the driver
simply requests its interrupt when it is initialized. This is what the oppy disk devi e
driver does; it always requests IRQ 6. There may be o assions when a devi e driver
does not know whi h interrupt the devi e will use. This is not a problem for PCI
devi e drivers as they always know what their interrupt number is. Unfortunately
there is no easy way for ISA devi e drivers to nd their interrupt number. Linux
solves this problem by allowing devi e drivers to probe for their interrupts.
First, the devi e driver does something to the devi e that auses it to interrupt.
Then all of the unassigned interrupts in the system are enabled. This means that
the devi e's pending interrupt will now be delivered via the programmable interrupt
ontroller. Linux reads the interrupt status register and returns its ontents to the
devi e driver. A non-zero result means that one or more interrupts o ured during

See
request irq(),
enable irq() and
disable irq() in
ar h/*/kernel
irq.

See

irq probe *() in


ar h/*/kernel/irq.

See

ar h/alpha/kernel/bios32.

the probe. The driver now turns probing o and the unassigned interrupts are all
disabled. If the ISA devi e driver has su essfully found its IRQ number then it an
now request ontrol of it as normal.
PCI based systems are mu h more dynami than ISA based systems. The interrupt
pin that an ISA devi e uses is often set using jumpers on the hardware devi e and
xed in the devi e driver. On the other hand, PCI devi es have their interrupts
allo ated by the PCI BIOS or the PCI subsystem as PCI is initialized when the
system boots. Ea h PCI devi e may use one of four interrupt pins, A, B, C or D.
This was xed when the devi e was built and most devi es default to interrupt on
pin A. The PCI interrupt lines A, B, C and D for ea h PCI slot are routed to the
interrupt ontroller. So, Pin A from PCI slot 4 might be routed to pin 6 of the
interrupt ontroller, pin B of PCI slot 4 to pin 7 of the interrupt ontroller and so
on.
How the PCI interrupts are routed is entirely system spe i and there must be
some set up ode whi h understands this PCI interrupt routing topology. On Intel
based PCs this is the system BIOS ode that runs at boot time but for system's
without BIOS (for example Alpha AXP based systems) the Linux kernel does this
setup. The PCI set up ode writes the pin number of the interrupt ontroller into the
PCI on guration header for ea h devi e. It determines the interrupt pin (or IRQ)
number using its knowledge of the PCI interrupt routing topology together with the
devi es PCI slot number and whi h PCI interrupt pin that it is using. The interrupt
pin that a devi e uses is xed and is kept in a eld in the PCI on guration header
for this devi e. It writes this information into the interrupt line eld that is reserved
for this purpose. When the devi e driver runs, it reads this information and uses it
to request ontrol of the interrupt from the Linux kernel.
There may be many PCI interrupt sour es in the system, for example when PCI-PCI
bridges are used. The number of interrupt sour es may ex eed the number of pins on
the system's programmable interrupt ontrollers. In this ase, PCI devi es may share
interrupts, one pin on the interrupt ontroller taking interrupts from more than one
PCI devi e. Linux supports this by allowing the rst requestor of an interrupt sour e
de lare whether it may be shared. Sharing interrupts results in several irqa tion
data stru tures being pointed at by one entry in the irq a tion ve tor ve tor.
When a shared interrupt happens, Linux will all all of the interrupt handlers for
that sour e. Any devi e driver that an share interrupts (whi h should be all PCI
devi e drivers) must be prepared to have its interrupt handler alled when there is
no interrupt to be servi ed.

7.3 Interrupt Handling


One of the prin ipal tasks of Linux's interrupt handling subsystem is to route the
interrupts to the right pie es of interrupt handling ode. This ode must understand
the interrupt topology of the system. If, for example, the oppy ontroller interrupts
on pin 6 1 of the interrupt ontroller then it must re ognize the interrupt as from the
oppy and route it to the oppy devi e driver's interrupt handling ode. Linux uses a
set of pointers to data stru tures ontaining the addresses of the routines that handle
the system's interrupts. These routines belong to the devi e drivers for the devi es in
1 A tually,

the oppy ontroller is one of the xed interrupts in a PC system as, by onvention,
the oppy ontroller is always wired to interrupt 6.

irq_action
irqaction
handler
flags
name
next

Interrupt
handling
routine
for this
device

irqaction

irqaction

handler

handler

flags

flags

name

name

next

next

1
0

Figure 7.2: Linux Interrupt Handling Data Stru tures


the system and it is the responsibility of ea h devi e driver to request the interrupt
that it wants when the driver is initialized. Figure 7.2 shows that irq a tion is a
ve tor of pointers to the irqa tion data stru ture. Ea h irqa tion data stru ture
ontains information about the handler for this interrupt, in luding the address of the
interrupt handling routine. As the number of interrupts and how they are handled
varies between ar hite tures and, sometimes, between systems, the Linux interrupt
handling ode is ar hite ture spe i . This means that the size of the irq a tion
ve tor ve tor varies depending on the number of interrupt sour es that there are.
When the interrupt happens, Linux must rst determine its sour e by reading the
interrupt status register of the system's programmable interrupt ontrollers. It then
translates that sour e into an o set into the irq a tion ve tor ve tor. So, for
example, an interrupt on pin 6 of the interrupt ontroller from the oppy ontroller
would be translated into the seventh pointer in the ve tor of interrupt handlers. If
there is not an interrupt handler for the interrupt that o urred then the Linux kernel
will log an error, otherwise it will all into the interrupt handling routines for all of
the irqa tion data stru tures for this interrupt sour e.
When the devi e driver's interrupt handling routine is alled by the Linux kernel
it must e iently work out why it was interrupted and respond. To nd the ause
of the interrupt the devi e driver would read the status register of the devi e that
interrupted. The devi e may be reporting an error or that a requested operation has
ompleted. For example the oppy ontroller may be reporting that it has ompleted
the positioning of the oppy's read head over the orre t se tor on the oppy disk.
On e the reason for the interrupt has been determined, the devi e driver may need to
do more work. If it does, the Linux kernel has me hanisms that allow it to postpone
that work until later. This avoids the CPU spending too mu h time in interrupt
mode. See the Devi e Driver hapter (Chapter 8) for more details.
REVIEW NOTE: Fast and slow interrupts, are these an Intel thing?

Chapter 8

Devi e Drivers

One of the purposes of an operating system is to hide the pe uliarities of


the system's hardware devi es from its users. For example the Virtual File
System presents a uniform view of the mounted lesystems irrespe tive
of the underlying physi al devi es. This hapter des ribes how the Linux
kernel manages the physi al devi es in the system.
The CPU is not the only intelligent devi e in the system, every physi al devi e has
its own hardware ontroller. The keyboard, mouse and serial ports are ontrolled by
a SuperIO hip, the IDE disks by an IDE ontroller, SCSI disks by a SCSI ontroller
and so on. Ea h hardware ontroller has its own ontrol and status registers (CSRs)
and these di er between devi es. The CSRs for an Adapte 2940 SCSI ontroller
are ompletely di erent from those of an NCR 810 SCSI ontroller. The CSRs are
used to start and stop the devi e, to initialize it and to diagnose any problems with
it. Instead of putting ode to manage the hardware ontrollers in the system into
every appli ation, the ode is kept in the Linux kernel. The software that handles or
manages a hardware ontroller is known as a devi e driver. The Linux kernel devi e
drivers are, essentially, a shared library of privileged, memory resident, low level
hardware handling routines. It is Linux's devi e drivers that handle the pe uliarities
of the devi es they are managing.
One of the basi features of un?x is that it abstra ts the handling of devi es. All
hardware devi es look like regular les; they an be opened, losed, read and written
using the same, standard, system alls that are used to manipulate les. Every devi e
in the system is represented by a devi e spe ial le, for example the rst IDE disk in
the system is represented by /dev/hda. For blo k (disk) and hara ter devi es, these
devi e spe ial les are reated by the mknod ommand and they des ribe the devi e
using major and minor devi e numbers. Network devi es are also represented by
devi e spe ial les but they are reated by Linux as it nds and initializes the network
ontrollers in the system. All devi es ontrolled by the same devi e driver have a
ommon major devi e number. The minor devi e numbers are used to distinguish
between di erent devi es and their ontrollers, for example ea h partition on the
81

See

fs/devi es.

primary IDE disk has a di erent minor devi e number. So, /dev/hda2, the se ond
partition of the primary IDE disk has a major number of 3 and a minor number of 2.
Linux maps the devi e spe ial le passed in system alls (say to mount a le system
on a blo k devi e) to the devi e's devi e driver using the major devi e number and
a number of system tables, for example the hara ter devi e table, hrdevs .
Linux supports three types of hardware devi e: hara ter, blo k and network. Chara ter devi es are read and written dire tly without bu ering, for example the system's
serial ports /dev/ ua0 and /dev/ ua1. Blo k devi es an only be written to and
read from in multiples of the blo k size, typi ally 512 or 1024 bytes. Blo k devi es
are a essed via the bu er a he and may be randomly a essed, that is to say, any
blo k an be read or written no matter where it is on the devi e. Blo k devi es an
be a essed via their devi e spe ial le but more ommonly they are a essed via the
le system. Only a blo k devi e an support a mounted le system. Network devi es
are a essed via the BSD so ket interfa e and the networking subsytems des ribed
in the Networking hapter (Chapter 10).
There are many di erent devi e drivers in the Linux kernel (that is one of Linux's
strengths) but they all share some ommon attributes:

kernel ode Devi e drivers are part of the kernel and, like other ode within the

kernel, if they go wrong they an seriously damage the system. A badly written
driver may even rash the system, possibly orrupting le systems and losing
data,

Kernel interfa es Devi e drivers must provide a standard interfa e to the Linux

kernel or to the subsystem that they are part of. For example, the terminal
driver provides a le I/O interfa e to the Linux kernel and a SCSI devi e driver
provides a SCSI devi e interfa e to the SCSI subsystem whi h, in turn, provides
both le I/O and bu er a he interfa es to the kernel.

Kernel me hanisms and servi es Devi e drivers make use of standard kernel
servi es su h as memory allo ation, interrupt delivery and wait queues to operate,

Loadable Most of the Linux devi e drivers an be loaded on demand as kernel


modules when they are needed and unloaded when they are no longer being
used. This makes the kernel very adaptable and e ient with the system's
resour es,

Con gurable Linux devi e drivers an be built into the kernel. Whi h devi es are
built is on gurable when the kernel is ompiled,

Dynami As the system boots and ea h devi e driver is initialized it looks for the

hardware devi es that it is ontrolling. It does not matter if the devi e being
ontrolled by a parti ular devi e driver does not exist. In this ase the devi e
driver is simply redundant and auses no harm apart from o upying a little
of the system's memory.

8.1 Polling and Interrupts


Ea h time the devi e is given a ommand, for example \move the read head to se tor
42 of the oppy disk" the devi e driver has a hoi e as to how it nds out that the

ommand has ompleted. The devi e drivers an either poll the devi e or they an
use interrupts.
Polling the devi e usually means reading its status register every so often until the
devi e's status hanges to indi ate that it has ompleted the request. As a devi e
driver is part of the kernel it would be disasterous if a driver were to poll as nothing
else in the kernel would run until the devi e had ompleted the request. Instead
polling devi e drivers use system timers to have the kernel all a routine within the
devi e driver at some later time. This timer routine would he k the status of the
ommand and this is exa tly how Linux's oppy driver works. Polling by means of
timers is at best approximate, a mu h more e ient method is to use interrupts.
An interrupt driven devi e driver is one where the hardware devi e being ontrolled
will raise a hardware interrupt whenever it needs to be servi ed. For example, an
ethernet devi e driver would interrupt whenever it re eives an ethernet pa ket from
the network. The Linux kernel needs to be able to deliver the interrupt from the
hardware devi e to the orre t devi e driver. This is a hieved by the devi e driver
registering its usage of the interrupt with the kernel. It registers the address of an
interrupt handling routine and the interrupt number that it wishes to own. You an
see whi h interrupts are being used by the devi e drivers, as well as how many of
ea h type of interrupts there have been, by looking at /pro /interrupts:
0:
1:
2:
3:
4:
5:
11:
13:
14:
15:

727432
20534
0
79691
28258
1
20868
1
247
170

+
+
+
+
+

timer
keyboard
as ade
serial
serial
sound blaster
ai 7xxx
math error
ide0
ide1

This requesting of interrupt resour es is done at driver initialization time. Some of


the interrupts in the system are xed, this is a lega y of the IBM PC's ar hite ture.
So, for example, the oppy disk ontroller always uses interrupt 6. Other interrupts,
for example the interrupts from PCI devi es are dynami ally allo ated at boot time.
In this ase the devi e driver must rst dis over the interrupt number (IRQ) of the
devi e that it is ontrolling before it requests ownership of that interrupt. For PCI
interrupts Linux supports standard PCI BIOS allba ks to determine information
about the devi es in the system, in luding their IRQ numbers.
How an interrupt is delivered to the CPU itself is ar hite ture dependent but on most
ar hite tures the interrupt is delivered in a spe ial mode that stops other interrupts
from happening in the system. A devi e driver should do as little as possible in its
interrupt handling routine so that the Linux kernel an dismiss the interrupt and
return to what it was doing before it was interrupted. Devi e drivers that need to
do a lot of work as a result of re eiving an interrupt an use the kernel's bottom half
handlers or task queues to queue routines to be alled later on.

8.2 Dire t Memory A ess (DMA)


Using interrupts driven devi e drivers to transfer data to or from hardware devi es
works well when the amount of data is reasonably low. For example a 9600 baud
modem an transfer approximately one hara ter every millise ond (1=1000'th se ond). If the interrupt laten y, the amount of time that it takes between the hardware
devi e raising the interrupt and the devi e driver's interrupt handling routine being
alled, is low (say 2 millise onds) then the overall system impa t of the data transfer
is very low. The 9600 baud modem data transfer would only take 0.002% of the
CPU's pro essing time. For high speed devi es, su h as hard disk ontrollers or ethernet devi es the data transfer rate is a lot higher. A SCSI devi e an transfer up to
40 Mbytes of information per se ond.
Dire t Memory A ess, or DMA, was invented to solve this problem. A DMA ontroller allows devi es to transfer data to or from the system's memory without the
intervention of the pro essor. A PC's ISA DMA ontroller has 8 DMA hannels of
whi h 7 are available for use by the devi e drivers. Ea h DMA hannel has asso iated
with it a 16 bit address register and a 16 bit ount register. To initiate a data transfer
the devi e driver sets up the DMA hannel's address and ount registers together
with the dire tion of the data transfer, read or write. It then tells the devi e that
it may start the DMA when it wishes. When the transfer is omplete the devi e
interrupts the PC. Whilst the transfer is taking pla e the CPU is free to do other
things.
Devi e drivers have to be areful when using DMA. First of all the DMA ontroller
knows nothing of virtual memory, it only has a ess to the physi al memory in the
system. Therefore the memory that is being DMA'd to or from must be a ontiguous
blo k of physi al memory. This means that you annot DMA dire tly into the virtual
address spa e of a pro ess. You an however lo k the pro ess's physi al pages into
memory, preventing them from being swapped out to the swap devi e during a DMA
operation. Se ondly, the DMA ontroller annot a ess the whole of physi al memory.
The DMA hannel's address register represents the rst 16 bits of the DMA address,
the next 8 bits ome from the page register. This means that DMA requests are
limited to the bottom 16 Mbytes of memory.
DMA hannels are s ar e resour es, there are only 7 of them, and they annot be
shared between devi e drivers. Just like interrupts, the devi e driver must be able
to work out whi h DMA hannel it should use. Like interrupts, some devi es have
a xed DMA hannel. The oppy devi e, for example, always uses DMA hannel
2. Sometimes the DMA hannel for a devi e an be set by jumpers; a number of
ethernet devi es use this te hnique. The more exible devi es an be told (via their
CSRs) whi h DMA hannels to use and, in this ase, the devi e driver an simply
pi k a free DMA hannel to use.
Linux tra ks the usage of the DMA hannels using a ve tor of dma han data stru tures (one per DMA hannel). The dma han data stru ture ontains just two elds,
a pointer to a string des ribing the owner of the DMA hannel and a ag indi ating
if the DMA hannel is allo ated or not. It is this ve tor of dma han data stru tures
that is printed when you at /pro /dma.

8.3 Memory
Devi e drivers have to be areful when using memory. As they are part of the Linux
kernel they annot use virtual memory. Ea h time a devi e driver runs, maybe as
an interrupt is re eived or as a bottom half or task queue handler is s heduled, the
urrent pro ess may hange. The devi e driver annot rely on a parti ular pro ess
running even if it is doing work on its behalf. Like the rest of the kernel, devi e
drivers use data stru tures to keep tra k of the devi e that it is ontrolling. These
data stru tures an be stati ally allo ated, part of the devi e driver's ode, but that
would be wasteful as it makes the kernel larger than it need be. Most devi e drivers
allo ate kernel, non-paged, memory to hold their data.
Linux provides kernel memory allo ation and deallo ation routines and it is these
that the devi e drivers use. Kernel memory is allo ated in hunks that are powers
of 2. For example 128 or 512 bytes, even if the devi e driver asks for less. The
number of bytes that the devi e driver requests is rounded up to the next blo k size
boundary. This makes kernel memory deallo ation easier as the smaller free blo ks
an be re ombined into bigger blo ks.
It may be that Linux needs to do quite a lot of extra work when the kernel memory
is requested. If the amount of free memory is low, physi al pages may need to be dis arded or written to the swap devi e. Normally, Linux would suspend the requestor,
putting the pro ess onto a wait queue until there is enough physi al memory. Not
all devi e drivers (or indeed Linux kernel ode) may want this to happen and so the
kernel memory allo ation routines an be requested to fail if they annot immediately allo ate memory. If the devi e driver wishes to DMA to or from the allo ated
memory it an also spe ify that the memory is DMA'able. This way it is the Linux
kernel that needs to understand what onstitutes DMA'able memory for this system,
and not the devi e driver.

8.4 Interfa ing Devi e Drivers with the Kernel


The Linux kernel must be able to intera t with them in standard ways. Ea h lass
of devi e driver, hara ter, blo k and network, provides ommon interfa es that the
kernel uses when requesting servi es from them. These ommon interfa es mean that
the kernel an treat often very di erent devi es and their devi e drivers absolutely
the same. For example, SCSI and IDE disks behave very di erently but the Linux
kernel uses the same interfa e to both of them.
Linux is very dynami , every time a Linux kernel boots it may en ounter di erent
physi al devi es and thus need di erent devi e drivers. Linux allows you to in lude
devi e drivers at kernel build time via its on guration s ripts. When these drivers
are initialized at boot time they may not dis over any hardware to ontrol. Other
drivers an be loaded as kernel modules when they are needed. To ope with this
dynami nature of devi e drivers, devi e drivers register themselves with the kernel
as they are initialized. Linux maintains tables of registered devi e drivers as part of
its interfa es with them. These tables in lude pointers to routines and information
that support the interfa e with that lass of devi es.

chrdevs
name
fops

file operations
lseek
read
write
readdir
select
ioclt
mmap
open
release
fsyn
fasync
check_media_change
revalidate

Figure 8.1: Chara ter Devi es

8.4.1 Chara ter Devi es

See in lude/linux/major.h

See

ext2 read inode()

in

fs/ext2/inode.

def hr fops

See

hrdev open() in
fs/devi es.

Chara ter devi es, the simplest of Linux's devi es, are a essed as les, appli ations
use standard system alls to open them, read from them, write to them and lose
them exa tly as if the devi e were a le. This is true even if the devi e is a modem
being used by the PPP daemon to onne t a Linux system onto a network. As a
hara ter devi e is initialized its devi e driver registers itself with the Linux kernel
by adding an entry into the hrdevs ve tor of devi e stru t data stru tures. The
devi e's major devi e identi er (for example 4 for the tty devi e) is used as an
index into this ve tor. The major devi e identi er for a devi e is xed. Ea h entry
in the hrdevs ve tor, a devi e stru t data stru ture ontains two elements; a
pointer to the name of the registered devi e driver and a pointer to a blo k of le
operations. This blo k of le operations is itself the addresses of routines within the
devi e hara ter devi e driver ea h of whi h handles spe i le operations su h as
open, read, write and lose. The ontents of /pro /devi es for hara ter devi es is
taken from the hrdevs ve tor.
When a hara ter spe ial le representing a hara ter devi e (for example /dev/ ua0)
is opened the kernel must set things up so that the orre t hara ter devi e driver's
le operation routines will be alled. Just like an ordinairy le or dire tory, ea h
devi e spe ial le is represented by a VFS inode . The VFS inode for a hara ter
spe ial le, indeed for all devi e spe ial les, ontains both the major and minor
identi ers for the devi e. This VFS inode was reated by the underlying lesystem,
for example EXT2, from information in the real lesystem when the devi e spe ial
le's name was looked up.
Ea h VFS inode has asso iated with it a set of le operations and these are di erent
depending on the lesystem obje t that the inode represents. Whenever a VFS
inode representing a hara ter spe ial le is reated, its le operations are set to the
default hara ter devi e operations . This has only one le operation, the open le
operation. When the hara ter spe ial le is opened by an appli ation the generi
open le operation uses the devi e's major identi er as an index into the hrdevs
ve tor to retrieve the le operations blo k for this parti ular devi e. It also sets up the
file data stru ture des ribing this hara ter spe ial le, making its le operations
pointer point to those of the devi e driver. Thereafter all of the appli ations le
operations will be mapped to alls to the hara ter devi es set of le operations.

blk_dev

blk_dev_struct

request_fn()
current_request

:
:

request

request

rq_status
rq_dev

rq_status
rq_dev

mcd

mcd

sem
bh

sem
bh

tail

tail

next

next

buffer_head
b_dev
b_blocknr
b_state
b_count
b_size

0x0301
39

1024

b_next
b_prev

b_data

Figure 8.2: Bu er Ca he Blo k Devi e Requests

8.4.2 Blo k Devi es


Blo k devi es also support being a essed like les. The me hanisms used to provide
the orre t set of le operations for the opened blo k spe ial le are very mu h the
same as for hara ter devi es. Linux maintains the set of registered blo k devi es as
the blkdevs ve tor. It, like the hrdevs ve tor, is indexed using the devi e's major
devi e number. Its entries are also devi e stru t data stru tures. Unlike hara ter
devi es, there are lasses of blo k devi es. SCSI devi es are one su h lass and IDE
devi es are another. It is the lass that registers itself with the Linux kernel and
provides le operations to the kernel. The devi e drivers for a lass of blo k devi e
provide lass spe i interfa es to the lass. So, for example, a SCSI devi e driver
has to provide interfa es to the SCSI subsystem whi h the SCSI subsystem uses to
provide le operations for this devi e to the kernel.
Every blo k devi e driver must provide an interfa e to the bu er a he as well as
the normal le operations interfa e. Ea h blo k devi e driver lls in its entry in the
blk dev ve tor of blk dev stru t data stru tures . The index into this ve tor is,
again, the devi e's major number. The blk dev stru t data stru ture onsists of
the address of a request routine and a pointer to a list of request data stru tures,
ea h one representing a request from the bu er a he for the driver to read or write
a blo k of data.
Ea h time the bu er a he wishes to read or write a blo k of data to or from a registered devi e it adds a request data stru ture onto its blk dev stru t. Figure 8.2
shows that ea h request has a pointer to one or more buffer head data stru tures,
ea h one a request to read or write a blo k of data. The buffer head stru tures are
lo ked (by the bu er a he) and there may be a pro ess waiting on the blo k operation to this bu er to omplete. Ea h request stru ture is allo ated from a stati
list, the all requests list. If the request is being added to an empty request list, the
driver's request fun tion is alled to start pro essing the request queue. Otherwise
the driver will simply pro ess every request on the request list.
On e the devi e driver has ompleted a request it must remove ea h of the buffer head
stru tures from the request stru ture, mark them as up to date and unlo k them.

See

fs/devi es.

See

drivers/blo k/ll rw blk.

See in lude/linux/blkdev.h

This unlo king of the buffer head will wake up any pro ess that has been sleeping
waiting for the blo k operation to omplete. An example of this would be where
a le name is being resolved and the EXT2 lesystem must read the blo k of data
that ontains the next EXT2 dire tory entry from the blo k devi e that holds the
lesystem. The pro ess sleeps on the buffer head that will ontain the dire tory
entry until the devi e driver wakes it up. The request data stru ture is marked as
free so that it an be used in another blo k request.

8.5 Hard Disks


Disk drives provide a more permanent method for storing data, keeping it on spinning
disk platters. To write data, a tiny head magnetizes minute parti les on the platter's
surfa e. The data is read by a head, whi h an dete t whether a parti ular minute
parti le is magnetized.
A disk drive onsists of one or more platters, ea h made of nely polished glass or
erami omposites and oated with a ne layer of iron oxide. The platters are
atta hed to a entral spindle and spin at a onstant speed that an vary between
3000 and 10,000 RPM depending on the model. Compare this to a oppy disk whi h
only spins at 360 RPM. The disk's read/write heads are responsible for reading and
writing data and there is a pair for ea h platter, one head for ea h surfa e. The
read/write heads do not physi ally tou h the surfa e of the platters, instead they
oat on a very thin (10 millionths of an in h) ushion of air. The read/write heads
are moved a ross the surfa e of the platters by an a tuator. All of the read/write
heads are atta hed together, they all move a ross the surfa es of the platters together.
Ea h surfa e of the platter is divided into narrow, on entri ir les alled tra ks.
Tra k 0 is the outermost tra k and the highest numbered tra k is the tra k losest
to the entral spindle. A ylinder is the set of all tra ks with the same number. So
all of the 5th tra ks from ea h side of every platter in the disk is known as ylinder
5. As the number of ylinders is the same as the number of tra ks, you often see
disk geometries des ribed in terms of ylinders. Ea h tra k is divided into se tors.
A se tor is the smallest unit of data that an be written to or read from a hard disk
and it is also the disk's blo k size. A ommon se tor size is 512 bytes and the se tor
size was set when the disk was formatted, usually when the disk is manufa tured.
A disk is usually des ribed by its geometry, the number of ylinders, heads and
se tors. For example, at boot time Linux des ribes one of my IDE disks as:
hdb: Conner Peripherals 540MB - CFS540A, 516MB w/64kB Ca he, CHS=1050/16/63

This means that it has 1050 ylinders (tra ks), 16 heads (8 platters) and 63 se tors
per tra k. With a se tor, or blo k, size of 512 bytes this gives the disk a storage
apa ity of 529200 bytes. This does not mat h the disk's stated apa ity of 516
Mbytes as some of the se tors are used for disk partitioning information. Some disks
automati ally nd bad se tors and re-index the disk to work around them.
Hard disks an be further subdivided into partitions. A partition is a large group
of se tors allo ated for a parti ular purpose. Partitioning a disk allows the disk to
be used by several operating system or for several purposes. A lot of Linux systems
have a single disk with three partitions; one ontaining a DOS lesystem, another
an EXT2 lesystem and a third for the swap partition. The partitions of a hard disk

gendisk_head

gendisk

gendisk

major
major_name
minor_shift
max_p
max_nr
init()
part
sizes
nr_real
real_devices
next

8
"sd"

major
major_name
minor_shift
max_p
max_nr
init()
part
sizes
nr_real
real_devices
next

3
"ide0"

hd_struct[]
start_sect
nr_sects
:
:
:

max_p

start_sect
nr_sects

Figure 8.3: Linked list of disks


are des ribed by a partition table; ea h entry des ribing where the partition starts
and ends in terms of heads, se tors and ylinder numbers. For DOS formatted disks,
those formatted by fdisk, there are four primary disk partitions. Not all four entries
in the partition table have to be used. There are three types of partition supported
by fdisk, primary, extended and logi al. Extended partitions are not real partitions
at all, they ontain any number of logi al parititions. Extended and logi al partitions
were invented as a way around the limit of four primary partitions. The following is
the output from fdisk for a disk ontaining two primary partitions:
Disk /dev/sda: 64 heads, 32 se tors, 510 ylinders
Units = ylinders of 2048 * 512 bytes
Devi e Boot
/dev/sda1
/dev/sda2

Begin
1
479

Start
1
479

End
478
510

Blo ks
489456
32768

Id System
83 Linux native
82 Linux swap

Expert ommand (m for help): p


Disk /dev/sda: 64 heads, 32 se tors, 510 ylinders
Nr
1
2
3
4

AF Hd Se Cyl Hd Se Cyl Start


Size ID
00 1 1
0 63 32 477
32 978912 83
00 0 1 478 63 32 509 978944 65536 82
00 0 0
0 0 0
0
0
0 00
00 0 0
0 0 0
0
0
0 00

This shows that the rst partition starts at ylinder or tra k 0, head 1 and se tor 1
and extends to in lude ylinder 477, se tor 32 and head 63. As there are 32 se tors in
a tra k and 64 read/write heads, this partition is a whole number of ylinders in size.
fdisk alligns partitions on ylinder boundaries by default. It starts at the outermost
ylinder (0) and extends inwards, towards the spindle, for 478 ylinders. The se ond
partition, the swap partition, starts at the next ylinder (478) and extends to the
innermost ylinder of the disk.
During initialization Linux maps the topology of the hard disks in the system. It
nds out how many hard disks there are and of what type. Additionally, Linux

dis overs how the individual disks have been partitioned. This is all represented by a
list of gendisk data stru tures pointed at by the gendisk head list pointer. As ea h
disk subsystem, for example IDE, is initialized it generates gendisk data stru tures
representing the disks that it nds. It does this at the same time as it registers its le
operations and adds its entry into the blk dev data stru ture. Ea h gendisk data
stru ture has a unique major devi e number and these mat h the major numbers
of the blo k spe ial devi es. For example, the SCSI disk subsystem reates a single
gendisk entry (``sd'') with a major number of 8, the major number of all SCSI
disk devi es. Figure 8.3 shows two gendisk entries, the rst one for the SCSI disk
subsystem and the se ond for an IDE disk ontroller. This is ide0, the primary IDE
ontroller.
Although the disk subsystems build the gendisk entries during their initialization
they are only used by Linux during partition he king. Instead, ea h disk subsystem
maintains its own data stru tures whi h allow it to map devi e spe ial major and
minor devi e numbers to partitions within physi al disks. Whenever a blo k devi e
is read from or written to, either via the bu er a he or le operations, the kernel
dire ts the operation to the appropriate devi e using the major devi e number found
in its blo k spe ial devi e le (for example /dev/sda2). It is the individual devi e
driver or subsystem that maps the minor devi e number to the real physi al devi e.

8.5.1 IDE Disks


The most ommon disks used in Linux systems today are Integrated Disk Ele troni
or IDE disks. IDE is a disk interfa e rather than an I/O bus like SCSI. Ea h IDE
ontroller an support up to two disks, one the master disk and the other the slave
disk. The master and slave fun tions are usually set by jumpers on the disk. The
rst IDE ontroller in the system is known as the primary IDE ontroller, the next
the se ondary ontroller and so on. IDE an manage about 3.3 Mbytes per se ond
of data transfer to or from the disk and the maximum IDE disk size is 538Mbytes.
Extended IDE, or EIDE, has raised the disk size to a maximum of 8.6 Gbytes and the
data transfer rate up to 16.6 Mbytes per se ond. IDE and EIDE disks are heaper
than SCSI disks and most modern PCs ontain one or more on board IDE ontrollers.
Linux names IDE disks in the order in whi h it nds their ontrollers. The master
disk on the primary ontroller is /dev/hda and the slave disk is /dev/hdb. /dev/hd
is the master disk on the se ondary IDE ontroller. The IDE subsystem registers IDE
ontrollers and not disks with the Linux kernel. The major identi er for the primary
IDE ontroller is 3 and is 22 for the se ondary IDE ontroller. This means that if a
system has two IDE ontrollers there will be entries for the IDE subsystem at indi es
at 3 and 22 in the blk dev and blkdevs ve tors. The blo k spe ial les for IDE disks
re e t this numbering, disks /dev/hda and /dev/hdb, both onne ted to the primary
IDE ontroller, have a major identi er of 3. Any le or bu er a he operations for
the IDE subsystem operations on these blo k spe ial les will be dire ted to the IDE
subsystem as the kernel uses the major identi er as an index. When the request is
made, it is up to the IDE subsystem to work out whi h IDE disk the request is for.
To do this the IDE subsystem uses the minor devi e number from the devi e spe ial
identi er, this ontains information that allows it to dire t the request to the orre t
partition of the orre t disk. The devi e identi er for /dev/hdb, the slave IDE drive
on the primary IDE ontroller is (3,64). The devi e identi er for the rst partition
of that disk (/dev/hdb1) is (3,65).

8.5.2 Initializing the IDE Subsystem


IDE disks have been around for mu h of the IBM PC's history. Throughout this
time the interfa e to these devi es has hanged. This makes the initialization of the
IDE subsystem more omplex than it might at rst appear.
The maximum number of IDE ontrollers that Linux an support is 4. Ea h ontroller is represented by an ide hwif t data stru ture in the ide hwifs ve tor. Ea h
ide hwif t data stru ture ontains two ide drive t data stru tures, one per possible supported master and slave IDE drive. During the initializing of the IDE
subsystem, Linux rst looks to see if there is information about the disks present
in the system's CMOS memory. This is battery ba ked memory that does not lose
its ontents when the PC is powered o . This CMOS memory is a tually in the
system's real time lo k devi e whi h always runs no matter if your PC is on or o .
The CMOS memory lo ations are set up by the system's BIOS and tell Linux what
IDE ontrollers and drives have been found. Linux retrieves the found disk's geometry from BIOS and uses the information to set up the ide hwif t data stru ture
for this drive. More modern PCs use PCI hipsets su h as Intel's 82430 VX hipset
whi h in ludes a PCI EIDE ontroller. The IDE subsystem uses PCI BIOS allba ks to lo ate the PCI (E)IDE ontrollers in the system. It then alls PCI spe i
interrogation routines for those hipsets that are present.
On e ea h IDE interfa e or ontroller has been dis overed, its ide hwif t is set up
to re e t the ontrollers and atta hed disks. During operation the IDE driver writes
ommands to IDE ommand registers that exist in the I/O memory spa e. The
default I/O address for the primary IDE ontroller's ontrol and status registers is
0x1F0 - 0x1F7. These addresses were set by onvention in the early days of the IBM
PC. The IDE driver registers ea h ontroller with the Linux blo k bu er a he and
VFS, adding it to the blk dev and blkdevs ve tors respe tively. The IDE drive will
also request ontrol of the appropriate interrupt. Again these interrupts are set by
onvention to be 14 for the primary IDE ontroller and 15 for the se ondary IDE
ontroller. However, they like all IDE details, an be overridden by ommand line
options to the kernel. The IDE driver also adds a gendisk entry into the list of
gendisk's dis overed during boot for ea h IDE ontroller found. This list will later
be used to dis over the partition tables of all of the hard disks found at boot time.
The partition he king ode understands that IDE ontrollers may ea h ontrol two
IDE disks.

8.5.3 SCSI Disks


The SCSI (Small Computer System Interfa e) bus is an e ient peer-to-peer data
bus that supports up to eight devi es per bus, in luding one or more hosts. Ea h
devi e has to have a unique identi er and this is usually set by jumpers on the disks.
Data an be transfered syn hronously or asyn hronously between any two devi es
on the bus and with 32 bit wide data transfers up to 40 Mbytes per se ond are
possible. The SCSI bus transfers both data and state information between devi es,
and a single transa tion between an initiator and a target an involve up to eight
distin t phases. You an tell the urrent phase of a SCSI bus from ve signals from
the bus. The eight phases are:

BUS FREE No devi e has ontrol of the bus and there are no transa tions urrently
happening,

ARBITRATION A SCSI devi e has attempted to get ontrol of the SCSI bus, it

does this by asserting its SCSI identifer onto the address pins. The highest
number SCSI identi er wins.

SELECTION When a devi e has su eeded in getting ontrol of the SCSI bus

through arbitration it must now signal the target of this SCSI request that it
wants to send a ommand to it. It does this by asserting the SCSI identi er of
the target on the address pins.

RESELECTION SCSI devi es may dis onne t during the pro essing of a request.
The target may then resele t the initiator. Not all SCSI devi es support this
phase.

COMMAND 6,10 or 12 bytes of ommand an be transfered from the initiator to


the target,

DATA IN, DATA OUT During these phases data is transfered between the initiator and the target,

STATUS This phase is entered after ompletion of all ommands and allows the
target to send a status byte indi ating su ess or failure to the initiator,

MESSAGE IN, MESSAGE OUT Additional information is transfered between


the initiator and the target.

The Linux SCSI subsystem is made up of two basi elements, ea h of whi h is represented by data stru tures:

host A SCSI host is a physi al pie e of hardware, a SCSI ontroller. The NCR810

PCI SCSI ontroller is an example of a SCSI host. If a Linux system has more
than one SCSI ontroller of the same type, ea h instan e will be represented by
a separate SCSI host. This means that a SCSI devi e driver may ontrol more
than one instan e of its ontroller. SCSI hosts are almost always the initiator
of SCSI ommands.

Devi e The most ommon set of SCSI devi e is a SCSI disk but the SCSI standard
supports several more types; tape, CD-ROM and also a generi SCSI devi e.
SCSI devi es are almost always the targets of SCSI ommands. These devi es
must be treated di erently, for example with removable media su h as CDROMs or tapes, Linux needs to dete t if the media was removed. The di erent
disk types have di erent major devi e numbers, allowing Linux to dire t blo k
devi e requests to the appropriate SCSI type.

Initializing the SCSI Subsystem


Initializing the SCSI subsystem is quite omplex, re e ting the dynami nature of
SCSI buses and their devi es. Linux initializes the SCSI subsystem at boot time; it
nds the SCSI ontrollers (known as SCSI hosts) in the system and then probes ea h
of their SCSI buses nding all of their devi es. It then initializes those devi es and
makes them available to the rest of the Linux kernel via the normal le and bu er
a he blo k devi e operations. This initialization is done in four phases:
First, Linux nds out whi h of the SCSI host adapters, or ontrollers, that were
built into the kernel at kernel build time have hardware to ontrol. Ea h built in

Scsi_Host_Template

scsi_hosts

next
name

"Buslogic"

Device
Driver
Routines

Scsi_Host

scsi_hostlist

next
this_id
max_id
hostt

scsi_devices

Scsi_Device

Scsi_Device

next

next

id
type

id
type

host

host

Figure 8.4: SCSI Data Stru tures


SCSI host has a S si Host Template entry in the builtin s si hosts ve tor The
S si Host Template data stru ture ontains pointers to routines that arry out SCSI
host spe i a tions su h as dete ting what SCSI devi es are atta hed to this SCSI
host. These routines are alled by the SCSI subsystem as it on gures itself and they
are part of the SCSI devi e driver supporting this host type. Ea h dete ted SCSI host,
those for whi h there are real SCSI devi es atta hed, has its S si Host Template
data stru ture added to the s si hosts list of a tive SCSI hosts. Ea h instan e
of a dete ted host type is represented by a S si Host data stru ture held in the
s si hostlist list. For example a system with two NCR810 PCI SCSI ontrollers
would have two S si Host entries in the list, one per ontroller. Ea h S si Host
points at the S si Host Template representing its devi e driver.
Now that every SCSI host has been dis overed, the SCSI subsystem must nd out
what SCSI devi es are atta hed to ea h host's bus. SCSI devi es are numbered between 0 and 7 in lusively, ea h devi e's number or SCSI identi er being unique on
the SCSI bus to whi h it is atta hed. SCSI identi ers are usually set by jumpers on
the devi e. The SCSI initialization ode nds ea h SCSI devi e on a SCSI bus by
sending it a TEST UNIT READY ommand. When a devi e responds, its identi ation is read by sending it an ENQUIRY ommand. This gives Linux the vendor's
name and the devi e's model and revision names. SCSI ommands are represented
by a S si Cmnd data stru ture and these are passed to the devi e driver for this
SCSI host by alling the devi e driver routines within its S si Host Template data
stru ture. Every SCSI devi e that is found is represented by a S si Devi e data

stru ture, ea h of whi h points to its parent S si Host. All of the S si Devi e data
stru tures are added to the s si devi es list. Figure 8.4 shows how the main data
stru tures relate to one another.
There are four SCSI devi e types: disk, tape, CD and generi . Ea h of these SCSI
types are individually registered with the kernel as di erent major blo k devi e types.
However they will only register themselves if one or more of a given SCSI devi e
type has been found. Ea h SCSI type, for example SCSI disk, maintains its own
tables of devi es. It uses these tables to dire t kernel blo k operations ( le or bu er
a he) to the orre t devi e driver or SCSI host. Ea h SCSI type is represented by
a S si Devi e Template data stru ture. This ontains information about this type
of SCSI devi e and the addresses of routines to perform various tasks. The SCSI
subsystem uses these templates to all the SCSI type routines for ea h type of SCSI
devi e. In other words, if the SCSI subsystem wishes to atta h a SCSI disk devi e it
will all the SCSI disk type atta h routine. The S si Type Template data stru tures
are added to the s si devi elist list if one or more SCSI devi es of that type have
been dete ted.
The nal phase of the SCSI subsystem initialization is to all the nish fun tions for
ea h registered S si Devi e Template. For the SCSI disk type this spins up all of
the SCSI disks that were found and then re ords their disk geometry. It also adds the
gendisk data stru ture representing all SCSI disks to the linked list of disks shown
in Figure 8.3.

Delivering Blo k Devi e Requests


On e Linux has initialized the SCSI subsystem, the SCSI devi es may be used. Ea h
a tive SCSI devi e type registers itself with the kernel so that Linux an dire t
blo k devi e requests to it. There an be bu er a he requests via blk dev or le
operations via blkdevs. Taking a SCSI disk driver that has one or more EXT2
lesystem partitions as an example, how do kernel bu er requests get dire ted to the
right SCSI disk when one of its EXT2 partitions is mounted?
Ea h request to read or write a blo k of data to or from a SCSI disk partition results
in a new request stru ture being added to the SCSI disks urrent request list in
the blk dev ve tor. If the request list is being pro essed, the bu er a he need not
do anything else; otherwise it must nudge the SCSI disk subsystem to go and pro ess
its request queue. Ea h SCSI disk in the system is represented by a S si Disk data
stru ture. These are kept in the rs si disks ve tor that is indexed using part of the
SCSI disk partition's minor devi e number. For exmaple, /dev/sdb1 has a major
number of 8 and a minor number of 17; this generates an index of 1. Ea h S si Disk
data stru ture ontains a pointer to the S si Devi e data stru ture representing
this devi e. That in turn points at the S si Host data stru ture whi h \owns" it.
The request data stru tures from the bu er a he are translated into S si Cmd
stru tures des ribing the SCSI ommand that needs to be sent to the SCSI devi e
and this is queued onto the S si Host stru ture representing this devi e. These will
be pro essed by the individual SCSI devi e driver on e the appropriate data blo ks
have been read or written.

8.6 Network Devi es


A network devi e is, so far as Linux's network subsystem is on erned, an entity that
sends and re eives pa kets of data. This is normally a physi al devi e su h as an
ethernet ard. Some network devi es though are software only su h as the loopba k
devi e whi h is used for sending data to yourself. Ea h network devi e is represented
by a devi e data stru ture. Network devi e drivers register the devi es that they
ontrol with Linux during network initialization at kernel boot time. The devi e
data stru ture ontains information about the devi e and the addresses of fun tions
that allow the various supported network proto ols to use the devi e's servi es. These
fun tions are mostly on erned with transmitting data using the network devi e. The
devi e uses standard networking support me hanisms to pass re eived data up to the
appropriate proto ol layer. All network data (pa kets) transmitted and re eived
are represented by sk buff data stru tures, these are exible data stru tures that
allow network proto ol headers to be easily added and removed. How the network
proto ol layers use the network devi es, how they pass data ba k and forth using
sk buff data stru tures is des ribed in detail in the Networks hapter (Chapter 10).
This hapter on entrates on the devi e data stru ture and on how network devi es
are dis overed and initialized.
The devi e data stru ture ontains information about the network devi e:

Name Unlike blo k and hara ter devi es whi h have their devi e spe ial les reated using the mknod ommand, network devi e spe ial les appear spontaniously as the system's network devi es are dis overed and initialized. Their
names are standard, ea h name representing the type of devi e that it is. Multiple devi es of the same type are numbered upwards from 0. Thus the ethernet
devi es are known as /dev/eth0,/dev/eth1,/dev/eth2 and so on. Some ommon network devi es are:
/dev/ethN
/dev/slN
/dev/pppN
/dev/lo

Ethernet devi es
SLIP devi es
PPP devi es
Loopba k devi es

Bus Information This is information that the devi e driver needs in order to on-

trol the devi e. The irq number is the interrupt that this devi e is using. The
base address is the address of any of the devi e's ontrol and status registers
in I/O memory. The DMA hannel is the DMA hannel number that this net-

work devi e is using. All of this information is set at boot time as the devi e
is initialized.

Interfa e Flags These des ribe the hara teristi s and abilities of the network devi e:

See

in lude/linux/netdevi e.h

IFF
IFF
IFF
IFF
IFF
IFF
IFF
IFF
IFF

UP
BROADCAST
DEBUG
LOOPBACK
POINTTOPOINT
NOTRAILERS
RUNNING
NOARP
PROMISC

IFF ALLMULTI
IFF MULTICAST

Interfa e is up and running,


Broad ast address in devi e is valid
Devi e debugging turned on
This is a loopba k devi e
This is point to point link (SLIP and PPP)
No network trailers
Resour es allo ated
Does not support ARP proto ol
Devi e in promis uous re eive mode, it will re eive
all pa kets no matter who they are addressed to
Re eive all IP multi ast frames
Can re eive IP multi ast frames

Proto ol Information Ea h devi e des ribes how it may be used by the network
proto ool layers:

mtu The size of the largest pa ket that this network an transmit not in luding
any link layer headers that it needs to add. This maximum is used by the
proto ol layers, for example IP, to sele t suitable pa ket sizes to send.

Family The family indi ates the proto ol family that the devi e an support.

The family for all Linux network devi es is AF INET, the Internet address
family.

Type The hardware interfa e type des ribes the media that this network de-

vi e is atta hed to. There are many di erent types of media that Linux
network devi es support. These in lude Ethernet, X.25, Token Ring, Slip,
PPP and Apple Lo altalk.

Addresses The devi e data stru ture holds a number of addresses that are
relevent to this network devi e, in luding its IP addresses.

Pa ket Queue This is the queue of


mitted on this network devi e,

sk buff pa kets queued waiting to be trans-

Support Fun tions Ea h devi e provides a standard set of routines that proto ol
layers all as part of their interfa e to this devi e's link layer. These in lude
setup and frame transmit routines as well as routines to add standard frame
headers and olle t statisti s. These statisti s an be seen using the if on g
ommand.

8.6.1 Initializing Network Devi es


Network devi e drivers an, like other Linux devi e drivers, be built into the Linux
kernel. Ea h potential network devi e is represented by a devi e data stru ture
within the network devi e list pointed at by dev base list pointer. The network layers
all one of a number of network devi e servi e routines whose addresses are held in
the devi e data stru ture if they need devi e spe i work performing. Initially
though, ea h devi e data stru ture holds only the address of an initialization or
probe routine.
There are two problems to be solved for network devi e drivers. Firstly, not all of
the network devi e drivers built into the Linux kernel will have devi es to ontrol.
Se ondly, the ethernet devi es in the system are always alled /dev/eth0, /dev/eth1

and so on, no matter what their underlying devi e drivers are. The problem of
\missing" network devi es is easily solved. As the initialization routine for ea h
network devi e is alled, it returns a status indi ating whether or not it lo ated an
instan e of the ontroller that it is driving. If the driver ould not nd any devi es, its
entry in the devi e list pointed at by dev base is removed. If the driver ould nd
a devi e it lls out the rest of the devi e data stru ture with information about the
devi e and the addresses of the support fun tions within the network devi e driver.
The se ond problem, that of dynami ally assigning ethernet devi es to the standard
/dev/ethN devi e spe ial les is solved more elegantly. There are eight standard
entries in the devi es list; one for eth0, eth1 and so on to eth7. The initialization
routine is the same for all of them, it tries ea h ethernet devi e driver built into the
kernel in turn until one nds a devi e. When the driver nds its ethernet devi e it
lls out the ethN devi e data stru ture, whi h it now owns. It is also at this time
that the network devi e driver initializes the physi al hardware that it is ontrolling
and works out whi h IRQ it is using, whi h DMA hannel (if any) and so on. A
driver may nd several instan es of the network devi e that it is ontrolling and, in
this ase, it will take over several of the /dev/ethN devi e data stru tures. On e
all eight standard /dev/ethN have been allo ated, no more ethernet devi es will be
probed for.

Chapter 9

The File system

This hapter des ribes how the Linux kernel maintains the les in the le
systems that it supports. It des ribes the Virtual File System (VFS) and
explains how the Linux kernel's real le systems are supported.
One of the most important features of Linux is its support for many di erent le
systems. This makes it very exible and well able to oexist with many other operating systems. At the time of writing, Linux supports 15 le systems; ext, ext2,
xia, minix, umsdos, msdos, vfat, pro , smb, n p, iso9660, sysv, hpfs, affs and
ufs, and no doubt, over time more will be added.
In Linux, as it is for UnixTM , the separate le systems the system may use are not
a essed by devi e identi ers (su h as a drive number or a drive name) but instead
they are ombined into a single hierar hi al tree stru ture that represents the le
system as one whole single entity. Linux adds ea h new le system into this single
le system tree as it is mounted. All le systems, of whatever type, are mounted onto
a dire tory and the les of the mounted le system over up the existing ontents
of that dire tory. This dire tory is known as the mount dire tory or mount point.
When the le system is unmounted, the mount dire tory's own les are on e again
revealed.
When disks are initialized (using fdisk, say) they have a partition stru ture imposed
on them that divides the physi al disk into a number of logi al partitions. Ea h
partition may hold a single le system, for example an EXT2 le system. File systems
organize les into logi al hierar hi al stru tures with dire tories, soft links and so on
held in blo ks on physi al devi es. Devi es that an ontain le systems are known
as blo k devi es. The IDE disk partition /dev/hda1, the rst partition of the rst
IDE disk drive in the system, is a blo k devi e. The Linux le systems regard these
blo k devi es as simply linear olle tions of blo ks, they do not know or are about
the underlying physi al disk's geometry. It is the task of ea h blo k devi e driver to
map a request to read a parti ular blo k of its devi e into terms meaningful to its
devi e; the parti ular tra k, se tor and ylinder of its hard disk where the blo k is
99

kept. A le system has to look, feel and operate in the same way no matter what
devi e is holding it. Moreover, using Linux's le systems, it does not matter (at least
to the system user) that these di erent le systems are on di erent physi al media
ontrolled by di erent hardware ontrollers. The le system might not even be on
the lo al system, it ould just as well be a disk remotely mounted over a network
link. Consider the following example where a Linux system has its root le system
on a SCSI disk:
A
C
D

E
F
bin

boot
drom
dev

et
fd
home

lib
pro
mnt

opt
tmp
root
var
lost+found

Neither the users nor the programs that operate on the les themselves need know
that /C is in fa t a mounted VFAT le system that is on the rst IDE disk in the
system. In the example (whi h is a tually my home Linux system), /E is the master
IDE disk on the se ond IDE ontroller. It does not matter either that the rst IDE
ontroller is a PCI ontroller and that the se ond is an ISA ontroller whi h also
ontrols the IDE CDROM. I an dial into the network where I work using a modem
and the PPP network proto ol using a modem and in this ase I an remotely mount
my Alpha AXP Linux system's le systems on /mnt/remote.
The les in a le system are olle tions of data; the le holding the sour es to this
hapter is an ASCII le alled filesystems.tex. A le system not only holds the
data that is ontained within the les of the le system but also the stru ture of
the le system. It holds all of the information that Linux users and pro esses see as
les, dire tories soft links, le prote tion information and so on. Moreover it must
hold that information safely and se urely, the basi integrity of the operating system
depends on its le systems. Nobody would use an operating system that randomly
lost data and les1 .
Minix, the rst le system that Linux had is rather restri tive and la king in performan e. Its lenames annot be longer than 14 hara ters (whi h is still better
than 8.3 lenames) and the maximum le size is 64MBytes. 64Mbytes might at
rst glan e seem large enough but large le sizes are ne essary to hold even modest
databases. The rst le system designed spe i ally for Linux, the Extended File
system, or EXT, was introdu ed in April 1992 and ured a lot of the problems but it
was still felt to la k performan e. So, in 1993, the Se ond Extended File system,
or EXT2, was added. It is this le system that is des ribed in detail later on in this
hapter.
An important development took pla e when the EXT le system was added into
Linux. The real le systems were separated from the operating system and system
servi es by an interfa e layer known as the Virtual File system, or VFS. VFS allows
Linux to support many, often very di erent, le systems, ea h presenting a ommon
software interfa e to the VFS. All of the details of the Linux le systems are translated
by software so that all le systems appear identi al to the rest of the Linux kernel
and to programs running in the system. Linux's Virtual File system layer allows you
to transparently mount the many di erent le systems at the same time.
The Linux Virtual File system is implemented so that a ess to its les is as fast and
e ient as possible. It must also make sure that the les and their data are kept
1 Well,

not knowingly, although I have been bitten by operating systems with more lawyers than
Linux has developers

usr
sbin

Super
Block

Block

Block

Block

Group 0

Group N-1

Group N

Group
Descriptors

Block
Bitmap

Inode
Bitmap

Inode
Table

Data
Blocks

Figure 9.1: Physi al Layout of the EXT2 File system


orre tly. These two requirements an be at odds with ea h other. The Linux VFS
a hes information in memory from ea h le system as it is mounted and used. A
lot of are must be taken to update the le system orre tly as data within these
a hes is modi ed as les and dire tories are reated, written to and deleted. If you
ould see the le system's data stru tures within the running kernel, you would be
able to see data blo ks being read and written by the le system. Data stru tures,
des ribing the les and dire tories being a essed would be reated and destroyed
and all the time the devi e drivers would be working away, fet hing and saving data.
The most important of these a hes is the Bu er Ca he, whi h is integrated into
the way that the individual le systems a ess their underlying blo k devi es. As
blo ks are a essed they are put into the Bu er Ca he and kept on various queues
depending on their states. The Bu er Ca he not only a hes data bu ers, it also
helps manage the asyn hronous interfa e with the blo k devi e drivers.

9.1 The Se ond Extended File system (EXT2)


The Se ond Extended File system was devised (by Remy Card) as an extensible and
powerful le system for Linux. It is also the most su essful le system so far in the
Linux ommunity and is the basis for all of the urrently shipping Linux distributions.
The EXT2 le system, like a lot of the le systems, is built on the premise that the
data held in les is kept in data blo ks. These data blo ks are all of the same length
and, although that length an vary between di erent EXT2 le systems the blo k
size of a parti ular EXT2 le system is set when it is reated (using mke2fs). Every
le's size is rounded up to an integral number of blo ks. If the blo k size is 1024
bytes, then a le of 1025 bytes will o upy two 1024 byte blo ks. Unfortunately this
means that on average you waste half a blo k per le. Usually in omputing you
trade o CPU usage for memory and disk spa e utilisation. In this ase Linux, along
with most operating systems, trades o a relatively ine ient disk usage in order to
redu e the workload on the CPU. Not all of the blo ks in the le system hold data,
some must be used to ontain the information that des ribes the stru ture of the le
system. EXT2 de nes the le system topology by des ribing ea h le in the system
with an inode data stru ture. An inode des ribes whi h blo ks the data within a
le o upies as well as the a ess rights of the le, the le's modi ation times and
the type of the le. Every le in the EXT2 le system is des ribed by a single inode
and ea h inode has a single unique number identifying it. The inodes for the le
system are all kept together in inode tables. EXT2 dire tories are simply spe ial

See fs/ext2/*

ext2_inode
Mode

Data

Owner info
Size

Data

Timestamps
Direct Blocks
Data

Data
Data

Indirect blocks
Double Indirect

Data
Triple Indirect
Data
Data

Figure 9.2: EXT2 Inode


les (themselves des ribed by inodes) whi h ontain pointers to the inodes of their
dire tory entries.
Figure 9.1 shows the layout of the EXT2 le system as o upying a series of blo ks in
a blo k stru tured devi e. So far as ea h le system is on erned, blo k devi es are
just a series of blo ks that an be read and written. A le system does not need to
on ern itself with where on the physi al media a blo k should be put, that is the job
of the devi e's driver. Whenever a le system needs to read information or data from
the blo k devi e ontaining it, it requests that its supporting devi e driver reads an
integral number of blo ks. The EXT2 le system divides the logi al partition that
it o upies into Blo k Groups. Ea h group dupli ates information riti al to the
integrity of the le system as well as holding real les and dire tories as blo ks of
information and data. This dupli ation is ne essary should a disaster o ur and the
le system need re overing. The subse tions des ribe in more detail the ontents of
ea h Blo k Group.

9.1.1 The EXT2 Inode

See

In the EXT2 le system, the inode is the basi building blo k; every le and dire tory
in the le system is des ribed by one and only one inode. The EXT2 inodes for
ea h Blo k Group are kept in the inode table together with a bitmap that allows
the system to keep tra k of allo ated and unallo ated inodes. Figure 9.2 shows the
format of an EXT2 inode, amongst other information, it ontains the following elds:

in lude/linux/ext2 fs i.h

mode This holds two pie es of information; what this inode des ribes and the permissions that users have to it. For EXT2, an inode an des ribe one of le,
dire tory, symboli link, blo k devi e, hara ter devi e or FIFO.

Owner Information The user and group identi ers of the owners of this le or

dire tory. This allows the le system to orre tly allow the right sort of a esses,

Size The size of the le in bytes,


Timestamps The time that the inode was reated and the last time that it was
modi ed,

Datablo ks Pointers to the blo ks that ontain the data that this inode is des ribing. The rst twelve are pointers to the physi al blo ks ontaining the data
des ribed by this inode and the last three pointers ontain more and more levels of indire tion. For example, the double indire t blo ks pointer points at a
blo k of pointers to blo ks of pointers to data blo ks. This means that les less
than or equal to twelve data blo ks in length are more qui kly a essed than
larger les.

You should note that EXT2 inodes an des ribe spe ial devi e les. These are not
real les but handles that programs an use to a ess devi es. All of the devi e les
in /dev are there to allow programs to a ess Linux's devi es. For example the mount
program takes as an argument the devi e le that it wishes to mount.

9.1.2 The EXT2 Superblo k


The Superblo k ontains a des ription of the basi size and shape of this le system.
The information within it allows the le system manager to use and maintain the le
system. Usually only the Superblo k in Blo k Group 0 is read when the le system
is mounted but ea h Blo k Group ontains a dupli ate opy in ase of le system
orruption. Amongst other information it holds the:

Magi Number This allows the mounting software to he k that this is indeed the

Superblo k for an EXT2 le system. For the urrent version of EXT2 this is
0xEF53.

Revision Level The major and minor revision levels allow the mounting ode to de-

termine whether or not this le system supports features that are only available
in parti ular revisions of the le system. There are also feature ompatibility
elds whi h help the mounting ode to determine whi h new features an safely
be used on this le system,

Mount Count and Maximum Mount Count Together these allow the system

to determine if the le system should be fully he ked. The mount ount


is in remented ea h time the le system is mounted and when it equals the
maximum mount ount the warning message \maximal mount ount rea hed,
running e2fs k is re ommended" is displayed,

Blo k Group Number The Blo k Group number that holds this opy of the Superblo k,

Blo k Size The size of the blo k for this le system in bytes, for example 1024
bytes,

Blo ks per Group The number of blo ks in a group. Like the blo k size this is
xed when the le system is reated,

See

in lude/linux/ext2 fs sb.h

Free Blo ks The number of free blo ks in the le system,


Free Inodes The number of free Inodes in the le system,
First Inode This is the inode number of the rst inode in the le system. The

rst inode in an EXT2 root le system would be the dire tory entry for the '/'
dire tory.

9.1.3 The EXT2 Group Des riptor


See

ext2 group des

in in lude/-

linux/ext2 fs.h

Ea h Blo k Group has a data stru ture des ribing it. Like the Superblo k, all the
group des riptors for all of the Blo k Groups are dupli ated in ea h Blo k Group
in ase of le system orruption. Ea h Group Des riptor ontains the following
information:

Blo ks Bitmap The blo k number of the blo k allo ation bitmap for this Blo k
Group. This is used during blo k allo ation and deallo ation,

Inode Bitmap The blo k number of the inode allo ation bitmap for this Blo k
Group. This is used during inode allo ation and deallo ation,

Inode Table The blo k number of the starting blo k for the inode table for this

Blo k Group. Ea h inode is represented by the EXT2 inode data stru ture
des ribed below.

Free blo ks ount, Free Inodes ount, Used dire tory ount
The group des riptors are pla ed on after another and together they make the group
des riptor table. Ea h Blo ks Group ontains the entire table of group des riptors
after its opy of the Superblo k. Only the rst opy (in Blo k Group 0) is a tually
used by the EXT2 le system. The other opies are there, like the opies of the
Superblo k, in ase the main opy is orrupted.

9.1.4 EXT2 Dire tories


See

ext2 dir entry

in in lude/-

linux/ext2 fs.h

In the EXT2 le system, dire tories are spe ial les that are used to reate and hold
a ess paths to the les in the le system. Figure 9.3 shows the layout of a dire tory
entry in memory. A dire tory le is a list of dire tory entries, ea h one ontaining
the following information:

inode The inode for this dire tory entry. This is an index into the array of inodes

held in the Inode Table of the Blo k Group. In gure 9.3, the dire tory entry
for the le alled file has a referen e to inode number i1,

name length The length of this dire tory entry in bytes,


name The name of this dire tory entry.
The rst two entries for every dire tory are always the standard \." and \.." entries
meaning \this dire tory" and \the parent dire tory" respe tively.

0
i1

15
15 5

file

55

i2 40 14 very_long_name

inode table

Figure 9.3: EXT2 Dire tory

9.1.5 Finding a File in an EXT2 File System


A Linux lename has the same format as all UnixTM lenames have. It is a series of
dire tory names separated by forward slashes (\/") and ending in the le's name.
One example lename would be /home/rusling/. shr where /home and /rusling
are dire tory names and the le's name is . shr . Like all other UnixTM systems,
Linux does not are about the format of the lename itself; it an be any length and
onsist of any of the printable hara ters. To nd the inode representing this le
within an EXT2 le system the system must parse the lename a dire tory at a time
until we get to the le itself.
The rst inode we need is the inode for the root of the le system and we nd its
number in the le system's superblo k. To read an EXT2 inode we must look for it
in the inode table of the appropriate Blo k Group. If, for example, the root inode
number is 42, then we need the 42nd inode from the inode table of Blo k Group 0.
The root inode is for an EXT2 dire tory, in other words the mode of the root inode
des ribes it as a dire tory and it's data blo ks ontain EXT2 dire tory entries.
home is just one of the many dire tory entries and this dire tory entry gives us the
number of the inode des ribing the /home dire tory. We have to read this dire tory
(by rst reading its inode and then reading the dire tory entries from the data
blo ks des ribed by its inode) to nd the rusling entry whi h gives us the number
of the inode des ribing the /home/rusling dire tory. Finally we read the dire tory
entries pointed at by the inode des ribing the /home/rusling dire tory to nd the
inode number of the . shr le and from this we get the data blo ks ontaining the
information in the le.

9.1.6 Changing the Size of a File in an EXT2 File System


One ommon problem with a le system is its tenden y to fragment. The blo ks that
hold the le's data get spread all over the le system and this makes sequentially
a essing the data blo ks of a le more and more ine ient the further apart the
data blo ks are. The EXT2 le system tries to over ome this by allo ating the new
blo ks for a le physi ally lose to its urrent data blo ks or at least in the same
Blo k Group as its urrent data blo ks. Only when this fails does it allo ate data
blo ks in another Blo k Group.
Whenever a pro ess attempts to write data into a le the Linux le system he ks
to see if the data has gone o the end of the le's last allo ated blo k. If it has, then
it must allo ate a new data blo k for this le. Until the allo ation is omplete, the
pro ess annot run; it must wait for the le system to allo ate a new data blo k and
write the rest of the data to it before it an ontinue. The rst thing that the EXT2
blo k allo ation routines do is to lo k the EXT2 Superblo k for this le system.
Allo ating and deallo ating hanges elds within the superblo k, and the Linux le
system annot allow more than one pro ess to do this at the same time. If another
pro ess needs to allo ate more data blo ks, it will have to wait until this pro ess has
nished. Pro esses waiting for the superblo k are suspended, unable to run, until
ontrol of the superblo k is relinquished by its urrent user. A ess to the superblo k
is granted on a rst ome, rst served basis and on e a pro ess has ontrol of the
superblo k, it keeps ontrol until it has nished. Having lo ked the superblo k, the
pro ess he ks that there are enough free blo ks left in this le system. If there are
not enough free blo ks, then this attempt to allo ate more will fail and the pro ess
will relinquish ontrol of this le system's superblo k.
See

ext2 new blo k()


in fs/ext2/ballo .

If there are enough free blo ks in the le system, the pro ess tries to allo ate one.
If the EXT2 le system has been built to preallo ate data blo ks then we may
be able to take one of those. The preallo ated blo ks do not a tually exist, they
are just reserved within the allo ated blo k bitmap. The VFS inode representing
the le that we are trying to allo ate a new data blo k for has two EXT2 spe i
elds, preallo blo k and preallo ount, whi h are the blo k number of the rst
preallo ated data blo k and how many of them there are, respe tively. If there were
no preallo ated blo ks or blo k preallo ation is not enabled, the EXT2 le system
must allo ate a new blo k. The EXT2 le system rst looks to see if the data blo k
after the last data blo k in the le is free. Logi ally, this is the most e ient blo k
to allo ate as it makes sequential a esses mu h qui ker. If this blo k is not free,
then the sear h widens and it looks for a data blo k within 64 blo ks of the of the
ideal blo k. This blo k, although not ideal is at least fairly lose and within the same
Blo k Group as the other data blo ks belonging to this le.
If even that blo k is not free, the pro ess starts looking in all of the other Blo k
Groups in turn until it nds some free blo ks. The blo k allo ation ode looks for a
luster of eight free data blo ks somewhere in one of the Blo k Groups. If it annot
nd eight together, it will settle for less. If blo k preallo ation is wanted and enabled
it will update preallo blo k and preallo ount a ordingly.
Wherever it nds the free blo k, the blo k allo ation ode updates the Blo k Group's
blo k bitmap and allo ates a data bu er in the bu er a he. That data bu er is
uniquely identi ed by the le system's supporting devi e identi er and the blo k
number of the allo ated blo k. The data in the bu er is zero'd and the bu er is
marked as \dirty" to show that it's ontents have not been written to the physi al

Inode
Cache

VFS

MINIX

EXT2
Directory
Cache

Buffer
Cache

Disk
Drivers

Figure 9.4: A Logi al Diagram of the Virtual File System


disk. Finally, the superblo k itself is marked as \dirty" to show that it has been
hanged and it is unlo ked. If there were any pro esses waiting for the superblo k,
the rst one in the queue is allowed to run again and will gain ex lusive ontrol of
the superblo k for its le operations. The pro ess's data is written to the new data
blo k and, if that data blo k is lled, the entire pro ess is repeated and another data
blo k allo ated.

9.2 The Virtual File System (VFS)


Figure 9.4 shows the relationship between the Linux kernel's Virtual File System
and it's real le systems. The virtual le system must manage all of the di erent le
systems that are mounted at any given time. To do this it maintains data stru tures
that des ribe the whole (virtual) le system and the real, mounted, le systems.
Rather onfusingly, the VFS des ribes the system's les in terms of superblo ks
and inodes in mu h the same way as the EXT2 le system uses superblo ks and
inodes. Like the EXT2 inodes, the VFS inodes des ribe les and dire tories within
the system; the ontents and topology of the Virtual File System. From now on, to
avoid onfusion, I will write about VFS inodes and VFS superblo ks to distinquish
them from EXT2 inodes and superblo ks.
As ea h le system is initialised, it registers itself with the VFS. This happens as
the operating system initialises itself at system boot time. The real le systems
are either built into the kernel itself or are built as loadable modules. File System
modules are loaded as the system needs them, so, for example, if the VFAT le system
is implemented as a kernel module, then it is only loaded when a VFAT le system
is mounted. When a blo k devi e based le system is mounted, and this in ludes

See fs/*

the root le system, the VFS must read its superblo k. Ea h le system type's
superblo k read routine must work out the le system's topology and map that
information onto a VFS superblo k data stru ture. The VFS keeps a list of the
mounted le systems in the system together with their VFS superblo ks. Ea h VFS
superblo k ontains information and pointers to routines that perform parti ular
fun tions. So, for example, the superblo k representing a mounted EXT2 le system
ontains a pointer to the EXT2 spe i inode reading routine. This EXT2 inode
read routine, like all of the le system spe i inode read routines, lls out the elds
in a VFS inode. Ea h VFS superblo k ontains a pointer to the rst VFS inode on
the le system. For the root le system, this is the inode that represents the ``/''
dire tory. This mapping of information is very e ient for the EXT2 le system but
moderately less so for other le systems.
See fs/inode.

See fs/buffer.

See fs/d a he.

As the system's pro esses a ess dire tories and les, system routines are alled that
traverse the VFS inodes in the system. For example, typing ls for a dire tory or at
for a le ause the the Virtual File System to sear h through the VFS inodes that
represent the le system. As every le and dire tory on the system is represented
by a VFS inode, then a number of inodes will be being repeatedly a essed. These
inodes are kept in the inode a he whi h makes a ess to them qui ker. If an inode
is not in the inode a he, then a le system spe i routine must be alled in order
to read the appropriate inode. The a tion of reading the inode auses it to be put
into the inode a he and further a esses to the inode keep it in the a he. The less
used VFS inodes get removed from the a he.
All of the Linux le systems use a ommon bu er a he to a he data bu ers from the
underlying devi es to help speed up a ess by all of the le systems to the physi al
devi es holding the le systems. This bu er a he is independent of the le systems
and is integrated into the me hanisms that the Linux kernel uses to allo ate and
read and write data bu ers. It has the distin t advantage of making the Linux
le systems independent from the underlying media and from the devi e drivers that
support them. All blo k stru tured devi es register themselves with the Linux kernel
and present a uniform, blo k based, usually asyn hronous interfa e. Even relatively
omplex blo k devi es su h as SCSI devi es do this. As the real le systems read
data from the underlying physi al disks, this results in requests to the blo k devi e
drivers to read physi al blo ks from the devi e that they ontrol. Integrated into this
blo k devi e interfa e is the bu er a he. As blo ks are read by the le systems they
are saved in the global bu er a he shared by all of the le systems and the Linux
kernel. Bu ers within it are identi ed by their blo k number and a unique identi er
for the devi e that read it. So, if the same data is needed often, it will be retrieved
from the bu er a he rather than read from the disk, whi h would take somewhat
longer. Some devi es support read ahead where data blo ks are spe ulatively read
just in ase they are needed.
The VFS also keeps a a he of dire tory lookups so that the inodes for frequently
used dire tories an be qui kly found. As an experiment, try listing a dire tory that
you have not listed re ently. The rst time you list it, you may noti e a slight pause
but the se ond time you list its ontents the result is immediate. The dire tory a he
does not store the inodes for the dire tories itself; these should be in the inode a he,
the dire tory a he simply stores the mapping between the full dire tory names and
their inode numbers.

9.2.1 The VFS Superblo k


Every mounted le system is represented by a VFS superblo k; amongst other information, the VFS superblo k ontains the:

See in lude/linux/fs.h

Devi e This is the devi e identi er for the blo k devi e that this le system is
ontained in. For example, /dev/hda1, the rst IDE hard disk in the system
has a devi e identi er of 0x301,

Inode pointers The mounted inode pointer points at the rst inode in this le system. The overed inode pointer points at the inode representing the dire tory
that this le system is mounted on. The root le system's VFS superblo k
does not have a overed pointer,

Blo ksize The blo k size in bytes of this le system, for example 1024 bytes,
Superblo k operations A pointer to a set of superblo k routines for this le system. Amongst other things, these routines are used by the VFS to read and
write inodes and superblo ks.

File System type A pointer to the mounted le system's file system type data
stru ture,

File System spe i A pointer to information needed by this le system,

9.2.2 The VFS Inode


Like the EXT2 le system, every le, dire tory and so on in the VFS is represented
by one and only one VFS inode. The information in ea h VFS inode is built from
information in the underlying le system by le system spe i routines. VFS inodes
exist only in the kernel's memory and are kept in the VFS inode a he as long as
they are useful to the system. Amongst other information, VFS inodes ontain the
following elds:

devi e This is the devi e identifer of the devi e holding the le or whatever that
this VFS inode represents,

inode number This is the number of the inode and is unique within this le system.
The ombination of devi e and inode number is unique within the Virtual File
System,

mode Like EXT2 this eld des ribes what this VFS inode represents as well as
a ess rights to it,

user ids The owner identi ers,


times The reation, modi ation and write times,
blo k size The size of a blo k for this le in bytes, for example 1024 bytes,
inode operations A pointer to a blo k of routine addresses. These routines are
spe i to the le system and they perform operations for this inode, for example, trun ate the le that is represented by this inode.

ount The number of system omponents urrently using this VFS inode. A ount
of zero means that the inode is free to be dis arded or reused,

See in lude/linux/fs.h

file_systems

file_system_type

file_system_type

file_system_type

*read_super()

*read_super()

*read_super()

name

name

"ext2"

"proc"

name

requires_dev

requires_dev

requires_dev

next

next

next

"iso9660"

Figure 9.5: Registered File Systems

lo k This eld is used to lo k the VFS inode, for example, when it is being read
from the le system,

dirty Indi ates whether this VFS inode has been written to, if so the underlying le
system will need modifying,

le system spe i information

9.2.3 Registering the File Systems


See sys setup()
in fs/filesystems.

See

file system type


in in lude/linux/fs.h

When you build the Linux kernel you are asked if you want ea h of the supported
le systems. When the kernel is built, the le system startup ode ontains alls to
the initialisation routines of all of the built in le systems. Linux le systems may
also be built as modules and, in this ase, they may be demand loaded as they are
needed or loaded by hand using insmod. Whenever a le system module is loaded
it registers itself with the kernel and unregisters itself when it is unloaded. Ea h
le system's initialisation routine registers itself with the Virtual File System and is
represented by a file system type data stru ture whi h ontains the name of the le
system and a pointer to its VFS superblo k read routine. Figure 9.5 shows that the
file system type data stru tures are put into a list pointed at by the file systems
pointer. Ea h file system type data stru ture ontains the following information:

Superblo k read routine This routine is alled by the VFS when an instan e of
the le system is mounted,

File System name The name of this le system, for example ext2,
Devi e needed Does this le system need a devi e to support? Not all le system
need a devi e to hold them. The /pro le system, for example, does not
require a blo k devi e,

You an see whi h le systems are registered by looking in at /pro /filesystems.


For example:
ext2
nodev pro
iso9660

9.2.4 Mounting a File System


When the superuser attempts to mount a le system, the Linux kernel must rst
validate the arguments passed in the system all. Although mount does some basi

he king, it does not know whi h le systems this kernel has been built to support
or that the proposed mount point a tually exists. Consider the following mount
ommand:
$ mount -t iso9660 -o ro /dev/ drom /mnt/ drom

This mount ommand will pass the kernel three pie es of information; the name of
the le system, the physi al blo k devi e that ontains the le system and, thirdly,
where in the existing le system topology the new le system is to be mounted.
The rst thing that the Virtual File System must do is to nd the le system.
See do mount()
To do this it sear hes through the list of known le systems by looking at ea h in fs/super.
file system type data stru ture in the list pointed at by file systems. If it nds
a mat hing name it now knows that this le system type is supported by this kernel See
get fs type() in
and it has the address of the le system spe i routine for reading this le system's fs/super.
superblo k. If it annot nd a mat hing le system name then all is not lost if the
kernel is built to demand load kernel modules (see Chapter 12). In this ase the
kernel will request that the kernel daemon loads the appropriate le system module
before ontinuing as before.
Next if the physi al devi e passed by mount is not already mounted, it must nd the
VFS inode of the dire tory that is to be the new le system's mount point. This
VFS inode may be in the inode a he or it might have to be read from the blo k
devi e supporting the le system of the mount point. On e the inode has been found
it is he ked to see that it is a dire tory and that there is not already some other
le system mounted there. The same dire tory annot be used as a mount point for
more than one le system.
At this point the VFS mount ode must allo ate a VFS superblo k and pass it the
mount information to the superblo k read routine for this le system. All of the
system's VFS superblo ks are kept in the super blo ks ve tor of super blo k data
stru tures and one must be allo ated for this mount. The superblo k read routine
must ll out the VFS superblo k elds based on information that it reads from the
physi al devi e. For the EXT2 le system this mapping or translation of information
is quite easy, it simply reads the EXT2 superblo k and lls out the VFS superblo k
from there. For other le systems, su h as the MS DOS le system, it is not quite su h
an easy task. Whatever the le system, lling out the VFS superblo k means that
the le system must read whatever des ribes it from the blo k devi e that supports
it. If the blo k devi e annot be read from or if it does not ontain this type of le
system then the mount ommand will fail.
Ea h mounted le system is des ribed by a vfsmount data stru ture; see gure 9.6.
These are queued on a list pointed at by vfsmntlist. Another pointer, vfsmnttail
points at the last entry in the list and the mru vfsmnt pointer points at the most
re ently used le system. Ea h vfsmount stru ture ontains the devi e number of the
blo k devi e holding the le system, the dire tory where this le system is mounted
and a pointer to the VFS superblo k allo ated when this le system was mounted. In
turn the VFS superblo k points at the file system type data stru ture for this sort
of le system and to the root inode for this le system. This inode is kept resident
in the VFS inode a he all of the time that this le system is loaded.

See

add vfsmnt() in
fs/super.

vfsmntlist

vfsmount
mnt_dev
mnt_devname
mnt_dirname
mnt_flags
mnt_sb
next

0x0301
/dev/hda1
/

VFS
super_block
s_dev
s_blocksize
s_type

file_system_type
0x0301
1024

s_flags

*read_super()
name
requires_dev

"ext2"

next

s_covered
s_mounted

VFS
inode
i_dev
i_ino

0x0301
42

Figure 9.6: A Mounted File System

9.2.5 Finding a File in the Virtual File System


To nd the VFS inode of a le in the Virtual File System, VFS must resolve the name
a dire tory at a time, looking up the VFS inode representing ea h of the intermediate
dire tories in the name. Ea h dire tory lookup involves alling the le system spe i
lookup whose address is held in the VFS inode representing the parent dire tory. This
works be ause we always have the VFS inode of the root of ea h le system available
and pointed at by the VFS superblo k for that system. Ea h time an inode is looked
up by the real le system it he ks the dire tory a he for the dire tory. If there is
no entry in the dire tory a he, the real le system gets the VFS inode either from
the underlying le system or from the inode a he.

9.2.6 Creating a File in the Virtual File System


9.2.7 Unmounting a File System
See do umount()
in fs/super.

See

remove vfsmnt()
in fs/super.

The workshop manual for my MG usually des ribes assembly as the reverse of disassembly and the reverse is more or less true for unmounting a le system. A le
system annot be unmounted if something in the system is using one of its les. So,
for example, you annot umount /mnt/ drom if a pro ess is using that dire tory or
any of its hildren. If anything is using the le system to be unmounted there may be
VFS inodes from it in the VFS inode a he, and the ode he ks for this by looking
through the list of inodes looking for inodes owned by the devi e that this le system
o upies. If the VFS superblo k for the mounted le system is dirty, that is it has
been modi ed, then it must be written ba k to the le system on disk. On e it has
been written to disk, the memory o upied by the VFS superblo k is returned to the
kernel's free pool of memory. Finally the vfsmount data stru ture for this mount is
unlinked from vfsmntlist and freed.

9.2.8 The VFS Inode Ca he


As the mounted le systems are navigated, their VFS inodes are being ontinually
read and, in some ases, written. The Virtual File System maintains an inode a he
to speed up a esses to all of the mounted le systems. Every time a VFS inode is
read from the inode a he the system saves an a ess to a physi al devi e.

See fs/inode.

The VFS inode a he is implmented as a hash table whose entries are pointers to
lists of VFS inodes that have the same hash value. The hash value of an inode is
al ulated from its inode number and from the devi e identi er for the underlying
physi al devi e ontaining the le system. Whenever the Virtual File System needs
to a ess an inode, it rst looks in the VFS inode a he. To nd an inode in the
a he, the system rst al ulates its hash value and then uses it as an index into the
inode hash table. This gives it a pointer to a list of inodes with the same hash value.
It then reads ea h inode in turn until it nds one with both the same inode number
and the same devi e identi er as the one that it is sear hing for.
If it an nd the inode in the a he, its ount is in remented to show that it has
another user and the le system a ess ontinues. Otherwise a free VFS inode must
be found so that the le system an read the inode from memory. VFS has a number
of hoi es about how to get a free inode. If the system may allo ate more VFS inodes
then this is what it does; it allo ates kernel pages and breaks them up into new, free,
inodes and puts them into the inode list. All of the system's VFS inodes are in a
list pointed at by first inode as well as in the inode hash table. If the system
already has all of the inodes that it is allowed to have, it must nd an inode that is
a good andidate to be reused. Good andidates are inodes with a usage ount of
zero; this indi ates that the system is not urrently using them. Really important
VFS inodes, for example the root inodes of le systems always have a usage ount
greater than zero and so are never andidates for reuse. On e a andidate for reuse
has been lo ated it is leaned up. The VFS inode might be dirty and in this ase it
needs to be written ba k to the le system or it might be lo ked and in this ase the
system must wait for it to be unlo ked before ontinuing. The andidate VFS inode
must be leaned up before it an be reused.
However the new VFS inode is found, a le system spe i routine must be alled
to ll it out from information read from the underlying real le system. Whilst it is
being lled out, the new VFS inode has a usage ount of one and is lo ked so that
nothing else a esses it until it ontains valid information.
To get the VFS inode that is a tually needed, the le system may need to a ess
several other inodes. This happens when you read a dire tory; only the inode for
the nal dire tory is needed but the inodes for the intermediate dire tories must also
be read. As the VFS inode a he is used and lled up, the less used inodes will be
dis arded and the more used inodes will remain in the a he.

9.2.9 The Dire tory Ca he


To speed up a esses to ommonly used dire tories, the VFS maintains a a he of
dire tory entries. As dire tories are looked up by the real le systems their details
are added into the dire tory a he. The next time the same dire tory is looked up,
for example to list it or open a le within it, then it will be found in the dire tory
a he. Only short dire tory entries (up to 15 hara ters long) are a hed but this

See fs/d a he.

hash_table

buffer_head

buffer_head
b_dev
b_blocknr
b_state
b_count
b_size

0x0301
42

1024

b_dev
b_blocknr
b_state
b_count
b_size

b_next
b_prev

b_next
b_prev

b_data

b_data

0x0801
17

2048

buffer_head
b_dev
b_blocknr
b_state
b_count
b_size

0x0301
39

1024

b_next
b_prev

b_data

Figure 9.7: The Bu er Ca he


is reasonable as the shorter dire tory names are the most ommonly used ones. For
example, /usr/X11R6/bin is very ommonly a essed when the X server is running.
The dire tory a he onsists of a hash table, ea h entry of whi h points at a list
of dire tory a he entries that have the same hash value. The hash fun tion uses
the devi e number of the devi e holding the le system and the dire tory's name to
al ulate the o set, or index, into the hash table. It allows a hed dire tory entries
to be qui kly found. It is no use having a a he when lookups within the a he take
too long to nd entries, or even not to nd them.
In an e ort to keep the a hes valid and up to date the VFS keeps lists of Least
Re ently Used (LRU) dire tory a he entries. When a dire tory entry is rst put
into the a he, whi h is when it is rst looked up, it is added onto the end of the rst
level LRU list. In a full a he this will displa e an existing entry from the front of the
LRU list. As the dire tory entry is a essed again it is promoted to the ba k of the
se ond LRU a he list. Again, this may displa e a a hed level two dire tory entry
at the front of the level two LRU a he list. This displa ing of entries at the front
of the level one and level two LRU lists is ne. The only reason that entries are at
the front of the lists is that they have not been re ently a essed. If they had, they
would be nearer the ba k of the lists. The entries in the se ond level LRU a he list
are safer than entries in the level one LRU a he list. This is the intention as these
entries have not only been looked up but also they have been repeatedly referen ed.
REVIEW NOTE: Do we need a diagram for this?

9.3 The Bu er Ca he
As the mounted le systems are used they generate a lot of requests to the blo k
devi es to read and write data blo ks. All blo k data read and write requests are
given to the devi e drivers in the form of buffer head data stru tures via standard
kernel routine alls. These give all of the information that the blo k devi e drivers
need; the devi e identi er uniquely identi es the devi e and the blo k number tells

the driver whi h blo k to read. All blo k devi es are viewed as linear olle tions
of blo ks of the same size. To speed up a ess to the physi al blo k devi es, Linux
maintains a a he of blo k bu ers. All of the blo k bu ers in the system are kept
somewhere in this bu er a he, even the new, unused bu ers. This a he is shared
between all of the physi al blo k devi es; at any one time there are many blo k
bu ers in the a he, belonging to any one of the system's blo k devi es and often in
many di erent states. If valid data is available from the bu er a he this saves the
system an a ess to a physi al devi e. Any blo k bu er that has been used to read
data from a blo k devi e or to write data to it goes into the bu er a he. Over time
it may be removed from the a he to make way for a more deserving bu er or it may
remain in the a he as it is frequently a essed.
Blo k bu ers within the a he are uniquely ident ed by the owning devi e identi er
and the blo k number of the bu er. The bu er a he is omposed of two fun tional
parts. The rst part is the lists of free blo k bu ers. There is one list per supported
bu er size and the system's free blo k bu ers are queued onto these lists when they
are rst reated or when they have been dis arded. The urrently supported bu er
sizes are 512, 1024, 2048, 4096 and 8192 bytes. The se ond fun tional part is the
a he itself. This is a hash table whi h is a ve tor of pointers to hains of bu ers
that have the same hash index. The hash index is generated from the owning devi e
identi er and the blo k number of the data blo k. Figure 9.7 shows the hash table
together with a few entries. Blo k bu ers are either in one of the free lists or they
are in the bu er a he. When they are in the bu er a he they are also queued onto
Least Re ently Used (LRU) lists. There is an LRU list for ea h bu er type and these
are used by the system to perform work on bu ers of a type, for example, writing
bu ers with new data in them out to disk. The bu er's type re e ts its state and
Linux urrently supports the following types:

lean Unused, new bu ers,


lo ked Bu ers that are lo ked, waiting to be written,
dirty Dirty bu ers. These ontain new, valid data, and will be written but so far
have not been s heduled to write,

shared Shared bu ers,


unshared Bu ers that were on e shared but whi h are now not shared,
Whenever a le system needs to read a bu er from its underlying physi al devi e, it
trys to get a blo k from the bu er a he. If it annot get a bu er from the bu er
a he, then it will get a lean one from the appropriate sized free list and this new
bu er will go into the bu er a he. If the bu er that it needed is in the bu er a he,
then it may or may not be up to date. If it is not up to date or if it is a new blo k
bu er, the le system must request that the devi e driver read the appropriate blo k
of data from the disk.
Like all a hes, the bu er a he must be maintained so that it runs e iently and
fairly allo ates a he entries between the blo k devi es using the bu er a he. Linux
uses the bdflush kernel daemon to perform a lot of housekeeping duties on the
a he but some happen automati ally as a result of the a he being used.

9.3.1 The bdflush Kernel Daemon


The bdflush kernel daemon is a simple kernel daemon that provides a dynami
response to the system having too many dirty bu ers; bu ers that ontain data that
must be written out to disk at some time. It is started as a kernel thread at system
startup time and, rather onfusingly, it alls itself \k ushd" and that is the name
that you will see if you use the ps ommand to show the pro esses in the system.
Mostly this daemon sleeps waiting for the number of dirty bu ers in the system to
grow too large. As bu ers are allo ated and dis arded the number of dirty bu ers in
the system is he ked. If there are too many as a per entage of the total number of
bu ers in the system then bdflush is woken up. The default threshold is 60% but,
if the system is desperate for bu ers, bdflush will be woken up anyway. This value
an be seen and hanged using the update ommand:

See bdflush()
in fs/buffer.

# update -d
bdflush version 1.4
0:
60 Max fra tion of LRU list to examine for dirty blo ks
1: 500 Max number of dirty blo ks to write ea h time bdflush a tivated
2:
64 Num of lean buffers to be loaded onto free list by refill_freelist
3: 256 Dirty blo k threshold for a tivating bdflush in refill_freelist
4:
15 Per entage of a he to s an for free lusters
5: 3000 Time for data buffers to age before flushing
6: 500 Time for non-data (dir, bitmap, et ) buffers to age before flushing
7: 1884 Time buffer a he load average onstant
8:
2 LAV ratio (used to determine threshold for buffer fratri ide).

All of the dirty bu ers are linked into the BUF DIRTY LRU list whenever they are
made dirty by having data written to them and bdflush tries to write a reasonable
number of them out to their owning disks. Again this number an be seen and
ontrolled by the update ommand and the default is 500 (see above).

9.3.2 The update Pro ess


See

sys bdflush() in
fs/buffer.

The update ommand is more than just a ommand; it is also a daemon. When run
as superuser (during system initialisation) it will periodi ally ush all of the older
dirty bu ers out to disk. It does this by alling a system servi e routine that does
more or less the same thing as bdflush. Whenever a dirty bu er is nished with,
it is tagged with the system time that it should be written out to its owning disk.
Every time that update runs it looks at all of the dirty bu ers in the system looking
for ones with an expired ush time. Every expired bu er is written out to disk.

9.4 The /pro File System


The /pro le system really shows the power of the Linux Virtual File System. It
does not really exist (yet another of Linux's onjuring tri ks), neither the /pro
dire tory nor its subdire tories and its les a tually exist. So how an you at

/pro /devi es? The /pro le system, like a real le system, registers itself with the

Virtual File System. However, when the VFS makes alls to it requesting inodes as
its les and dire tories are opened, the /pro le system reates those les and dire tories from information within the kernel. For example, the kernel's /pro /devi es
le is generated from the kernel's data stru tures des ribing its devi es.
The /pro le system presents a user readable window into the kernel's inner workings. Several Linux subsystems, su h as Linux kernel modules des ribed in hapter 12, reate entries in the the /pro le system.

9.5 Devi e Spe ial Files


Linux, like all versions of UnixTM presents its hardware devi es as spe ial les. So,
for example, /dev/null is the null devi e. A devi e le does not use any data spa e in
the le system, it is only an a ess point to the devi e driver. The EXT2 le system
and the Linux VFS both implement devi e les as spe ial types of inode. There are
two types of devi e le; hara ter and blo k spe ial les. Within the kernel itself,
the devi e drivers implement le semanti es: you an open them, lose them and
so on. Chara ter devi es allow I/O operations in hara ter mode and blo k devi es
require that all I/O is via the bu er a he. When an I/O request is made to a devi e
le, it is forwarded to the appropriate devi e driver within the system. Often this
is not a real devi e driver but a pseudo-devi e driver for some subsystem su h as
the SCSI devi e driver layer. Devi e les are referen ed by a major number, whi h
identi es the devi e type, and a minor type, whi h identi es the unit, or instan e
of that major type. For example, the IDE disks on the rst IDE ontroller in the
system have a major number of 3 and the rst partition of an IDE disk would have
a minor number of 1. So, ls -l of /dev/hda1 gives:
$ brw-rw----

1 root

disk

3,

1 Nov 24 15:09 /dev/hda1

Within the kernel, every devi e is uniquely des ribed by a kdev t data type, this is
two bytes long, the rst byte ontaining the minor devi e number and the se ond
byte holding the major devi e number. The IDE devi e above is held within the
kernel as 0x0301. An EXT2 inode that represents a blo k or hara ter devi e keeps
the devi e's major and minor numbers in its rst dire t blo k pointer. When it is
read by the VFS, the VFS inode data stru ture representing it has its i rdev eld
set to the orre t devi e identi er.

see

/in lude/linux/
major.h for all of

Linux's major
devi e numbers.

See in lude/linux/kdev t.h

Chapter 10

Networks

Networking and Linux are terms that are almost synonymous. In a very
real sense Linux is a produ t of the Internet or World Wide Web (WWW).
Its developers and users use the web to ex hange information ideas, ode,
and Linux itself is often used to support the networking needs of organizations. This hapter des ribes how Linux supports the network proto ols
known olle tively as TCP/IP.
The TCP/IP proto ols were designed to support ommuni ations between omputers
onne ted to the ARPANET, an Ameri an resear h network funded by the US government. The ARPANET pioneered networking on epts su h as pa ket swit hing
and proto ol layering where one proto ol uses the servi es of another. ARPANET
was retired in 1988 but its su essors (NSF1 NET and the Internet) have grown even
larger. What is now known as the World Wide Web grew from the ARPANET and
is itself supported by the TCP/IP proto ols. UnixTM was extensively used on the
ARPANET and the rst released networking version of UnixTM was 4.3 BSD. Linux's
networking implementation is modeled on 4.3 BSD in that it supports BSD so kets
(with some extensions) and the full range of TCP/IP networking. This programming
interfa e was hosen be ause of its popularity and to help appli ations be portable
between Linux and other UnixTM platforms.

10.1 An Overview of TCP/IP Networking


This se tion gives an overview of the main prin iples of TCP/IP networking. It is not
meant to be an exhaustive des ription, for that I suggest that you read [10, Comer.
In an IP network every ma hine is assigned an IP address, this is a 32 bit number
that uniquely identi es the ma hine. The WWW is a very large, and growing, IP
network and every ma hine that is onne ted to it has to have a unique IP address
assigned to it. IP addresses are represented by four numbers separated by dots, for
1 National

S ien e Foundation

119

example, 16.42.0.9. This IP address is a tually in two parts, the network address
and the host address. The sizes of these parts may vary (there are several lasses of IP
addresses) but using 16.42.0.9 as an example, the network address would be 16.42
and the host address 0.9. The host address is further subdivided into a subnetwork
and a host address. Again, using 16.42.0.9 as an example, the subnetwork address
would be 16.42.0 and the host address 16.42.0.9. This subdivision of the IP address
allows organizations to subdivide their networks. For example, 16.42 ould be the
network address of the ACME Computer Company; 16.42.0 would be subnet 0
and 16.42.1 would be subnet 1. These subnets might be in separate buildings,
perhaps onne ted by leased telephone lines or even mi rowave links. IP addresses
are assigned by the network administrator and having IP subnetworks is a good way
of distributing the administration of the network. IP subnet administrators are free
to allo ate IP addresses within their IP subnetworks.
Generally though, IP addresses are somewhat hard to remember. Names are mu h
easier. linux.a me. om is mu h easier to remember than 16.42.0.9 but there must
be some me hanism to onvert the network names into an IP address. These names
an be stati ally spe i ed in the /et /hosts le or Linux an ask a Distributed
Name Server (DNS server) to resolve the name for it. In this ase the lo al host
must know the IP address of one or more DNS servers and these are spe i ed in
/et /resolv. onf.
Whenever you onne t to another ma hine, say when reading a web page, its IP
address is used to ex hange data with that ma hine. This data is ontained in IP
pa kets ea h of whi h have an IP header ontaining the IP addresses of the sour e
and destination ma hine's IP addresses, a he ksum and other useful information.
The he ksum is derived from the data in the IP pa ket and allows the re eiver of
IP pa kets to tell if the IP pa ket was orrupted during transmission, perhaps by a
noisy telephone line. The data transmitted by an appli ation may have been broken
down into smaller pa kets whi h are easier to handle. The size of the IP data pa kets
varies depending on the onne tion media; ethernet pa kets are generally bigger than
PPP pa kets. The destination host must reassemble the data pa kets before giving
the data to the re eiving appli ation. You an see this fragmentation and reassembly
of data graphi ally if you a ess a web page ontaining a lot of graphi al images via
a moderately slow serial link.
Hosts onne ted to the same IP subnet an send IP pa kets dire tly to ea h other, all
other IP pa kets will be sent to a spe ial host, a gateway. Gateways (or routers) are
onne ted to more than one IP subnet and they will resend IP pa kets re eived on
one subnet, but destined for another onwards. For example, if subnets 16.42.1.0
and 16.42.0.0 are onne ted together by a gateway then any pa kets sent from
subnet 0 to subnet 1 would have to be dire ted to the gateway so that it ould route
them. The lo al host builds up routing tables whi h allow it to route IP pa kets to
the orre t ma hine. For every IP destination there is an entry in the routing tables
whi h tells Linux whi h host to send IP pa kets to in order that they rea h their
destination. These routing tables are dynami and hange over time as appli ations
use the network and as the network topology hanges.
The IP proto ol is a transport layer that is used by other proto ols to arry their data.
The Transmission Control Proto ol (TCP) is a reliable end to end proto ol that uses
IP to transmit and re eive its own pa kets. Just as IP pa kets have their own header,
TCP has its own header. TCP is a onne tion based proto ol where two networking

ETHERNET FRAME
Destination
ethernet
address

Source
ethernet
address

Protocol

Data

Checksum

IP PACKET
Length

Protocol

Checksum

Source
IP address

Destination
IP address

Data

TCP PACKET
Source TCP
address

Destination
TCP address

SEQ

ACK

Data

Figure 10.1: TCP/IP Proto ol Layers


appli ations are onne ted by a single, virtual onne tion even though there may be
many subnetworks, gateways and routers between them. TCP reliably transmits and
re eives data between the two appli ations and guarantees that there will be no lost or
dupli ated data. When TCP transmits its pa ket using IP, the data ontained within
the IP pa ket is the TCP pa ket itself. The IP layer on ea h ommuni ating host
is responsible for transmitting and re eiving IP pa kets. User Datagram Proto ol
(UDP) also uses the IP layer to transport its pa kets, unlike TCP, UDP is not a
reliable proto ol but o ers a datagram servi e. This use of IP by other proto ols
means that when IP pa kets are re eived the re eiving IP layer must know whi h
upper proto ol layer to give the data ontained in this IP pa ket to. To fa ilitate this
every IP pa ket header has a byte ontaining a proto ol identi er. When TCP asks
the IP layer to transmit an IP pa ket , that IP pa ket's header states that it ontains
a TCP pa ket. The re eiving IP layer uses that proto ol identi er to de ide whi h
layer to pass the re eived data up to, in this ase the TCP layer. When appli ations
ommuni ate via TCP/IP they must spe ify not only the target's IP address but also
the port address of the appli ation. A port address uniquely identi es an appli ation
and standard network appli ations use standard port addresses; for example, web
servers use port 80. These registered port addresses an be seen in /et /servi es.
This layering of proto ols does not stop with TCP, UDP and IP. The IP proto ol
layer itself uses many di erent physi al media to transport IP pa kets to other IP
hosts. These media may themselves add their own proto ol headers. One su h
example is the ethernet layer, but PPP and SLIP are others. An ethernet network
allows many hosts to be simultaneously onne ted to a single physi al able. Every
transmitted ethernet frame an be seen by all onne ted hosts and so every ethernet
devi e has a unique address. Any ethernet frame transmitted to that address will be
re eived by the addressed host but ignored by all the other hosts onne ted to the
network. These unique addresses are built into ea h ethernet devi e when they are
manufa tured and it is usually kept in an SROM2 on the ethernet ard. Ethernet
addresses are 6 bytes long, an example would be 08-00-2b-00-49-A4. Some ethernet
2 Syn hronous

Read Only Memory

addresses are reserved for multi ast purposes and ethernet frames sent with these
destination addresses will be re eived by all hosts on the network. As ethernet frames
an arry many di erent proto ols (as data) they, like IP pa kets, ontain a proto ol
identi er in their headers. This allows the ethernet layer to orre tly re eive IP
pa kets and to pass them onto the IP layer.
In order to send an IP pa ket via a multi- onne tion proto ol su h as ethernet, the
IP layer must nd the ethernet address of the IP host. This is be ause IP addresses
are simply an addressing on ept, the ethernet devi es themselves have their own
physi al addresses. IP addresses on the other hand an be assigned and reassigned
by network administrators at will but the network hardware responds only to ethernet frames with its own physi al address or to spe ial multi ast addresses whi h
all ma hines must re eive. Linux uses the Address Resolution Proto ol (or ARP)
to allow ma hines to translate IP addresses into real hardware addresses su h as
ethernet addresses. A host wishing to know the hardware address asso iated with
an IP address sends an ARP request pa ket ontaining the IP address that it wishes
translating to all nodes on the network by sending it to a multi ast address. The
target host that owns the IP address, responds with an ARP reply that ontains its
physi al hardware address. ARP is not just restri ted to ethernet devi es, it an
resolve IP addresses for other physi al media, for example FDDI. Those network
devi es that annot ARP are marked so that Linux does not attempt to ARP. There
is also the reverse fun tion, Reverse ARP or RARP, whi h translates phsyi al network addresses into IP addresses. This is used by gateways, whi h respond to ARP
requests on behalf of IP addresses that are in the remote network.

10.2 The Linux TCP/IP Networking Layers


Just like the network proto ols themselves, Figure 10.2 shows that Linux implements
the internet proto ol address family as a series of onne ted layers of software. BSD
so kets are supported by a generi so ket management software on erned only with
BSD so kets. Supporting this is the INET so ket layer, this manages the ommuni ation end points for the IP based proto ols TCP and UDP. UDP (User Datagram
Proto ol) is a onne tionless proto ol whereas TCP (Transmission Control Proto ol)
is a reliable end to end proto ol. When UDP pa kets are transmitted, Linux neither
knows nor ares if they arrive safely at their destination. TCP pa kets are numbered
and both ends of the TCP onne tion make sure that transmitted data is re eived
orre tly. The IP layer ontains ode implementing the Internet Proto ol. This ode
prepends IP headers to transmitted data and understands how to route in oming IP
pa kets to either the TCP or UDP layers. Underneath the IP layer, supporting all
of Linux's networking are the network devi es, for example PPP and ethernet. Network devi es do not always represent physi al devi es; some like the loopba k devi e
are purely software devi es. Unlike standard Linux devi es that are reated via the
mknod ommand, network devi es appear only if the underlying software has found
and initialized them. You will only see /dev/eth0 when you have built a kernel with
the appropriate ethernet devi e driver in it. The ARP proto ol sits between the IP
layer and the proto ols that support ARPing for addresses.

Network
Applications

User
Kernel

BSD
Sockets

Socket
Interface
INET
Sockets

TCP

UDP

Protocol
Layers

IP
ARP

Network
Devices

PPP

SLIP

Ethernet

Figure 10.2: Linux Networking Layers

10.3 The BSD So ket Interfa e


This is a general interfa e whi h not only supports various forms of networking but
is also an inter-pro ess ommuni ations me hanism. A so ket des ribes one end
of a ommuni ations link, two ommuni ating pro esses would ea h have a so ket
des ribing their end of the ommuni ation link between them. So kets ould be
thought of as a spe ial ase of pipes but, unlike pipes, so kets have no limit on the
amount of data that they an ontain. Linux supports several lasses of so ket and
these are known as address families. This is be ause ea h lass has its own method of
addressing its ommuni ations. Linux supports the following so ket address families
or domains:
UNIX
Unix domain so kets,
INET
The Internet address family supports ommuni ations via
TCP/IP proto ols
AX25
Amateur radio X25
IPX
Novell IPX
APPLETALK Appletalk DDP
X25
X25
There are several so ket types and these represent the type of servi e that supports
the onne tion. Not all address families support all types of servi e. Linux BSD
so kets support a number of so ket types:

Stream These so kets provide reliable two way sequen ed data streams with a guarantee that data annot be lost, orrupted or dupli ated in transit. Stream
so kets are supported by the TCP proto ol of the Internet (INET) address
family.

Datagram These so kets also provide two way data transfer but, unlike stream

so kets, there is no guarantee that the messages will arrive. Even if they
do arrive there is no guarantee that they will arrive in order or even not be
dupli ated or orrupted. This type of so ket is supported by the UDP proto ol
of the Internet address family.

Raw This allows pro esses dire t (hen e \raw") a ess to the underlying proto ols.
It is, for example, possible to open a raw so ket to an ethernet devi e and see
raw IP data tra .

Reliable Delivered Messages These are very like datagram so kets but the data
is guaranteed to arrive.

Sequen ed Pa kets These are like stream so kets ex ept that the data pa ket sizes
are xed.

Pa ket This is not a standard BSD so ket type, it is a Linux spe i extension that
allows pro esses to a ess pa kets dire tly at the devi e level.

Pro esses that ommuni ate using so kets use a lient server model. A server provides
a servi e and lients make use of that servi e. One example would be a Web Server,
whi h provides web pages and a web lient, or browser, whi h reads those pages. A
server using so kets, rst reates a so ket and then binds a name to it. The format
of this name is dependent on the so ket's address family and it is, in e e t, the lo al
address of the server. The so ket's name or address is spe i ed using the so kaddr

data stru ture. An INET so ket would have an IP port address bound to it. The
registered port numbers an be seen in /et /servi es; for example, the port number
for a web server is 80. Having bound an address to the so ket, the server then listens
for in oming onne tion requests spe ifying the bound address. The originator of the
request, the lient, reates a so ket and makes a onne tion request on it, spe ifying
the target address of the server. For an INET so ket the address of the server is its
IP address and its port number. These in oming requests must nd their way up
through the various proto ol layers and then wait on the server's listening so ket.
On e the server has re eived the in oming request it either a epts or reje ts it. If
the in oming request is to be a epted, the server must reate a new so ket to a ept
it on. On e a so ket has been used for listening for in oming onne tion requests it
annot be used to support a onne tion. With the onne tion established both ends
are free to send and re eive data. Finally, when the onne tion is no longer needed it
an be shutdown. Care is taken to ensure that data pa kets in transit are orre tly
dealt with.
The exa t meaning of operations on a BSD so ket depends on its underlying address
family. Setting up TCP/IP onne tions is very di erent from setting up an amateur
radio X.25 onne tion. Like the virtual lesystem, Linux abstra ts the so ket interfa e with the BSD so ket layer being on erned with the BSD so ket interfa e to
the appli ation programs whi h is in turn supported by independent address family
spe i software. At kernel initialization time, the address families built into the
kernel register themselves with the BSD so ket interfa e. Later on, as appli ations
reate and use BSD so kets, an asso iation is made between the BSD so ket and
its supporting address family. This asso iation is made via ross-linking data stru tures and tables of address family spe i support routines. For example there is an
address family spe i so ket reation routine whi h the BSD so ket interfa e uses
when an appli ation reates a new so ket.
When the kernel is on gured, a number of address families and proto ols are built
into the proto ols ve tor. Ea h is represented by its name, for example \INET"
and the address of its initialization routine. When the so ket interfa e is initialized
at boot time ea h proto ol's initialization routine is alled. For the so ket address
families this results in them registering a set of proto ol operations. This is a set
of routines, ea h of whi h performs a a parti ular operation spe i to that address
family. The registered proto ol operations are kept in the pops ve tor, a ve tor of
pointers to proto ops data stru tures. The proto ops data stru ture onsists of
the address family type and a set of pointers to so ket operation routines spe i
to a parti ular address family. The pops ve tor is indexed by the address family
identi er, for example the Internet address family identi er (AF INET is 2).

10.4 The INET So ket Layer


The INET so ket layer supports the internet address family whi h ontains the
TCP/IP proto ols. As dis ussed above, these proto ols are layered, one proto ol
using the servi es of another. Linux's TCP/IP ode and data stru tures re e t this
layering. Its interfa e with the BSD so ket layer is through the set of Internet address family so ket operations whi h it registers with the BSD so ket layer during
network initialization. These are kept in the pops ve tor along with the other registered address families. The BSD so ket layer alls the INET layer so ket support

See in lude/linux/net.h

files_struct
count
close_on_exec
open_fs
fd[0]

file

fd[1]

f_mode
f_pos
f_flags

fd[255]

f_count
f_owner
f_op

BSD Socket
File Operations
lseek
read
write
select
ioctl
close
fasync

f_inode
f_version
inode

socket
type

SOCK_STREAM
Address Family
socket operations

ops
data

sock
type

SOCK_STREAM

protocol
socket

Figure 10.3: Linux BSD So ket Data Stru tures

routines from the registered INET proto ops data stru ture to perform work for it.
For example a BSD so ket reate request that gives the address family as INET will
use the underlying INET so ket reate fun tion. The BSD so ket layer passes the
so ket data stru ture representing the BSD so ket to the INET layer in ea h of these
operations. Rather than lutter the BSD so ket wiht TCP/IP spe i information,
the INET so ket layer uses its own data stru ture, the so k whi h it links to the
BSD so ket data stru ture. This linkage an be seen in Figure 10.3. It links the
so k data stru ture to the BSD so ket data stru ture using the data pointer in
the BSD so ket. This means that subsequent INET so ket alls an easily retrieve
the so k data stru ture. The so k data stru ture's proto ol operations pointer is
also set up at reation time and it depends on the proto ol requested. If TCP is
requested, then the so k data stru ture's proto ol operations pointer will point to
the set of TCP proto ol operations needed for a TCP onne tion.

See in lude/net/so k.h

10.4.1 Creating a BSD So ket


The system all to reate a new so ket passes identi ers for its address family, so ket
type and proto ol. Firstly the requested address family is used to sear h the pops
ve tor for a mat hing address family. It may be that a parti ular address family is
implemented as a kernel module and, in this ase, the kerneld daemon must load
the module before we an ontinue. A new so ket data stru ture is allo ated to
represent the BSD so ket. A tually the so ket data stru ture is physi ally part of
the VFS inode data stru ture and allo ating a so ket really means allo ating a VFS
inode. This may seem strange unless you onsider that so kets an be operated on
in just the same way that ordinairy les an. As all les are represented by a VFS
inode data stru ture, then in order to support le operations, BSD so kets must
also be represented by a VFS inode data stru ture.
The newly reated BSD so ket data stru ture ontains a pointer to the address
family spe i so ket routines and this is set to the proto ops data stru ture retrieved from the pops ve tor. Its type is set to the s ket type requested; one of
SOCK STREAM, SOCK DGRAM and so on. The address family spe i reation
routine is alled using the address kept in the proto ops data stru ture.
A free le des riptor is allo ated from the urrent pro esses fd ve tor and the file
data stru ture that it points at is initialized. This in ludes setting the le operations
pointer to point to the set of BSD so ket le operations supported by the BSD so ket
interfa e. Any future operations will be dire ted to the so ket interfa e and it will
in turn pass them to the supporting address family by alling its address family
operation routines.

10.4.2 Binding an Address to an INET BSD So ket


In order to be able to listen for in oming internet onne tion requests, ea h server
must reate an INET BSD so ket and bind its address to it. The bind operation is
mostly handled within the INET so ket layer with some support from the underlying
TCP and UDP proto ol layers. The so ket having an address bound to annot be
being used for any other ommuni ation. This means that the so ket's state must
be TCP CLOSE. The so kaddr pass to the bind operation ontains the IP address to
be bound to and, optionally, a port number. Normally the IP address bound to
would be one that has been assigned to a network devi e that supports the INET

See

sys so ket() in
net/so ket.

address family and whose interfa e is up and able to be used. You an see whi h
network interfa es are urrently a tive in the system by using the if on g ommand.
The IP address may also be the IP broad ast address of either all 1's or all 0's.
These are spe ial addresses that mean \send to everybody"3 . The IP address ould
also be spe i ed as any IP address if the ma hine is a ting as a transparent proxy or
rewall, but only pro esses with superuser privileges an bind to any IP address. The
IP address bound to is saved in the so k data stru ture in the re v addr and saddr
elds. These are used in hash lookups and as the sending IP address respe tively.
The port number is optional and if it is not spe i ed the supporting network is
asked for a free one. By onvention, port numbers less than 1024 annot be used
by pro esses without superuser privileges. If the underlying network does allo ate a
port number it always allo ates ones greater than 1024.
As pa kets are being re eived by the underlying network devi es they must be routed
to the orre t INET and BSD so kets so that they an be pro essed. For this reason
UDP and TCP maintain hash tables whi h are used to lookup the addresses within
in oming IP messages and dire t them to the orre t so ket/so k pair. TCP is a
onne tion oriented proto ol and so there is more information involved in pro essing
TCP pa kets than there is in pro essing UDP pa kets.
UDP maintains a hash table of allo ated UDP ports, the udp hash table. This
onsists of pointers to so k data stru tures indexed by a hash fun tion based on the
port number. As the UDP hash table is mu h smaller than the number of permissible
port numbers (udp hash is only 128 or UDP HTABLE SIZE entries long) some entries in
the table point to a hain of so k data stru tures linked together using ea h so k's
next pointer.
TCP is mu h more omplex as it maintains several hash tables. However, TCP
does not a tually add the binding so k data stu ture into its hash tables during the
bind operation, it merely he ks that the port number requested is not urrently
being used. The so k data stru ture is added to TCP's hash tables during the listen
operation.
REVIEW NOTE: What about the route entered?

10.4.3 Making a Conne tion on an INET BSD So ket


On e a so ket has been reated and, provided it has not been used to listen for
inbound onne tion requests, it an be used to make outbound onne tion requests.
For onne tionless proto ols like UDP this so ket operation does not do a whole lot
but for onne tion orientated proto ols like TCP it involves building a virtual ir uit
between two appli ations.
An outbound onne tion an only be made on an INET BSD so ket that is in the
right state; that is to say one that does not already have a onne tion established and
one that is not being used for listening for inbound onne tions. This means that the
BSD so ket data stru ture must be in state SS UNCONNECTED. The UDP proto ol
does not establish virtual onne tions between appli ations, any messages sent are
datagrams, one o messages that may or may not rea h their destinations. It does,
however, support the onne t BSD so ket operation. A onne tion operation on a
UDP INET BSD so ket simply sets up the addresses of the remote appli ation; its IP
address and its IP port number. Additionally it sets up a a he of the routing table
3 duh?

What used for?

entry so that UDP pa kets sent on this BSD so ket do not need to he k the routing
database again (unless this route be omes invalid). The a hed routing information
is pointed at from the ip route a he pointer in the INET so k data stru ture. If
no addressing information is given, this a hed routing and IP addressing information
will be automati ally be used for messages sent using this BSD so ket. UDP moves
the so k's state to TCP ESTABLISHED.
For a onne t operation on a TCP BSD so ket, TCP must build a TCP message
ontaining the onne tion information and send it to IP destination given. The TCP
message ontains information about the onne tion, a unique starting message sequen e number, the maximum sized message that an be managed by the initiating
host, the transmit and re eive window size and so on. Within TCP all messages
are numbered and the initial sequen e number is used as the rst message number.
Linux hooses a reasonably random value to avoid mali ious proto ol atta ks. Every
message transmitted by one end of the TCP onne tion and su essfully re eived by
the other is a knowledged to say that it arrived su essfully and un orrupted. Una knowledges messages will be retransmitted. The transmit and re eive window size is
the number of outstanding messages that there an be without an a knowledgement
being sent. The maximum message size is based on the network devi e that is being
used at the initiating end of the request. If the re eiving end's network devi e supports smaller maximum message sizes then the onne tion will use the minimum of
the two. The appli ation making the outbound TCP onne tion request must now
wait for a response from the target appli ation to a ept or reje t the onne tion
request. As the TCP so k is now expe ting in oming messages, it is added to the
t p listening hash so that in oming TCP messages an be dire ted to this so k
data stru ture. TCP also starts timers so that the outbound onne tion request an
be timed out if the target appli ation does not respond to the request.

10.4.4 Listening on an INET BSD So ket


On e a so ket has had an address bound to it, it may listen for in oming onne tion
requests spe ifying the bound addresses. A network appli ation an listen on a so ket
without rst binding an address to it; in this ase the INET so ket layer nds an
unused port number (for this proto ol) and automati ally binds it to the so ket. The
listen so ket fun tion moves the so ket into state TCP LISTEN and does any network
spe i work needed to allow in oming onne tions.
For UDP so kets, hanging the so ket's state is enough but TCP now adds the
so ket's so k data stru ture into two hash tables as it is now a tive. These are the
t p bound hash table and the t p listening hash. Both are indexed via a hash
fun tion based on the IP port number.
Whenever an in oming TCP onne tion request is re eived for an a tive listening
so ket, TCP builds a new so k data stru ture to represent it. This so k data
stru ture will be ome the bottom half of the TCP onne tion when it is eventually
a epted. It also lones the in oming sk buff ontaining the onne tion request and
queues it onto the re eive queue for the listening so k data stru ture. The lone
sk buff ontains a pointer to the newly reated so k data stru ture.

10.4.5 A epting Conne tion Requests


UDP does not support the on ept of onne tions, a epting INET so ket onne tion
requests only applies to the TCP proto ol as an a ept operation on a listening so ket
auses a new so ket data stru ture to be loned from the original listening so ket.
The a ept operation is then passed to the supporting proto ol layer, in this ase
INET to a ept any in oming onne tion requests. The INET proto ol layer will
fail the a ept operation if the underlying proto ol, say UDP, does not support
onne tions. Otherwise the a ept operation is passed through to the real proto ol,
in this ase TCP. The a ept operation an be either blo king or non-blo king. In
the non-blo king ase if there are no in oming onne tions to a ept, the a ept
operation will fail and the newly reated so ket data stru ture will be thrown away.
In the blo king ase the network appli ation performing the a ept operation will
be added to a wait queue and then suspended until a TCP onne tion request is
re eived. On e a onne tion request has been re eived the sk buff ontaining the
request is dis arded and the so k data stru ture is returned to the INET so ket
layer where it is linked to the new so ket data stru ture reated earlier. The le
des riptor (fd) number of the new so ket is returned to the network appli ation,
and the appli ation an then use that le des riptor in so ket operations on the newly
reated INET BSD so ket.

10.5 The IP Layer


10.5.1 So ket Bu ers

See in lude/linux/skbuff.h

One of the problems of having many layers of network proto ols, ea h one using the
servi es of another, is that ea h proto ol needs to add proto ol headers and tails to
data as it is transmitted and to remove them as it pro esses re eived data. This make
passing data bu ers between the proto ols di ult as ea h layer needs to nd where
its parti ular proto ol headers and tails are. One solution is to opy bu ers at ea h
layer but that would be ine ient. Instead, Linux uses so ket bu ers or sk buffs
to pass data between the proto ol layers and the network devi e drivers. sk buffs
ontain pointer and length elds that allow ea h proto ol layer to manipulate the
appli ation data via standard fun tions or \methods".
Figure 10.4 shows the sk buff data stru ture; ea h sk buff has a blo k of data
asso iated with it. The sk buff has four data pointers, whi h are used to manipulate
and manage the so ket bu er's data:

head points to the start of the data area in memory. This is xed when the sk buff
and its asso iated data blo k is allo ated,

data points at the urrent start of the proto ol data. This pointer varies depending
on the proto ol layer that urrently owns the sk buff,

tail points at the urrent end of the proto ol data. Again, this pointer varies depending on the owning proto ol layer,

end points at the end of the data area in memory. This is xed when the sk buff
is allo ated.

sk_buff
next
prev
dev

head
data
tail
end

truesize

len

Packet
to be
transmitted

Figure 10.4: The So ket Bu er (sk bu )


There are two length elds len and truesize, whi h des ribe the length of the urrent proto ol pa ket and the total size of the data bu er respe tively. The sk buff
handling ode provides standard me hanisms for adding and removing proto ol headers and tails to the appli ation data. These safely manipulate the data, tail and
len elds in the sk buff:

push This moves the data pointer towards the start of the data area and in rements

the len eld. This is used when adding data or proto ol headers to the start See skb push()
in in lude/of the data to be transmitted,

pull This moves the data pointer away from the start, towards the end of the data

linux/skbuff.h

area and de rements the len eld. This is used when removing data or proto ol See skb pull()
in in lude/headers from the start of the data that has been re eived,

put This moves the tail pointer towards the end of the data area and in rements
the len eld. This is used when adding data or proto ol information to the
end of the data to be transmitted,

trim This moves the tail pointer towards the start of the data area and de rements

linux/skbuff.h

See skb put() in


in lude/linux/skbuff.h

the len eld. This is used when removing data or proto ol tails from the See skb trim()
in in lude/re eived pa ket.

The sk buff data stru ture also ontains pointers that are used as it is stored in
doubly linked ir ular lists of sk buff's during pro essing. There are generi sk buff
routines for adding sk buffs to the front and ba k of these lists and for removing
them.

10.5.2 Re eiving IP Pa kets


Chapter 8 des ribed how Linux's network drivers built are into the kernel and initialized. This results in a series of devi e data stru tures linked together in the

linux/skbuff.h

dev base list. Ea h devi e data stru ture des ribes its devi e and provides a set of

allba k routines that the network proto ol layers all when they need the network
driver to perform work. These fun tions are mostly on erned with transmitting
data and with the network devi e's addresses. When a network devi e re eives pa kets from its network it must onvert the re eived data into sk buff data stru tures.
These re eived sk buff's are added onto the ba klog queue by the network drivers as
See netif rx()
in
they are re eived. If the ba klog queue grows too large, then the re eived sk buff's
net/ ore/dev.
are dis arded. The network bottom half is agged as ready to run as there is work
to do.
When the network bottom half handler is run by the s heduler it pro esses any
network pa kets waiting to be transmitted before pro essing the ba klog queue of
See net bh() in
sk buff's determining whi h proto ol layer to pass the re eived pa kets to. As the
net/ ore/dev.
Linux networking layers were initialized, ea h proto ol registered itself by adding a
pa ket type data stru ture onto either the ptype all list or into the ptype base
hash table. The pa ket type data stru ture ontains the proto ol type, a pointer
to a network devi e, a pointer to the proto ol's re eive data pro essing routine and,
nally, a pointer to the next pa ket type data stru ture in the list or hash hain.
The ptype all hain is used to snoop all pa kets being re eived from any network
devi e and is not normally used. The ptype base hash table is hashed by proto ol
identi er and is used to de ide whi h proto ol should re eive the in oming network
pa ket. The network bottom half mat hes the proto ol types of in oming sk buff's
against one or more of the pa ket type entries in either table. The proto ol may
mat h more than one entry, for example when snooping all network tra , and in this
See ip re v() in ase the sk buff will be loned. The sk buff is passed to the mat hing proto ol's
net/ipv4/handling routine.
ip input.

10.5.3 Sending IP Pa kets

See in lude/net/route.h

Pa kets are transmitted by appli ations ex hanging data or else they are generated by
the network proto ols as they support established onne tions or onne tions being
established. Whi hever way the data is generated, an sk buff is built to ontain the
data and various headers are added by the proto ol layers as it passes through them.
The sk buff needs to be passed to a network devi e to be transmitted. First though
the proto ol, for example IP, needs to de ide whi h network devi e to use. This
depends on the best route for the pa ket. For omputers onne ted by modem to
a single network, say via the PPP proto ol, the routing hoi e is easy. The pa ket
should either be sent to the lo al host via the loopba k devi e or to the gateway at
the end of the PPP modem onne tion. For omputers onne ted to an ethernet the
hoi es are harder as there are many omputers onne ted to the network.
For every IP pa ket transmitted, IP uses the routing tables to resolve the route for
the destination IP address. Ea h IP destination su essfully looked up in the routing
tables returns a rtable data stru ture des ribing the route to use. This in ludes
the sour e IP address to use, the address of the network devi e data stru ture and,
sometimes, a prebuilt hardware header. This hardware header is network devi e
spe i and ontains the sour e and destination physi al addresses and other media
spe i information. If the network devi e is an ethernet devi e, the hardware header
would be as shown in Figure 10.1 and the sour e and destination addresses would be
physi al ethernet addresses. The hardware header is a hed with the route be ause
it must be appended to ea h IP pa ket transmitted on this route and onstru ting

it takes time. The hardware header may ontain physi al addresses that have to be
resolved using the ARP proto ol. In this ase the outgoing pa ket is stalled until the
address has been resolved. On e it has been resolved and the hardware header built,
the hardware header is a hed so that future IP pa kets sent using this interfa e do
not have to ARP.

10.5.4 Data Fragmentation


Every network devi e has a maximum pa ket size and it annot transmit or re eive a
data pa ket bigger than this. The IP proto ol allows for this and will fragment data
into smaller units to t into the pa ket size that the network devi e an handle. The
IP proto ol header in ludes a fragment eld whi h ontains a ag and the fragment
o set.
When an IP pa ket is ready to be transmited, IP nds the network devi e to send
the IP pa ket out on. This devi e is found from the IP routing tables. Ea h devi e
has a eld des ribing its maximum transfer unit (in bytes), this is the mtu eld. If
the devi e's mtu is smaller than the pa ket size of the IP pa ket that is waiting to
be transmitted, then the IP pa ket must be broken down into smaller (mtu sized)
fragments. Ea h fragment is represented by an sk buff; its IP header marked to
show that it is a fragment and what o set into the data this IP pa ket ontains. The
last pa ket is marked as being the last IP fragment. If, during the fragmentation, IP
annot allo ate an sk buff, the transmit will fail.
Re eiving IP fragments is a little more di ult than sending them be ause the IP
fragments an be re eived in any order and they must all be re eived before they an
be reassembled. Ea h time an IP pa ket is re eived it is he ked to see if it is an IP
fragment. The rst time that the fragment of a message is re eived, IP reates a new
ipq data stru ture, and this is linked into the ipqueue list of IP fragments awaiting
re ombination. As more IP fragments are re eived, the orre t ipq data stru ture is
found and a new ipfrag data stru ture is reated to des ribe this fragment. Ea h
ipq data stru ture uniquely des ribes a fragmented IP re eive frame with its sour e
and destination IP addresses, the upper layer proto ol identi er and the identi er for
this IP frame. When all of the fragments have been re eived, they are ombined into
a single sk buff and passed up to the next proto ol level to be pro essed. Ea h ipq
ontains a timer that is restarted ea h time a valid fragment is re eived. If this timer
expires, the ipq data stru ture and its ipfrag's are dismantled and the message is
presumed to have been lost in transit. It is then up to the higher level proto ols to
retransmit the message.

See

ip build xmit()
in net/ipv4/
ip output.

See ip r v() in
net/ipv4/ip input.

10.6 The Address Resolution Proto ol (ARP)


The Address Resolution Proto ol's role is to provide translations of IP addresses into
physi al hardware addresses su h as ethernet addresses. IP needs this translation
just before it passes the data (in the form of an sk buff) to the devi e driver for See
ip build xmit()
transmission. It performs various he ks to see if this devi e needs a hardware header in net/ipv4/and, if it does, if the hardware header for the pa ket needs to be rebuilt. Linux a hes ip output.
hardware headers to avoid frequent rebuilding of them. If the hardware header
needs rebuilding, it alls the devi e spe i hardware header rebuilding routine. All See eth
header()
ethernet devi es use the same generi header rebuilding routine whi h in turn uses rebuild
in net/ethernet/eth.

the ARP servi es to translate the destination IP address into a physi al address.
The ARP proto ol itself is very simple and onsists of two message types, an ARP
request and an ARP reply. The ARP request ontains the IP address that needs
translating and the reply (hopefully) ontains the translated IP address, the hardware
address. The ARP request is broad ast to all hosts onne ted to the network, so,
for an ethernet network, all of the ma hines onne ted to the ethernet will see the
ARP request. The ma hine that owns the IP address in the request will respond to
the ARP request with an ARP reply ontaining its own physi al address.
The ARP proto ol layer in Linux is built around a table of arp table data stru tures
whi h ea h des ribe an IP to physi al address translation. These entries are reated
as IP addresses need to be translated and removed as they be ome stale over time.
Ea h arp table data stru ture has the following elds:
last used
last updated
ags
IP address
hardware address
hardware header
timer

the time that this ARP entry was last used,


the time that this ARP entry was last updated,
these des ribe this entry's state, if it is omplete and so on,
The IP address that this entry des ribes
The translated hardware address
This is a pointer to a a hed hardware header,
This is a timer list entry used to time out ARP requests
that do not get a response,
retries
The number of times that this ARP request has been
retried,
List of sk buff entries waiting for this IP address
sk buff queue
to be resolved
The ARP table onsists of a table of pointers (the arp tables ve tor) to hains of
arp table entries. The entries are a hed to speed up a ess to them, ea h entry is
found by taking the last two bytes of its IP address to generate an index into the
table and then following the hain of entries until the orre t one is found. Linux also
a hes prebuilt hardware headers o the arp table entries in the form of hh a he
data stru tures.
When an IP address translation is requested and there is no orresponding arp table
entry, ARP must send an ARP request message. It reates a new arp table entry
in the table and queues the sk buff ontaining the network pa ket that needs the
address translation on the sk buff queue of the new entry. It sends out an ARP
request and sets the ARP expiry timer running. If there is no response then ARP
will retry the request a number of times and if there is still no response ARP will
remove the arp table entry. Any sk buff data stru tures queued waiting for the
IP address to be translated will be noti ed and it is up to the proto ol layer that is
transmitting them to ope with this failure. UDP does not are about lost pa kets
but TCP will attempt to retransmit on an established TCP link. If the owner of
the IP address responds with its hardware address, the arp table entry is marked
as omplete and any queued sk buff's will be removed from the queue and will go
on to be transmitted. The hardware address is written into the hardware header of
ea h sk buff.
The ARP proto ol layer must also respond to ARP requests that spe fy its IP address. It registers its proto ol type (ETH P ARP), generating a pa ket type data
stru ture. This means that it will be passed all ARP pa kets that are re eived by
the network devi es. As well as ARP replies, this in ludes ARP requests. It gener-

ates an ARP reply using the hardware address kept in the re eiving devi e's devi e
data stru ture.
Network topologies an hange over time and IP addresses an be reassigned to
di erent hardware addresses. For example, some dial up servi es assign an IP address
as ea h onne tion is established. In order that the ARP table ontains up to date
entries, ARP runs a periodi timer whi h looks through all of the arp table entries
to see whi h have timed out. It is very areful not to remove entries that ontain
one or more a hed hardware headers. Removing these entries is dangerous as other
data stru tures rely on them. Some arp table entries are permanent and these are
marked so that they will not be deallo ated. The ARP table annot be allowed to
grow too large; ea h arp table entry onsumes some kernel memory. Whenever the
a new entry needs to be allo ated and the ARP table has rea hed its maximum size
the table is pruned by sear hing out the oldest entries and removing them.

10.7 IP Routing
The IP routing fun tion determines where to send IP pa kets destined for a parti ular
IP address. There are many hoi es to be made when transmitting IP pa kets. Can
the destination be rea hed at all? If it an be rea hed, whi h network devi e should
be used to transmit it? If there is more than one network devi e that ould be used
to rea h the destination, whi h is the better one? The IP routing database maintains
information that gives answers to these questions. There are two databases, the most
important being the Forwarding Information Database. This is an exhaustive list of
known IP destinations and their best routes. A smaller and mu h faster database,
the route a he is used for qui k lookups of routes for IP destinations. Like all a hes,
it must ontain only the frequently a essed routes; its ontents are derived from the
Forwarding Information Database.
Routes are added and deleted via IOCTL requests to the BSD so ket interfa e.
These are passed onto the proto ol to pro ess. The INET proto ol layer only allows
pro esses with superuser privileges to add and delete IP routes. These routes an be
xed or they an be dynami and hange over time. Most systems use xed routes
unless they themselves are routers. Routers run routing proto ols whi h onstantly
he k on the availability of routes to all known IP destinations. Systems that are
not routers are known as end systems. The routing proto ols are implemented as
daemons, for example GATED, and they also add and delete routes via the IOCTL
BSD so ket interfa e.

10.7.1 The Route Ca he


Whenever an IP route is looked up, the route a he is rst he ked for a mat hing
route. If there is no mat hing route in the route a he the Forwarding Information
Database is sear hed for a route. If no route an be found there, the IP pa ket will fail
to be sent and the appli ation noti ed. If a route is in the Forwarding Information
Database and not in the route a he, then a new entry is generated and added into
the route a he for this route. The route a he is a table (ip rt hash table) that
ontains pointers to hains of rtable data stru tures. The index into the route table
is a hash fun tion based on the least signi ant two bytes of the IP address. These
are the two bytes most likely to be di erent between destinations and provide the

fib_zones
fib_node
fib_next

fib_zone
fz_next
fz_hash_table
fz_list
fz_nent
fz_logmask
fz_mask

fib_dst
fib_use
fib_info
fib_metric
fib_tos

fib_info
fib_next
fib_prev
fib_gateway
fib_dev
fib_refcnt
fib_window
fib_flags
fib_mtu
fib_irtt

fib_node
fib_next
fib_dst
fib_use
fib_info
fib_metric
fib_tos

fib_info
fib_next
fib_prev
fib_gateway
fib_dev
fib_refcnt
fib_window
fib_flags
fib_mtu
fib_irtt

Figure 10.5: The Forwarding Information Database

See ip rt
he k expire()
in net/ipv4/route.

best spread of hash values. Ea h rtable entry ontains information about the route;
the destination IP address, the network devi e to use to rea h that IP address, the
maximum size of message that an be used and so on. It also has a referen e ount,
a usage ount and a timestamp of the last time that they were used (in jiffies).
The referen e ount is in remented ea h time the route is used to show the number
of network onne tions using this route. It is de remented as appli ations stop using
the route. The usage ount is in remented ea h time the route is looked up and is
used to order the rtable entry in its hain of hash entries. The last used timestamp
for all of the entries in the route a he is periodi ally he ked to see if the rtable
is too old . If the route has not been re ently used, it is dis arded from the route
a he. If routes are kept in the route a he they are ordered so that the most used
entries are at the front of the hash hains. This means that nding them will be
qui ker when routes are looked up.

10.7.2 The Forwarding Information Database


The forwarding information database (shown in Figure 10.5 ontains IP's view of the
routes available to this system at this time. It is quite a ompli ated data stru ture
and, although it is reasonably e iently arranged, it is not a qui k database to
onsult. In parti ular it would be very slow to look up destinations in this database
for every IP pa ket transmitted. This is the reason that the route a he exists: to
speed up IP pa ket transmission using known good routes. The route a he is derived
from the forwarding database and represents its ommonly used entries.
Ea h IP subnet is represented by a fib zone data stru ture. All of these are pointed
at from the fib zones hash table. The hash index is derived from the IP subnet

mask. All routes to the same subnet are des ribed by pairs of fib node and fib info
data stru tures queued onto the fz list of ea h fib zone data stru ture. If the
number of routes in this subnet grows large, a hash table is generated to make
nding the fib node data stru tures easier.
Several routes may exist to the same IP subnet and these routes an go through one
of several gateways. The IP routing layer does not allow more than one route to a
subnet using the same gateway. In other words, if there are several routes to a subnet,
then ea h route is guaranteed to use a di erent gateway. Asso iated with ea h route
is its metri . This is a measure of how advantagious this route is. A route's metri
is, essentially, the number of IP subnets that it must hop a ross before it rea hes the
destination subnet. The higher the metri , the worse the route.

Chapter 11

Kernel Me hanisms

This hapter des ribes some of the general tasks and me hanisms that
the Linux kernel needs to supply so that other parts of the kernel work
e e tively together.

11.1 Bottom Half Handling


There are often times in a kernel when you do not want to do work at this moment.
A good example of this is during interrupt pro essing. When the interrupt was
asserted, the pro essor stopped what it was doing and the operating system delivered
the interrupt to the appropriate devi e driver. Devi e drivers should not spend too
mu h time handling interrupts as, during this time, nothing else in the system an
run. There is often some work that ould just as well be done later on. Linux's
bottom half handlers were invented so that devi e drivers and other parts of the
Linux kernel ould queue work to be done later on. Figure 11.1 shows the kernel
data stru tures asso iated with bottom half handling. There an be up to 32 di erent
31

bh_active

bh_base

Bottom half handler


(timers)

31

bh_mask

31

Figure 11.1: Bottom Half Handling Data Stru tures


139

See

in lude/linux/interrupt.h

tq_struct

task queue

tq_struct

next

next

sync

sync

*routine()

*routine()

*data

*data

Figure 11.2: A Task Queue


bottom half handlers; bh base is a ve tor of pointers to ea h of the kernel's bottom
half handling routines. bh a tive and bh mask have their bits set a ording to what
handlers have been installed and are a tive. If bit N of bh mask is set then the
Nth element of bh base ontains the address of a bottom half routine. If bit N of
bh a tive is set then the N'th bottom half handler routine should be alled as soon
as the s heduler deems reasonable. These indi es are stati ally de ned; the timer
bottom half handler is the highest priority (index 0), the onsole bottom half handler
is next in priority (index 1) and so on. Typi ally the bottom half handling routines
have lists of tasks asso iated with them. For example, the immediate bottom half
handler works its way through the immediate tasks queue (tq immediate) whi h
ontains tasks that need to be performed immediately.
Some of the kernel's bottom half handers are devi e spe i , but others are more
generi :

TIMER This handler is marked as a tive ea h time the system's periodi timer
interrupts and is used to drive the kernel's timer queue me hanisms,

CONSOLE This handler is used to pro ess onsole messages,


TQUEUE This handler is used to pro ess tty messages,
NET This handler handles general network pro essing,
IMMEDIATE This is a generi handler used by several devi e drivers to queue
work to be done later.

See

do bottom half()
in kernel/softirq.

Whenever a devi e driver, or some other part of the kernel, needs to s hedule work to
be done later, it adds work to the appropriate system queue, for example the timer
queue, and then signals the kernel that some bottom half handling needs to be done.
It does this by setting the appropriate bit in bh a tive. Bit 8 is set if the driver
has queued something on the immediate queue and wishes the immediate bottom
half handler to run and pro ess it. The bh a tive bitmask is he ked at the end of
ea h system all, just before ontrol is returned to the alling pro ess. If it has any
bits set, the bottom half handler routines that are a tive are alled. Bit 0 is he ked
rst, then 1 and so on until bit 31. The bit in bh a tive is leared as ea h bottom
half handling routine is alled. bh a tive is transient; it only has meaning between
alls to the s heduler and is a way of not alling bottom half handling routines when
there is no work for them to do.

11.2 Task Queues


See in lude/-

linux/tqueue.h

Task queues are the kernel's way of deferring work until later. Linux has a generi
me hanism for queuing work on queues and for pro essing them later. Task queues

are often used in onjun tion with bottom half handlers; the timer task queue is
pro essed when the timer queue bottom half handler runs. A task queue is a simple
data stru ture, see gure 11.2 whi h onsists of a singly linked list of tq stru t data
stru tures ea h of whi h ontains the address of a routine and a pointer to some data.
The routine will be alled when the element on the task queue is pro essed and it
will be passed a pointer to the data.
Anything in the kernel, for example a devi e driver, an reate and use task queues
but there are three task queues reated and managed by the kernel:

timer This queue is used to queue work that will be done as soon after the next

system lo k ti k as is possible. Ea h lo k ti k, this queue is he ked to see if it


ontains any entries and, if it does, the timer queue bottom half handler is made
a tive. The timer queue bottom half handler is pro essed, along with all the
other bottom half handlers, when the s heduler next runs. This queue should
not be onfused with system timers, whi h are a mu h more sophisti ated
me hanism.

immediate This queue is also pro essed when the s heduler pro esses the a tive

bottom half handlers. The immediate bottom half handler is not as high in
priority as the timer queue bottom half handler and so these tasks will be run
later.

s heduler This task queue is pro essed dire tly by the s heduler. It is used to

support other task queues in the system and, in this ase, the task to be run
will be a routine that pro esses a task queue, say for a devi e driver.

When task queues are pro essed, the pointer to the rst element in the queue is
removed from the queue and repla ed with a null pointer. In fa t, this removal is
an atomi operation, one that annot be interrupted. Then ea h element in the
queue has its handling routine alled in turn. The elements in the queue are often
stati ally allo ated data. However there is no inherent me hanism for dis arding
allo ated memory. The task queue pro essing routine simply moves onto the next
element in the list. It is the job of the task itself to ensure that it properly leans up
any allo ated kernel memory.

11.3 Timers
An operating system needs to be able to s hedule an a tivity sometime in the future.
A me hanism is needed whereby a tivities an be s heduled to run at some relatively
pre ise time. Any mi ropro essor that wishes to support an operating system must
have a programmable interval timer that periodi ally interrupts the pro essor. This
periodi interrupt is known as a system lo k ti k and it a ts like a metronome,
or hestrating the system's a tivities. Linux has a very simple view of what time it
is; it measures time in lo k ti ks sin e the system booted. All system times are
based on this measurement, whi h is known as jiffies after the globally available
variable of the same name.
Linux has two types of system timers, both queue routines to be alled at some
system time but they are slightly di erent in their implementations. Figure 11.3
shows both me hanisms. The rst, the old timer me hanism, has a stati array of 32
pointers to timer stru t data stru tures and a mask of a tive timers, timer a tive.

See in lude/linux/timer.h

timer_table

timer_struct
expires
*fn()

timer_struct
expires
*fn()

31

timer_active

31

timer_head
next
prev
expires
data
*function()

timer_list
next
prev
expires
data
*function()

timer_list
next
prev
expires
data
*function()

Figure 11.3: System Timers


Where the timers go in the timer table is stati ally de ned (rather like the bottom
half handler table bh base). Entries are added into this table mostly at system
initialization time. The se ond, newer, me hanism uses a linked list of timer list
data stru tures held in as ending expiry time order.

See timer bh()


in
kernel/s hed.

See

run old timers()

in

kernel/s hed.

See

run timer list()

Both methods use the time in jiffies as an expiry time so that a timer that wished
to run in 5s would have to onvert 5s to units of jiffies and add that to the urrent
system time to get the system time in jiffies when the timer should expire. Every
system lo k ti k the timer bottom half handler is marked as a tive so that the
when the s heduler next runs, the timer queues will be pro essed. The timer bottom
half handler pro esses both types of system timer. For the old system timers the
timer a tive bit mask is he k for bits that are set. If the expiry time for an a tive
timer has expired (expiry time is less than the urrent system jiffies), its timer
routine is alled and its a tive bit is leared. For new system timers, the entries in
the linked list of timer list data stru tures are he ked. Every expired timer is
removed from the list and its routine is alled. The new timer me hanism has the
advantage of being able to pass an argument to the timer routine.

in

kernel/s hed.

See in lude/linux/wait.h

11.4 Wait Queues


There are many times when a pro ess must wait for a system resour e. For example
a pro ess may need the VFS inode des ribing a dire tory in the le system and that
inode may not be in the bu er a he. In this ase the pro ess must wait for that
inode to be fet hed from the physi al media ontaining the le system before it an
arry on.
The Linux kernel uses a simple data stru ture, a wait queue (see gure 11.4), whi h
onsists of a pointer to the pro esses task stru t and a pointer to the next element
in the wait queue.
When pro esses are added to the end of a wait queue they an either be inter-

wait queue
*task
*next
Figure 11.4: Wait Queue
ruptible or uninterruptible. Interruptible pro esses may be interrupted by events
su h as timers expiring or signals being delivered whilst they are waiting on a wait
queue. The waiting pro esses state will re e t this and either be INTERRUPTIBLE or
UNINTERRUPTIBLE. As this pro ess an not now ontinue to run, the s heduler is run
and, when it sele ts a new pro ess to run, the waiting pro ess will be suspended. 1
When the wait queue is pro essed, the state of every pro ess in the wait queue is
set to RUNNING. If the pro ess has been removed from the run queue, it is put ba k
onto the run queue. The next time the s heduler runs, the pro esses that are on
the wait queue are now andidates to be run as they are now no longer waiting.
When a pro ess on the wait queue is s heduled the rst thing that it will do is
remove itself from the wait queue. Wait queues an be used to syn hronize a ess
to system resour es and they are used by Linux in its implementation of semaphores
(see below).

11.5 Buzz Lo ks
These are better known as spin lo ks and they are a primitive way of prote ting a
data stru ture or pie e of ode. They only allow one pro ess at a time to be within
a riti al region of ode. They are used in Linux to restri t a ess to elds in data
stru tures, using a single integer eld as a lo k. Ea h pro ess wishing to enter the
region attempts to hange the lo k's initial value from 0 to 1. If its urrent value is
1, the pro ess tries again, spinning in a tight loop of ode. The a ess to the memory
lo ation holding the lo k must be atomi , the a tion of reading its value, he king
that it is 0 and then hanging it to 1 annot be interrupted by any other pro ess.
Most CPU ar hite tures provide support for this via spe ial instru tions but you an
also implement buzz lo ks using un a hed main memory.
When the owning pro ess leaves the riti al region of ode it de rements the buzz
lo k, returning its value to 0. Any pro esses spinning on the lo k will now read it as
0, the rst one to do this will in rement it to 1 and enter the riti al region.

11.6 Semaphores
Semaphores are used to prote t riti al regions of ode or data stru tures. Remember
that ea h a ess of a riti al pie e of data su h as a VFS inode des ribing a dire tory
is made by kernel ode running on behalf of a pro ess. It would be very dangerous
to allow one pro ess to alter a riti al data stru ture that is being used by another
pro ess. One way to a hieve this would be to use a buzz lo k around the riti al pie e
of data is being a essed but this is a simplisti approa h that would not give very
good system performan e. Instead Linux uses semaphores to allow just one pro ess
1 REVIEW

NOTE: What is to stop a task in state

INTERRUPTIBLE being

made to run the next time

the s heduler runs? Pro esses in a wait queue should never run until they are woken up.

See in lude/asm/semaphore.h

at a time to a ess riti al regions of ode and data; all other pro esses wishing to
a ess this resour e will be made to wait until it be omes free. The waiting pro esses
are suspended, other pro esses in the system an ontinue to run as normal.
A Linux semaphore data stru ture ontains the following information:

ount This eld keeps tra k of the ount of pro esses wishing to use this resour e.
A positive value means that the resour e is available. A negative or zero value
means that pro esses are waiting for it. An initial value of 1 means that one
and only one pro ess at a time an use this resour e. When pro esses want
this resour e they de rement the ount and when they have nished with this
resour e they in rement the ount,

waking This is the ount of pro esses waiting for this resour e whi h is also the
number of pro ess waiting to be woken up when this resour e be omes free,

wait queue When pro esses are waiting for this resour e they are put onto this
wait queue,

lo k A buzz lo k used when a essing the waking eld.


Suppose the initial ount for a semaphore is 1, the rst pro ess to ome along will
see that the ount is positive and de rement it by 1, making it 0. The pro ess now
\owns" the riti al pie e of ode or resour e that is being prote ted by the semaphore.
When the pro ess leaves the riti al region it in rements the semphore's ount. The
most optimal ase is where there are no other pro esses ontending for ownership of
the riti al region. Linux has implemented semaphores to work e iently for this,
the most ommon, ase.
If another pro ess wishes to enter the riti al region whilst it is owned by a pro ess
it too will de rement the ount. As the ount is now negative (-1) the pro ess annot
enter the riti al region. Instead it must wait until the owning pro ess exits it.
Linux makes the waiting pro ess sleep until the owning pro ess wakes it on exiting
the riti al region. The waiting pro ess adds itself to the semaphore's wait queue
and sits in a loop he king the value of the waking eld and alling the s heduler
until waking is non-zero.
The owner of the riti al region in rements the semaphore's ount and if it is less
than or equal to zero then there are pro esses sleeping, waiting for this resour e. In
the optimal ase the semaphore's ount would have been returned to its initial value
of 1 and no further work would be ne essary. The owning pro ess in rements the
waking ounter and wakes up the pro ess sleeping on the semaphore's wait queue.
When the waiting pro ess wakes up, the waking ounter is now 1 and it knows that
it it may now enter the riti al region. It de rements the waking ounter, returning
it to a value of zero, and ontinues. All a ess to the waking eld of semaphore are
prote ted by a buzz lo k using the semaphore's lo k.

Chapter 12

Modules

This hapter des ribes how the Linux kernel an dynami ally load fun tions, for example lesystems, only when they are needed.
Linux is a monolithi kernel; that is, it is one, single, large program where all the
fun tional omponents of the kernel have a ess to all of its internal data stru tures and routines. The alternative is to have a mi ro-kernel stru ture where the
fun tional pie es of the kernel are broken out into separate units with stri t ommuni ation me hanisms between them. This makes adding new omponents into
the kernel via the on guration pro ess rather time onsuming. Say you wanted to
use a SCSI driver for an NCR 810 SCSI and you had not built it into the kernel.
You would have to on gure and then build a new kernel before you ould use the
NCR 810. There is an alternative, Linux allows you to dynami ally load and unload
omponents of the operating system as you need them. Linux modules are lumps of
ode that an be dynami ally linked into the kernel at any point after the system has
booted. They an be unlinked from the kernel and removed when they are no longer
needed. Mostly Linux kernel modules are devi e drivers, pseudo-devi e drivers su h
as network drivers, or le-systems.
You an either load and unload Linux kernel modules expli itly using the insmod and
rmmod ommands or the kernel itself an demand that the kernel daemon (kerneld)
loads and unloads the modules as they are needed. Dynami ally loading ode as it
is needed is attra tive as it keeps the kernel size to a minimum and makes the kernel
very exible. My urrent Intel kernel uses modules extensively and is only 406Kbytes
long. I only o asionally use VFAT le systems and so I build my Linux kernel to
automati ally load the VFAT le system module as I mount a VFAT partition. When
I have unmounted the VFAT partition the system dete ts that I no longer need the
VFAT le system module and removes it from the system. Modules an also be useful
for trying out new kernel ode without having to rebuild and reboot the kernel every
time you try it out. Nothing, though, is for free and there is a slight performan e
and memory penalty asso iated with kernel modules. There is a little more ode that
a loadable module must provide and this and the extra data stru tures take a little
145

more memory. There is also a level of indire tion introdu ed that makes a esses of
kernel resour es slightly less e ient for modules.
On e a Linux module has been loaded it is as mu h a part of the kernel as any normal
kernel ode. It has the same rights and responsibilities as any kernel ode; in other
words, Linux kernel modules an rash the kernel just like all kernel ode or devi e
drivers an.
So that modules an use the kernel resour es that they need, they must be able to nd
them. Say a module needs to all kmallo (), the kernel memory allo ation routine.
At the time that it is built, a module does not know where in memory kmallo () is,
so when the module is loaded, the kernel must x up all of the module's referen es
to kmallo () before the module an work. The kernel keeps a list of all of the
kernel's resour es in the kernel symbol table so that it an resolve referen es to those
resour es from the modules as they are loaded. Linux allows module sta king, this
is where one module requires the servi es of another module. For example, the VFAT
le system module requires the servi es of the FAT le system module as the VFAT
le system is more or less a set of extensions to the FAT le system. One module
requiring servi es or resour es from another module is very similar to the situation
where a module requires servi es and resour es from the kernel itself. Only here
the required servi es are in another, previously loaded module. As ea h module is
loaded, the kernel modi es the kernel symbol table, adding to it all of the resour es
or symbols exported by the newly loaded module. This means that, when the next
module is loaded, it has a ess to the servi es of the already loaded modules.
When an attempt is made to unload a module, the kernel needs to know that the
module is unused and it needs some way of notifying the module that it is about to
be unloaded. That way the module will be able to free up any system resour es that
it has allo ated, for example kernel memory or interrupts, before it is removed from
the kernel. When the module is unloaded, the kernel removes any symbols that that
module exported into the kernel symbol table.
Apart from the ability of a loaded module to rash the operating system by being
badly written, it presents another danger. What happens if you load a module built
for an earlier or later kernel than the one that you are now running? This may
ause a problem if, say, the module makes a all to a kernel routine and supplies the
wrong arguments. The kernel an optionally prote t against this by making rigorous
version he ks on the module as it is loaded.

12.1 Loading a Module


There are two ways that a kernel module an be loaded. The rst way is to use the

insmod ommand to manually insert the it into the kernel. The se ond, and mu h

more lever way, is to load the module as it is needed; this is known as demand
loading. When the kernel dis overs the need for a module, for example when the
kerneld is in the
user mounts a le system that is not in the kernel, the kernel will request that the
modules pa kage
kernel daemon (kerneld) attempts to load the appropriate module.
along with
, lsmod
.

insmod

and

rmmod

See in lude/linux/kerneld.h

The kernel daemon is a normal user pro ess albeit with super user privileges. When
it is started up, usually at system boot time, it opens up an Inter-Pro ess Communi ation (IPC) hannel to the kernel. This link is used by the kernel to send messages
to the kerneld asking for various tasks to be performed. Kerneld's major fun tion

module_list

module

module

next

next

ref
symtab
name

ref
symtab
name

"fat"

size

size

addr

addr

state

state

*cleanup()

*cleanup()

"vfat"

symbol_table
size
n_symbols
n_refs

symbol_table
size
n_symbols
n_refs

symbols

symbols

references

references

Figure 12.1: The List of Kernel Modules


is to load and unload kernel modules but it is also apable of other tasks su h as
starting up the PPP link over serial line when it is needed and losing it down when it
is not. Kerneld does not perform these tasks itself, it runs the ne essary programs
su h as insmod to do the work. Kerneld is just an agent of the kernel, s heduling
work on its behalf.
The insmod utility must nd the requested kernel module that it is to load. Demand
loaded kernel modules are normally kept in /lib/modules/kernel-version. The
kernel modules are linked obje t les just like other programs in the system ex ept
that they are linked as a relo atable images. That is, images that are not linked
to run from a parti ular address. They an be either a.out or elf format obje t
les. insmod makes a privileged system all to nd the kernel's exported symbols.
These are kept in pairs ontaining the symbol's name and its value, for example its
address. The kernel's exported symbol table is held in the rst module data stru ture
in the list of modules maintained by the kernel and pointed at by the module list
pointer. Only spe i ally entered symbols are added into the table, whi h is built
when the kernel is ompiled and linked, not every symbol in the kernel is exported
to its modules. An example symbol is ``request irq'' whi h is the kernel routine
that must be alled when a driver wishes to take ontrol of a parti ular system
interrupt. In my urrent kernel, this has a value of 0x0010 d30. You an easily see
the exported kernel symbols and their values by looking at /pro /ksyms or by using
the ksyms utility. The ksyms utility an either show you all of the exported kernel
symbols or only those symbols exported by loaded modules. insmod reads the module
into its virtual memory and xes up its unresolved referen es to kernel routines and
resour es using the exported symbols from the kernel. This xing up takes the form
of pat hing the module image in memory. insmod physi ally writes the address of
the symbol into the appropriate pla e in the module.

See sys get kernel syms() in


kernel/module.
See in lude/linux/module.h

See sys reate module()


in kernel/module. .

See sys init module() in


kernel/module. .

When insmod has xed up the module's referen es to exported kernel symbols, it asks
the kernel for enough spa e to hold the new kernel, again using a privileged system
all. The kernel allo ates a new module data stru ture and enough kernel memory
to hold the new module and puts it at the end of the kernel modules list. The new
module is marked as UNINITIALIZED. Figure 12.1 shows the list of kernel modules
after two modules, VFAT and VFAT have been loaded into the kernel. Not shown in the
diagram is the rst module on the list, whi h is a pseudo-module that is only there
to hold the kernel's exported symbol table. You an use the ommand lsmod to list
all of the loaded kernel modules and their interdependen ies. lsmod simply reformats
/pro /modules whi h is built from the list of kernel module data stru tures. The
memory that the kernel allo ates for it is mapped into the insmod pro ess's address
spa e so that it an a ess it. insmod opies the module into the allo ated spa e and
relo ates it so that it will run from the kernel address that it has been allo ated.
This must happen as the module annot expe t to be loaded at the same address
twi e let alone into the same address in two di erent Linux systems. Again, this
relo ation involves pat hing the module image with the appropriate addresses.
The new module also exports symbols to the kernel and insmod builds a table of
these exported images. Every kernel module must ontain module initialization and
module leanup routines and these symbols are deliberately not exported but insmod
must know the addresses of them so that it an pass them to the kernel. All being
well, insmod is now ready to initialize the module and it makes a privileged system all
passing the kernel the addresses of the module's initialization and leanup routines.
When a new module is added into the kernel, it must update the kernel's set of
symbols and modify the modules that are being used by the new module. Modules
that have other modules dependent on them must maintain a list of referen es at the
end of their symbol table and pointed at by their module data stru ture. Figure 12.1
shows that the VFAT le system module is dependent on the FAT le system module.
So, the FAT module ontains a referen e to the VFAT module; the referen e was
added when the VFAT module was loaded. The kernel alls the modules initialization
routine and, if it is su essful it arries on installing the module. The module's
leanup routine address is stored in it's module data stru ture and it will be alled
by the kernel when that module is unloaded. Finally, the module's state is set to
RUNNING.

12.2 Unloading a Module


Modules an be removed using the rmmod ommand but demand loaded modules are
automati ally removed from the system by kerneld when they are no longer being
used. Every time its idle timer expires, kerneld makes a system all requesting
that all unused demand loaded modules are removed from the system. The timer's
value is set when you start kerneld; my kerneld he ks every 180 se onds. So,
for example, if you mount an iso9660 CD ROM and your iso9660 lesystem is a
loadable module, then shortly after the CD ROM is unmounted, the iso9660 module
will be removed from the kernel.
A module annot be unloaded so long as other omponents of the kernel are depending on it. For example, you annot unload the VFAT module if you have one or more
VFAT le systems mounted. If you look at the output of lsmod, you will see that ea h
module has a ount asso iated with it. For example:

Module:
msdos
vfat
fat

#pages: Used by:


5
1
4
1 (auto lean)
6
[vfat msdos 2 (auto lean)

The ount is the number of kernel entities that are dependent on this module. In the
above example, the vfat and msdos modules are both dependent on the fat module
and so it has a ount of 2. Both the vfat and msdos modules have 1 dependent,
whi h is a mounted le system. If I were to load another VFAT le system then the
vfat module's ount would be ome 2. A module's ount is held in the rst longword
of its image.
This eld is slightly overloaded as it also holds the AUTOCLEAN and VISITED ags.
Both of these ags are used for demand loaded modules. These modules are marked
as AUTOCLEAN so that the system an re ognize whi h ones it may automati ally
unload. The VISITED ag marks the module as in use by one or more other system
omponents; it is set whenever another omponent makes use of the module. Ea h
time the system is asked by kerneld to remove unused demand loaded modules it
looks through all of the modules in the system for likely andidates. It only looks
at modules marked as AUTOCLEAN and in the state RUNNING. If the andidate has
its VISITED ag leared then it will remove the module, otherwise it will lear the
VISITED ag and go on to look at the next module in the system.
Assuming that a module an be unloaded, its leanup routine is alled to allow it See sys delete module()
to free up the kernel resour es that it has allo ated. The module data stru ture is in
marked as DELETED and it is unlinked from the list of kernel modules. Any other kernel/module.
modules that it is dependent on have their referen e lists modi ed so that they no
longer have it as a dependent. All of the kernel memory that the module needed is
deallo ated.

Chapter 13

Pro essors

Linux runs on a number of pro essors; this hapter gives a brief outline
of ea h of them.

13.1 X86
TBD

13.2 ARM
The ARM pro essor implements a low power, high performan e 32 bit RISC ar hite ture. It is being widely used in embedded devi es su h as mobile phones and PDAs
(Personal Data Assistants). It has 31 32 bit registers with 16 visible in any mode. Its
instru tions are simple load and store instru tions (load a value from memory, perform an operation and store the result ba k into memory). One interesting feature
it has is that every instru tion is onditional. For example, you an test the value
of a register and, until you next test for the same ondition, you an onditionally
exe ute instru tions as and when you like. Another interesting feature is that you
an perform arithmeti and shift operations on values as you load them. It operates
in several modes, in luding a system mode that an be entered from user mode via
a SWI (software interrupt).
It is a synthasisable ore and ARM (the ompany) does not itself manufa ture pro essors. Instead the ARM partners ( ompanies su h as Intel or LSI for example)
implement the ARM ar hite ture in sili on. It allows other pro essors to be tightly
oupled via a o-pro essor interfa e and it has several memory management unit
variations. These range from simple memory prote tion s hemes to omplex page
hierar hies.
151

13.3 Alpha AXP Pro essor


The Alpha AXP ar hite ture is a 64-bit load/store RISC ar hite ture designed with
speed in mind. All registers are 64 bits in length; 32 integer registers and 32 oating
point registers. Integer register 31 and oating point register 31 are used for null
operations. A read from them generates a zero value and a write to them has no e e t.
All instru tions are 32 bits long and memory operations are either reads or writes.
The ar hite ture allows di erent implementations so long as the implementations
follow the ar hite ture.
There are no instru tions that operate dire tly on values stored in memory; all data
manipulation is done between registers. So, if you want to in rement a ounter
in memory, you rst read it into a register, then modify it and write it out. The
instru tions only intera t with ea h other by one instru tion writing to a register
or memory lo ation and another register reading that register or memory lo ation.
One interesting feature of Alpha AXP is that there are instru tions that an generate
ags, su h as testing if two registers are equal, the result is not stored in a pro essor
status register, but is instead stored in a third register. This may seem strange at
rst, but removing this dependen y from a status register means that it is mu h
easier to build a CPU whi h an issue multiple instru tions every y le. Instru tions
on unrelated registers do not have to wait for ea h other to exe ute as they would if
there were a single status register. The la k of dire t operations on memory and the
large number of registers also help issue multiple instru tions.
The Alpha AXP ar hite ture uses a set of subroutines, alled privileged ar hite ture
library ode (PAL ode). PAL ode is spe i to the operating system, the CPU
implementation of the Alpha AXP ar hite ture and to the system hardware. These
subroutines provide operating system primitives for ontext swit hing, interrupts,
ex eptions and memory management. These subroutines an be invoked by hardware
or by CALL PAL instru tions. PAL ode is written in standard Alpha AXP assembler
with some implementation spe i extensions to provide dire t a ess to low level
hardware fun tions, for example internal pro essor registers. PAL ode is exe uted in
PALmode, a privileged mode that stops some system events happening and allows
the PAL ode omplete ontrol of the physi al system hardware.

Chapter 14

The Linux Kernel Sour es

This hapter des ribes where in the Linux kernel sour es you should start
looking for parti ular kernel fun tions.
This book does not depend on a knowledge of the 'C' programming language or
require that you have the Linux kernel sour es available in order to understand how
the Linux kernel works. That said, it is a fruitful exer ise to look at the kernel sour es
to get an in-depth understanding of the Linux operating system. This hapter gives
an overview of the kernel sour es; how they are arranged and where you might start
to look for parti ular ode.

Where to Get The Linux Kernel Sour es


All of the major Linux distributions (Craftworks, Debian, Sla kware, Red Hat et etera)
in lude the kernel sour es in them. Usually the Linux kernel that got installed on
your Linux system was built from those sour es. By their very nature these sour es
tend to be a little out of date so you may want to get the latest sour es from one of
the web sites mentioned in hapter B. They are kept on ftp://ftp. s.helsinki.fi
and all of the other web sites shadow them. This makes the Helsinki web site the
most up to date, but sites like MIT and Sunsite are never very far behind.
If you do not have a ess to the web, there are many CD ROM vendors who o er
snapshots of the world's major web sites at a very reasonable ost. Some even o er a
subs ription servi e with quarterly or even monthly updates. Your lo al Linux User
Group is also a good sour e of sour es.
The Linux kernel sour es have a very simple numbering system. Any even number
kernel (for example 2.0.30) is a stable, released, kernel and any odd numbered
kernel (for example 2.1.42 is a development kernel. This book is based on the
stable 2.0.30 sour e tree. Development kernels have all of the latest features and
support all of the latest devi es. Although they an be unstable, whi h may not be
exa tly what you want it, is important that the Linux ommunity tries the latest
153

kernels. That way they are tested for the whole ommunity. Remember that it is
always worth ba king up your system thoroughly if you do try out non-produ tion
kernels.
Changes to the kernel sour es are distributed as pat h les. The pat h utility is used
to apply a series of edits to a set of sour e les. So, for example, if you have the
2.0.29 kernel sour e tree and you wanted to move to the 2.0.30 sour e tree, you would
obtain the 2.0.30 pat h le and apply the pat hes (edits) to that sour e tree:
$ d /usr/sr /linux
$ pat h -p1 < pat h-2.0.30

This saves opying whole sour e trees, perhaps over slow serial onne tions. A good
sour e of kernel pat hes (o ial and uno ial) is the http://www.linuxhq. om web
site.

How The Kernel Sour es Are Arranged


At the very top level of the sour e tree /usr/sr /linux you will see a number of
dire tories:

ar h The ar h subdire tory ontains all of the ar hite ture spe i kernel ode. It

has further subdire tories, one per supported ar hite ture, for example i386
and alpha.

in lude The in lude subdire tory ontains most of the in lude les needed to build

the kernel ode. It too has further subdire tories in luding one for every ar hite ture supported. The in lude/asm subdire tory is a soft link to the real
in lude dire tory needed for this ar hite ture, for example in lude/asm-i386.
To hange ar hite tures you need to edit the kernel make le and rerun the
Linux kernel on guration program.

init This dire tory ontains the initialization ode for the kernel and it is a very
good pla e to start looking at how the kernel works.

mm This dire tory ontains all of the memory management ode. The ar hite ture spe i memory management ode lives down in ar h/*/mm/, for example
ar h/i386/mm/fault. .

drivers All of the system's devi e drivers live in this dire tory. They are further
sub-divided into lasses of devi e driver, for example blo k.

ip This dire tory ontains the kernels inter-pro ess ommuni ations ode.
modules This is simply a dire tory used to hold built modules.
fs All of the le system ode. This is further sub-divided into dire tories, one per
supported le system, for example vfat and ext2.

kernel The main kernel ode. Again, the ar hite ture spe i kernel ode is in
ar h/*/kernel.

net The kernel's networking ode.

lib This dire tory ontains the kernel's library ode. The ar hite ture spe i library
ode an be found in ar h/*/lib/.

s ripts This dire tory ontains the s ripts (for example awk and tk s ripts) that are
used when the kernel is on gured.

Where to Start Looking


A large omplex program like the Linux kernel an be rather daunting to look at.
It is rather like a large ball of string with no end showing. Looking at one part of
the kernel often leads to looking at several other related les and before long you
have forgotten what you were looking for. The next subse tions give you a hint as
to where in the sour e tree the best pla e to look is for a given subje t.

System Startup and Initialization


On an Intel based system, the kernel starts when either loadlin.exe or LILO has
loaded the kernel into memory and passed ontrol to it. Look in ar h/i386/kernel/head.S for this part. Head.S does some ar hite ture spe i setup and
then jumps to the main() routine in init/main. .

Memory Management
This ode is mostly in mm but the ar hite ture spe i ode is in ar h/*/mm. The
page fault handling ode is in mm/memory. and the memory mapping and page a he
ode is in mm/filemap. . The bu er a he is implemented in mm/buffer. and the
swap a he in mm/swap state. and mm/swapfile. .

Kernel
Most of the relevent generi ode is in kernel with the ar hite ture spe i ode
in ar h/*/kernel. The s heduler is in kernel/s hed. and the fork ode is in
kernel/fork. . The bottom half handling ode is in in lude/linux/interrupt.h.
The task stru t data stru ture an be found in in lude/linux/s hed.h.

PCI
The PCI pseudo driver is in drivers/p i/p i. with the system wide de nitions in
in lude/linux/p i.h. Ea h ar hite ture has some spe i PCI BIOS ode, Alpha
AXP's is in ar h/alpha/kernel/bios32. .

Interpro ess Communi ation


This is all in ip . All System V IPC obje ts in lude an ip perm data stru ture and
this an be found in in lude/linux/ip .h. System V messages are implemented in
ip /msg. , shared memory in ip /shm. and semaphores in ip /sem. . Pipes are
implemented in ip /pipe. .

Interrupt Handling
The kernel's interrupt handling ode is almost all mi ropro essor (and often platform)
spe i . The Intel interrupt handling ode is in ar h/i386/kernel/irq. and its
de nitions in in lude/asm-i386/irq.h.

Devi e Drivers
Most of the lines of the Linux kernel's sour e ode are in its devi e drivers. All of
Linux's devi e driver sour es are held in drivers but these are further broken out
by type:

/blo k blo k devi e drivers su h as ide (in

ide. ). If you want to look at how


all of the devi es that ould possibly ontain le systems are initialized then
you should look at devi e setup() in drivers/blo k/genhd. . It not only
initializes the hard disks but also the network as you need a network to mount
nfs le systems. Blo k devi es in lude both IDE and SCSI based devi es.

/ har This the pla e to look for hara ter based devi es su h as ttys, serial ports
and mi e.

/ drom All of the CDROM ode for Linux. It is here that the spe ial CDROM

devi es (su h as Soundblaster CDROM) an be found. Note that the ide CD


driver is ide- d. in drivers/blo k and that the SCSI CD driver is in s si.
in drivers/s si.

/p i This are the sour es for the PCI pseudo-driver. A good pla e to look at how

the PCI subsystem is mapped and initialized. The Alpha AXP PCI xup ode
is also worth looking at in ar h/alpha/kernel/bios32. .

/s si This is where to nd all of the SCSI ode as well as all of the drivers for the
s si devi es supported by Linux.

/net This is where to look to nd the network devi e drivers su h as the DECChip
21040 PCI ethernet driver whi h is in tulip. .

/sound This is where all of the sound ard drivers are.

File Systems
The sour es for the EXT2 le system are all in the fs/ext2/ dire tory with data stru ture de nitions in in lude/linux/ext2 fs.h, ext2 fs i.h and ext2 fs sb.h. The
Virtual File System data stru tures are des ribed in in lude/linux/fs.h and the
ode is in fs/*. The bu er a he is implemented in fs/buffer. along with the
update kernel daemon.

Network
The networking ode is kept in net with most of the in lude les in in lude/net.
The BSD so ket ode is in net/so ket. and the IP version 4 INET so ket ode is
in net/ipv4/af inet. . The generi proto ol support ode (in luding the sk buff
handling routines) is in net/ ore with the TCP/IP networking ode in net/ipv4.
The network devi e drivers are in drivers/net.

Modules
The kernel module ode is partially in the kernel and partially in the modules pa kage. The kernel ode is all in kernel/modules. with the data stru tures and kernel demon kerneld messages in in lude/linux/module.h and in lude/linux/kerneld.h respe tively. You may want to look at the stru ture of an ELF obje t le
in in lude/linux/elf.h.

Appendix A

Linux Data Stru tures

This appendix lists the major data stru tures that Linux uses and whi h are des ribed
in this book. They have been edited slightly to t the paper.

blo k dev stru t


blo k dev stru t data stru tures are used to register blo k devi es as available for
use by the bu er a he. They are held together in the blk dev ve tor.

See

in lude/linux/
blkdev.h

stru t blk_dev_stru t {
void (*request_fn)(void);
stru t request * urrent_request;
stru t request
plug;
stru t tq_stru t plug_tq;
};

bu er head
The buffer head data stru ture holds information about a blo k bu er in the bu er
a he.
/* bh state bits */
#define BH_Uptodate
#define BH_Dirty
#define BH_Lo k
#define BH_Req
#define BH_Tou hed
#define BH_Has_aged
#define BH_Prote ted
#define BH_FreeOnIO

0
1
2
3
4
5
6
7

/*
/*
/*
/*
/*
/*
/*
/*

1
1
1
0
1
1
1
1

if
if
if
if
if
if
if
to

the buffer ontains valid data


the buffer is dirty
the buffer is lo ked
the buffer has been invalidated
the buffer has been tou hed (aging)
the buffer has been aged (aging)
the buffer is prote ted
dis ard the buffer_head after IO

stru t buffer_head {

159

*/
*/
*/
*/
*/
*/
*/
*/

See

in lude/linux/
fs.h

/* First a he line: */
unsigned long
b_blo knr;
kdev_t
b_dev;
kdev_t
b_rdev;
unsigned long
b_rse tor;
stru t buffer_head *b_next;
stru t buffer_head *b_this_page;

/*
/*
/*
/*
/*
/*

blo k number
*/
devi e (B_FREE = free)
*/
Real devi e
*/
Real buffer lo ation on disk
*/
Hash queue list
*/
ir ular list of buffers in one
page
*/

/* Se ond a he line: */
unsigned long
b_state;
/* buffer state bitmap (above)
stru t buffer_head *b_next_free;
unsigned int
b_ ount;
/* users using this blo k
unsigned long
b_size;
/* blo k size
/* Non-performan e- riti al data
har
*b_data;
unsigned int
b_list;
unsigned long
b_flushtime;

*/
*/
*/

unsigned long

b_lru_time;

follows. */
/* pointer to data blo k
/* List that this buffer appears
/* Time when this (dirty) buffer
* should be written
/* Time when this buffer was
* last used.

*/

stru t
stru t
stru t
stru t

*b_wait;
*b_prev;
/* doubly linked hash list
*b_prev_free; /* doubly linked list of buffers
*b_reqnext;
/* request queue

*/
*/
*/

wait_queue
buffer_head
buffer_head
buffer_head

*/
*/
*/

};

See

in lude/linux/
netdevi e.h

devi e
Every network devi e in the system is represented by a devi e data stru ture.
stru t devi e
{
/*
* This is the first field of the "visible" part of this stru ture
* (i.e. as seen by users in the "Spa e. " file). It is the name
* the interfa e.
*/
har
*name;
/* I/O spe ifi fields
unsigned long
unsigned long
unsigned long
unsigned long
unsigned long
unsigned har

rmem_end;
rmem_start;
mem_end;
mem_start;
base_addr;
irq;

/* Low-level status flags. */


volatile unsigned har start,
interrupt;
unsigned long
tbusy;

shmem "re v" end


shmem "re v" start
shared mem end
shared mem start
devi e I/O address
devi e IRQ number

*/
*/
*/
*/
*/
*/
*/

/* start an operation
/* interrupt arrived
/* transmitter busy

*/
*/
*/

/*
/*
/*
/*
/*
/*

stru t devi e

*next;

/* The devi e initialization fun tion. Called only on e.


int
(*init)(stru t devi e *dev);

*/

/* Some hardware also needs these fields, but they are not part of
the usual set spe ified in Spa e. . */
unsigned har
if_port;
/* Sele table AUI,TP,
*/
unsigned har
dma;
/* DMA hannel
*/
stru t enet_statisti s* (*get_stats)(stru t devi e *dev);
/*
* This marks the end of the "visible" part of the stru ture. All
* fields hereafter are internal to the system, and may hange at
* will (read: may be leaned up at will).
*/
/* These may be needed for future network-power-down ode.
*/
unsigned long
trans_start;
/* Time (jiffies) of
last transmit
*/
unsigned long
last_rx;
/* Time of last Rx
*/
unsigned short
flags;
/* interfa e flags (BSD)*/
unsigned short
family;
/* address family ID
*/
unsigned short
metri ;
/* routing metri
*/
unsigned short
mtu;
/* MTU value
*/
unsigned short
type;
/* hardware type
*/
unsigned short
hard_header_len; /* hardware hdr len
*/
void
*priv;
/* private data
*/
/* Interfa e address info. */
unsigned har
broad ast[MAX_ADDR_LEN;
unsigned har
pad;
unsigned har
dev_addr[MAX_ADDR_LEN;
unsigned har
addr_len;
/* hardware
unsigned long
pa_addr;
/* proto ol
unsigned long
pa_brdaddr;
/* proto ol
unsigned long
pa_dstaddr;
/* proto ol
unsigned long
pa_mask;
/* proto ol
unsigned short
pa_alen;
/* proto ol

addr len
*/
address
*/
broad ast addr*/
P-P other addr*/
netmask
*/
address len */

stru t dev_m _list


int

*m _list;
m _ ount;

/* M' ast ma addrs


/* No installed m asts

stru t ip_m _list


__u32

*ip_m _list;
tx_queue_len;

/* IP m' ast filter hain */


/* Max frames per queue
*/

/* For load balan ing driver pair support */


unsigned long
pkt_queue;
/* Pa kets queued
stru t devi e
*slave;
/* Slave devi e
stru t net_alias_info
*alias_info;
/* main dev alias info
stru t net_alias
*my_alias;
/* alias devs
/* Pointer to the interfa e buffers. */

*/
*/

*/
*/
*/
*/

stru t sk_buff_head

buffs[DEV_NUMBUFFS;

/* Pointers to interfa e servi e routines. */


int
(*open)(stru t devi e *dev);
int
(*stop)(stru t devi e *dev);
int
(*hard_start_xmit) (stru t sk_buff *skb,
stru t devi e *dev);
int
(*hard_header) (stru t sk_buff *skb,
stru t devi e *dev,
unsigned short type,
void *daddr,
void *saddr,
unsigned len);
int
(*rebuild_header)(void *eth,
stru t devi e *dev,
unsigned long raddr,
stru t sk_buff *skb);
void
(*set_multi ast_list)(stru t devi e *dev);
int
(*set_ma _address)(stru t devi e *dev,
void *addr);
int
(*do_io tl)(stru t devi e *dev,
stru t ifreq *ifr,
int md);
int
(*set_ onfig)(stru t devi e *dev,
stru t ifmap *map);
void
(*header_ a he_bind)(stru t hh_ a he **hhp,
stru t devi e *dev,
unsigned short htype,
__u32 daddr);
void
(*header_ a he_update)(stru t hh_ a he *hh,
stru t devi e *dev,
unsigned har * haddr);
int
(* hange_mtu)(stru t devi e *dev,
int new_mtu);
stru t iw_statisti s*
(*get_wireless_stats)(stru t devi e *dev);
};

devi e stru t
devi e stru t data stru tures are used to register hara ter and blo k devi es (they
See

fs/devi es.

hold its name and the set of le operations that an be used for this devi e). Ea h
valid member of the hrdevs and blkdevs ve tors represents a hara ter or blo k
devi e respe tively.
stru t devi e_stru t {
onst har * name;
stru t file_operations * fops;
};

le
See

in lude/linux/
fs.h

Ea h open le, so ket et etera is represented by a file data stru ture.

stru t file {
mode_t f_mode;
loff_t f_pos;
unsigned short f_flags;
unsigned short f_ ount;
unsigned long f_reada, f_ramax, f_raend, f_ralen, f_rawin;
stru t file *f_next, *f_prev;
int f_owner;
/* pid or -pgrp where SIGIO should be sent */
stru t inode * f_inode;
stru t file_operations * f_op;
unsigned long f_version;
void *private_data; /* needed for tty driver, and maybe others */
};

les stru t
The files stru t data stru ture des ribes the les that a pro ess has open.

See

in lude/linux/
s hed.h

stru t files_stru t {
int ount;
fd_set lose_on_exe ;
fd_set open_fds;
stru t file * fd[NR_OPEN;
};

fs stru t

See

in lude/linux/
s hed.h

stru t fs_stru t {
int ount;
unsigned short umask;
stru t inode * root, * pwd;
};

gendisk
The gendisk data stru ture holds information about a hard disk. They are used
during initialization when the disks are found and then probed for partitions.
stru t hd_stru t {
long start_se t;
long nr_se ts;
};
stru t gendisk {
int major;
onst har *major_name;
int minor_shift;
int max_p;
int max_nr;

/* major number of driver */


/* name of major driver */
/* number of times minor is shifted to
get real minor */
/* maximum partitions per devi e */
/* maximum number of real devi es */

void (*init)(stru t gendisk *);


/* Initialization alled before we

See

in lude/linux/
genhd.h

stru t hd_stru t *part;


int *sizes;
int nr_real;
void *real_devi es;
stru t gendisk *next;

do our thing */
/* partition table */
/* devi e size in blo ks, opied to
blk_size[ */
/* number of real devi es */
/* internal use */

};

See

in lude/linux/
fs.h

inode
The VFS inode data stru ture holds information about a le or dire tory on disk.
stru t inode {
kdev_t
unsigned long
umode_t
nlink_t
uid_t
gid_t
kdev_t
off_t
time_t
time_t
time_t
unsigned long
unsigned long
unsigned long
unsigned long
stru t semaphore
stru t inode_operations
stru t super_blo k
stru t wait_queue
stru t file_lo k
stru t vm_area_stru t
stru t page
stru t dquot
stru t inode
stru t inode
stru t inode
stru t inode
unsigned short
unsigned short
unsigned har
unsigned har
unsigned har
unsigned har
unsigned har
unsigned har
unsigned short
union {
stru t pipe_inode_info
stru t minix_inode_info

i_dev;
i_ino;
i_mode;
i_nlink;
i_uid;
i_gid;
i_rdev;
i_size;
i_atime;
i_mtime;
i_ time;
i_blksize;
i_blo ks;
i_version;
i_nrpages;
i_sem;
*i_op;
*i_sb;
*i_wait;
*i_flo k;
*i_mmap;
*i_pages;
*i_dquot[MAXQUOTAS;
*i_next, *i_prev;
*i_hash_next, *i_hash_prev;
*i_bound_to, *i_bound_by;
*i_mount;
i_ ount;
i_flags;
i_lo k;
i_dirt;
i_pipe;
i_so k;
i_seek;
i_update;
i_write ount;
pipe_i;
minix_i;

stru t
stru t
stru t
stru t
stru t
stru t
stru t
stru t
stru t
stru t
stru t
stru t
void
} u;

ext_inode_info
ext2_inode_info
hpfs_inode_info
msdos_inode_info
umsdos_inode_info
iso_inode_info
nfs_inode_info
xiafs_inode_info
sysv_inode_info
affs_inode_info
ufs_inode_info
so ket

ext_i;
ext2_i;
hpfs_i;
msdos_i;
umsdos_i;
isofs_i;
nfs_i;
xiafs_i;
sysv_i;
affs_i;
ufs_i;
so ket_i;
*generi _ip;

};

ip perm
The ip perm data stru ture des ribes the a ess permissions of a System V IPC
obje t .
stru t ip _perm
{
key_t key;
ushort uid;
ushort gid;
ushort uid;
ushort gid;
ushort mode;
ushort seq;
};

See

in lude/linux/
ip .h

/* owner euid and egid */


/* reator euid and egid */
/* a ess modes see mode flags below */
/* sequen e number */

irqa tion
The irqa tion data stru ture is used to des ribe the system's interrupt handlers.

See

in lude/linux/
interrupt.h

stru t irqa tion {


void (*handler)(int, void *, stru t pt_regs *);
unsigned long flags;
unsigned long mask;
onst har *name;
void *dev_id;
stru t irqa tion *next;
};

linux binfmt
Ea h binary le format that Linux understands is represented by a linux binfmt
data stru ture.
stru t linux_binfmt {
stru t linux_binfmt * next;
long *use_ ount;

See

in lude/linux/
binfmts.h

int (*load_binary)(stru t linux_binprm *, stru t pt_regs * regs);


int (*load_shlib)(int fd);
int (* ore_dump)(long signr, stru t pt_regs * regs);
};

mem map t
See

in lude/linux/
mm.h

The mem map t data stru ture (also known as page) is used to hold information about
ea h page of physi al memory.
typedef stru t page {
/* these must be first (free area handling) */
stru t page
*next;
stru t page
*prev;
stru t inode
*inode;
unsigned long
offset;
stru t page
*next_hash;
atomi _t
ount;
unsigned
flags;
/* atomi flags, some possibly
updated asyn hronously */
unsigned
dirty:16,
age:8;
stru t wait_queue *wait;
stru t page
*prev_hash;
stru t buffer_head *buffers;
unsigned long
swap_unlo k_entry;
unsigned long
map_nr;
/* page->map_nr == page - mem_map */
} mem_map_t;

mm stru t
See

in lude/linux/
s hed.h

The mm stru t data stru ture is used to des ribe the virtual memory of a task or
pro ess.
stru t mm_stru t {
int ount;
pgd_t * pgd;
unsigned long ontext;
unsigned long start_ ode, end_ ode, start_data, end_data;
unsigned long start_brk, brk, start_sta k, start_mmap;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, lo ked_vm;
unsigned long def_flags;
stru t vm_area_stru t * mmap;
stru t vm_area_stru t * mmap_avl;
stru t semaphore mmap_sem;
};

See

in lude/linux/
p i.h

p i bus
Every PCI bus in the system is represented by a p i bus data stru ture.

stru t p i_bus {
stru t p i_bus
stru t p i_bus
stru t p i_bus
stru t p i_dev
stru t p i_dev
void
unsigned
unsigned
unsigned
unsigned
};

*parent;
* hildren;
*next;

/* parent bus this bridge is on */


/* hain of P2P bridges on this bus */
/* hain of all PCI buses */

*self;
*devi es;

/* bridge devi e as seen by parent */


/* devi es behind this bridge */

*sysdata;
har
har
har
har

number;
primary;
se ondary;
subordinate;

/* hook for sys-spe ifi extension */


/*
/*
/*
/*

bus number */
number of primary bridge */
number of se ondary bridge */
max number of subordinate buses */

p i dev
Every PCI devi e in the system, in luding PCI-PCI and PCI-ISA bridge devi es is
represented by a p i dev data stru ture.
/*
* There is one p i_dev stru ture for ea h slot-number/fun tion-number
* ombination:
*/
stru t p i_dev {
stru t p i_bus *bus;
/* bus this devi e is on */
stru t p i_dev *sibling; /* next devi e on this bus */
stru t p i_dev *next;
/* hain of all devi es */
void

*sysdata;

/* hook for sys-spe ifi extension */

unsigned int devfn;


/* en oded devi e & fun tion index */
unsigned short vendor;
unsigned short devi e;
unsigned int lass;
/* 3 bytes: (base,sub,prog-if) */
unsigned int master : 1; /* set if devi e is master apable */
/*
* In theory, the irq level an be read from onfiguration
* spa e and all would be fine. However, old PCI hips don't
* support these registers and return 0 instead. For example,
* the Vision864-P rev 0 hip an uses INTA, but returns 0 in
* the interrupt line and pin registers. p i_init()
* initializes this field with the value at PCI_INTERRUPT_LINE
* and it is the job of p ibios_fixup() to hange it if
* ne essary. The field must not be 0 unless the devi e
* annot generate interrupts at all.
*/
unsigned har irq;
/* irq generated by this devi e */
};

See

in lude/linux/
p i.h

request
See

in lude/linux/
blkdev.h

request data stru tures are used to make requests to the blo k devi es in the system.

The requests are always to read or write blo ks of data to or from the bu er a he.
stru t request {
volatile int rq_status;
#define RQ_INACTIVE
#define RQ_ACTIVE
#define RQ_SCSI_BUSY
#define RQ_SCSI_DONE
#define RQ_SCSI_DISCONNECTING

(-1)
1
0xffff
0xfffe
0xffe0

kdev_t rq_dev;
int md;
/* READ or WRITE */
int errors;
unsigned long se tor;
unsigned long nr_se tors;
unsigned long urrent_nr_se tors;
har * buffer;
stru t semaphore * sem;
stru t buffer_head * bh;
stru t buffer_head * bhtail;
stru t request * next;
};

rtable
See

in lude/net/
route.h

Ea h rtable data stru ture holds information about the route to take in order to
send pa kets to an IP host. rtable data stru tures are used within the IP route
a he.
stru t rtable
{
stru t rtable
__u32
__u32
__u32
atomi _t
atomi _t
unsigned long
atomi _t
stru t hh_ a he
stru t devi e
unsigned short
unsigned short
unsigned short
unsigned har
};

See

in lude/asm/
semaphore.h

*rt_next;
rt_dst;
rt_sr ;
rt_gateway;
rt_ref nt;
rt_use;
rt_window;
rt_lastuse;
*rt_hh;
*rt_dev;
rt_flags;
rt_mtu;
rt_irtt;
rt_tos;

semaphore
Semaphores are used to prote t riti al data stru tures and regions of ode. y

stru t semaphore {
int ount;
int waking;
int lo k ;
stru t wait_queue *wait;
};

/* to make waking testing atomi */

sk bu
The sk buff data stru ture is used to des ribe network data as it moves between
the layers of proto ol.

See

in lude/linux/
skbuff.h

stru t sk_buff
{
stru t sk_buff
*next;
/* Next buffer in list
*/
stru t sk_buff
*prev;
/* Previous buffer in list
*/
stru t sk_buff_head *list;
/* List we are on
*/
int
magi _debug_ ookie;
stru t sk_buff
*link3;
/* Link for IP proto ol level buffer hains */
stru t so k
*sk;
/* So ket we are owned by
*/
unsigned long
when;
/* used to ompute rtt's
*/
stru t timeval
stamp;
/* Time we arrived
*/
stru t devi e
*dev;
/* Devi e we arrived on/are leaving by
*/
union
{
stru t t phdr
*th;
stru t ethhdr
*eth;
stru t iphdr
*iph;
stru t udphdr
*uh;
unsigned har
*raw;
/* for passing file handles in a unix domain so ket */
void
*filp;
} h;
union
{
/* As yet in omplete physi al layer views */
unsigned har
*raw;
stru t ethhdr
*ethernet;
} ma ;
stru t iphdr
unsigned long
unsigned long
__u32
__u32
__u32
__u32
__u32
__u32
unsigned har
volatile har

*ip_hdr;
/*
len;
/*
sum;
/*
saddr;
/*
daddr;
/*
raddr;
/*
seq;
/*
end_seq;
/*
a k_seq;
/*
proto_priv[16;
a ked,
/*
used,
/*
free,
/*

For IPPROTO_RAW
Length of a tual data
Che ksum
IP sour e address
IP target address
IP next hop address
TCP sequen e number
seq [+ fin [+ syn + datalen
TCP a k sequen e number

*/
*/
*/
*/
*/
*/
*/
*/
*/

Are we a ked ?
Are we in use ?
How to free this buffer

*/
*/
*/

arp;
/* Has IP/ARP resolution finished
tries,
/* Times tried
lo k,
/* Are we lo ked ?
lo alroute, /* Lo al routing asserted for this frame
pkt_type,
/* Pa ket lass
pkt_bridged, /* Tra ker for bridging
ip_summed;
/* Driver fed us an IP he ksum
#define PACKET_HOST
0
/* To us
#define PACKET_BROADCAST
1
/* To all
#define PACKET_MULTICAST
2
/* To group
#define PACKET_OTHERHOST
3
/* To someone else
unsigned short
users;
/* User ount - see datagram. ,t p.
unsigned short
proto ol;
/* Pa ket proto ol from driver.
unsigned int
truesize;
/* Buffer size
atomi _t
ount;
/* referen e ount
stru t sk_buff
*data_skb;
/* Link to the a tual data skb
unsigned har
*head;
/* Head of buffer
unsigned har
*data;
/* Data head pointer
unsigned har
*tail;
/* Tail pointer
unsigned har
*end;
/* End pointer
void
(*destru tor)(stru t sk_buff *); /* Destru t fun tion
__u16
redirport;
/* Redire t port
};
unsigned har

so k
See

in lude/linux/
net.h

Ea h so k data stru ture holds proto ol spe i information about a BSD so ket.
For example, for an INET (Internet Address Domain) so ket this data stru ture
would hold all of the TCP/IP and UDP/IP spe i information.
stru t so k
{
/* This must be first. */
stru t so k
*sklist_next;
stru t so k
*sklist_prev;
stru t options
*opt;
atomi _t
wmem_allo ;
atomi _t
rmem_allo ;
unsigned long
allo ation;
/* Allo ation mode */
__u32
write_seq;
__u32
sent_seq;
__u32
a ked_seq;
__u32
opied_seq;
__u32
r v_a k_seq;
unsigned short
r v_a k_ nt;
/* ount of same a k */
__u32
window_seq;
__u32
fin_seq;
__u32
urg_seq;
__u32
urg_data;
__u32
syn_seq;
int
users;
/* user ount */
/*
*
Not all are volatile, but some are, so we

*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/

*
might as well say they all are.
*/
volatile har
dead,
urginline,
intr,
blog,
done,
reuse,
keepopen,
linger,
delay_a ks,
destroy,
a k_timed,
no_ he k,
zapped,
broad ast,
nonagle,
bsdism;
unsigned long
lingertime;
int
pro ;
stru t
stru t
stru t
stru t
stru t
int
stru t
stru t
stru t
stru t
stru t
stru t
stru t
long
stru t

so k
so k
so k
so k
so k
so k
sk_buff
sk_buff
sk_buff
sk_buff_head
sk_buff
timer_list
sk_buff_head

stru t proto
stru t wait_queue
__u32
__u32
__u32
unsigned short
unsigned short
__u32
__u32
volatile
volatile
volatile
unsigned
/*
*

unsigned long
unsigned long
unsigned long
int

*next;
**pprev;
*bind_next;
**bind_pprev;
*pair;
hashent;
*prev;
*volatile send_head;
*volatile send_next;
*volatile send_tail;
ba k_log;
*partial;
partial_timer;
retransmits;
write_queue,
re eive_queue;
*prot;
**sleep;
daddr;
saddr;
/* Sending sour e */
r v_saddr;
/* Bound address */
max_una ked;
window;
lastwin_seq;
/* sequen e number when we last
updated the window we offer */
high_seq;
/* sequen e number when we did
urrent fast retransmit */
ato;
/* a k timeout */
lr vtime;
/* jiffies at last data r v */
idletime;
/* jiffies at last r v */
bytes_r v;

mss is min(mtu, max_window)

*/
unsigned
volatile
volatile
volatile
unsigned
unsigned
unsigned
volatile
volatile
volatile
volatile
volatile
volatile
volatile

short
unsigned
unsigned
unsigned
long
int
short
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned

mtu;
short mss;
short user_mss;
short max_window;
window_ lamp;
ssthresh;
num;
short ong_window;
short ong_ ount;
short pa kets_out;
short shutdown;
long rtt;
long mdev;
long rto;

volatile unsigned short ba koff;


int
err, err_soft;

unsigned
volatile
unsigned
unsigned
unsigned
unsigned
int
int
unsigned
unsigned
/*
*
*
*/

har
unsigned har
har
har
har
har

short
har

proto ol;
state;
a k_ba klog;
max_a k_ba klog;
priority;
debug;
r vbuf;
sndbuf;
type;
lo alroute;

/* mss negotiated in the syn's */


/* urrent eff. mss - an hange */
/* mss requested by user in io tl */

/* Soft holds errors that don't


ause failure but are the ause
of a persistent failure not
just 'timed out' */

/* Route lo ally only */

This is where all the private (optional) areas that don't


overlap will eventually live.
union
{

stru t unix_opt
af_unix;
#if defined(CONFIG_ATALK) || defined(CONFIG_ATALK_MODULE)
stru t atalk_so k
af_at;
#endif
#if defined(CONFIG_IPX) || defined(CONFIG_IPX_MODULE)
stru t ipx_opt
af_ipx;
#endif
#ifdef CONFIG_INET
stru t inet_pa ket_opt af_pa ket;
#ifdef CONFIG_NUTCP
stru t t p_opt
af_t p;
#endif
#endif
} protinfo;
/*
*
IP 'private area'
*/

int
int
stru t t phdr
stru t timer_list
stru t timer_list
stru t timer_list
int
stru t rtable
unsigned har
#ifdef CONFIG_IP_MULTICAST
int
int
har
stru t ip_m _so klist
#endif
/*
*
*/

/*
*
*/

ip_ttl;
ip_tos;
dummy_th;
keepalive_timer;
retransmit_timer;
dela k_timer;
ip_xmit_timeout;
*ip_route_ a he;
ip_hdrin l;

/* TTL setting */
/* TOS */
/*
/*
/*
/*
/*
/*

TCP keepalive ha k */
TCP retransmit timer */
TCP delayed a k timer */
Why the timeout is running */
Ca hed output route */
In lude headers ? */

ip_m _ttl;
/* Multi asting TTL */
ip_m _loop;
/* Loopba k */
ip_m _name[MAX_ADDR_LEN; /* Multi ast devi e name */
*ip_m _list;
/* Group array */

This part is used for the timeout fun tions (timer. ).


int
stru t timer_list

timeout;
timer;

stru t timeval

stamp;

/* What are we waiting for? */


/* This is the TIME_WAIT/re eive
* timer when we are doing IP
*/

Identd

stru t so ket
/*
*
Callba ks
*/
void
void
void
void

*so ket;

(*state_ hange)(stru t so k *sk);


(*data_ready)(stru t so k *sk,int bytes);
(*write_spa e)(stru t so k *sk);
(*error_report)(stru t so k *sk);

};

so ket
Ea h so ket data stru ture holds information about a BSD so ket. It does not exist
independently; it is, instead, part of the VFS inode data stru ture.
stru t so ket {
short
so ket_state
long
stru t proto_ops
void
stru t so ket
stru t so ket
stru t so ket
stru t wait_queue
stru t inode

type;
state;
flags;
*ops;
*data;
* onn;
*i onn;
*next;
**wait;
*inode;

/* SOCK_STREAM, ...

*/

/*
/*
/*
/*

*/
*/
*/
*/

proto ols do most everything


proto ol data
server so ket onne ted to
in omplete lient onn.s

/* ptr to pla e to wait on

*/

See

in lude/linux/
net.h

stru t fasyn _stru t *fasyn _list; /* Asyn hronous wake up list


stru t file
*file;
/* File ba k pointer for g
};

See

in lude/linux/
s hed.h

*/
*/

task stru t
Ea h task stru t data stru ture des ribes a pro ess or task in the system.
stru t task_stru t {
/* these are hard oded - don't tou h */
volatile long
state;
/* -1 unrunnable, 0 runnable, >0 stopped */
long
ounter;
long
priority;
unsigned
long signal;
unsigned
long blo ked;
/* bitmap of masked signals */
unsigned
long flags;
/* per pro ess flags, defined below */
int errno;
long
debugreg[8;
/* Hardware debugging registers */
stru t exe _domain
*exe _domain;
/* various fields */
stru t linux_binfmt *binfmt;
stru t task_stru t
*next_task, *prev_task;
stru t task_stru t
*next_run, *prev_run;
unsigned long
saved_kernel_sta k;
unsigned long
kernel_sta k_page;
int
exit_ ode, exit_signal;
/* ??? */
unsigned long
personality;
int
dumpable:1;
int
did_exe :1;
int
pid;
int
pgrp;
int
tty_old_pgrp;
int
session;
/* boolean value for session group leader */
int
leader;
int
groups[NGROUPS;
/*
* pointers to (original) parent pro ess, youngest hild, younger sibling,
* older sibling, respe tively. (p->father an be repla ed with
* p->p_pptr->pid)
*/
stru t task_stru t
*p_opptr, *p_pptr, *p_ ptr,
*p_ysptr, *p_osptr;
stru t wait_queue
*wait_ hldexit;
unsigned short
uid,euid,suid,fsuid;
unsigned short
gid,egid,sgid,fsgid;
unsigned long
timeout, poli y, rt_priority;
unsigned long
it_real_value, it_prof_value, it_virt_value;
unsigned long
it_real_in r, it_prof_in r, it_virt_in r;
stru t timer_list
real_timer;
long
utime, stime, utime, stime, start_time;
/* mm fault and swap info: this an arguably be seen as either
mm-spe ifi or thread-spe ifi */

unsigned long
min_flt, maj_flt, nswap, min_flt, maj_flt, nswap;
int swappable:1;
unsigned long
swap_address;
unsigned long
old_maj_flt;
/* old value of maj_flt */
unsigned long
de _flt;
/* page fault ount of the last time */
unsigned long
swap_ nt;
/* number of pages to swap on next pass */
/* limits */
stru t rlimit
rlim[RLIM_NLIMITS;
unsigned short
used_math;
har
omm[16;
/* file system info */
int
link_ ount;
stru t tty_stru t
*tty;
/* NULL if no tty */
/* ip stuff */
stru t sem_undo
*semundo;
stru t sem_queue
*semsleeping;
/* ldt for this task - used by Wine. If NULL, default_ldt is used */
stru t des _stru t *ldt;
/* tss for this task */
stru t thread_stru t tss;
/* filesystem information */
stru t fs_stru t
*fs;
/* open file information */
stru t files_stru t *files;
/* memory management info */
stru t mm_stru t
*mm;
/* signal handlers */
stru t signal_stru t *sig;
#ifdef __SMP__
int
pro essor;
int
last_pro essor;
int
lo k_depth;
/* Lo k depth.
We an ontext swit h in and out
of holding a sys all kernel lo k... */
#endif
};

timer list
timer list data stru ture's are used to implement real time timers for pro esses.
stru t timer_list {
stru t timer_list *next;
stru t timer_list *prev;
unsigned long expires;
unsigned long data;
void (*fun tion)(unsigned long);
};

tq stru t
Ea h task queue (tq stru t) data stru ture holds information about work that has
been queued. This is usually a task needed by a devi e driver but whi h does not

See

in lude/linux/
timer.h

See

in lude/linux/
tqueue.h

have to be done immediately.


stru t tq_stru t {
stru t tq_stru t *next;
int syn ;
void (*routine)(void *);
void *data;
};

/*
/*
/*
/*

linked list of a tive bh's */


must be initialized to zero */
fun tion to all */
argument to fun tion */

vm area stru t
See

in lude/linux/
mm.h

Ea h vm area stru t data stru ture des ribes an area of virtual memory for a pro ess.
stru t vm_area_stru t {
stru t mm_stru t * vm_mm; /* VM area parameters */
unsigned long vm_start;
unsigned long vm_end;
pgprot_t vm_page_prot;
unsigned short vm_flags;
/* AVL tree of VM areas per task, sorted by address */
short vm_avl_height;
stru t vm_area_stru t * vm_avl_left;
stru t vm_area_stru t * vm_avl_right;
/* linked list of VM areas per task, sorted by address */
stru t vm_area_stru t * vm_next;
/* for areas with inode, the ir ular list inode->i_mmap */
/* for shm areas, the ir ular list of atta hes */
/* otherwise unused */
stru t vm_area_stru t * vm_next_share;
stru t vm_area_stru t * vm_prev_share;
/* more */
stru t vm_operations_stru t * vm_ops;
unsigned long vm_offset;
stru t inode * vm_inode;
unsigned long vm_pte;
/* shared mem */
};

Appendix B

Useful Web and FTP Sites

The following World Wide Web and ftp sites are useful:

http://www.azstarnet. om/ ~axplinux This is David Mosberger-Tang's Alpha

AXP Linux web site and it is the pla e to go for all of the Alpha AXP HOWTOs. It also has a large number of pointers to Linux and Alpha AXP spe i
information su h as CPU data sheets.

http://www.redhat. om/ Red Hat's web site. This has a lot of useful pointers.
ftp://sunsite.un .edu This is the major site for a lot of free software. The Linux
spe i software is held in pub/Linux.

http://www.intel. om Intel's web site and a good pla e to look for Intel hip
information.

http://www.ss . om/lj/index.html The Linux Journal is a very good Linux


magazine and well worth the yearly subs ription for its ex ellent arti les.

http://www.bla kdown.org/java-linux.html This is the primary site for information on Java on Linux.

ftp://tsx-11.mit.edu/ ~ftp/pub/linux MIT's Linux ftp site.


ftp://ftp. s.helsinki. /pub/Software/Linux/Kernel Linus's kernel sour es.
http://www.linux.org.uk The UK Linux User Group.
http://sunsite.un .edu/mdw/linux.html Home page for the Linux Do umentation Proje t,

http://www.digital. om Digital Equipment Corporation's main web page.


http://altavista.digital. om DIGITAL's Altavista sear h engine. A very good
pla e to sear h for information within the web and news groups.
177

http://www.linuxhq. om The Linux HQ web site holds up to date o ial and


uno al pat hes as well as advi e and web pointers that help you get the best
set of kernel sour es possible for your system.

http://www.amd. om The AMD web site.


http://www. yrix. om Cyrix's web site.
http://www.arm. om ARM's web site.

Appendix C

Linux Do umentation Proje t


Manifesto

This is the Linux Do umentation Proje t \Manifesto"


Last Revision 21 September 1998, by Mi hael K. Johnson
This le des ribes the goals and urrent status of the Linux Do umentation Proje t, in luding names of proje ts, volunteers, FTP sites, and so
on.

C.1 Overview
The Linux Do umentation Proje t is working on developing good, reliable do s for
the Linux operating system. The overall goal of the LDP is to ollaborate in taking
are of all of the issues of Linux do umentation, ranging from online do s (man
pages, texinfo do s, and so on) to printed manuals overing topi s su h as installing,
using, and running Linux. The LDP is essentially a loose team of volunteers with
little entral organization; anyone who is interested in helping is wel ome to join in
the e ort. We feel that working together and agreeing on the dire tion and s ope
of Linux do umentation is the best way to go, to redu e problems with on i ting
e orts|two people writing two books on the same aspe t of Linux wastes someone's
time along the way.
The LDP is set out to produ e the anoni al set of Linux online and printed do umentation. Be ause our do s will be freely available (like software li ensed under
the terms of the GNU GPL) and distributed on the net, we are able to easily update
the do umentation to stay on top of the many hanges in the Linux world. If you
are interested in publishing any of the LDP works, see the se tion \Publishing LDP
Manuals", below.
179

C.2 Getting Involved


Send mail to linux-howtometalab.un .edu
Of ourse, you'll also need to get in tou h with the oordinator of whatever LDP
proje ts you're interested in working on; see the next se tion.

C.3 Current Proje ts


For a list of urrent proje ts, see the LDP Homepage at http://sunsite.un .edu/LDP/ldp.html.
The best way to get involved with one of these proje ts is to pi k up the urrent
version of the manual and send revisions, editions, or suggestions to the oordinator.
You probably want to oordinate with the author before sending revisions so that
you know you are working together.

C.4 FTP sites for LDP works


LDP works an be found on sunsite.un .edu in the dire tory /pub/Linux/do s. LDP
manuals are found in /pub/Linux/do s/LDP, HOWTOs and other do umentation
found in /pub/Linux/do s/HOWTO.

C.5 Do umentation Conventions


Here are the onventions that are urrently used by LDP manuals. If you are interested in writing another manual using di erent onventions, please let us know of
your plans rst.
The man pages { the Unix standard for online manuals { are reated with the Unix
standard nro man (or BSD mdo ) ma ros.
The guides { full books produ ed by the LDP { have histori ally been done in LaTeX,
as their primary goal has been to printed do umentation. However, guide authors
have been moving towards SGML with the Do Book DTD, be ause it allows them
to reate more di erent kinds of output, both printed and on-line. If you use LaTeX,
we have a style le you an use to keep your printed look onsistent with other LDP
do uments, and we suggest that you use it.
The HOWTO do uments are all required to be in SGML format. Currently, they
use the linuxdo DTD, whi h is quite simple. There is a move afoot to swit h to the
Do Book DTD over time.
LDP do uments must be freely redistributable without fees paid to the authors. It
is not required that the text be modi able, but it is en ouraged. You an ome up
with your own li ense terms that satisfy this onstraint, or you an use a previously
prepared li ense. The LDP provides a boilerplate li ense that you an use, some
people like to use the GPL, and others write their own.
The opyright for ea h manual should be in the name of the head writer or oordinator for the proje t. \The Linux Do umentation Proje t" isn't a formal entity and
shouldn't be used to opyright the do s.

C.6 Copyright and Li ense


Here is a \boilerplate" li ense you may apply to your work. It has not been reviewed
by a lawyer; feel free to have your own lawyer review it (or your modi ation of it) for
its appli ability to your own desires. Remember that in order for your do ument to
be part of the LDP, you must allow unlimited reprodu tion and distribution without
fee.
This manual may be reprodu ed and distributed in whole or in part, without fee,
subje t to the following onditions:

 The opyright noti e above and this permission noti e must be preserved omplete on all omplete or partial opies.

 Any translation or derived work must be approved by the author in writing


before distribution.

 If you distribute this work in part, instru tions for obtaining the omplete
version of this manual must be in luded, and a means for obtaining a omplete
version provided.

 Small portions may be reprodu ed as illustrations for reviews or quotes in other


works without this permission noti e if proper itation is given.

Ex eptions to these rules may be granted for a ademi purposes: Write to the author
and ask. These restri tions are here to prote t us as authors, not to restri t you as
learners and edu ators.
All sour e ode in this do ument is pla ed under the GNU General Publi Li ense,
available via anonymous FTP from prep.ai.mit.edu:/pub/gnu/COPYING.

C.7 Publishing LDP Manuals


If you're a publishing ompany interested in distributing any of the LDP manuals,
read on.
By the li ense requirements given previously, anyone is allowed to publish and distribute verbatim opies of the Linux Do umentation Proje t manuals. You don't
need our expli it permission for this. However, if you would like to distribute a
translation or derivative work based on any of the LDP manuals, you may need to
obtain permission from the author, in writing, before doing so, if the li ense requires
that.
You may, of ourse, sell the LDP manuals for pro t. We en ourage you to do so.
Keep in mind, however, that be ause the LDP manuals are freely distributable,
anyone may photo opy or distribute printed opies free of harge, if they wish to do
so.
We do not require to be paid royalties for any pro t earned from selling LDP manuals.
However, we would like to suggest that if you do sell LDP manuals for pro t, that
you either o er the author royalties, or donate a portion of your earnings to the
author, the LDP as a whole, or to the Linux development ommunity. You may also
wish to send one or more free opies of the LDP manuals that you are distributing
to the authors. Your show of support for the LDP and the Linux ommunity will be
very mu h appre iated.

We would like to be informed of any plans to publish or distribute LDP manuals,


just so we know how they're be oming available. If you are publishing or planning
to publish any LDP manuals, please send mail to ldp-llinux.org.au. It's ni e to
know who's doing what.
We en ourage Linux software distributors to distribute the LDP manuals (su h as
the Installation and Getting Started Guide) with their software. The LDP manuals
are intended to be used as the "o ial" Linux do umentation, and we are glad to see
mail-order distributors bundling the LDP manuals with the software. As the LDP
manuals mature, hopefully they will ful ll this goal more and more adequately.

Appendix D

The GNU General Publi


Li ense

Printed below is the GNU General Publi Li ense (the GPL or opyleft ), under
whi h Linux is li ensed. It is reprodu ed here to lear up some of the onfusion
about Linux's opyright status|Linux is not shareware, and it is not in the publi
domain. The bulk of the Linux kernel is opyright 1993 by Linus Torvalds, and
other software and parts of the kernel are opyrighted by their authors. Thus, Linux
is opyrighted, however, you may redistribute it under the terms of the GPL printed
below.

GNU GENERAL PUBLIC LICENSE

Version 2, June 1991


Copyright (C) 1989, 1991 Free Software Foundation, In . 675 Mass Ave, Cambridge,
MA 02139, USA Everyone is permitted to opy and distribute verbatim opies of
this li ense do ument, but hanging it is not allowed.

D.1 Preamble
The li enses for most software are designed to take away your freedom to share and
hange it. By ontrast, the GNU General Publi Li ense is intended to guarantee
your freedom to share and hange free software{to make sure the software is free
for all its users. This General Publi Li ense applies to most of the Free Software
Foundation's software and to any other program whose authors ommit to using
it. (Some other Free Software Foundation software is overed by the GNU Library
General Publi Li ense instead.) You an apply it to your programs, too.
When we speak of free software, we are referring to freedom, not pri e. Our General
Publi Li enses are designed to make sure that you have the freedom to distribute
opies of free software (and harge for this servi e if you wish), that you re eive
183

sour e ode or an get it if you want it, that you an hange the software or use
pie es of it in new free programs; and that you know you an do these things.
To prote t your rights, we need to make restri tions that forbid anyone to deny
you these rights or to ask you to surrender the rights. These restri tions translate
to ertain responsibilities for you if you distribute opies of the software, or if you
modify it.
For example, if you distribute opies of su h a program, whether gratis or for a fee,
you must give the re ipients all the rights that you have. You must make sure that
they, too, re eive or an get the sour e ode. And you must show them these terms
so they know their rights.
We prote t your rights with two steps: (1) opyright the software, and (2) o er you
this li ense whi h gives you legal permission to opy, distribute and/or modify the
software.
Also, for ea h author's prote tion and ours, we want to make ertain that everyone
understands that there is no warranty for this free software. If the software is modi ed
by someone else and passed on, we want its re ipients to know that what they have
is not the original, so that any problems introdu ed by others will not re e t on the
original authors' reputations.
Finally, any free program is threatened onstantly by software patents. We wish to
avoid the danger that redistributors of a free program will individually obtain patent
li enses, in e e t making the program proprietary. To prevent this, we have made it
lear that any patent must be li ensed for everyone's free use or not li ensed at all.
The pre ise terms and onditions for opying, distribution and modi ation follow.

D.2 Terms and Conditions for Copying, Distribution, and Modi ation
0. This Li ense applies to any program or other work whi h ontains a noti e
pla ed by the opyright holder saying it may be distributed under the terms of
this General Publi Li ense. The \Program", below, refers to any su h program
or work, and a \work based on the Program" means either the Program or
any derivative work under opyright law: that is to say, a work ontaining
the Program or a portion of it, either verbatim or with modi ations and/or
translated into another language. (Hereinafter, translation is in luded without
limitation in the term \modi ation".) Ea h li ensee is addressed as \you".
A tivities other than opying, distribution and modi ation are not overed
by this Li ense; they are outside its s ope. The a t of running the Program is
not restri ted, and the output from the Program is overed only if its ontents
onstitute a work based on the Program (independent of having been made by
running the Program). Whether that is true depends on what the Program
does.
1. You may opy and distribute verbatim opies of the Program's sour e ode
as you re eive it, in any medium, provided that you onspi uously and appropriately publish on ea h opy an appropriate opyright noti e and dis laimer
of warranty; keep inta t all the noti es that refer to this Li ense and to the

absen e of any warranty; and give any other re ipients of the Program a opy
of this Li ense along with the Program.
You may harge a fee for the physi al a t of transferring a opy, and you may
at your option o er warranty prote tion in ex hange for a fee.
2. You may modify your opy or opies of the Program or any portion of it,
thus forming a work based on the Program, and opy and distribute su h
modi ations or work under the terms of Se tion 1 above, provided that you
also meet all of these onditions:
a. You must ause the modi ed les to arry prominent noti es stating that
you hanged the les and the date of any hange.
b. You must ause any work that you distribute or publish, that in whole or
in part ontains or is derived from the Program or any part thereof, to
be li ensed as a whole at no harge to all third parties under the terms of
this Li ense.
. If the modi ed program normally reads ommands intera tively when run,
you must ause it, when started running for su h intera tive use in the
most ordinary way, to print or display an announ ement in luding an
appropriate opyright noti e and a noti e that there is no warranty (or
else, saying that you provide a warranty) and that users may redistribute
the program under these onditions, and telling the user how to view a
opy of this Li ense. (Ex eption: if the Program itself is intera tive but
does not normally print su h an announ ement, your work based on the
Program is not required to print an announ ement.)
These requirements apply to the modi ed work as a whole. If identi able
se tions of that work are not derived from the Program, and an be reasonably
onsidered independent and separate works in themselves, then this Li ense,
and its terms, do not apply to those se tions when you distribute them as
separate works. But when you distribute the same se tions as part of a whole
whi h is a work based on the Program, the distribution of the whole must be
on the terms of this Li ense, whose permissions for other li ensees extend to
the entire whole, and thus to ea h and every part regardless of who wrote it.
Thus, it is not the intent of this se tion to laim rights or ontest your rights
to work written entirely by you; rather, the intent is to exer ise the right to
ontrol the distribution of derivative or olle tive works based on the Program.
In addition, mere aggregation of another work not based on the Program with
the Program (or with a work based on the Program) on a volume of a storage
or distribution medium does not bring the other work under the s ope of this
Li ense.
3. You may opy and distribute the Program (or a work based on it, under Se tion
2) in obje t ode or exe utable form under the terms of Se tions 1 and 2 above
provided that you also do one of the following:
a. A ompany it with the omplete orresponding ma hine-readable sour e
ode, whi h must be distributed under the terms of Se tions 1 and 2 above
on a medium ustomarily used for software inter hange; or,

b. A ompany it with a written o er, valid for at least three years, to give any
third party, for a harge no more than your ost of physi ally performing
sour e distribution, a omplete ma hine-readable opy of the orresponding sour e ode, to be distributed under the terms of Se tions 1 and 2
above on a medium ustomarily used for software inter hange; or,
. A ompany it with the information you re eived as to the o er to distribute orresponding sour e ode. (This alternative is allowed only for
non ommer ial distribution and only if you re eived the program in obje t
ode or exe utable form with su h an o er, in a ord with Subse tion b
above.)
The sour e ode for a work means the preferred form of the work for making
modi ations to it. For an exe utable work, omplete sour e ode means all
the sour e ode for all modules it ontains, plus any asso iated interfa e de nition les, plus the s ripts used to ontrol ompilation and installation of the
exe utable. However, as a spe ial ex eption, the sour e ode distributed need
not in lude anything that is normally distributed (in either sour e or binary
form) with the major omponents ( ompiler, kernel, and so on) of the operating
system on whi h the exe utable runs, unless that omponent itself a ompanies
the exe utable.
If distribution of exe utable or obje t ode is made by o ering a ess to opy
from a designated pla e, then o ering equivalent a ess to opy the sour e ode
from the same pla e ounts as distribution of the sour e ode, even though third
parties are not ompelled to opy the sour e along with the obje t ode.
4. You may not opy, modify, subli ense, or distribute the Program ex ept as
expressly provided under this Li ense. Any attempt otherwise to opy, modify,
subli ense or distribute the Program is void, and will automati ally terminate
your rights under this Li ense. However, parties who have re eived opies, or
rights, from you under this Li ense will not have their li enses terminated so
long as su h parties remain in full omplian e.
5. You are not required to a ept this Li ense, sin e you have not signed it. However, nothing else grants you permission to modify or distribute the Program
or its derivative works. These a tions are prohibited by law if you do not a ept this Li ense. Therefore, by modifying or distributing the Program (or any
work based on the Program), you indi ate your a eptan e of this Li ense to
do so, and all its terms and onditions for opying, distributing or modifying
the Program or works based on it.
6. Ea h time you redistribute the Program (or any work based on the Program),
the re ipient automati ally re eives a li ense from the original li ensor to opy,
distribute or modify the Program subje t to these terms and onditions. You
may not impose any further restri tions on the re ipients' exer ise of the rights
granted herein. You are not responsible for enfor ing omplian e by third
parties to this Li ense.
7. If, as a onsequen e of a ourt judgment or allegation of patent infringement
or for any other reason (not limited to patent issues), onditions are imposed
on you (whether by ourt order, agreement or otherwise) that ontradi t the
onditions of this Li ense, they do not ex use you from the onditions of this

Li ense. If you annot distribute so as to satisfy simultaneously your obligations


under this Li ense and any other pertinent obligations, then as a onsequen e
you may not distribute the Program at all. For example, if a patent li ense
would not permit royalty-free redistribution of the Program by all those who
re eive opies dire tly or indire tly through you, then the only way you ould
satisfy both it and this Li ense would be to refrain entirely from distribution
of the Program.
If any portion of this se tion is held invalid or unenfor eable under any parti ular ir umstan e, the balan e of the se tion is intended to apply and the
se tion as a whole is intended to apply in other ir umstan es.
It is not the purpose of this se tion to indu e you to infringe any patents or
other property right laims or to ontest validity of any su h laims; this se tion
has the sole purpose of prote ting the integrity of the free software distribution
system, whi h is implemented by publi li ense pra ti es. Many people have
made generous ontributions to the wide range of software distributed through
that system in relian e on onsistent appli ation of that system; it is up to the
author/donor to de ide if he or she is willing to distribute software through
any other system and a li ensee annot impose that hoi e.
This se tion is intended to make thoroughly lear what is believed to be a
onsequen e of the rest of this Li ense.
8. If the distribution and/or use of the Program is restri ted in ertain ountries
either by patents or by opyrighted interfa es, the original opyright holder
who pla es the Program under this Li ense may add an expli it geographi al
distribution limitation ex luding those ountries, so that distribution is permitted only in or among ountries not thus ex luded. In su h ase, this Li ense
in orporates the limitation as if written in the body of this Li ense.
9. The Free Software Foundation may publish revised and/or new versions of the
General Publi Li ense from time to time. Su h new versions will be similar in
spirit to the present version, but may di er in detail to address new problems
or on erns.
Ea h version is given a distinguishing version number. If the Program spe i es
a version number of this Li ense whi h applies to it and \any later version", you
have the option of following the terms and onditions either of that version or
of any later version published by the Free Software Foundation. If the Program
does not spe ify a version number of this Li ense, you may hoose any version
ever published by the Free Software Foundation.
10. If you wish to in orporate parts of the Program into other free programs whose
distribution onditions are di erent, write to the author to ask for permission.
For software whi h is opyrighted by the Free Software Foundation, write to the
Free Software Foundation; we sometimes make ex eptions for this. Our de ision
will be guided by the two goals of preserving the free status of all derivatives of
our free software and of promoting the sharing and reuse of software generally.

NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE
IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED

IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES


PROVIDE THE PROGRAM \AS IS" WITHOUT WARRANTY OF ANY
KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO
THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH
YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED
TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER
PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM
AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE
THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA
OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED
BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO
OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER
OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.

END OF TERMS AND CONDITIONS

D.3 Appendix: How to Apply These Terms to Your


New Programs
If you develop a new program, and you want it to be of the greatest possible use to
the publi , the best way to a hieve this is to make it free software whi h everyone
an redistribute and hange under these terms.
To do so, atta h the following noti es to the program. It is safest to atta h them
to the start of ea h sour e le to most e e tively onvey the ex lusion of warranty;
and ea h le should have at least the \ opyright" line and a pointer to where the full
noti e is found.

one line to give the program's name and a brief idea of what it does.

Copyright 19yy name of author


This program is free software; you an redistribute it and/or modify it
under the terms of the GNU General Publi Li ense as published by the
Free Software Foundation; either version 2 of the Li ense, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
the GNU General Publi Li ense for more details.
You should have re eived a opy of the GNU General Publi Li ense along
with this program; if not, write to the Free Software Foundation, In .,
675 Mass Ave, Cambridge, MA 02139, USA.

Also add information on how to onta t you by ele troni and paper mail.
If the program is intera tive, make it output a short noti e like this when it starts
in an intera tive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision
omes with ABSOLUTELY NO WARRANTY; for details type `show w'. This
is free software, and you are wel ome to redistribute it under ertain
onditions; type `show ' for details.

The hypotheti al ommands `show w' and `show ' should show the appropriate
parts of the General Publi Li ense. Of ourse, the ommands you use may be alled
something other than `show w' and `show '; they ould even be mouse- li ks or
menu items{whatever suits your program.
You should also get your employer (if you work as a programmer) or your s hool, if
any, to sign a \ opyright dis laimer" for the program, if ne essary. Here is a sample;
alter the names:
Yoyodyne, In ., hereby dis laims all opyright interest in the program
`Gnomovision' (whi h makes passes at ompilers) written by James Ha ker.

signature of Ty Coon

i, 1 April 1989 Ty Coon, President of Vi e

This General Publi Li ense does not permit in orporating your program into proprietary programs. If your program is a subroutine library, you may onsider it more
useful to permit linking proprietary appli ations with the library. If this is what you
want to do, use the GNU Library General Publi Li ense instead of this Li ense.

Glossary

Argument Fun tions and routines are passed arguments to pro ess.
ARP Address Resolution Proto ol. Used to translate IP addresses into physi al
hardware addresses.

As ii Ameri an Standard Code for Information Inter hange. Ea h letter of the

alphabet is represented by an 8 bit ode. As ii is most often used to store


written hara ters.

Bit A single bit of data that represents either 1 or 0 (on or o ).


Bottom Half Handler Handlers for work queued within the kernel.
Byte 8 bits of data,
C A high level programming language. Most of the Linux kernel is written in C.
CISC Complex Instru tion Set Computer. The opposite of RISC, a pro essor whi h
supports a large number of often omplex assembly instru tions. The X86
ar hite ture is a CISC ar hite ture.

CPU Central Pro essing Unit. The main engine of the omputer, see also mi ropro essor and pro essor.

Data Stru ture This is a set of data in memory omprised of elds,


Devi e Driver The software ontrolling a parti ular devi e, for example the NCR
810 devi e driver ontrols the NCR 810 SCSI devi e.

DMA Dire t Memory A ess.


ELF Exe utable and Linkable Format. This obje t le format designed by the Unix

System Laboratories is now rmly established as the most ommonly used


format in Linux.

EIDE Extended IDE.


191

Exe utable image A stru tured le ontaining ma hine instru tions and data.

This le an be loaded into a pro ess's virtual memory and exe uted. See
also program.

Fun tion A pie e of software that performs an a tion. For example, returning the
bigger of two numbers.

IDE Integrated Disk Ele troni s.


Image See exe utable image.
IP Internet Proto ol.
IPC Interpro ess Communi tion.
Interfa e A standard way of alling routines and passing data stru tures. For

example, the interfa e between two layers of ode might be expressed in terms
of routines that pass and return a parti ular data stru ture. Linux's VFS is a
good example of an interfa e.

IRQ Interrupt Request Queue.


ISA Industry Standard Ar hite ture. This is a standard, although now rather dated,
data bus interfa e for system omponents su h as oppy disk drivers.

Kernel Module A dynami ally loaded kernel fun tion su h as a lesystem or a


devi e driver.

Kilobyte A thousand bytes of data, often written as Kbyte,


Megabyte A million bytes of data, often written as Mbyte,
Mi ropro essor A very integrated CPU. Most modern CPUs are Mi ropro essors.
Module A le ontaining CPU instru tions in the form of either assembly language
instru tions or a high level language like C.

Obje t le A le ontaining ma hine ode and data that has not yet been linked
with other obje t les or libraries to be ome an exe utable image.

Page Physi al memory is divided up into equal sized pages.


Pointer A lo ation in memory that ontains the address of another lo ation in
memory,

Pro ess This is an entity whi h an exe ute programs. A pro ess ould be thought
of as a program in a tion.

Pro essor Short for Mi ropro essor, equivalent to CPU.


PCI Peripheral Component Inter onne t. A standard des ribing how the peripheral
omponents of a omputer system may be onne ted together.

Peripheral An intelligent pro essor that does work on behalf of the system's CPU.
For example, an IDE ontroller hip,

Program A oherent set of CPU instru tions that performs a task, su h as printing
\hello world". See also exe utable image.

Proto ol A proto ol is a networking language used to transfer appli ation data


between two ooperating pro esses or network layers.

Register A lo ation within a hip, used to store information or instru tions.


Register File The set of registers in a pro essor.
RISC Redu ed Instru tion Set Computer. The opposite of CISC, that is a pro essor
with a small number of assembly instru tions, ea h of whi h performs simple
operations. The ARM and Alpha pro essors are both RISC ar hite tures.

Routine Similar to a fun tion ex ept that, stri tly speaking, routines do not return
values.

SCSI Small Computer Systems Interfa e.


Shell This is a program whi h a ts as an interfa e between the operating system

and a human user. Also alled a ommand shell, the most ommonly used shell
in Linux is the bash shell.

SMP Symmetri al multipro essing. Systems with more than one pro essor whi h
fairly share the work amongst those pro essors.

So ket A so ket represents one end of a network onne tion, Linux supports the
BSD So ket interfa e.

Software CPU instru tions (both assembler and high level languages like C) and
data. Mostly inter hangable with Program.

System V A variant of UnixTM produ ed in 1983, whi h in luded, amongst other


things, System V IPC me hanisms.

TCP Transmission Control Proto ol.


Task Queue A me hanism for deferring work in the Linux kernel.
UDP User Datagram Proto ol.
Virtual memory A hardware and software me hanism for making the physi al
memory in a system appear larger than it a tually is.

Bibliography
[1 Ri hard L. Sites. Alpha Ar hite ture Referen e Manual Digital Press
[2 Matt Welsh and Lar Kaufman. Running Linux O'Reilly & Asso iates, In , ISBN
1-56592-100-3
[3 PCI Spe ial Interest Group PCI Lo al Bus Spe i ation
[4 PCI Spe ial Interest Group PCI BIOS ROM Spe i ation
[5 PCI Spe ial Interest Group PCI to PCI Bridge Ar hite ture Spe i ation
[6 Intel Peripheral Components Intel 296467, ISBN 1-55512-207-8
[7 Brian W. Kernighan and Dennis M. Ri hie The C Programming Language Prenti e Hall, ISBN 0-13-110362-8
[8 Steven Levy Ha kers Penguin, ISBN 0-14-023269-9
[9 Intel Intel486 Pro essor Family: Programmer's Referen e Manual Intel
[10 Comer D. E. Interworking with TCP/IP, Volume 1 - Prin iples, Proto ols and
Ar hite ture Prenti e Hall International In
[11 David Jagger ARM Ar hite tural Referen e Manual Prenti e Hall, ISBN 0-13736299-4

195

Index
/pro le system, 116
PAGE ACCESSED, bit in Alpha AXP
PTE, 21
PAGE DIRTY, bit in Alpha AXP PTE,
21

Ca he, page, 21, 27


Ca he, swap, 22, 31
Ca hes, bu er, 114
Ca hes, dire tory, 113
Ca hes, VFS inode, 113
at ommand, 84, 108, 116
d ommand, 47
hara ter devi es, 86
hrdevs ve tor, 86, 87, 162
Command shells, 47
CPU, 2
Creating a le, 112
urrent data stru ture, 36, 46
urrent pro ess, 85

Aging, Pages, 31
all requests, list of request data stru tures, 87
Alpha AXP Pro essor, 152
Alpha AXP PTE, 20
Alpha AXP, ar hite ture, 152
Altair 8080, 1
ARM Pro essor, 151
arp table data stru ture, 134, 135
arp tables ve tor, 134
Assembly languages, 7
awk ommand, 155

Daemon, Kernel Swap, 28


Data fragmentation, 133
Data stru tures, EXT2 Dire tory, 104
Data stru tures, EXT2 inode, 102
Data stru tures, free area, 24
Demand loading, 48
Demand Paging, 18, 26
dev base data stru ture, 132
dev base list pointer, 96, 97
devi e data stru ture, 95{97, 131{133,
135, 136, 160
Devi e Drivers, 11
Devi e Drivers, polling, 83
Devi e Spe ial Files, 117
devi e stru t data stru ture, 86, 87,
162
DIGITAL, iii
Dire t Memory A ess (DMA), 84
Dire tory a he, 113
Disks, IDE, 90
Disks, SCSI, 91
DMA, Dire t Memory A ess, 84
dma han data stru ture, 84

ba klog queue, 132


bash ommand, 16, 25, 193
bd ush, kernel daemon, 115, 116
bg ommand, 47
bh a tive bitmask, 140
bh base ve tor, 140
bh mask bitmask, 140
Binary, 3
Bind, 127
blk dev ve tor, 87, 90, 94, 159
blk dev stru t data stru ture, 87
blkdevs ve tor, 87, 90, 94, 162
blo k dev stru t data stru ture, 159
blo ked data stru ture, 52, 53
Bottom Half Handling, 139
BSD So kets, 124
BSD So kets, reating, 127
Bu er a hes, 114
bu er head data stru ture, 87, 88, 114,
159
builtin s si hosts ve tor, 93
Buzz lo ks, 143

ELF, 48
ELF shared libraries, 49
Exe uting Programs, 47

C programming language, 8
196

EXT, 100
EXT2, 100, 101
EXT2 Blo k Groups, 102
EXT2 Dire tories, 104
EXT2 Group Des riptor, 104
EXT2 Inode, 102
EXT2 Superblo k, 103
Extended File system, 100
fd data stru ture, 43
fd ve tor, 127
fdisk ommand, 89, 99
b info data stru ture, 137
b node data stru ture, 137
b zone data stru ture, 136, 137
b zones ve tor, 136
le data stru ture, 43, 53, 54, 86, 127,
162
File system, 99
File System, mounting, 110
File System, registering, 110
File System, unmounting, 112
le system type data stru ture, 109{
111
le systems data stru ture, 110, 111
Files, 42, 53
Files, reating, 112
Files, nding, 112
les stru t data stru ture, 42, 43, 163
Filesystems, 11
Finding a File, 112
rst inode data stru ture, 113
Free Software Foundation, iv
free area data stru ture, 24
free area ve tor, 23{25, 29, 31
fs stru t data stru ture, 42
GATED daemon, 135
gendisk data stru ture, 90, 91, 94, 163
GNU, iv
groups ve tor, 38
Hard disks, 88
Hexade imal, 3
hh a he data stru ture, 134
IDE disks, 90
ide drive t data stru ture, 91
ide hwif t data stru ture, 91
ide hwifs ve tor, 91
Identi ers, 38

if on g ommand, 96, 128


INET so ket layer, 125
init task data stru ture, 45
Initializing Network devi es, 96
Initializing the IDE subsystem, 91
Inode a he, 113
inode data stru ture, 28, 127, 164, 173
inode, VFS, 86, 109
insmod, 146
insmod ommand, 110, 145{148
Interpro ess Communi ation Me hanism
(IPC), 51
IP routing, 135
ip rt hash table table, 135
IPC, Interpro ess Communi ation Me hanism, 51
ip perm data stru ture, 55, 155, 165
ipfrag data stru ture, 133
ipq data stru ture, 133
ipqueue list, 133
irq a tion ve tor, 79
irqa tion data stru ture, 78, 79, 165
jies, 37, 40, 46, 141, 142
kdev t data type, 117
Kernel Address, Alpha AXP, 20
Kernel daemon, the, 146
Kernel daemons, bd ush, 115, 116
Kernel Swap Daemon, 28
Kernel, monolithi , 145
kerneld, 145
kerneld, the kernel daemon, 146
kill ommand, 37, 51
ksyms ommand, 147
last pro essor data stru ture, 42
ld ommand, 48
Linkers, 9
linux binfmt data stru ture, 165
Lo ks, buzz, 143
lpr ommand, 53
ls ommand, 9, 10, 43, 53, 108, 117
lsmod ommand, 146, 148
Mapping, memory, 25
mem map data stru ture, 23
mem map page ve tor, 29
mem map t data stru ture, 23, 27, 28,
31, 166
Memory Management, 10

Memory Map, redu ing its size, 29


Memory mapping, 25
Memory, shared, 58
Mi ropro essor, 2
Minix, iii, 100
mke2fs ommand, 101
mk fo ommand, 54
mknod ommand, 81, 95, 122
mm stru t data stru ture, 25, 31, 44,
46, 49, 166
module data stru ture, 147{149
module list data stru ture, 147
Modules, 145
Modules, demand loading, 146
Modules, loading, 146
Modules, unloading, 148
mount ommand, 103, 110, 111
Mounting a File System, 110
mru vfsmnt pointer, 111
msg data stru ture, 55
msgque ve tor, 55
msqid ds data stru ture, 55
Multipro essing, 10
Network devi es, 95
Network devi es, initializing, 96
Operating System, 9
pa ket type data stru ture, 132, 134
Page A ess Control, 20
Page Aging, 31
Page Allo ation, 24
Page Ca he, 21
Page a he, 27
page data stru ture, 23, 166
Page Deallo ation, 24
Page Frame Number, 16
page frame number, 21
Page Tables, 22
page hash table, 27
PAL ode, 152
Paradis, Jim, iv
pat h ommand, 154
PC, 1
PCI, 61
PCI, Address Spa es, 61
PCI, Base Address Registers, 72
PCI, BIOS Fun tions, 70
PCI, Con guration Headers, 62

PCI, Fixup Code, 72


PCI, I/O, 64
PCI, Interrupt Routing, 64
PCI, Linux Initialisation of, 66
PCI, Linux PCI Psuedo Devi e Driver,
68
PCI, Memory, 64
PCI, Overview, 61
PCI-PCI Bridges, 65
PCI-PCI Bridges, Bus Numbering, 65
PCI-PCI Bridges, Bus Numbering Example, 68
PCI-PCI Bridges: The Bus Numbering Rule, 66
PCI: Type 0 and Type 1 Con guration
y les, 65
p i bus data stru ture, 67, 68, 73, 166
p i dev data stru ture, 67, 68, 73, 167
p i devi es data stru ture, 68
PDP-11/45, iii
PDP-11/70, iii
PDP-7, iii
perl ommand, 50
PFN, 16
PFN, Page Frame Number, 16
poli y data stru ture, 40
Polling, Devi e Drivers, 83
pops ve tor, 125
pr ommand, 53
priority data stru ture, 40
Pro ess Creation, 45
Pro ess's, Virtual Memory, 43
Pro esses, 10, 35, 36
Pro essor, 2
pro essor data stru ture, 42
Pro essor, ARM, 151
Pro essor, X86, 151, 152
pro essor mask data stru ture, 42
Programmable Interrupt Controllers,
77
proto ops data stru ture, 125, 127
proto ols ve tor, 125
ps ommand, 10, 37, 116
pstree ommand, 37
PTE, Alpha AXP, 20
ptype all list, 132
ptype base hash table, 132
pwd ommand, 38, 47
Registering a le system, 110

reni e ommand, 40
request data stru ture, 87, 88, 94, 168
Ri hie, Dennis, iii
Rights identi ers, 38
rmmod ommand, 145, 146, 148
Routing, IP, 135
rs si disks ve tor, 94
rtable data stru ture, 132, 135, 136,
168
s heduler, 39
S heduling, 39
S heduling in multipro essor systems,
41
S ript Files, 50
SCSI disks, 91
SCSI, initializing, 92
S si Cmd data stru ture, 94
S si Cmnd data stru ture, 93
S si Devi e data stru ture, 93, 94
S si Devi e Template data stru ture,
94
s si devi elist list, 94
s si devi es list, 94
S si Disk data stru ture, 94
S si Host data stru ture, 93, 94
S si Host Template data stru ture, 93
s si hostlist list, 93
s si hosts list, 93
S si Type Template data stru ture, 94
Se ond Extended File system, 100
sem data stru ture, 57
sem queue data stru ture, 57
sem undo data stru ture, 58
semaphore data stru ture, 144
Semaphores, 56, 143
Semaphores, System V, 56
semary data stru ture, 57
semid ds data stru ture, 57, 58
semid ds, data stru ture, 57
Shared libraries, ELF, 49
Shared memory, 58
Sharing virtual memory, 19
Shells, 47
shm segs data stru ture, 58
shmid ds data stru ture, 30, 58, 59
shmid ds, data stru ture, 58
siga tion data stru ture, 52, 53
signal data stru ture, 52
Signals, 51

sk bu data stru ture, 95, 96, 129{


134, 156, 169
sk bu s data stru ture, 130, 131
SMP s heduling, 41
so k data stru ture, 127{130, 170
so kaddr data stru ture, 124, 127
so ket data stru ture, 127, 128, 130,
173
So ket layer, INET, 125
So kets, BSD, 124
Spin lo ks, see Buzz lo ks, 143
Star Trek, 1
super blo k data stru ture, 111
super blo ks data stru ture, 111
Superblo k, VFS, 109
Swap Ca he, 22
Swap a he, 31
swap ontrol data stru ture, 31
Swapping, 19, 28
Swapping out and dis arding pages, 30
Swapping, Pages In, 26
Swapping, Pages Out, 30
system lo k, 141
System V Semphores, 56
task data stru ture, 36, 37, 42, 46
Task Queues, 140
task stru t data stru ture, 36{42, 44{
46, 52, 58, 142, 155, 174
t p bound hash table, 129
t p listening hash table, 129
t sh ommand, 50
Thompson, Ken, iii
time-sli e, 39
timer a tive bit mask, 142
timer list data stru ture, 47, 134, 142,
175
timer stru t data stru ture, 141, 142
Timers, 141
tk ommand, 155
TLB, translation lookaside bu er, 22
Torvalds, Linus, iii
tq immediate task queue, 140
tq stru t data stru ture, 141, 175
Translation lookaside bu er, 22
umask data stru ture, 42
Unmounting a File System, 112
upd hash table, 128
update ommand, 116

update pro ess, 116


VFS, 100, 107
VFS inode, 86, 109
VFS superblo k, 109
vfsmntlist data stru ture, 111, 112
vfsmnttail data stru ture, 111
vfsmount data stru ture, 111, 112
Virtual File system, 100
Virtual File System (VFS), 107
Virtual Memory, Pro esses, 43
Virtual memory, shared, 19
Virtual Memory, Theoreti al Model,
16
vm area stru t data stru ture, 25{27,
30{32, 44{46, 49, 59, 176
vm next data stru ture, 31
vm ops data stru ture, 44
wish ommand, 50
X86 Pro essor, 151

You might also like