Professional Documents
Culture Documents
tlk-0 8-3 Ps
tlk-0 8-3 Ps
Copyright 1996-1999
David A Rusling
david.ruslingdigital.
om
REVIEW, Version 0.8-3
Legal Noti
e
UNIX is a trademark of Univel.
Linux is a trademark of Linus Torvalds, and has no
onne
tion to UNIXTM or
Univel.
The
opyright noti
e above and this permission noti
e must be preserved
omplete on all
omplete or partial
opies.
If you distribute this work in part, instru tions for obtaining the omplete
version of this manual must be in luded, and a means for obtaining a omplete version provided.
Ex
eptions to these rules may be granted for a
ademi
purposes: Write to the
author and ask. These restri
tions are here to prote
t us as authors, not to restri
t
you as learners and edu
ators.
All sour
e
ode in this do
ument is pla
ed under the GNU General Publi
Li
ense,
available via anonymous FTP from prep.ai.mit.edu:/pub/gnu/COPYING. It is also
reprodu
ed in appendix D.
Prefa
e
Linux is a phenomenon of the Internet. Born out of the hobby proje
t of a student it
has grown to be
ome more popular than any other freely available operating system.
To many Linux is an enigma. How
an something that is free be worthwhile? In
a world dominated by a handful of large software
orporations, how
an something
that has been written by a bun
h of \ha
kers" (si
) hope to
ompete? How
an
software
ontributed to by many dierent people in many dierent
ountries around
the world have a hope of being stable and ee
tive? Yet stable and ee
tive it is
and
ompete it does. Many Universities and resear
h establishments use it for their
everyday
omputing needs. People are running it on their home PCs and I would
wager that most
ompanies are using it somewhere even if they do not always realize
that they do. Linux is used to browse the web, host web sites, write theses, send
ele
troni
mail and, as always with
omputers, to play games. Linux is emphati
ally
not a toy; it is a fully developed and professionally written operating system used by
enthusiasts all over the world.
The roots of Linux
an be tra
ed ba
k to the origins of UnixTM . In 1969, Ken
Thompson of the Resear
h Group at Bell Laboratories began experimenting on a
multi-user, multi-tasking operating system using an otherwise idle PDP-7. He was
soon joined by Dennis Ri
hie and the two of them, along with other members of the
Resear
h Group produ
ed the early versions of UnixTM . Ri
hie was strongly in
uen
ed
by an earlier proje
t, MULTICS and the name UnixTM is itself a pun on the name
MULTICS. Early versions were written in assembly
ode, but the third version was
rewritten in a new programming language, C. C was designed and written by Ri
hie
expressly as a programming language for writing operating systems. This rewrite
allowed UnixTM to move onto the more powerful PDP-11/45 and 11/70
omputers
then being produ
ed by DIGITAL. The rest, as they say, is history. UnixTM moved
out of the laboratory and into mainstream
omputing and soon most major
omputer
manufa
turers were produ
ing their own versions.
Linux was the solution to a simple need. The only software that Linus Torvalds,
Linux's author and prin
iple maintainer was able to aord was Minix. Minix is a
simple, UnixTM like, operating system widely used as a tea
hing aid. Linus was less
than impressed with its features, his solution was to write his own software. He took
UnixTM as his model as that was an operating system that he was familiar with in his
day to day student life. He started with an Intel 386 based PC and started to write.
Progress was rapid and, ex
ited by this, Linus oered his eorts to other students
via the emerging world wide
omputer networks, then mainly used by the a
ademi
ommunity. Others saw the software and started
ontributing. Mu
h of this new
software was itself the solution to a problem that one of the
ontributors had. Before
long, Linux had be
ome an operating system. It is important to note that Linux
iii
HOWTO is just what it sounds like, a do
ument des
ribing how to do something. Many have
been written for Linux and all are very useful.
some key point. It must be noted that around 95% of the Linux kernel sour
es are
ommon to all of the hardware platforms that it runs on. Likewise, around 95% of
this book is about the ma
hine independent parts of the Linux kernel.
Reader Prole
This book does not make any assumptions about the knowledge or experien
e of
the reader. I believe that interest in the subje
t matter will en
ourage a pro
ess of
self edu
ation where ne
essary. That said, a degree of familiarity with
omputers,
preferably the PC will help the reader derive real benet from the material, as will
some knowledge of the C programming language.
The Peripheral Component Inter
onne
t (PCI) standard is now rmly established
as the low
ost, high performan
e data bus for PCs. The PCI
hapter (Chapter 6)
des
ribes how the Linux kernel initializes and uses PCI buses and devi
es in the
system.
The Interrupts and Interrupt Handling
hapter (Chapter 7) looks at how the Linux
kernel handles interrupts. Whilst the kernel has generi
me
hanisms and interfa
es
for handling interrupts, some of the interrupt handling details are hardware and
ar
hite
ture spe
i
.
One of Linux's strengths is its support for the many available hardware devi
es for
the modern PC. The Devi
e Drivers
hapter (Chapter 8) des
ribes how the Linux
kernel
ontrols the physi
al devi
es in the system.
The File system
hapter (Chapter 9) des
ribes how the Linux kernel maintains the
les in the le systems that it supports. It des
ribes the Virtual File System (VFS)
and how the Linux kernel's real le systems are supported.
Networking and Linux are terms that are almost synonymous. In a very real sense
Linux is a produ
t of the Internet or World Wide Web (WWW). Its developers and
users use the web to ex
hange information ideas,
ode and Linux itself is often used
to support the networking needs of organizations. Chapter 10 des
ribes how Linux
supports the network proto
ols known
olle
tively as TCP/IP.
The Kernel Me
hanisms
hapter (Chapter 11) looks at some of the general tasks and
me
hanisms that the Linux kernel needs to supply so that other parts of the kernel
work ee
tively together.
The Modules
hapter (Chapter 12) des
ribes how the Linux kernel
an dynami
ally
load fun
tions, for example le systems, only when they are needed.
The Pro
essors
hapter (Chapter 13) gives a brief des
ription of some of the pro
essors that Linux has been ported to.
The Sour
es
hapter (Chapter 14) des
ribes where in the Linux kernel sour
es you
should start looking for parti
ular kernel fun
tions.
See foo() in
foo/bar.
Throughout the text there referen
es to pie
es of
ode within the Linux kernel sour
e
tree (for example the boxed margin note adja
ent to this text ). These are given
in
ase you wish to look at the sour
e
ode itself and all of the le referen
es are
relative to /usr/sr
/linux. Taking foo/bar.
as an example, the full lename
would be /usr/sr
/linux/foo/bar.
If you are running Linux (and you should),
then looking at the
ode is a worthwhile experien
e and you
an use this book as an
aid to understanding the
ode and as a guide to its many data stru
tures.
Trademarks
ARM is a trademark of ARM Holdings PLC.
Caldera, OpenLinux and the \C" logo are trademarks of Caldera, In
.
Caldera OpenDOS 1997 Caldera, In
.
DEC is a trademark of Digital Equipment Corporation.
DIGITAL is a trademark of Digital Equipment Corporation.
Linux is a trademark of Linus Torvalds.
Motif is a trademark of The Open System Foundation, In
.
MSDOS is a trademark of Mi
rosoft Corporation.
Red Hat, glint and the Red Hat logo are trademarks of Red Hat Software, In
.
UNIX is a registered trademark of X/Open.
XFree86 is a trademark of XFree86 Proje
t, In
.
X Window System is a trademark of the X Consortium and the Massa
husetts Institute of Te
hnology.
The Author
I was born in 1957, a few weeks before Sputnik was laun
hed, in the north of England.
I rst met Unix at University, where a le
turer used it as an example when tea
hing
the notions of kernels, s
heduling and other operating systems goodies. I loved using
the newly delivered PDP-11 for my nal year proje
t. After graduating (in 1982 with
a First Class Honours degree in Computer S
ien
e) I worked for Prime Computers
(Primos) and then after a
ouple of years for Digital (VMS, Ultrix). At Digital I
worked on many things but for the last 5 years there, I worked for the semi
ondu
tor
group on Alpha and StrongARM evaluation boards. In 1998 I moved to ARM where
I have a small group of engineers writing low level rmware and porting operating
systems. My
hildren (Esther and Stephen) des
ribe me as a geek.
People often ask me about Linux at work and at home and I am only too happy
to oblige. The more that I use Linux in both my professional and personal life the
more that I be
ome a Linux zealot. You may note that I use the term `zealot' and
not `bigot'; I dene a Linux zealot to be an enthusiast that re
ognizes that there
are other operating systems but prefers not to use them. As my wife, Gill, who
uses Windows 95 on
e remarked \I never realized that we would have his and her
operating systems". For me, as an engineer, Linux suits my needs perfe
tly. It is
a superb,
exible and adaptable engineering tool that I use at work and at home.
Most freely available software easily builds on Linux and I
an often simply download
pre-built exe
utable les or install them from a CD ROM. What else
ould I use to
learn to program in C++, Perl or learn about Java for free?
A
knowledgements
I must thank the many people who have been kind enough to take the time to email me with
omments about this book. I have attempted to in
orporated those
omments in ea
h new version that I have produ
ed and I am more than happy to
re
eive
omments, however please note my new e-mail address.
A number of le
turers have written to me asking if they
an use some or parts of
this book in order to tea
h
omputing. My answer is an emphati
yes; this is one
use of the book that I parti
ularly wanted. Who knows, there may be another Linus
Torvalds sat in the
lass.
Spe
ial thanks must go to John Rigby and Mi
hael Bauer who gave me full, detailed
review notes of the whole book. Not an easy task. Alan Cox and Stephen Tweedie
have patiently answered my questions - thanks. I used Larry Ewing's penguins to
brighten up the
hapters a bit. Finally, thank you to Greg Hankins for a
epting
this book into the Linux Do
umentation Proje
t and onto their web site.
Contents
Prefa
e
iii
1 Hardware Basi s
1.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Controllers and Peripherals . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
5
6
2 Software Basi s
7
7
8
2.1.3 Linkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 What is an Operating System? . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Memory management . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Pro
esses . . . .
2.2.3 Devi
e drivers . .
2.2.4 The Filesystems
2.3 Kernel Data Stru
tures
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
11
11
11
3 Memory Management
15
3.2 Ca
hes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Linux Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Page Allo
ation and Deallo
ation . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Page Allo
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.2 Page Deallo
ation . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5
3.6
3.7
3.8
Memory Mapping . . . . . . . . . . . . . . . . . . . . . .
Demand Paging . . . . . . . . . . . . . . . . . . . . . . .
The Linux Page Ca
he . . . . . . . . . . . . . . . . . . .
Swapping Out and Dis
arding Pages . . . . . . . . . . .
3.8.1 Redu
ing the Size of the Page and Buer Ca
hes
3.8.2 Swapping Out System V Shared Memory Pages .
3.8.3 Swapping Out and Dis
arding Pages . . . . . . .
3.9 The Swap Ca
he . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
26
27
28
29
30
30
31
4 Pro esses
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
38
39
41
42
43
51
5.1 Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 So
kets . . . . . . . . . . . . . . .
5.3.1 System V IPC Me
hanisms
5.3.2 Message Queues . . . . . .
5.3.3 Semaphores . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
55
56
6 PCI
61
75
8 Devi e Drivers
81
99
10 Networks
119
11 Kernel Me hanisms
139
12 Modules
145
13 Pro essors
151
153
159
177
179
C.1
C.2
C.3
C.4
C.5
C.6
C.7
Overview . . . . . . . . . .
Getting Involved . . . . . .
Current Proje
ts . . . . . .
FTP sites for LDP works .
Do
umentation Conventions
Copyright and Li
ense . . .
Publishing LDP Manuals .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
179
180
180
180
180
181
181
183
Glossary
191
Bibliography
194
List of Figures
1.1 A typi
al PC motherboard. . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
3.3
3.4
3.5
3.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
20
23
25
26
27
4.1
4.2
4.3
4.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
44
47
48
5.1
5.2
5.3
5.4
Pipes . . . . . . . . . . . . . . .
System V IPC Message Queues
System V IPC Semaphores . .
System V IPC Shared Memory
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
56
57
59
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
63
65
65
67
69
70
71
71
72
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
102
105
107
110
112
114
10.1
10.2
10.3
10.4
10.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
123
126
131
136
11.1
11.2
11.3
11.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
139
140
142
143
Chapter 1
Hardware Basi s
An operating system has to work
losely with the hardware system that
a
ts as its foundations. The operating system needs
ertain servi
es that
an only be provided by the hardware. In order to fully understand
the Linux operating system, you need to understand the basi
s of the
underlying hardware. This
hapter gives a brief introdu
tion to that
hardware: the modern PC.
When the \Popular Ele
troni
s" magazine for January 1975 was printed with an
illustration of the Altair 8080 on its front
over, a revolution started. The Altair
8080, named after the destination of an early Star Trek episode,
ould be assembled
by home ele
troni
s enthusiasts for a mere $397. With its Intel 8080 pro
essor and
256 bytes of memory but no s
reen or keyboard it was puny by today's standards.
Its inventor, Ed Roberts,
oined the term \personal
omputer" to des
ribe his new
invention, but the term PC is now used to refer to almost any
omputer that you
an pi
k up without needing help. By this denition, even some of the very powerful
Alpha AXP systems are PCs.
Enthusiasti
ha
kers saw the Altair's potential and started to write software and
build hardware for it. To these early pioneers it represented freedom; the freedom
from huge bat
h pro
essing mainframe systems run and guarded by an elite priesthood. Overnight fortunes were made by
ollege dropouts fas
inated by this new
phenomenon, a
omputer that you
ould have at home on your kit
hen table. A lot
of hardware appeared, all dierent to some degree and software ha
kers were happy
to write software for these new ma
hines. Paradoxi
ally it was IBM who rmly
ast
the mould of the modern PC by announ
ing the IBM PC in 1981 and shipping it to
ustomers early in 1982. With its Intel 8088 pro
essor, 64K of memory (expandable
to 256K), two
oppy disks and an 80
hara
ter by 25 lines Colour Graphi
s Adapter
(CGA) it was not very powerful by today's standards but it sold well. It was followed, in 1983, by the IBM PC-XT whi
h had the luxury of a 10Mbyte hard drive.
It was not long before IBM PC
lones were being produ
ed by a host of
ompanies
su
h as Compaq and the ar
hite
ture of the PC be
ame a de-fa
to standard. This
1
power
power
parallel port
COM1
COM2
CPU
PCI Slots
ISA Slots
ing the fun
tional
omponents of the mi
ropro
essor were separate (and physi
ally
large) units. This is when the term Central Pro
essing Unit was
oined. The modern
mi
ropro
essor
ombines these
omponents onto an integrated
ir
uit et
hed onto
a very small pie
e of sili
on. The terms CPU, mi
ropro
essor and pro
essor are all
used inter
hangeably in this book.
Mi
ropro
essors operate on binary data; that is data
omposed of ones and zeros.
These ones and zeros
orrespond to ele
tri
al swit
hes being either on or o. Just
as 42 is a de
imal number meaning \4 10s and 2 units", a binary number is a series
of binary digits ea
h one representing a power of 2. In this
ontext, a power means
the number of times that a number is multiplied by itself. 10 to the power 1 ( 101 )
is 10, 10 to the power 2 ( 102 ) is 10x10, 103 is 10x10x10 and so on. Binary 0001 is
de
imal 1, binary 0010 is de
imal 2, binary 0011 is 3, binary 0100 is 4 and so on. So,
42 de
imal is 101010 binary or (2 + 8 + 32 or 21 + 23 + 25 ). Rather than using binary
to represent numbers in
omputer programs, another base, hexade
imal is usually
used. In this base, ea
h digital represents a power of 16. As de
imal numbers only
go from 0 to 9 the numbers 10 to 15 are represented as a single digit by the letters
A, B, C, D, E and F. For example, hexade
imal E is de
imal 14 and hexade
imal 2A
is de
imal 42 (two 16s) + 10). Using the C programming language notation (as I do
throughout this book) hexade
imal numbers are prefa
ed by \0x"; hexade
imal 2A
is written as 0x2A .
Mi
ropro
essors
an perform arithmeti
operations su
h as add, multiply and divide
and logi
al operations su
h as \is X greater than Y?".
The pro
essor's exe
ution is governed by an external
lo
k. This
lo
k, the system
lo
k, generates regular
lo
k pulses to the pro
essor and, at ea
h
lo
k pulse, the
pro
essor does some work. For example, a pro
essor
ould exe
ute an instru
tion
every
lo
k pulse. A pro
essor's speed is des
ribed in terms of the rate of the system
lo
k ti
ks. A 100Mhz pro
essor will re
eive 100,000,000
lo
k ti
ks every se
ond. It
is misleading to des
ribe the power of a CPU by its
lo
k rate as dierent pro
essors
perform dierent amounts of work per
lo
k ti
k. However, all things being equal, a
faster
lo
k speed means a more powerful pro
essor. The instru
tions exe
uted by the
pro
essor are very simple; for example \read the
ontents of memory at lo
ation X
into register Y". Registers are the mi
ropro
essor's internal storage, used for storing
data and performing operations on it. The operations performed may
ause the
pro
essor to stop what it is doing and jump to another instru
tion somewhere else in
memory. These tiny building blo
ks give the modern mi
ropro
essor almost limitless
power as it
an exe
ute millions or even billions of instru
tions a se
ond.
The instru
tions have to be fet
hed from memory as they are exe
uted. Instru
tions
may themselves referen
e data within memory and that data must be fet
hed from
memory and saved there when appropriate.
The size, number and type of register within a mi
ropro
essor is entirely dependent
on its type. An Intel 4086 pro
essor has a dierent register set to an Alpha AXP
pro
essor; for a start, the Intel's are 32 bits wide and the Alpha AXP's are 64 bits
wide. In general, though, any given pro
essor will have a number of general purpose
registers and a smaller number of dedi
ated registers. Most pro
essors have the
following spe
ial purpose, dedi
ated, registers:
Program Counter (PC) This register ontains the address of the next instru tion
Sta k Pointer (SP) Pro essors have to have a ess to large amounts of external
Pro essor Status (PS) Instru tions may yield results; for example \is the ontent
of register X greater than the
ontent of register Y?" will yield true or false as
a result. The PS register holds this and other information about the
urrent
state of the pro
essor. For example, most pro
essors have at least two modes
of operation, kernel (or supervisor) and user. The PS register would hold
information identifying the
urrent mode.
1.2 Memory
All systems have a memory hierar
hy with memory at dierent speeds and sizes at
dierent points in the hierar
hy. The fastest memory is known as
a
he memory and
is what it sounds like - memory that is used to temporarily hold, or
a
he,
ontents
of the main memory. This sort of memory is very fast but expensive, therefore most
pro
essors have a small amount of on-
hip
a
he memory and more system based (onboard)
a
he memory. Some pro
essors have one
a
he to
ontain both instru
tions
and data, but others have two, one for instru
tions and the other for data. The
Alpha AXP pro
essor has two internal memory
a
hes; one for data (the D-Ca
he)
and one for instru
tions (the I-Ca
he). The external
a
he (or B-Ca
he) mixes the
two together. Finally there is the main memory whi
h relative to the external
a
he
memory is very slow. Relative to the on-CPU
a
he, main memory is positively
rawling.
The
a
he and main memories must be kept in step (
oherent). In other words, if
a word of main memory is held in one or more lo
ations in
a
he, then the system
must make sure that the
ontents of
a
he and memory are the same. The job of
a
he
oheren
y is done partially by the hardware and partially by the operating
system. This is also true for a number of major system tasks where the hardware
and software must
ooperate
losely to a
hieve their aims.
1.3 Buses
The individual
omponents of the system board are inter
onne
ted by multiple
onne
tion systems known as buses. The system bus is divided into three logi
al fun
tions; the address bus, the data bus and the
ontrol bus. The address bus spe
ies
the memory lo
ations (addresses) for the data transfers. The data bus holds the data
transfered. The data bus is bidire
tional; it allows data to be read into the CPU and
written from the CPU. The
ontrol bus
ontains various lines used to route timing
and
ontrol signals throughout the system. Many
avours of bus exist, for example
ISA and PCI buses are popular ways of
onne
ting peripherals to the system.
1.6 Timers
All operating systems need to know the time and so the modern PC in
ludes a spe
ial
peripheral
alled the Real Time Clo
k (RTC). This provides two things: a reliable
time of day and an a
urate timing interval. The RTC has its own battery so that
it
ontinues to run even when the PC is not powered on, this is how your PC always
\knows" the
orre
t date and time. The interval timer allows the operating system
to a
urately s
hedule essential work.
Chapter 2
Software Basi s
A program is a set of
omputer instru
tions that perform a parti
ular task.
That program
an be written in assembler, a very low level
omputer
language, or in a high level, ma
hine independent language su
h as the
C programming language. An operating system is a spe
ial program
whi
h allows the user to run appli
ations su
h as spreadsheets and word
pro
essors. This
hapter introdu
es basi
programming prin
iples and
gives an overview of the aims and fun
tions of an operating system.
r16, (r15)
r17, 4(r15)
r16,r17,100
r17, (r15)
;
;
;
;
;
Line
Line
Line
Line
Line
1
2
3
4
5
The rst statement (on line 1) loads register 16 from the address held in register
15. The next instru
tion loads register 17 from the next lo
ation in memory. Line 3
ompares the
ontents of register 16 with that of register 17 and, if they are equal,
bran
hes to label 100. If the registers do not
ontain the same value then the program
ontinues to line 4 where the
ontents of r17 are saved into memory. If the registers
do
ontain the same value then no data needs to be saved. Assembly level programs
are tedious and tri
ky to write and prone to errors. Very little of the Linux kernel is
written in assembly language and those parts that are are written only for e
ien
y
and they are spe
i
to parti
ular mi
ropro
essors.
performs exa
tly the same operations as the previous example assembly
ode. If the
ontents of the variable x are not the same as the
ontents of variable y then the
ontents of y will be
opied to x. C
ode is organized into routines, ea
h of whi
h
perform a task. Routines may return any value or data type supported by C. Large
programs like the Linux kernel
omprise many separate C sour
e modules ea
h with
its own routines and data stru
tures. These C sour
e
ode modules group together
logi
al fun
tions su
h as lesystem handling
ode.
C supports many types of variables, a variable is a lo
ation in memory whi
h
an be
referen
ed by a symboli
name. In the above C fragment x and y refer to lo
ations
in memory. The programmer does not
are where in memory the variables are put,
it is the linker (see below) that has to worry about that. Some variables
ontain
dierent sorts of data, integer and
oating point and others are pointers.
Pointers are variables that
ontain the address, the lo
ation in memory of other
data. Consider a variable
alled x, it might live in memory at address 0x80010000.
You
ould have a pointer,
alled px, whi
h points at x. px might live at address
0x80010030. The value of px would be 0x80010000: the address of the variable x.
C allows you to bundle together related variables into data stru
tures. For example,
stru
t {
int i ;
har b ;
} my_stru
t ;
is a data stru
ture
alled my stru
t whi
h
ontains two elements, an integer (32 bits
of data storage)
alled i and a
hara
ter (8 bits of data)
alled b.
2.1.3 Linkers
Linkers are programs that link together several obje
t modules and libraries to form
a single,
oherent, program. Obje
t modules are the ma
hine
ode output from an
assembler or
ompiler and
ontain exe
utable ma
hine
ode and data together with
information that allows the linker to
ombine the modules together to form a program. For example one module might
ontain all of a program's database fun
tions
and another module its
ommand line argument handling fun
tions. Linkers x up
referen
es between these obje
t modules, where a routine or data stru
ture referen
ed in one module a
tually exists in another module. The Linux kernel is a single,
large program linked together from its many
onstituent obje
t modules.
t
l
images
perl
The $ is a prompt put out by a login shell (in this
ase bash). This means that it
is waiting for you, the user, to type some
ommand. Typing ls
auses the keyboard
driver to re
ognize that
hara
ters have been typed. The keyboard driver passes
them to the shell whi
h pro
esses that
ommand by looking for an exe
utable image
of the same name. It nds that image, in /bin/ls. Kernel servi
es are
alled to pull
the ls exe
utable image into virtual memory and start exe
uting it. The ls image
makes
alls to the le subsystem of the kernel to nd out what les are available.
The lesystem might make use of
a
hed lesystem information or use the disk
devi
e driver to read this information from the disk. It might even
ause a network
driver to ex
hange information with a remote ma
hine to nd out details of remote
les that this system has a
ess to (lesystems
an be remotely mounted via the
Networked File System or NFS). Whi
hever way the information is lo
ated, ls writes
that information out and the video driver displays it on the s
reen.
All of the above seems rather
ompli
ated but it shows that even most simple
ommands reveal that an operating system is in fa
t a
o-operating set of fun
tions that
together give you, the user, a
oherent view of the system.
TTY
pRe
pRe
pRe
pRe
pRe
pRe
pRe
pp6
pRe
pp
pRe
v06
pp6
pp6
STAT
1
1
1
1 N
1 N
1 <
1 <
1
1 N
2
1 N
1
3 <
3
TIME
0:00
0:00
0:00
0:00
0:01
0:00
0:00
9:26
0:00
0:00
0:00
0:00
0:02
0:00
COMMAND
-bash
sh /usr/X11R6/bin/startx
xinit /usr/X11R6/lib/X11/xinit/xinitr
-bowman
rxvt -geometry 120x35 -fg white -bg bla
k
x
lo
k -bg grey -geometry -1500-1500 -padding 0
xload -bg grey -geometry -0-0 -label xload
/bin/bash
rxvt -geometry 120x35 -fg white -bg bla
k
/bin/bash
rxvt -geometry 120x35 -fg white -bg bla
k
/bin/bash
ema
s intro/introdu
tion.tex
ps
If my system had many CPUs then ea
h pro
ess
ould (theoreti
ally at least) run
on a dierent CPU. Unfortunately, there is only one so again the operating system
resorts to tri
kery by running ea
h pro
ess in turn for a short period. This period of
time is known as a time-sli
e. This tri
k is known as multi-pro
essing or s
heduling
and it fools ea
h pro
ess into thinking that it is the only pro
ess. Pro
esses are
prote
ted from one another so that if one pro
ess
rashes or malfun
tions then it will
not ae
t any others. The operating system a
hieves this by giving ea
h pro
ess a
separate address spa
e whi
h only they have a
ess to.
has a purpose and although some are used by several kernel subsystems, they are
more simple than they appear at rst sight.
Understanding the Linux kernel hinges on understanding its data stru
tures and the
use that the various fun
tions within the Linux kernel makes of them. This book
bases its des
ription of the Linux kernel on its data stru
tures. It talks about ea
h
kernel subsystem in terms of its algorithms, its methods of getting things done, and
their usage of the kernel's data stru
tures.
maintain than simple linked lists or hash tables. If the data stru
ture
an be found
in the
a
he (this is known as a
a
he hit, then all well and good. If it
annot then
all of the relevant data stru
tures must be sear
hed and, if the data stru
ture exists
at all, it must be added into the
a
he. In adding new data stru
tures into the
a
he
an old
a
he entry may need dis
arding. Linux must de
ide whi
h one to dis
ard,
the danger being that the dis
arded data stru
ture may be the next one that Linux
needs.
Chapter 3
Memory Management
Large Address Spa
es The operating system makes the system appear as if it has
a larger amount of memory than it a
tually has. The virtual memory
an be
many times larger than the physi
al memory in the system,
Prote tion Ea h pro ess in the system has its own virtual address spa e. These
virtual address spa
es are
ompletely separate from ea
h other and so a pro
ess
running one appli
ation
annot ae
t another. Also, the hardware virtual
memory me
hanisms allow areas of memory to be prote
ted against writing.
This prote
ts
ode and data from being overwritten by rogue appli
ations.
Memory Mapping Memory mapping is used to map image and data les into a
pro
esses address spa
e. In memory mapping, the
ontents of a le are linked
dire
tly into the virtual address spa
e of a pro
ess.
Fair Physi
al Memory Allo
ation The memory management subsystem allows
ea
h running pro
ess in the system a fair share of the physi
al memory of the
system,
Shared Virtual Memory Although virtual memory allows pro esses to have sep-
arate (virtual) address spa
es, there are times when you need pro
esses to share
memory. For example there
ould be several pro
esses in the system running
15
Process X
Process Y
VPFN 7
VPFN 7
VPFN 6
Process X
Page Tables
VPFN 6
Process Y
Page Tables
VPFN 5
VPFN 5
VPFN 4
PFN 4
VPFN 4
VPFN 3
PFN 3
VPFN 3
VPFN 2
PFN 2
VPFN 2
VPFN 1
PFN 1
VPFN 1
VPFN 0
PFN 0
VPFN 0
VIRTUAL MEMORY
PHYSICAL MEMORY
VIRTUAL MEMORY
the oset and bits 12 and above are the virtual page frame number. Ea
h time the
pro
essor en
ounters a virtual address it must extra
t the oset and the virtual page
frame number. The pro
essor must translate the virtual page frame number into
a physi
al one and then a
ess the lo
ation at the
orre
t oset into that physi
al
page. To do this the pro
essor uses page tables.
Figure 3.1 shows the virtual address spa
es of two pro
esses, pro
ess X and pro
ess
Y, ea
h with their own page tables. These page tables map ea
h pro
esses virtual
pages into physi
al pages in memory. This shows that pro
ess X's virtual page frame
number 0 is mapped into memory in physi
al page frame number 1 and that pro
ess
Y's virtual page frame number 1 is mapped into physi
al page frame number 4. Ea
h
entry in the theoreti
al page table
ontains the following information:
Valid
ag. This indi
ates if this page table entry is valid,
The physi
al page frame number that this entry is des
ribing,
A
ess
ontrol information. This des
ribes how the page may be used. Can it
be written to? Does it
ontain exe
utable
ode?
The page table is a
essed using the virtual page frame number as an oset. Virtual
page frame 5 would be the 6th element of the table (0 is the rst element).
To translate a virtual address into a physi
al one, the pro
essor must rst work out
the virtual addresses page frame number and the oset within that virtual page. By
making the page size a power of 2 this
an be easily done by masking and shifting.
Looking again at Figures 3.1 and assuming a page size of 0x2000 bytes (whi
h is
de
imal 8192) and an address of 0x2194 in pro
ess Y's virtual address spa
e then
the pro
essor would translate that address into oset 0x194 into virtual page frame
number 1.
The pro
essor uses the virtual page frame number as an index into the pro
esses
page table to retrieve its page table entry. If the page table entry at that oset is
valid, the pro
essor takes the physi
al page frame number from this entry. If the
entry is invalid, the pro
ess has a
essed a non-existent area of its virtual memory.
In this
ase, the pro
essor
annot resolve the address and must pass
ontrol to the
operating system so that it
an x things up.
Just how the pro
essor noties the operating system that the
orre
t pro
ess has
attempted to a
ess a virtual address for whi
h there is no valid translation is spe
i
to the pro
essor. However the pro
essor delivers it, this is known as a page fault and
the operating system is notied of the faulting virtual address and the reason for the
page fault.
Assuming that this is a valid page table entry, the pro
essor takes that physi
al page
frame number and multiplies it by the page size to get the address of the base of the
page in physi
al memory. Finally, the pro
essor adds in the oset to the instru
tion
or data that it needs.
Using the above example again, pro
ess Y's virtual page frame number 1 is mapped
to physi
al page frame number 4 whi
h starts at 0x8000 (4 x 0x2000). Adding in the
0x194 byte oset gives us a nal physi
al address of 0x8194.
By mapping virtual to physi
al addresses this way, the virtual memory
an be
mapped into the system's physi
al pages in any order. For example, in Figure 3.1
pro
ess X's virtual page frame number 0 is mapped to physi
al page frame number
1 whereas virtual page frame number 7 is mapped to physi
al page frame number
0 even though it is higher in virtual memory than virtual page frame number 0.
This demonstrates an interesting byprodu
t of virtual memory; the pages of virtual
memory do not have to be present in physi
al memory in any parti
ular order.
3.1.2 Swapping
If a pro
ess needs to bring a virtual page into physi
al memory and there are no
free physi
al pages available, the operating system must make room for this page by
dis
arding another page from physi
al memory.
If the page to be dis
arded from physi
al memory
ame from an image or data le
and has not been written to then the page does not need to be saved. Instead it
an
be dis
arded and if the pro
ess needs that page again it
an be brought ba
k into
memory from the image or data le.
However, if the page has been modied, the operating system must preserve the
ontents of that page so that it
an be a
essed at a later time. This type of page is
known as a dirty page and when it is removed from memory it is saved in a spe
ial
sort of le
alled the swap le. A
esses to the swap le are very long relative to the
speed of the pro
essor and physi
al memory and the operating system must juggle
the need to write pages to disk with the need to retain them in memory to be used
again.
If the algorithm used to de
ide whi
h pages to dis
ard or swap (the swap algorithm
is not e
ient then a
ondition known as thrashing o
urs. In this
ase, pages are
onstantly being written to disk and then being read ba
k and the operating system
is too busy to allow mu
h real work to be performed. If, for example, physi
al
page frame number 1 in Figure 3.1 is being regularly a
essed then it is not a good
andidate for swapping to hard disk. The set of pages that a pro
ess is
urrently
using is
alled the working set. An e
ient swap s
heme would make sure that all
pro
esses have their working set in physi
al memory.
Linux uses a Least Re
ently Used (LRU) page aging te
hnique to fairly
hoose pages
whi
h might be removed from the system. This s
heme involves every page in the
system having an age whi
h
hanges as the page is a
essed. The more that a page
is a
essed, the younger it is; the less that it is a
essed the older and more stale it
be
omes. Old pages are good
andidates for swapping.
31
15 14 13 12 11 10 9
U K
W W
E E
8 7 6 5
U K
R R
E E
G
H
4 3 2 1 0
A F F F V
S O O O
M E W R
__PAGE_DIRTY
__PAGE_ACCESSED
63
32
PFN
o
urs, the pro
essor reports a page fault and passes
ontrol to the operating
system,
FOW \Fault on Write", as above but page fault on an attempt to write to this
page,
FOR \Fault on Read", as above but page fault on an attempt to read from this
page,
ASM Address Spa
e Mat
h. This is used when the operating system wishes to
lear
only some of the entries from the Translation Buer,
PAGE DIRTY if set, the page needs to be written out to the swap le,
PAGE ACCESSED Used by Linux to mark a page as having been a
essed.
3.2 Ca
hes
If you were to implement a system using the above theoreti
al model then it would
work, but not parti
ularly e
iently. Both operating system and pro
essor designers
try hard to extra
t more performan
e from the system. Apart from making the
pro
essors, memory and so on faster the best approa
h is to maintain
a
hes of
useful information and data that make some operations faster. Linux uses a number
of memory management related
a
hes:
devi
e drivers. These buers are of xed sizes (for example 512 bytes) and
ontain blo
ks of information that have either been read from a blo
k devi
e
or are being written to it. A blo
k devi
e is one that
an only be a
essed by
reading and writing xed sized blo
ks of data. All hard disks are blo
k devi
es.
The buer
a
he is indexed via the devi
e identier and the desired blo
k
number and is used to qui
kly nd a blo
k of data. Blo
k devi
es are only ever
a
essed via the buer
a
he. If data
an be found in the buer
a
he then it
does not need to be read from the physi
al blo
k devi
e, for example a hard
disk, and a
ess to it is mu
h faster.
Page Ca
he This is used to speed up a
ess to images and data on disk. It is used
to
a
he the logi
al
ontents of a le a page at a time and is a
essed via the
le and oset within the le. As pages are read into memory from disk, they
are
a
hed in the page
a
he.
See fs/buffer.
See
mm/filemap.
See swap.h,
mm/swap state.
mm/swapfile.
Swap Ca
he Only modied (or dirty ) pages are saved in the swap le. So long
as these pages are not modied after they have been written to the swap le
then the next time the page is swapped out there is no need to write it to the
swap le as the page is already in the swap le. Instead the page
an simply
be dis
arded. In a heavily swapping system this saves many unne
essary and
ostly disk operations.
When the referen
e to the virtual address is made, the pro
essor will attempt to
nd a mat
hing TLB entry. If it nds one, it
an dire
tly translate the virtual
address into a physi
al one and perform the
orre
t operation on the data. If
the pro
essor
annot nd a mat
hing TLB entry then it must get the operating
system to help. It does this by signalling the operating system that a TLB miss
has o
urred. A system spe
i
me
hanism is used to deliver that ex
eption
to the operating system
ode that
an x things up. The operating system
generates a new TLB entry for the address mapping. When the ex
eption has
been
leared, the pro
essor will make another attempt to translate the virtual
address. This time it will work be
ause there is now a valid entry in the TLB
for that address.
The drawba
k of using
a
hes, hardware or otherwise, is that in order to save eort
Linux must use more time and spa
e maintaining these
a
hes and, if the
a
hes
be
ome
orrupted, the system will
rash.
Ea
h platform that Linux runs on must provide translation ma
ros that allow the
kernel to traverse the page tables for a parti
ular pro
ess. This way, the kernel does
not need to know the format of the page table entries or how they are arranged. This
is so su
essful that Linux uses the same page table manipulation
ode for the Alpha
pro
essor, whi
h has three levels of page tables, and for Intel x86 pro
essors, whi
h
have two levels of page tables.
VIRTUAL ADDRESS
Level 1
Level 2
Level 3
Level 1
Level 2
Level 3
Page Table
Page Table
Page Table
PFN
PFN
PFN
Physical Page
PGD
ount This is a
ount of the number of users of this page. The
ount is greater than
one when the page is shared between many pro
esses,
age This eld des
ribes the age of the page and is used to de
ide if the page is a
good
andidate for dis
arding or swapping,
map nr This is the physi
al page frame number that this mem map t des
ribes.
The free area ve
tor is used by the page allo
ation
ode to nd and free pages.
The whole buer management s
heme is supported by this me
hanism and so far as
the
ode is
on
erned, the size of the page and physi
al paging me
hanisms used by
the pro
essor are irrelevant.
Ea
h element of free area
ontains information about blo
ks of pages. The rst
element in the array des
ribes single pages, the next blo
ks of 2 pages, the next
blo
ks of 4 pages and so on upwards in powers of two. The list element is used as a
queue head and has pointers to the page data stru
tures in the mem map array. Free
1 Confusingly
page
stru ture.
See in lude/linux/mm.h
blo
ks of pages are queued here. map is a pointer to a bitmap whi
h keeps tra
k of
allo
ated groups of pages of this size. Bit N of the bitmap is set if the Nth blo
k of
pages is free.
Figure 3.4 shows the free area stru
ture. Element 0 has one free page (page frame
number 0) and element 2 has 2 free blo
ks of 4 pages, the rst starting at page frame
number 4 and the se
ond at page frame number 56.
in
mm/page allo .
Linux uses the Buddy algorithm 2 to ee
tively allo
ate and deallo
ate blo
ks of
pages. The page allo
ation
ode attempts to allo
ate a blo
k of one or more physi
al
pages. Pages are allo
ated in blo
ks whi
h are powers of 2 in size. That means that
it
an allo
ate a blo
k 1 page, 2 pages, 4 pages and so on. So long as there are enough
free pages in the system to grant this request (nr f ree pages > min f ree pages) the
allo
ation
ode will sear
h the free area for a blo
k of pages of the size requested.
Ea
h element of the free area has a map of the allo
ated and free blo
ks of pages
for that sized blo
k. For example, element 2 of the array has a memory map that
des
ribes free and allo
ated blo
ks ea
h of 4 pages long.
The allo
ation algorithm rst sear
hes for blo
ks of pages of the size requested. It
follows the
hain of free pages that is queued on the list element of the free area
data stru
ture. If no blo
ks of pages of the requested size are free, blo
ks of the next
size (whi
h is twi
e that of the size requested) are looked for. This pro
ess
ontinues
until all of the free area has been sear
hed or until a blo
k of pages has been found.
If the blo
k of pages found is larger than that requested it must be broken down until
there is a blo
k of the right size. Be
ause the blo
ks are ea
h a power of 2 pages big
then this breaking down pro
ess is easy as you simply break the blo
ks in half. The
free blo
ks are queued on the appropriate queue and the allo
ated blo
k of pages is
returned to the
aller.
For example, in Figure 3.4 if a blo
k of 2 pages was requested, the rst blo
k of 4
pages (starting at page frame number 4) would be broken into two 2 page blo
ks.
The rst, starting at page frame number 4 would be returned to the
aller as the
allo
ated pages and the se
ond blo
k, starting at page frame number 6 would be
queued as a free blo
k of 2 pages onto element 1 of the free area array.
free pages() in
mm/page allo
.
Allo
ating blo
ks of pages tends to fragment memory with larger blo
ks of free pages
being broken down into smaller ones. The page deallo
ation
ode re
ombines pages
into larger blo
ks of free pages whenever it
an. In fa
t the page blo
k size is
important as it allows for easy
ombination of blo
ks into larger blo
ks.
Whenever a blo
k of pages is freed, the adja
ent or buddy blo
k of the same size is
he
ked to see if it is free. If it is, then it is
ombined with the newly freed blo
k
of pages to form a new free blo
k of pages for the next size blo
k of pages. Ea
h
time two blo
ks of pages are re
ombined into a bigger blo
k of free pages the page
deallo
ation
ode attempts to re
ombine that blo
k into a yet larger one. In this way
the blo
ks of free pages are as large as memory usage will allow.
2 Bibliography
referen e here
PHYSICAL MEMORY
free_area
5
8
4
00 11
11
00
11 00
00
11
11
00
11
00
mem_map_t
56
mem_map_t
4
mem_map_t
3
2
map
map
map
4
map
3
2
1
Free PFN
1
0
0
1
0
1
0 PFN
Virtual Area
vm_area_struct
vm_end
vm_start
vm_flags
vm_inode
vm_ops
Virtual Memory
Operations
open()
close()
unmap()
protect()
sync()
advise()
nopage()
wppage()
swapout()
swapin()
vm_next
See
handle mm fault()
in mm/memory.
On
e an exe
utable image has been memory mapped into a pro
esses virtual memory
it
an start to exe
ute. As only the very start of the image is physi
ally pulled into
memory it will soon a
ess an area of virtual memory that is not yet in physi
al
memory. When a pro
ess a
esses a virtual address that does not have a valid page
table entry, the pro
essor will report a page fault to Linux. The page fault des
ribes
the virtual address where the page fault o
urred and the type of memory a
ess
that
aused.
Linux must nd the vm area stru
t that represents the area of memory that the
page fault o
urred in. As sear
hing through the vm area stru
t data stru
tures is
riti
al to the e
ient handling of page faults, these are linked together in an AVL
(Adelson-Velskii and Landis) tree stru
ture. If there is no vm area stru
t data
stru
ture for this faulting virtual address, this pro
ess has a
essed an illegal virtual
address. Linux will signal the pro
ess, sending a SIGSEGV signal, and if the pro
ess
does not have a handler for that signal it will be terminated.
Linux next
he
ks the type of page fault that o
urred against the types of a
esses
allowed for this area of virtual memory. If the pro
ess is a
essing the memory in
an illegal way, say writing to an area that it is only allowed to read from, it is also
page_hash_table
mem_map_t
inode
offset
:
:
:
mem_map_t
12
0x8000
next_hash
prev_hash
inode
offset
12
0x2000
next_hash
prev_hash
See in lude/linux/pagemap.h
inode data stru ture (des ribed in Chapter 9) and ea h VFS inode is unique and
fully des
ribes one and only one le. The index into the page table is derived from
the le's VFS inode and the oset into the le.
Whenever a page is read from a memory mapped le, for example when it needs
to be brought ba
k into memory during demand paging, the page is read through
the page
a
he. If the page is present in the
a
he, a pointer to the mem map t data
stru
ture representing it is returned to the page fault handling
ode. Otherwise the
page must be brought into memory from the le system that holds the image. Linux
allo
ates a physi
al page and reads the page from the le on disk.
If it is possible, Linux will initiate a read of the next page in the le. This single
page read ahead means that if the pro
ess is a
essing the pages in the le serially,
the next page will be waiting in memory for the pro
ess.
Over time the page
a
he grows as images are read and exe
uted. Pages will be
removed from the
a
he as they are no longer needed, say as an image is no longer
being used by any pro
ess. As Linux uses memory it
an start to run low on physi
al
pages. In this
ase Linux will redu
e the size of the page
a
he.
mm/vms an.
The Kernel swap daemon (kswapd ) is started by the kernel init pro
ess at startup
time and sits waiting for the kernel swap timer to periodi
ally expire. Every time
the timer expires, the swap daemon looks to see if the number of free pages in the
system is getting too low. It uses two variables, free pages high and free pages low to
de
ide if it should free some pages. So long as the number of free pages in the system
remains above free pages high, the kernel swap daemon does nothing; it sleeps again
until its timer next expires. For the purposes of this
he
k the kernel swap daemon
takes into a
ount the number of pages
urrently being written out to the swap le.
It keeps a
ount of these in nr asyn
pages ; this is in
remented ea
h time a page is
queued waiting to be written out to the swap le and de
remented when the write to
the swap devi
e has
ompleted. free pages low and free pages high are set at system
startup time and are related to the number of physi
al pages in the system. If the
number of free pages in the system has fallen below free pages high or worse still
free pages low, the kernel swap daemon will try three ways to redu
e the number of
physi
al pages being used by the system:
Redu
ing the size of the buer and page
a
hes,
Swapping out System V shared memory pages,
Swapping out and dis
arding pages.
If the number of free pages in the system has fallen below free pages low, the kernel
swap daemon will try to free 6 pages before it next runs. Otherwise it will try to
free 3 pages. Ea
h of the above methods are tried in turn until enough pages have
been freed. The kernel swap daemon remembers whi
h method it was using the last
time that it attempted to free physi
al pages. Ea
h time it runs it will start trying
to free pages using this last su
essful method.
After it has free su
ient pages, the swap daemon sleeps again until its timer expires.
If the reason that the kernel swap daemon freed pages was that the number of free
pages in the system had fallen below free pages low, it only sleeps for half its usual
time. On
e the number of free pages is more than free pages low the kernel swap
daemon goes ba
k to sleeping longer between
he
ks.
See
shrink mmap() in
mm/filemap.
See try to
free buffer() in
fs/buffer.
System V shared memory is an inter-pro
ess
ommuni
ation me
hanism whi
h allows two or more pro
esses to share virtual memory in order to pass information
amongst themselves. How pro
esses share memory in this way is des
ribed in
more detail in Chapter 5. For now it is enough to say that ea
h area of System V shared memory is des
ribed by a shmid ds data stru
ture. This
ontains
a pointer to a list of vm area stru
t data stru
tures, one for ea
h pro
ess sharing
this area of virtual memory. The vm area stru
t data stru
tures des
ribe where
in ea
h pro
esses virtual memory this area of System V shared memory goes. Ea
h
vm area stru
t data stru
ture for this System V shared memory is linked together
using the vm next shared and vm prev shared pointers. Ea
h shmid ds data stru
ture also
ontains a list of page table entries ea
h of whi
h des
ribes the physi
al
page that a shared virtual page maps to.
The kernel swap daemon also uses a
lo
k algorithm when swapping out System V
shared memory pages. . Ea
h time it runs it remembers whi
h page of whi
h shared
virtual memory area it last swapped out. It does this by keeping two indi
es, the
rst is an index into the set of shmid ds data stru
tures, the se
ond into the list of
page table entries for this area of System V shared memory. This makes sure that it
fairly vi
timizes the areas of System V shared memory.
As the physi
al page frame number for a given virtual page of System V shared
memory is
ontained in the page tables of all of the pro
esses sharing this area of
virtual memory, the kernel swap daemon must modify all of these page tables to
show that the page is no longer in memory but is now held in the swap le. For
ea
h shared page it is swapping out, the kernel swap daemon nds the page table
entry in ea
h of the sharing pro
esses page tables (by following a pointer from ea
h
vm area stru
t data stru
ture). If this pro
esses page table entry for this page of
System V shared memory is valid, it
onverts it into an invalid but swapped out
page table entry and redu
es this (shared) page's
ount of users by one. The format
of a swapped out System V shared page table entry
ontains an index into the set
of shmid ds data stru
tures and an index into the page table entries for this area of
System V shared memory.
If the page's
ount is zero after the page tables of the sharing pro
esses have all been
modied, the shared page
an be written out to the swap le. The page table entry
in the list pointed at by the shmid ds data stru
ture for this area of System V shared
memory is repla
ed by a swapped out page table entry. A swapped out page table
entry is invalid but
ontains an index into the set of open swap les and the oset
in that le where the swapped out page
an be found. This information will be used
when the page has to be brought ba
k into physi
al memory.
The swap daemon looks at ea
h pro
ess in the system in turn to see if it is a good
andidate for swapping. Good
andidates are pro
esses that
an be swapped (some
annot) and that have one or more pages whi
h
an be swapped or dis
arded from
memory. Pages are swapped out of physi
al memory into the system's swap les only
if the data in them
annot be retrieved another way.
A lot of the
ontents of an exe
utable image
ome from the image's le and
an easily
be re-read from that le. For example, the exe
utable instru
tions of an image will
never be modied by the image and so will never be written to the swap le. These
pages
an simply be dis
arded; when they are again referen
ed by the pro
ess, they To do this it
follows the
will be brought ba
k into memory from the exe
utable image.
On
e the pro
ess to swap has been lo
ated, the swap daemon looks through all of its
virtual memory regions looking for areas whi
h are not shared or lo
ked. Linux does
not swap out all of the swappable pages of the pro
ess that it has sele
ted; instead
it removes only a small number of pages. Pages
annot be swapped or dis
arded if
they are lo
ked in memory.
vm next pointer
along the list of
vm area stru
t
stru
tures
queued on the
mm stru
t for the
pro
ess.
The Linux swap algorithm uses page aging. Ea
h page has a
ounter (held in the See
mem map t data stru
ture) that gives the Kernel swap daemon some idea whether or swap out vma()
not a page is worth swapping. Pages age when they are unused and rejuvinate on in mm/vms
an.
a
ess; the swap daemon only swaps out old pages. The default a
tion when a page
is rst allo
ated, is to give it an initial age of 3. Ea
h time it is tou
hed, it's age is
in
reased by 3 to a maximum of 20. Every time the Kernel swap daemon runs it ages
pages, de
rementing their age by 1. These default a
tions
an be
hanged and for
this reason they (and other swap related information) are stored in the swap
ontrol
data stru
ture.
If the page is old (age = 0), the swap daemon will pro
ess it further. Dirty pages are
pages whi
h
an be swapped out. Linux uses an ar
hite
ture spe
i
bit in the PTE to
des
ribe pages this way (see Figure 3.2). However, not all dirty pages are ne
essarily
written to the swap le. Every virtual memory region of a pro
ess may have its own
swap operation (pointed at by the vm ops pointer in the vm area stru
t) and that
method is used. Otherwise, the swap daemon will allo
ate a page in the swap le
and write the page out to that devi
e.
The page's page table entry is repla
ed by one whi
h is marked as invalid but whi
h
ontains information about where the page is in the swap le. This is an oset into
the swap le where the page is held and an indi
ation of whi
h swap le is being used.
Whatever the swap method used, the original physi
al page is made free by putting
it ba
k into the free area. Clean (or rather not dirty ) pages
an be dis
arded and
put ba
k into the free area for re-use.
If enough of the swappable pro
esses pages have been swapped out or dis
arded,
the swap daemon will again sleep. The next time it wakes it will
onsider the next
pro
ess in the system. In this way, the swap daemon nibbles away at ea
h pro
esses
physi
al pages until the system is again in balan
e. This is mu
h fairer than swapping
out whole pro
esses.
a page whi
h is being held in a swap le that has not been modied. If the page
is subsequently modied (by being written to), its entry is removed from the swap
a
he.
When Linux needs to swap a physi
al page out to a swap le it
onsults the swap
a
he and, if there is a valid entry for this page, it does not need to write the page
out to the swap le. This is be
ause the page in memory has not been modied sin
e
it was last read from the swap le.
The entries in the swap
a
he are page table entries for swapped out pages. They
are marked as invalid but
ontain information whi
h allow Linux to nd the right
swap le and the right page within that swap le.
See
do page fault()
in ar
h/i386/mm/fault.
See
do no page() in
mm/memory.
See
do swap page()
in mm/memory.
See
The dirty pages saved in the swap les may be needed again, for example when
an appli
ation writes to an area of virtual memory whose
ontents are held in a
swapped out physi
al page. A
essing a page of virtual memory that is not held
in physi
al memory
auses a page fault to o
ur. The page fault is the pro
essor
signalling the operating system that it
annot translate a virtual address into a
physi
al one. In this
ase this is be
ause the page table entry des
ribing this page
of virtual memory was marked as invalid when the page was swapped out. The
pro
essor
annot handle the virtual to physi
al address translation and so hands
ontrol ba
k to the operating system des
ribing as it does so the virtual address that
faulted and the reason for the fault. The format of this information and how the
pro
essor passes
ontrol to the operating system is pro
essor spe
i
. The pro
essor
spe
i
page fault handling
ode must lo
ate the vm area stru
t data stru
ture that
des
ribes the area of virtual memory that
ontains the faulting virtual address. It
does this by sear
hing the vm area stru
t data stru
tures for this pro
ess until it
nds the one
ontaining the faulting virtual address. This is very time
riti
al
ode
and a pro
esses vm area stru
t data stru
tures are so arranged as to make this
sear
h take as little time as possible.
Having
arried out the appropriate pro
essor spe
i
a
tions and found that the
faulting virtual address is for a valid area of virtual memory, the page fault pro
essing
be
omes generi
and appli
able to all pro
essors that Linux runs on. The generi
page fault handling
ode looks for the page table entry for the faulting virtual address.
If the page table entry it nds is for a swapped out page, Linux must swap the page
ba
k into physi
al memory. The format of the page table entry for a swapped out
page is pro
essor spe
i
but all pro
essors mark these pages as invalid and put the
information ne
essary to lo
ate the page within the swap le into the page table
entry. Linux needs this information in order to bring the page ba
k into physi
al
memory.
At this point, Linux knows the faulting virtual address and has a page table
entry
ontaining information about where this page has been swapped to. The
vm area stru
t data stru
ture may
ontain a pointer to a routine whi
h will swap
any page of the area of virtual memory that it des
ribes ba
k into physi
al memory.
This is its swapin operation. If there is a swapin operation for this area of virtual
memory then Linux will use it. This is, in fa
t, how swapped out System V shared
memory pages are handled as it requires spe
ial handling be
ause the format of a
swapped out System V shared page is a little dierent from that of an ordinairy
swapped out page. There may not be a swapin operation, in whi
h
ase Linux will
assume that this is an ordinairy page that does not need to be spe
ially handled. It
allo
ates a free physi
al page and reads the swapped out page ba
k from the swap
le. Information telling it where in the swap le (and whi
h swap le) is taken from
the the invalid page table entry.
If the a
ess that
aused the page fault was not a write a
ess then the page is left
in the swap
a
he and its page table entry is not marked as writable. If the page is
subsequently written to, another page fault will o
ur and, at that point, the page
is marked as dirty and its entry is removed from the swap
a
he. If the page is not
written to and it needs to be swapped out again, Linux
an avoid the write of the
page to its swap le be
ause the page is already in the swap le.
If the a
ess that
aused the page to be brought in from the swap le was a write
operation, this page is removed from the swap
a
he and its page table entry is
marked as both dirty and writable.
Chapter 4
Pro esses
This
hapter des
ribes what a pro
ess is and how the Linux kernel
reates,
manages and deletes the pro
esses in the system.
Pro
esses
arry out tasks within the operating system. A program is a set of ma
hine
ode instru
tions and data stored in an exe
utable image on disk and is, as su
h, a
passive entity; a pro
ess
an be thought of as a
omputer program in a
tion. It is a
dynami
entity,
onstantly
hanging as the ma
hine
ode instru
tions are exe
uted
by the pro
essor. As well as the program's instru
tions and data, the pro
ess also
in
ludes the program
ounter and all of the CPU's registers as well as the pro
ess
sta
ks
ontaining temporary data su
h as routine parameters, return addresses and
saved variables. The
urrent exe
uting program, or pro
ess, in
ludes all of the
urrent a
tivity in the mi
ropro
essor. Linux is a multipro
essing operating system.
Pro
esses are separate tasks ea
h with their own rights and responsibilities. If one
pro
ess
rashes it will not
ause another pro
ess in the system to
rash. Ea
h individual pro
ess runs in its own virtual address spa
e and is not
apable of intera
ting
with another pro
ess ex
ept through se
ure, kernel managed me
hanisms.
During the lifetime of a pro
ess it will use many system resour
es. It will use the
CPUs in the system to run its instru
tions and the system's physi
al memory to hold
it and its data. It will open and use les within the lesystems and may dire
tly
or indire
tly use the physi
al devi
es in the system. Linux must keep tra
k of the
pro
ess itself and of the system resour
es that it has so that it
an manage it and
the other pro
esses in the system fairly. It would not be fair to the other pro
esses
in the system if one pro
ess monopolized most of the system's physi
al memory or
its CPUs.
The most pre
ious resour
e in the system is the CPU, usually there is only one. Linux
is a multipro
essing operating system, its obje
tive is to have a pro
ess running on
ea
h CPU in the system at all times, to maximize CPU utilization. If there are more
pro
esses than CPUs (and there usually are), the rest of the pro
esses must wait
before a CPU be
omes free until they
an be run. Multipro
essing is a simple idea; a
35
pro
ess is exe
uted until it must wait, usually for some system resour
e; when it has
this resour
e, it may run again. In a unipro
essing system, for example DOS, the CPU
would simply sit idle and the waiting time would be wasted. In a multipro
essing
system many pro
esses are kept in memory at the same time. Whenever a pro
ess
has to wait the operating system takes the CPU away from that pro
ess and gives
it to another, more deserving pro
ess. It is the s
heduler whi
h
hooses whi
h is
the most appropriate pro
ess to run next and Linux uses a number of s
heduling
strategies to ensure fairness.
Linux supports a number of dierent exe
utable le formats, ELF is one, Java is
another and these must be managed transparently as must the pro
esses use of the
system's shared libraries.
So that Linux
an manage the pro
esses in the system, ea
h pro
ess is represented
by a task stru
t data stru
ture (task and pro
ess are terms that Linux uses inter
hangeably). The task ve
tor is an array of pointers to every task stru
t data
stru
ture in the system. This means that the maximum number of pro
esses in the
system is limited by the size of the task ve
tor; by default it has 512 entries. As pro
esses are
reated, a new task stru
t is allo
ated from system memory and added
into the task ve
tor. To make it easy to nd, the
urrent, running, pro
ess is pointed
to by the
urrent pointer.
As well as the normal type of pro
ess, Linux supports real time pro
esses. These
pro
esses have to rea
t very qui
kly to external events (hen
e the term \real time")
and they are treated dierently from normal user pro
esses by the s
heduler. Although the task stru
t data stru
ture is quite large and
omplex, but its elds
an
be divided into a number of fun
tional areas:
State As a pro
ess exe
utes it
hanges state a
ording to its
ir
umstan
es. Linux
pro
esses have the following states:
Running The pro
ess is either running (it is the
urrent pro
ess in the system)
or it is ready to run (it is waiting to be assigned to one of the system's
CPUs).
Waiting The pro ess is waiting for an event or for a resour e. Linux dieren-
tiates between two types of waiting pro
ess; interruptible and uninterruptible. Interruptible waiting pro
esses
an be interrupted by signals whereas
uninterruptible waiting pro
esses are waiting dire
tly on hardware
onditions and
annot be interrupted under any
ir
umstan
es.
Stopped The pro
ess has been stopped, usually by re
eiving a signal. A
pro
ess that is being debugged
an be in a stopped state.
Zombie This is a halted pro
ess whi
h, for some reason, still has a task stru
t
data stru
ture in the task ve
tor. It is what it sounds like, a dead pro
ess.
1 REVIEW
Identiers Every pro ess in the system has a pro ess identier. The pro ess iden-
tier is not an index into the task ve
tor, it is simply a number. Ea
h pro
ess
also has User and group identiers, these are used to
ontrol this pro
esses
a
ess to the les and devi
es in the system,
Inter-Pro
ess Communi
ation Linux supports the
lassi
UnixTM IPC me
hanisms of signals, pipes and semaphores and also the System V IPC me
hanisms
of shared memory, semaphores and message queues. The IPC me
hanisms supported by Linux are des
ribed in Chapter 5.
Links In a Linux system no pro
ess is independent of any other pro
ess. Every
pro
ess in the system, ex
ept the initial pro
ess has a parent pro
ess. New
pro
esses are not
reated, they are
opied, or rather
loned from previous pro
esses. Every task stru
t representing a pro
ess keeps pointers to its parent
pro
ess and to its siblings (those pro
esses with the same parent pro
ess) as
well as to its own
hild pro
esses. You
an see the family relationship between
the running pro
esses in a Linux system using the pstree
ommand:
init(1)-+-
rond(98)
|-ema
s(387)
|-gpm(146)
|-inetd(110)
|-kerneld(18)
|-kflushd(2)
|-klogd(87)
|-kswapd(3)
|-login(160)---bash(192)---ema
s(225)
|-lpd(121)
|-mingetty(161)
|-mingetty(162)
|-mingetty(163)
|-mingetty(164)
|-login(403)---bash(404)---pstree(594)
|-sendmail(134)
|-syslogd(78)
`-update(166)
Additionally all of the pro
esses in the system are held in a doubly linked list
whose root is the init pro
esses task stru
t data stru
ture. This list allows
the Linux kernel to look at every pro
ess in the system. It needs to do this to
provide support for
ommands su
h as ps or kill.
Times and Timers The kernel keeps tra k of a pro esses reation time as well as
the CPU time that it
onsumes during its lifetime. Ea
h
lo
k ti
k, the kernel
updates the amount of time in jiffies that the
urrent pro
ess has spent in
system and in user mode. Linux also supports pro
ess spe
i
interval timers,
pro
esses
an use system
alls to set up timers to send signals to themselves
when the timers expire. These timers
an be single-shot or periodi
timers.
File system Pro esses an open and lose les as they wish and the pro esses
task stru
t
ontains pointers to des
riptors for ea
h open le as well as pointers to two VFS inodes. Ea
h VFS inode uniquely des
ribes a le or dire
tory
within a le system and also provides a uniform interfa
e to the underlying
le systems. How le systems are supported under Linux is des
ribed in Chapter 9. The rst is to the root of the pro
ess (its home dire
tory) and the se
ond
is to its
urrent or pwd dire
tory. pwd is derived from the UnixTM
ommand
pwd, print working dire
tory. These two VFS inodes have their
ount elds
in
remented to show that one or more pro
esses are referen
ing them. This is
why you
annot delete the dire
tory that a pro
ess has as its pwd dire
tory set
to, or for that matter one of its sub-dire
tories.
Virtual memory Most pro
esses have some virtual memory (kernel threads and
daemons do not) and the Linux kernel must tra
k how that virtual memory is
mapped onto the system's physi
al memory.
Pro essor Spe i Context A pro ess ould be thought of as the sum total of the
system's
urrent state. Whenever a pro
ess is running it is using the pro
essor's
registers, sta
ks and so on. This is the pro
esses
ontext and, when a pro
ess is
suspended, all of that CPU spe
i
ontext must be saved in the task stru
t
for the pro
ess. When a pro
ess is restarted by the s
heduler its
ontext is
restored from here.
Groups are Linux's way of assigning privileges to les and dire
tories for a group
of users rather than to a single user or to all pro
esses in the system. You might,
for example,
reate a group for all of the users in a software proje
t and arrange it
so that only they
ould read and write the sour
e
ode for the proje
t. A pro
ess
an belong to several groups (a maximum of 32 is the default) and these are held in
the groups ve
tor in the task stru
t for ea
h pro
ess. So long as a le has a
ess
rights for one of the groups that a pro
ess belongs to then that pro
ess will have
appropriate group a
ess rights to that le.
There are four pairs of pro
ess and group identiers held in a pro
esses task stru
t:
uid, gid The user identier and group identier of the user that the pro
ess is
running on behalf of,
ee
tive uid and gid There are some programs whi
h
hange the uid and gid from
that of the exe
uting pro
ess into their own (held as attributes in the VFS
inode des
ribing the exe
utable image). These programs are known as setuid
programs and they are useful be
ause it is a way of restri
ting a
esses to
servi
es, parti
ularly those that run on behalf of someone else, for example a
network daemon. The ee
tive uid and gid are those from the setuid program
and the uid and gid remain as they were. The kernel
he
ks the ee
tive uid
and gid whenever it
he
ks for privilege rights.
le system uid and gid These are normally the same as the ee tive uid and gid
and are used when
he
king le system a
ess rights. They are needed for NFS
mounted lesystems where the user mode NFS server needs to a
ess les as if
it were a parti
ular pro
ess. In this
ase only the le system uid and gid are
hanged (not the ee
tive uid and gid). This avoids a situation where mali
ious
users
ould send a kill signal to the NFS server. Kill signals are delivered to
pro
esses with a parti
ular ee
tive uid and gid.
saved uid and gid These are mandated by the POSIX standard and are used by
programs whi
h
hange the pro
esses uid and gid via system
alls. They are
used to save the real uid and gid during the time that the original uid and gid
have been
hanged.
4.3 S
heduling
All pro
esses run partially in user mode and partially in system mode. How these
modes are supported by the underlying hardware diers but generally there is a
se
ure me
hanism for getting from user mode into system mode and ba
k again.
User mode has far less privileges than system mode. Ea
h time a pro
ess makes a
system
all it swaps from user mode to system mode and
ontinues exe
uting. At
this point the kernel is exe
uting on behalf of the pro
ess. In Linux, pro
esses do
not preempt the
urrent, running pro
ess, they
annot stop it from running so that
they
an run. Ea
h pro
ess de
ides to relinquish the CPU that it is running on when
it has to wait for some system event. For example, a pro
ess may have to wait for
a
hara
ter to be read from a le. This waiting happens within the system
all, in
system mode; the pro
ess used a library fun
tion to open and read the le and it,
in turn made system
alls to read bytes from the open le. In this
ase the waiting
pro
ess will be suspended and another, more deserving pro
ess will be
hosen to run.
Pro
esses are always making system
alls and so may often need to wait. Even so, if
a pro
ess exe
utes until it waits then it still might use a disproportionate amount of
CPU time and so Linux uses pre-emptive s
heduling. In this s
heme, ea
h pro
ess is
allowed to run for a small amount of time, 200ms, and, when this time has expired
another pro
ess is sele
ted to run and the original pro
ess is made to wait for a little
while until it
an run again. This small amount of time is known as a time-sli
e.
It is the s
heduler that must sele
t the most deserving pro
ess to run out of all of See s
hedule()
the runnable pro
esses in the system. A runnable pro
ess is one whi
h is waiting in
only for a CPU to run on. Linux uses a reasonably simple priority based s
heduling kernel/s
hed.
algorithm to
hoose between the
urrent pro
esses in the system. When it has
hosen
a new pro
ess to run it saves the state of the
urrent pro
ess, the pro
essor spe
i
registers and other
ontext being saved in the pro
esses task stru
t data stru
ture.
It then restores the state of the new pro
ess (again this is pro
essor spe
i
) to run
and gives
ontrol of the system to that pro
ess. For the s
heduler to fairly allo
ate
CPU time between the runnable pro
esses in the system it keeps information in the
task stru
t for ea
h pro
ess:
poli
y This is the s
heduling poli
y that will be applied to this pro
ess. There are
two types of Linux pro
ess, normal and real time. Real time pro
esses have a
higher priority than all of the other pro
esses. If there is a real time pro
ess
ready to run, it will always run rst. Real time pro
esses may have two types
of poli
y, round robin and rst in rst out. In round robin s
heduling, ea
h
runnable real time pro
ess is run in turn and in rst in, rst out s
heduling
ea
h runnable pro
ess is run in the order that it is in on the run queue and
that order is never
hanged.
priority This is the priority that the s heduler will give to this pro ess. It is also the
amount of time (in jiffies) that this pro
ess will run for when it is allowed
to run. You
an alter the priority of a pro
ess by means of system
alls and
the reni
e
ommand.
rt priority Linux supports real time pro
esses and these are s
heduled to have a
higher priority than all of the other non-real time pro
esses in system. This
eld allows the s
heduler to give ea
h real time pro
ess a relative priority. The
priority of a real time pro
esses
an be altered using system
alls.
ounter This is the amount of time (in jiffies) that this pro ess is allowed to run
for. It is set to priority when the pro
ess is rst run and is de
remented ea
h
lo
k ti
k.
See s
hedule()
in
kernel/s
hed.
The s
heduler is run from several pla
es within the kernel. It is run after putting the
urrent pro
ess onto a wait queue and it may also be run at the end of a system
all,
just before a pro
ess is returned to pro
ess mode from system mode. One reason that
it might need to run is be
ause the system timer has just set the
urrent pro
esses
ounter to zero. Ea
h time the s
heduler is run it does the following:
kernel work The s
heduler runs the bottom half handlers and pro
esses the s
heduler task queue. These lightweight kernel threads are des
ribed in detail in
hapter 11.
Current pro
ess The
urrent pro
ess must be pro
essed before another pro
ess
an be sele
ted to run.
If the s
heduling poli
y of the
urrent pro
esses is round robin then it is put
onto the ba
k of the run queue.
If the task is INTERRUPTIBLE and it has re
eived a signal sin
e the last time it
was s
heduled then its state be
omes RUNNING.
If the
urrent pro
ess has timed out, then its state be
omes RUNNING.
If the
urrent pro
ess is RUNNING then it will remain in that state.
Pro
esses that were neither RUNNING nor INTERRUPTIBLE are removed from the
run queue. This means that they will not be
onsidered for running when the
s
heduler looks for the most deserving pro
ess to run.
Pro ess sele tion The s heduler looks through the pro esses on the run queue
looking for the most deserving pro ess to run. If there are any real time pro esses (those with a real time s heduling poli y) then those will get a higher
weighting than ordinary pro
esses. The weight for a normal pro
ess is its
ounter but for a real time pro
ess it is
ounter plus 1000. This means that if
there are any runnable real time pro
esses in the system then these will always
be run before any normal runnable pro
esses. The
urrent pro
ess, whi
h has
onsumed some of its time-sli
e (its
ounter has been de
remented) is at a disadvantage if there are other pro
esses with equal priority in the system; that
is as it should be. If several pro
esses have the same priority, the one nearest
the front of the run queue is
hosen. The
urrent pro
ess will get put onto the
ba
k of the run queue. In a balan
ed system with many pro
esses of the same
priority, ea
h one will run in turn. This is known as Round Robin s
heduling.
However, as pro
esses wait for resour
es, their run order tends to get moved
around.
Swap pro esses If the most deserving pro ess to run is not the urrent pro ess,
then the
urrent pro
ess must be suspended and the new one made to run.
When a pro
ess is running it is using the registers and physi
al memory of the
CPU and of the system. Ea
h time it
alls a routine it passes its arguments
in registers and may sta
k saved values su
h as the address to return to in the
alling routine. So, when the s
heduler is running it is running in the
ontext
of the
urrent pro
ess. It will be in a privileged mode, kernel mode, but it
is still the
urrent pro
ess that is running. When that pro
ess
omes to be
suspended, all of its ma
hine state, in
luding the program
ounter (PC) and
all of the pro
essor's registers, must be saved in the pro
esses task stru
t
data stru
ture. Then, all of the ma
hine state for the new pro
ess must be
loaded. This is a system dependent operation, no CPUs do this in quite the
same way but there is usually some hardware assistan
e for this a
t.
This swapping of pro
ess
ontext takes pla
e at the end of the s
heduler. The
saved
ontext for the previous pro
ess is, therefore, a snapshot of the hardware
ontext of the system as it was for this pro
ess at the end of the s
heduler.
Equally, when the
ontext of the new pro
ess is loaded, it too will be a snapshot
of the way things were at the end of the s
heduler, in
luding this pro
esses
program
ounter and register
ontents.
If the previous pro
ess or the new
urrent pro
ess uses virtual memory then
the system's page table entries may need to be updated. Again, this a
tion
is ar
hite
ture spe
i
. Pro
essors like the Alpha AXP, whi
h use Translation
Look-aside Tables or
a
hed Page Table Entries, must
ush those
a
hed table
entries that belonged to the previous pro
ess.
inode
fs_struct
count
task_struct
0x022
umask
*root
fs
*pwd
files
inode
files_struct
count
close_on_exec
open_fs
fd[0]
file
fd[1]
f_mode
inode
f_pos
f_flags
fd[255]
f_count
f_owner
f_inode
f_op
f_version
file operation
routines
4.4 Files
See in
lude/-
linux/s hed.h
Figure 4.1 shows that there are two data stru
tures that des
ribe le system spe
i
information for ea
h pro
ess in the system. The rst, the fs stru
t
ontains pointers
to this pro
ess's VFS inodes and its umask. The umask is the default mode that new
les will be
reated in, and it
an be
hanged via system
alls.
The se
ond data stru
ture, the files stru
t,
ontains information about all of the
les that this pro
ess is
urrently using. Programs read from standard input and write
to standard output. Any error messages should go to standard error. These may be
les, terminal input/output or a real devi
e but so far as the program is
on
erned
they are all treated as les. Every le has its own des
riptor and the files stru
t
ontains pointers to up to 256 file data stru
tures, ea
h one des
ribing a le being
used by this pro
ess. The f mode eld des
ribes what mode the le has been
reated
in; read only, read and write or write only. f pos holds the position in the le
where the next read or write operation will o
ur. f inode points at the VFS inode
des
ribing the le and f ops is a pointer to a ve
tor of routine addresses; one for
ea
h fun
tion that you might wish to perform on a le. There is, for example, a write
data fun
tion. This abstra
tion of the interfa
e is very powerful and allows Linux
to support a wide variety of le types. In Linux, pipes are implemented using this
me
hanism as we shall see later.
Every time a le is opened, one of the free file pointers in the files stru
t is used
to point to the new file stru
ture. Linux pro
esses expe
t three le des
riptors to
be open when they start. These are known as standard input, standard output and
standard error and they are usually inherited from the
reating parent pro
ess. All
a
esses to les are via standard system
alls whi
h pass or return le des
riptors.
These des
riptors are indi
es into the pro
ess's fd ve
tor, so standard input, standard
output and standard error have le des
riptors 0, 1 and 2. Ea
h a
ess to the le
uses the file data stru
ture's le operation routines to together with the VFS inode
to a
hieve its needs.
task_struct
mm_struct
mm
count
pgd
vm_area_struct
vm_end
vm_start
vm_flags
Data
vm_inode
vm_ops
0x8059BB8
mmap
mmap_avl
vm_next
mmap_sem
vm_area_struct
Code
vm_end
vm_start
0x8048000
vm_flags
vm_inode
vm_ops
vm_next
0x0000000
of the system. To speed up this a
ess, Linux also arranges the vm area stru
t data
stru
tures into an AVL (Adelson-Velskii and Landis) tree. This tree is arranged so
that ea
h vm area stru
t (or node) has a left and a right pointer to its neighbouring
vm area stru
t stru
ture. The left pointer points to node with a lower starting
virtual address and the right pointer points to a node with a higher starting virtual
address. To nd the
orre
t node, Linux goes to the root of the tree and follows
ea
h node's left and right pointers until it nds the right vm area stru
t. Of
ourse,
nothing is for free and inserting a new vm area stru
t into this tree takes additional
pro
essing time.
When a pro
ess allo
ates virtual memory, Linux does not a
tually reserve physi
al
memory for the pro
ess. Instead, it des
ribes the virtual memory by
reating a
new vm area stru
t data stru
ture. This is linked into the pro
ess's list of virtual
memory. When the pro
ess attempts to write to a virtual address within that new
virtual memory region then the system will page fault. The pro
essor will attempt
to de
ode the virtual address, but as there are no Page Table Entries for any of this
memory, it will give up and raise a page fault ex
eption, leaving the Linux kernel to
x things up. Linux looks to see if the virtual address referen
ed is in the
urrent
pro
ess's virtual address spa
e. If it is, Linux
reates the appropriate PTEs and
allo
ates a physi
al page of memory for this pro
ess. The
ode or data may need to
be brought into that physi
al page from the lesystem or from the swap disk. The
pro
ess
an then be restarted at the instru
tion that
aused the page fault and, this
time as the memory physi
ally exists, it may
ontinue.
See do fork() in
kernel/fork.
pages for the
loned pro
ess's sta
ks (user and kernel). A new pro
ess identier may
be
reated, one that is unique within the set of pro
ess identiers in the system.
However, it is perfe
tly reasonable for the
loned pro
ess to keep its parents pro
ess
identier. The new task stru
t is entered into the task ve
tor and the
ontents of
the old (
urrent) pro
ess's task stru
t are
opied into the
loned task stru
t.
When
loning pro
esses Linux allows the two pro
esses to share resour
es rather
than have two separate
opies. This applies to the pro
ess's les, signal handlers and
virtual memory. When the resour
es are to be shared their respe
tive
ount elds
are in
remented so that Linux will not deallo
ate these resour
es until both pro
esses
have nished using them. So, for example, if the
loned pro
ess is to share virtual
memory, its task stru
t will
ontain a pointer to the mm stru
t of the original
pro
ess and that mm stru
t has its
ount eld in
remented to show the number of
urrent pro
esses sharing it.
Cloning a pro
ess's virtual memory is rather tri
ky. A new set of vm area stru
t
data stru
tures must be generated together with their owning mm stru
t data stru
ture and the
loned pro
ess's page tables. None of the pro
ess's virtual memory is
opied at this point. That would be a rather di
ult and lengthy task for some
of that virtual memory would be in physi
al memory, some in the exe
utable image that the pro
ess is
urrently exe
uting and possibly some would be in the swap
le. Instead Linux uses a te
hnique
alled \
opy on write" whi
h means that virtual
memory will only be
opied when one of the two pro
esses tries to write to it. Any
virtual memory that is not written to, even if it
an be, will be shared between the
two pro
esses without any harm o
uring. The read only memory, for example the
exe
utable
ode, will always be shared. For \
opy on write" to work, the writeable
areas have their page table entries marked as read only and the vm area stru
t
data stru
tures des
ribing them are marked as \
opy on write". When one of the
pro
esses attempts to write to this virtual memory a page fault will o
ur. It is at
this point that Linux will make a
opy of the memory and x up the two pro
esses'
page tables and virtual memory data stru
tures.
See
kernel/itimer.
The kernel keeps tra
k of a pro
ess's
reation time as well as the CPU time that it
onsumes during its lifetime. Ea
h
lo
k ti
k, the kernel updates the amount of time
in jiffies that the
urrent pro
ess has spent in system and in user mode.
In addition to these a
ounting timers, Linux supports pro
ess spe
i
interval
timers. A pro
ess
an use these timers to send itself various signals ea
h time that
they expire. Three sorts of interval timers are supported:
Real the timer ti
ks in real time, and when the timer has expired, the pro
ess is
sent a SIGALRM signal.
Virtual This timer only ti
ks when the pro
ess is running and when it expires it
sends a SIGVTALRM signal.
Prole This timer ti
ks both when the pro
ess is running and when the system is
exe
uting on behalf of the pro
ess itself. SIGPROF is signalled when it expires.
One or all of the interval timers may be running and Linux keeps all of the ne
essary
information in the pro
ess's task stru
t data stru
ture. System
alls
an be made
formats
linux_binfmt
linux_binfmt
linux_binfmt
next
next
next
use_count
use_count
use_count
*load_binary()
*load_binary()
*load_binary()
*load_shlib()
*load_shlib()
*load_shlib()
*core_dump()
*core_dump()
*core_dump()
of a nut the kernel is the edible bit in the middle and the shell goes around it, providing
an interfa
e.
See
do it prof() in
kernel/s
hed.
See
it real fn() in
kernel/itimer.
Physical Header
Physical Header
e_ident
e_entry
e_phoff
e_phentsize
e_phnum
E L F
0x8048090
52
32
2
p_type
p_offset
p_vaddr
p_filesz
p_memsz
p_flags
PT_LOAD
0
0x8048000
68532
68532
PF_R, PF_X
p_type
p_offset
p_vaddr
p_filesz
p_memsz
p_flags
PT_LOAD
68536
0x8059BB8
2200
4248
PF_R, PF_W
Code
Data
As with le systems, the binary formats supported by Linux are either built into the
kernel at kernel build time or available to be loaded as modules. The kernel keeps
a list of supported binary formats (see gure 4.3) and when an attempt is made
to exe
ute a le, ea
h binary format is tried in turn until one works. Commonly
supported Linux binary formats are a.out and ELF. Exe
utable les do not have to
be read
ompletely into memory, a te
hnique known as demand loading is used. As
ea
h part of the exe
utable image is used by a pro
ess it is brought into memory.
Unused parts of the image may be dis
arded from memory.
4.8.1 ELF
See in lude/linux/elf.h
The ELF (Exe
utable and Linkable Format) obje
t le format, designed by the Unix
System Laboratories, is now rmly established as the most
ommonly used format
in Linux. Whilst there is a slight performan
e overhead when
ompared with other
obje
t le formats su
h as ECOFF and a.out, ELF is felt to be more
exible. ELF
exe
utable les
ontain exe
utable
ode, sometimes refered to as text, and data.
Tables within the exe
utable image des
ribe how the program should be pla
ed into
the pro
ess's virtual memory. Stati
ally linked images are built by the linker (ld),
or link editor, into one single image
ontaining all of the
ode and data needed to
run this image. The image also spe
ies the layout in memory of this image and the
address in the image of the rst
ode to exe
ute.
Figure 4.4 shows the layout of a stati
ally linked ELF exe
utable image. It is a simple
C program that prints \hello world" and then exits. The header des
ribes it as an
ELF image with two physi
al headers (e phnum is 2) starting 52 bytes (e phoff) from
the start of the image le. The rst physi
al header des
ribes the exe
utable
ode in
the image. It goes at virtual address 0x8048000 and there is 65532 bytes of it. This
is be
ause it is a stati
ally linked image whi
h
ontains all of the library
ode for
the printf()
all to output \hello world". The entry point for the image, the rst
instru
tion for the program, is not at the start of the image but at virtual address
0x8048090 (e entry). The
ode starts immediately after the se
ond physi
al header.
This physi
al header des
ribes the data for the program and is to be loaded into
virtual memory at address 0x8059BB8. This data is both readable and writeable.
You will noti
e that the size of the data in the le is 2200 bytes (p filesz) whereas
its size in memory is 4248 bytes. This be
ause the rst 2200 bytes
ontain preinitialized data and the next 2048 bytes
ontain data that will be initialized by the
exe
uting
ode.
When Linux loads an ELF exe
utable image into the pro
ess's virtual address spa
e,
it does not a
tually load the image. It sets up the virtual memory data stru
tures,
the pro
ess's vm area stru
t tree and its page tables. When the program is exe
uted page faults will
ause the program's
ode and data to be fet
hed into physi
al
memory. Unused portions of the program will never be loaded into memory. On
e
the ELF binary format loader is satised that the image is a valid ELF exe
utable
image it
ushes the pro
ess's
urrent exe
utable image from its virtual memory. As
this pro
ess is a
loned image (all pro
esses are) this, old, image is the program that
the parent pro
ess was exe
uting, for example the
ommand interpreter shell su
h
as bash. This
ushing of the old exe
utable image dis
ards the old virtual memory
data stru
tures and resets the pro
ess's page tables. It also
lears away any signal
handlers that were set up and
loses any les that are open. At the end of the
ush
the pro
ess is ready for the new exe
utable image. No matter what format the exe
utable image is, the same information gets set up in the pro
ess's mm stru
t. There
are pointers to the start and end of the image's
ode and data. These values are
found as the ELF exe
utable images physi
al headers are read and the se
tions of
the program that they des
ribe are mapped into the pro
ess's virtual address spa
e.
That is also when the vm area stru
t data stru
tures are set up and the pro
ess's
page tables are modied. The mm stru
t data stru
ture also
ontains pointers to the
parameters to be passed to the program and to this pro
ess's environment variables.
See do load
elf binary() in
fs/binfmt elf.
do load s ript()
in fs/-
binfmt s ript.
#!/usr/bin/wish
The s
ript binary loader tries to nd the intepreter for the s
ript. It does this by
attempting to open the exe
utable le that is named in the rst line of the s
ript.
If it
an open it, it has a pointer to its VFS inode and it
an go ahead and have
it interpret the s
ript le. The name of the s
ript le be
omes argument zero (the
rst argument) and all of the other arguments move up one pla
e (the original rst
argument be
omes the new se
ond argument and so on). Loading the interpreter
is done in the same way as Linux loads all of its exe
utable les. Linux tries ea
h
binary format in turn until one works. This means that you
ould in theory sta
k
several interpreters and binary formats making the Linux binary format handler a
very
exible pie
e of software.
Chapter 5
Pro
esses
ommuni
ate with ea
h other and with the kernel to
oordinate
their a
tivities. Linux supports a number of Inter-Pro
ess Communi
ation (IPC) me
hanisms. Signals and pipes are two of them but Linux also
supports the System V IPC me
hanisms named after the UnixTM release
in whi
h they rst appeared.
5.1 Signals
Signals are one of the oldest inter-pro
ess
ommuni
ation methods used by UnixTM
systems. They are used to signal asyn
hronous events to one or more pro
esses. A
signal
ould be generated by a keyboard interrupt or an error
ondition su
h as the
pro
ess attempting to a
ess a non-existent lo
ation in its virtual memory. Signals
are also used by the shells to signal job
ontrol
ommands to their
hild pro
esses.
There are a set of dened signals that the kernel
an generate or that
an be generated
by other pro
esses in the system, provided that they have the
orre
t privileges. You
an list a system's set of signals using the kill
ommand (kill -l), on my Intel Linux
box this gives:
1)
5)
9)
13)
18)
22)
26)
30)
51
The numbers are dierent for an Alpha AXP Linux box. Pro
esses
an
hoose to
ignore most of the signals that are generated, with two notable ex
eptions: neither
the SIGSTOP signal whi
h
auses a pro
ess to halt its exe
ution nor the SIGKILL
signal whi
h
auses a pro
ess to exit
an be ignored. Otherwise though, a pro
ess
an
hoose just how it wants to handle the various signals. Pro
esses
an blo
k
the signals and, if they do not blo
k them, they
an either
hoose to handle them
themselves or allow the kernel to handle them. If the kernel handles the signals, it will
do the default a
tions required for this signal. For example, the default a
tion when
a pro
ess re
eives the SIGFPE (
oating point ex
eption) signal is to
ore dump and
then exit. Signals have no inherent relative priorities. If two signals are generated
for a pro
ess at the same time then they may be presented to the pro
ess or handled
in any order. Also there is no me
hanism for handling multiple signals of the same
kind. There is no way that a pro
ess
an tell if it re
eived 1 or 42 SIGCONT signals.
Linux implements signals using information stored in the task stru
t for the pro
ess. The number of supported signals is limited to the word size of the pro
essor.
Pro
esses with a word size of 32 bits
an have 32 signals whereas 64 bit pro
essors
like the Alpha AXP may have up to 64 signals. The
urrently pending signals are
kept in the signal eld with a mask of blo
ked signals held in blo
ked. With the
ex
eption of SIGSTOP and SIGKILL, all signals
an be blo
ked. If a blo
ked signal
is generated, it remains pending until it is unblo
ked. Linux also holds information
about how ea
h pro
ess handles every possible signal and this is held in an array of
siga
tion data stru
tures pointed at by the task stru
t for ea
h pro
ess. Amongst
other things it
ontains either the address of a routine that will handle the signal or
a
ag whi
h tells Linux that the pro
ess either wishes to ignore this signal or let the
kernel handle the signal for it. The pro
ess modies the default signal handling by
making system
alls and these
alls alter the siga
tion for the appropriate signal
as well as the blo
ked mask.
Not every pro
ess in the system
an send signals to every other pro
ess, the kernel
an and super users
an. Normal pro
esses
an only send signals to pro
esses with the
same uid and gid or to pro
esses in the same pro
ess group1 . Signals are generated
by setting the appropriate bit in the task stru
t's signal eld. If the pro
ess has
not blo
ked the signal and is waiting but interruptible (in state Interruptible) then
it is woken up by
hanging its state to Running and making sure that it is in the
run queue. That way the s
heduler will
onsider it a
andidate for running when the
system next s
hedules. If the default handling is needed, then Linux
an optimize the
handling of the signal. For example if the signal SIGWINCH (the X window
hanged
fo
us) and the default handler is being used then there is nothing to be done.
Signals are not presented to the pro
ess immediately they are generated., they must
wait until the pro
ess is running again. Every time a pro
ess exits from a system
all
its signal and blo
ked elds are
he
ked and, if there are any unblo
ked signals,
they
an now be delivered. This might seem a very unreliable method but every
pro
ess in the system is making system
alls, for example to write a
hara
ter to the
terminal, all of the time. Pro
esses
an ele
t to wait for signals if they wish, they
are suspended in state Interruptible until a signal is presented. The Linux signal
pro
essing
ode looks at the siga
tion stru
ture for ea
h of the
urrent unblo
ked
signals.
If a signal's handler is set to the default a
tion then the kernel will handle it. The
1 REVIEW
SIGSTOP signal's default handler will hange the urrent pro ess's state to Stopped
and then run the s
heduler to sele
t a new pro
ess to run. The default a
tion for the
SIGFPE signal will
ore dump the pro
ess and then
ause it to exit. Alternatively,
the pro
ess may have spe
ed its own signal handler. This is a routine whi
h will
be
alled whenever the signal is generated and the siga
tion stru
ture holds the
address of this routine. The kernel must
all the pro
ess's signal handling routine
and how this happens is pro
essor spe
i
but all CPUs must
ope with the fa
t
that the
urrent pro
ess is running in kernel mode and is just about to return to
the pro
ess that
alled the kernel or system routine in user mode. The problem is
solved by manipulating the sta
k and registers of the pro
ess. The pro
ess's program
ounter is set to the address of its signal handling routine and the parameters to the
routine are added to the
all frame or passed in registers. When the pro
ess resumes
operation it appears as if the signal handling routine were
alled normally.
Linux is POSIX
ompatible and so the pro
ess
an spe
ify whi
h signals are blo
ked
when a parti
ular signal handling routine is
alled. This means
hanging the blo
ked
mask during the
all to the pro
esses signal handler. The blo
ked mask must be
returned to its original value when the signal handling routine has nished. Therefore
Linux adds a
all to a tidy up routine whi
h will restore the original blo
ked mask
onto the
all sta
k of the signalled pro
ess. Linux also optimizes the
ase where
several signal handling routines need to be
alled by sta
king them so that ea
h time
one handling routine exits, the next one is
alled until the tidy up routine is
alled.
5.2 Pipes
The
ommon Linux shells all allow redire
tion. For example
$ ls | pr | lpr
pipes the output from the ls
ommand listing the dire
tory's les into the standard
input of the pr
ommand whi
h paginates them. Finally the standard output from
the pr
ommand is piped into the standard input of the lpr
ommand whi
h prints
the results on the default printer. Pipes then are unidire
tional byte streams whi
h
onne
t the standard output from one pro
ess into the standard input of another
pro
ess. Neither pro
ess is aware of this redire
tion and behaves just as it would
normally. It is the shell whi
h sets up these temporary pipes between the pro
esses.
In Linux, a pipe is implemented using two file data stru
tures whi
h both point at
the same temporary VFS inode whi
h itself points at a physi
al page within memory.
Figure 5.1 shows that ea
h file data stru
ture
ontains pointers to dierent le
operation routine ve
tors; one for writing to the pipe, the other for reading from
the pipe. This hides the underlying dieren
es from the generi
system
alls whi
h
read and write to ordinary les. As the writing pro
ess writes to the pipe, bytes are
opied into the shared data page and when the reading pro
ess reads from the pipe,
bytes are
opied from the shared data page. Linux must syn
hronize a
ess to the
pipe. It must make sure that the reader and the writer of the pipe are in step and
to do this it uses lo
ks, wait queues and signals.
When the writer wants to write to the pipe it uses the standard write library fun
tions. These all pass le des
riptors that are indi
es into the pro
ess's set of file
See
in lude/linux/inode fs i.h
Process 1
Process 2
file
file
f_mode
f_mode
f_pos
f_pos
f_flags
f_flags
f_count
f_count
f_owner
f_owner
f_inode
f_inode
f_op
inode
f_version
f_op
f_version
Data Page
Pipe
Write
Operations
Pipe
Read
Operations
See
pipe write() in
fs/pipe.
data stru
tures, ea
h one representing an open le or, as in this
ase, an open pipe.
The Linux system
all uses the write routine pointed at by the file data stru
ture
des
ribing this pipe. That write routine uses information held in the VFS inode
representing the pipe to manage the write request. If there is enough room to write
all of the bytes into the pipe and, so long as the pipe is not lo
ked by its reader,
Linux lo
ks it for the writer and
opies the bytes to be written from the pro
ess's
address spa
e into the shared data page. If the pipe is lo
ked by the reader or if
there is not enough room for the data then the
urrent pro
ess is made to sleep on
the pipe inode's wait queue and the s
heduler is
alled so that another pro
ess
an
run. It is interruptible, so it
an re
eive signals and it will be woken by the reader
when there is enough room for the write data or when the pipe is unlo
ked. When
the data has been written, the pipe's VFS inode is unlo
ked and any waiting readers
sleeping on the inode's wait queue will themselves be woken up.
Reading data from the pipe is a very similar pro
ess to writing to it. Pro
esses are
allowed to do non-blo
king reads (it depends on the mode in whi
h they opened
the le or pipe) and, in this
ase, if there is no data to be read or if the pipe is
lo
ked, an error will be returned. This means that the pro
ess
an
ontinue to run.
The alternative is to wait on the pipe inode's wait queue until the write pro
ess
has nished. When both pro
esses have nished with the pipe, the pipe inode is
dis
arded along with the shared data page.
Linux also supports named pipes, also known as FIFOs be
ause pipes operate on a
First In, First Out prin
iple. The rst data written into the pipe is the rst data
read from the pipe. Unlike pipes, FIFOs are not temporary obje
ts, they are entities
in the le system and
an be
reated using the mkfo
ommand. Pro
esses are free to
use a FIFO so long as they have appropriate a
ess rights to it. The way that FIFOs
are opened is a little dierent from pipes. A pipe (its two file data stru
tures, its
VFS inode and the shared data page) is
reated in one go whereas a FIFO already
exists and is opened and
losed by its users. Linux must handle readers opening
the FIFO before writers open it as well as readers reading before any writers have
written to it. That aside, FIFOs are handled almost exa
tly the same way as pipes
and they use the same data stru
tures and operations.
5.3 So
kets
REVIEW NOTE: Add when networking
hapter written.
See in lude/linux/ip .h
See in lude/linux/msg.h
msqid_ds
ipc
*msg_last
msg
msg
*msg_first
*msg_next
*msg_next
msg_type
*msg_spot
msg_stime
msg_ts
msg_type
*msg_spot
msg_stime
msg_ts
times
*wwait
*rwait
msg_qnum
msg_ts
message
msg_ts
message
msg_qnum
5.3.3 Semaphores
In its simplest form a semaphore is a lo
ation in memory whose value
an be tested
and set by more than one pro
ess. The test and set operation is, so far as ea
h pro
ess
is
on
erned, uninterruptible or atomi
; on
e started nothing
an stop it. The result
of the test and set operation is the addition of the
urrent value of the semaphore
and the set value, whi
h
an be positive or negative. Depending on the result of the
test and set operation one pro
ess may have to sleep until the semphore's value is
hanged by another pro
ess. Semaphores
an be used to implement
riti
al regions,
areas of
riti
al
ode that only one pro
ess at a time should be exe
uting.
Say you had many
ooperating pro
esses reading re
ords from and writing re
ords
to a single data le. You would want that le a
ess to be stri
tly
oordinated. You
ould use a semaphore with an initial value of 1 and, around the le operating
ode,
put two semaphore operations, the rst to test and de
rement the semaphore's value
and the se
ond to test and in
rement it. The rst pro
ess to a
ess the le would
try to de
rement the semaphore's value and it would su
eed, the semaphore's value
now being 0. This pro
ess
an now go ahead and use the data le but if another
pro
ess wishing to use it now tries to de
rement the semaphore's value it would fail
as the result would be -1. That pro
ess will be suspended until the rst pro
ess has
array of
semaphores
semid_ds
ipc
times
sem_base
sem_pending
sem_pending_last
undo
sem_nsems
sem_undo
proc_next
id_next
semid
semadj
sem_queue
next
prev
sleeper
undo
pid
status
sma
sops
nsops
See in lude/linux/sem.h
ea
h member of the operations pending queue (sem pending) in turn, testing to see
if the semphore operations will su
eed this time. If they will then it removes the
sem queue data stru
ture from the operations pending list and applies the semaphore
operations to the semaphore array. It wakes up the sleeping pro
ess making it available to be restarted the next time the s
heduler runs. Linux keeps looking through
the pending list from the start until there is a pass where no semaphore operations
an be applied and so no more pro
esses
an be woken.
There is a problem with semaphores, deadlo
ks. These o
ur when one pro
ess has
altered the semaphores value as it enters a
riti
al region but then fails to leave
the
riti
al region be
ause it
rashed or was killed. Linux prote
ts against this by
maintaining lists of adjustments to the semaphore arrays. The idea is that when
these adjustments are applied, the semaphores will be put ba
k to the state that
they were in before the a pro
ess's set of semaphore operations were applied. These
adjustments are kept in sem undo data stru
tures queued both on the semid ds
data stru
ture and on the task stru
t data stru
ture for the pro
esses using these
semaphore arrays.
Ea
h individual semaphore operation may request that an adjustment be maintained. Linux will maintain at most one sem undo data stru
ture per pro
ess for
ea
h semaphore array. If the requesting pro
ess does not have one, then one is
reated when it is needed. The new sem undo data stru
ture is queued both onto this
pro
ess's task stru
t data stru
ture and onto the semaphore array's semid ds data
stru
ture. As operations are applied to the semphores in the semaphore array the
negation of the operation value is added to this semphore's entry in the adjustment
array of this pro
ess's sem undo data stru
ture. So, if the operation value is 2, then
-2 is added to the adjustment entry for this semaphore.
When pro
esses are deleted, as they exit Linux works through their set of sem undo
data stru
tures applying the adjustments to the semaphore arrays. If a semaphore set
is deleted, the sem undo data stru
tures are left queued on the pro
ess's task stru
t
but the semaphore array identier is made invalid. In this
ase the semaphore
lean
up
ode simply dis
ards the sem undo data stru
ture.
linux/sem.h
Ea
h newly
reated shared memory area is represented by a shmid ds data stru
ture.
These are kept in the shm segs ve
tor. The shmid ds data stru
ture de
ribes how
big the area of shared memory is, how many pro
esses are using it and information
about how that shared memory is mapped into their address spa
es. It is the
reator
of the shared memory that
ontrols the a
ess permissions to that memory and
whether its key is publi
or private. If it has enough a
ess rights it may also lo
k
the shared memory into physi
al memory.
shmid_ds
ipc
shm_segsz
times
shm_npages
shm_pages
pte
pte
pte
vm_area_struct
vm_area_struct
vm_next_shared
vm_next_shared
attaches
memory is swapped into and out of physi al memory is des ribed in Chapter 3.
Chapter 6
PCI
Peripheral Component Inter
onne
t (PCI), as its name implies is a standard that des
ribes how to
onne
t the peripheral
omponents of a system together in a stru
tured and
ontrolled way. The standard[3, PCI
Lo
al Bus Spe
i
ation des
ribes the way that the system
omponents
are ele
tri
ally
onne
ted and the way that they should behave. This
hapter looks at how the Linux kernel initializes the system's PCI buses
and devi
es.
Figure 6.1 is a logi
al diagram of an example PCI based system. The PCI buses
and PCI-PCI bridges are the glue
onne
ting the system
omponents together; the
CPU is
onne
ted to PCI bus 0, the primary PCI bus as is the video devi
e. A
spe
ial PCI devi
e, a PCI-PCI bridge
onne
ts the primary bus to the se
ondary
PCI bus, PCI bus 1. In the jargon of the PCI spe
i
ation, PCI bus 1 is des
ribed
as being downstream of the PCI-PCI bridge and PCI bus 0 is up-stream of the
bridge. Conne
ted to the se
ondary PCI bus are the SCSI and ethernet devi
es for
the system. Physi
ally the bridge, se
ondary PCI bus and two devi
es would all be
ontained on the same
ombination PCI
ard. The PCI-ISA bridge in the system
supports older, lega
y ISA devi
es and the diagram shows a super I/O
ontroller
hip, whi
h
ontrols the keyboard, mouse and
oppy. 1
example?
61
CPU
PCI Bus 0
PCI-ISA
Bridge
PCI-PCI
Bridge
Downstream
Video
ISA Bus
Upstream
PCI Bus 1
SCSI
Ethernet
31
16 15
Device Id
Vendor Id
00h
Status
Command
04h
Class Code
08h
10h
24h
Line
Pin
3Ch
Vendor Identi ation A unique number des ribing the originator of the PCI devi e. Digital's PCI Vendor Identi ation is 0x1011 and Intel's is 0x8086.
Devi
e Identi
ation A unique number des
ribing the devi
e itself. For example,
Digital's 21141 fast ethernet devi
e has a devi
e identi
ation of 0x0009.
Status This eld gives the status of the devi
e with the meaning of the bits of this
eld set by the standard. [3, PCI Lo
al Bus Spe
i
ation.
Command By writing to this eld the system
ontrols the devi
e, for example
allowing the devi
e to a
ess PCI I/O memory,
Class Code This identies the type of devi
e that this is. There are standard
lasses for every sort of devi
e; video, SCSI and so on. The
lass
ode for SCSI
is 0x0100.
Base Address Registers These registers are used to determine and allo ate the
type, amount and lo
ation of PCI I/O and PCI memory spa
e that the devi
e
an use.
Interrupt Pin Four of the physi al pins on the PCI ard arry interrupts from
the
ard to the PCI bus. The standard labels these as A, B, C and D. The
Interrupt Pin eld des
ribes whi
h of these pins this PCI devi
e uses. Generally
it is hardwired for a pariti
ular devi
e. That is, every time the system boots,
the devi
e uses the same interrupt pin. This information allows the interrupt
handling subsystem to manage interrupts from this devi
e,
Interrupt Line The Interrupt Line eld of the devi e's PCI Conguration header
is used to pass an interrupt handle between the PCI initialisation
ode, the
devi
e's driver and Linux's interrupt handling subsystem. The number written
there is meaningless to the the devi
e driver but it allows the interrupt handler
to
orre
tly route an interrupt from the PCI devi
e to the
orre
t devi
e driver's
interrupt handling
ode within the Linux operating system. See Chapter 7 on
page 75 for details on how Linux handles interrupts.
11 10
31
Device Select
8 7
Func
2 1 0
Register 0 0
24 23
Reserved
16 15
Bus
11 10
Device
8 7
Func
2 1 0
Register 0 1
these are shown in Figure 6.3 and Figure 6.4 respe
tively. Type 0 PCI Conguration
y
les do not
ontain a bus number and these are interpretted by all devi
es as being
for PCI
onguration addresses on this PCI bus. Bits 31:11 of the Type 0
onguraration
y
les are treated as the devi
e sele
t eld. One way to design a system is to
have ea
h bit sele
t a dierent devi
e. In this
ase bit 11 would sele
t the PCI devi
e
in slot 0, bit 12 would sele
t the PCI devi
e in slot 1 and so on. Another way is to
write the devi
e's slot number dire
tly into bits 31:11. Whi
h me
hanism is used in
a system depends on the system's PCI memory
ontroller.
Type 1 PCI Conguration
y
les
ontain a PCI bus number and this type of
onguration
y
le is ignored by all PCI devi
es ex
ept the PCI-PCI bridges. All of
the PCI-PCI Bridges seeing Type 1
onguration
y
les may
hoose to pass them
to the PCI buses downstream of themselves. Whether the PCI-PCI Bridge ignores
the Type 1
onguration
y
le or passes it onto the downstream PCI bus depends
on how the PCI-PCI Bridge has been
ongured. Every PCI-PCI bridge has a primary bus interfa
e number and a se
ondary bus interfa
e number. The primary bus
interfa
e being the one nearest the CPU and the se
ondary bus interfa
e being the
one furthest away. Ea
h PCI-PCI Bridge also has a subordinate bus number and
this is the maximum bus number of all the PCI buses that are bridged beyond the
se
ondary bus interfa
e. Or to put it another way, the subordinate bus number is the
highest numbered PCI bus downstream of the PCI-PCI bridge. When the PCI-PCI
bridge sees a Type 1 PCI
onguration
y
le it does one of the following things:
Ignore it if the bus number spe
ied is not in between the bridge's se
ondary
bus number and subordinate bus number (in
lusive),
Pass it onto the se
ondary bus interfa
e un
hanged if the bus number spe
ied is greater than the se
ondary bus number and less than or equal to the
subordinate bus number.
So, if we want to address Devi
e 1 on bus 3 of the topology Figure 6.9 on page 71
we must generate a Type 1 Conguration
ommand from the CPU. Bridge1 passes
this un
hanged onto Bus 1. Bridge2 ignores it but Bridge3
onverts it into a Type
0 Conguration
ommand and sends it out on Bus 3 where Devi
e 1 responds to it.
It is up to ea
h individual operating system to allo
ate bus numbers during PCI
onguration but whatever the numbering s
heme used the following statement must
be true for all of the PCI-PCI bridges in the system:
\All PCI buses lo
ated behind a PCI-PCI bridge must reside between the seondary
bus number and the subordinate bus number (in
lusive)."
If this rule is broken then the PCI-PCI Bridges will not pass and translate Type 1
PCI
onguration
y
les
orre
tly and the system will fail to nd and initialise the
PCI devi
es in the system. To a
hieve this numbering s
heme, Linux
ongures these
spe
ial devi
es in a parti
ular order. Se
tion 6.6.2 on page 68 des
ribes Linux's PCI
bridge and bus numbering s
heme in detail together with a worked example.
pci_root
pci_bus
parent
children
next
self
devices
bus = 0
pci_dev
pci_dev
pci_dev
bus
sibling
next
bus
sibling
next
bus
sibling
next
PCI-ISA Bridge
Video
PCI-PCI Bridge
pci_dev
pci_dev
bus
sibling
next
bus
sibling
next
pci_bus
parent
children
next
self
devices
bus = 1
SCSI
Ethernet
PCI Devi
e Driver This pseudo-devi
e driver sear
hes the PCI system starting at
Bus 0 and lo
ates all PCI devi
es and bridges in the system. It builds a linked
list of data stru
tures des
ribing the topology of the system. Additionally, it
numbers all of the bridges that it nds.
PCI BIOS This software layer provides the servi es des ribed in [4, PCI BIOS
ROM spe
i
ation. Even though Alpha AXP does not have BIOS servi
es,
there is equivalent
ode in the Linux kernel providing the same fun
tions,
PCI Fixup System spe
i
xup
ode tidies up the system spe
i
loose ends of
PCI initialization.
tree stru
ture of PCI buses ea
h of whi
h has a number of
hild PCI devi
es atta
hed
to it. As a PCI bus
an only be rea
hed using a PCI-PCI Bridge (ex
ept the primary
PCI bus, bus 0), ea
h p
i bus
ontains a pointer to the PCI devi
e (the PCI-PCI
Bridge) that it is a
essed through. That PCI devi
e is a
hild of the the PCI Bus's
parent PCI bus.
Not shown in the Figure 6.5 is a pointer to all of the PCI devi
es in the system,
p
i devi
es. All of the PCI devi
es in the system have their p
i dev data stru
tures
queued onto this queue.. This queue is used by the Linux kernel to qui
kly nd all
of the PCI devi
es in the system.
The PCI devi
e driver is not really a devi
e driver at all but a fun
tion of the
operating system
alled at system initialisation time. The PCI initialisation
ode
must s
an all of the PCI buses in the system looking for all PCI devi
es in the
system (in
luding PCI-PCI bridge devi
es). It uses the PCI BIOS
ode to nd out if
every possible slot in the
urrent PCI bus that it is s
anning is o
upied. If the PCI
slot is o
upied, it builds a p
i dev data stru
ture des
ribing the devi
e and links
into the list of known PCI devi
es (pointed at by p
i devi
es).
The PCI initialisation
ode starts by s
anning PCI Bus 0. It tries to read the
Vendor Identi
ation and Devi
e Identi
ation elds for every possible PCI devi
e
in every possible PCI slot. When it nds an o
upied slot it builds a p
i dev data
stru
ture des
ribing the devi
e. All of the p
i dev data stru
tures built by the PCI
initialisation
ode (in
luding all of the PCI-PCI Bridges) are linked into a singly
linked list; p
i devi
es.
If the PCI devi
e that was found was a PCI-PCI bridge then a p
i bus data stru
ture
is built and linked into the tree of p
i bus and p
i dev data stru
tures pointed at by
p
i root. The PCI initialisation
ode
an tell if the PCI devi
e is a PCI-PCI Bridge
be
ause it has a
lass
ode of 0x060400. The Linux kernel then
ongures the PCI bus
on the other (downstream) side of the PCI-PCI Bridge that it has just found. If more
PCI-PCI Bridges are found then these are also
ongured. This pro
ess is known as
a depthwise algorithm; the system's PCI topology is fully mapped depthwise before
sear
hing breadthwise. Looking at Figure 6.1 on page 62, Linux would
ongure PCI
Bus 1 with its Ethernet and SCSI devi
e before it
ongured the video devi
e on
PCI Bus 0.
As Linux sear
hes for downstream PCI buses it must also
ongure the intervening
PCI-PCI bridges' se
ondary and subordinate bus numbers. This is des
ribed in detail
in Se
tion 6.6.2 below.
Primary Bus Number The bus number immediately upstream of the PCI-PCI
Bridge,
Se ondary Bus Number The bus number immediately downstream of the PCIPCI Bridge,
CPU
DI
D2
Bus 0
DI
Bridge
D2
Primary Bus = 0
Secondary Bus = 1
Subordinate=0xFF
Bus 1
DI
Bridge
Bridge
2
Bus ?
Bus ?
Bridge
4
DI
D2
Bus ?
Subordinate Bus Number The highest bus number of all of the buses that
an
be rea
hed downstream of the bridge.
PCI I/O and PCI Memory Windows The window base and size for PCI I/O
address spa
e and PCI Memory address spa
e for all addresses downstream of
the PCI-PCI Bridge.
The problem is that at the time when you wish to
ongure any given PCI-PCI bridge
you do not know the subordinate bus number for that bridge. You do not know if
there are further PCI-PCI bridges downstream and if you did, you do not know
what numbers will be assigned to them. The answer is to use a depthwise re
ursive
algorithm and s
an ea
h bus for any PCI-PCI bridges assigning them numbers as
they are found. As ea
h PCI-PCI bridge is found and its se
ondary bus numbered,
assign it a temporary subordinate number of 0xFF and s
an and assign numbers to
all PCI-PCI bridges downstream of it. This all seems
ompli
ated but the worked
example below makes this pro
ess
learer.
PCI-PCI Bridge Numbering: Step 1 Taking the topology in Figure 6.6, the
ber of 1 and a temporary subordinate bus number of 0xFF. This means that
all Type 1 PCI Conguration addresses spe
ifying a PCI bus number of 1 or
higher would be passed a
ross Bridge1 and onto PCI Bus 1. They would be
translated into Type 0 Conguration
y
les if they have a bus number of 1 but
left untranslated for all other bus numbers. This is exa
tly what the Linux
PCI initialisation
ode needs to do in order to go and s
an PCI Bus 1.
CPU
DI
D2
Bus 0
DI
Primary Bus = 0
Secondary Bus = 1
Subordinate=0xFF
Bridge
D2
Bus 1
DI
Bridge
Bridge
Primary Bus = 1
Secondary Bus = 2
Subordinate=2
Bus 2
Bus ?
Bridge
4
DI
D2
Bus ?
it is assigned a subordinate bus number of 2 whi
h mat
hes the number assigned
to its se
ondary interfa
e. Figure 6.7 shows how the buses and PCI-PCI bridges
are numbered at this point.
PCI-PCI Bridge Numbering: Step 3 The PCI initialisation
ode returns to s
anning PCI Bus 1 and nds another PCI-PCI bridge, Bridge3 . It is assigned 1
as its primary bus interfa
e number, 3 as its se
ondary bus interfa
e number
and 0xFF as its subordinate bus number. Figure 6.8 on page 71 shows how the
system is
ongured now. Type 1 PCI
onguration
y
les with a bus number
of 1, 2 or 3 wil be
orre
tly delivered to the appropriate PCI buses.
PCI-PCI Bridge Numbering: Step 4 Linux starts s anning PCI Bus 3, down-
stream of PCI-PCI Bridge3 . PCI Bus 3 has another PCI-PCI bridge (Bridge4 )
on it, it is assigned 3 as its primary bus number and 4 as its se
ondary bus number. It is the last bridge on this bran
h and so it is assigned a subordinate bus
interfa
e number of 4. The initialisation
ode returns to PCI-PCI Bridge3 and
assigns it a subordinate bus number of 4. Finally, the PCI initialisation
ode
an assign 4 as the subordinate bus number for PCI-PCI Bridge1 . Figure 6.9
on page 71 shows the nal bus numbers.
See ar h/*/kernel/bios32.
The PCI BIOS fun
tions are a series of standard routines whi
h are
ommon a
ross
all platforms. For example, they are the same for both Intel and Alpha AXP based
systems. They allow the CPU
ontrolled a
ess to all of the PCI address spa
es.
Only Linux kernel
ode and devi
e drivers may use them.
CPU
DI
D2
Bus 0
DI
Bridge
D2
Primary Bus = 0
Secondary Bus = 2
Subordinate=0xFF
Bus 1
Bridge
DI
Primary Bus = 1
Secondary Bus = 3
Subordinate=0xFF
Bridge
2
Primary Bus = 1
Secondary Bus = 2
Subordinate=2
Bus 2
Bus 3
Bridge
DI
D2
Bus ?
CPU
DI
D2
Bus 0
DI
Bridge
D2
Primary Bus = 0
Secondary Bus = 1
Subordinate=4
Bus 1
Bridge
DI
Primary Bus = 1
Secondary Bus = 3
Subordinate=4
Bridge
2
Primary Bus = 1
Secondary Bus = 2
Subordinate=2
Bus 2
Bus 3
Bridge
4
Primary Bus = 3
Secondary Bus = 4
Subordinate=4
DI
D2
Bus 4
31
43 210
Base Address
prefetchable
Type
31
Base Address
Reserved
Base Address for PCI I/O Space
kernel/bios32.
The PCI xup
ode for Alpha AXP does rather more than that for Intel (whi
h
basi
ally does nothing). For Intel based systems the system BIOS, whi
h ran at
boot time, has already fully
ongured the PCI system. This leaves Linux with
little to do other than map that
onguration. For non-Intel based systems further
onguration needs to happen to:
Generate Interrupt Line values for the devi
es; these
ontrol interrupt handling
for the devi
e.
The next subse tions des ribe how that ode works.
Finding Out How Mu
h PCI I/O and PCI Memory Spa
e a Devi
e Needs
Ea
h PCI devi
e found is queried to nd out how mu
h PCI I/O and PCI Memory
address spa
e it requires. To do this, ea
h Base Address Register has all 1's written to
it and then read. The devi
e will return 0's in the don't-
are address bits, ee
tively
spe
ifying the address spa
e required.
There are two basi
types of Base Address Register, the rst indi
ates within whi
h
address spa
e the devi
es registers must reside; either PCI I/O or PCI Memory spa
e.
This is indi
ated by Bit 0 of the register. Figure 6.10 shows the two forms of the
Base Address Register for PCI Memory and for PCI I/O.
To nd out just how mu
h of ea
h address spa
e a given Base Address Register is
requesting, you write all 1s into the register and then read it ba
k. The devi
e will
spe
ify zeros in the don't
are address bits, ee
tively spe
ifying the address spa
e
required. This design implies that all address spa
es used are a power of two and are
naturally aligned.
For example when you initialize the DECChip 21142 PCI Fast Ethernet devi
e, it
tells you that it needs 0x100 bytes of spa
e of either PCI I/O or PCI Memory. The
initialization
ode allo
ates it spa
e. The moment that it allo
ates spa
e, the 21142's
ontrol and status registers
an be seen at those addresses.
Allo
ating PCI I/O and PCI Memory to PCI-PCI Bridges and Devi
es
Like all memory the PCI I/O and PCI memory spa
es are nite, and to some extent
s
ar
e. The PCI Fixup
ode for non-Intel systems (and the BIOS
ode for Intel
systems) has to allo
ate ea
h devi
e the amount of memory that it is requesting in
an e
ient manner. Both PCI I/O and PCI Memory must be allo
ated to a devi
e
in a naturally aligned way. For example, if a devi
e asks for 0xB0 of PCI I/O spa
e
then it must be aligned on an address that is a multiple of 0xB0. In addition to this,
the PCI I/O and PCI Memory bases for any given bridge must be aligned on 4K and
on 1Mbyte boundaries respe
tively. Given that the address spa
es for downstream
devi
es must lie within all of the upstream PCI-PCI Bridge's memory ranges for any
given devi
e, it is a somewhat di
ult problem to allo
ate spa
e e
iently.
The algorithm that Linux uses relies on ea
h devi
e des
ribed by the bus/devi
e tree
built by the PCI Devi
e Driver being allo
ated address spa
e in as
ending PCI I/O
memory order. Again a re
ursive algorithm is used to walk the p
i bus and p
i dev
data stru
tures built by the PCI initialisation
ode. Starting at the root PCI bus
(pointed at by p
i root) the BIOS xup
ode:
Aligns the
urrent global PCI I/O and Memory bases on 4K and 1 Mbyte
boundaries respe
tively,
For every devi
e on the
urrent bus (in as
ending PCI I/O memory needs),
{ allo
ates it spa
e in PCI I/O and/or PCI Memory,
{ moves on the global PCI I/O and Memory bases by the appropriate
amounts,
{ enables the devi e's use of PCI I/O and PCI Memory,
Allo
ates spa
e re
ursively to all of the buses downstream of the
urrent bus.
Note that this will
hange the global PCI I/O and Memory bases,
Aligns the urrent global PCI I/O and Memory bases on 4K and 1 Mbyte
boundaries respe
tively and in doing so gure out the size and base of PCI I/O
and PCI Memory windows required by the
urrent PCI-PCI bridge,
Programs the PCI-PCI bridge that links to this bus with its PCI I/O and PCI
Memory bases and limits,
Turns on bridging of PCI I/O and PCI Memory a
esses in the PCI-PCI Bridge.
This means that if any PCI I/O or PCI Memory addresses seen on the Bridge's
primary PCI bus that are within its PCI I/O and PCI Memory address windows
will be bridged onto its se
ondary PCI bus.
Taking the PCI system in Figure 6.1 on page 62 as our example the PCI Fixup
ode
would set up the system in the following way:
Align the PCI bases PCI I/O is 0x4000 and PCI Memory is 0x100000. This
allows the PCI-ISA bridges to translate all addresses below these into ISA
address
y
les,
The Video Devi e This is asking for 0x200000 of PCI Memory and so we allo ate
it that amount starting at the
urrent PCI Memory base of 0x200000 as it has
to be naturally aligned to the size requested. The PCI Memory base is moved
to 0x400000 and the PCI I/O base remains at 0x4000.
The PCI-PCI Bridge We now ross the PCI-PCI Bridge and allo ate PCI mem-
ory there, note that we do not need to align the bases as they are already
orre
tly aligned:
The Ethernet Devi e This is asking for 0xB0 bytes of both PCI I/O and
PCI Memory spa
e. It gets allo
ated PCI I/O at 0x4000 and PCI Memory
at 0x400000. The PCI Memory base is moved to 0x4000B0 and the PCI
I/O base to 0x40B0.
The SCSI Devi
e This is asking for 0x1000 PCI Memory and so it is allo
ated it at 0x401000 after it has been naturally aligned. The PCI I/O base
is still 0x40B0 and the PCI Memory base has been moved to 0x402000.
The PCI-PCI Bridge's PCI I/O and Memory Windows We now return to
the bridge and set its PCI I/O window at between 0x4000 and 0x40B0 and it's
PCI Memory window at between 0x400000 and 0x402000. This means that
the PCI-PCI Bridge will ignore the PCI Memory a
esses for the video devi
e
and pass them on if they are for the ethernet or SCSI devi
es.
Chapter 7
This
hapter looks at how interrupts are handled by the Linux kernel.
Whilst the kernel has generi
me
hanisms and interfa
es for handling interrupts, most of the interrupt handling details are ar
hite
ture spe
i
.
Linux uses a lot of dierent pie
es of hardware to perform many dierent tasks.
The video devi
e drives the monitor, the IDE devi
e drives the disks and so on.
You
ould drive these devi
es syn
hronously, that is you
ould send a request for
some operation (say writing a blo
k of memory out to disk) and then wait for the
operation to
omplete. That method, although it would work, is very ine
ient and
the operating system would spend a lot of time \busy doing nothing" as it waited
for ea
h operation to
omplete. A better, more e
ient, way is to make the request
and then do other, more useful work and later be interrupted by the devi
e when it
has nished the request. With this s
heme, there may be many outstanding requests
to the devi
es in the system all happening at the same time.
There has to be some hardware support for the devi
es to interrupt whatever the
CPU is doing. Most, if not all, general purpose pro
essors su
h as the Alpha AXP
use a similar method. Some of the physi
al pins of the CPU are wired su
h that
hanging the voltage (for example
hanging it from +5v to -5v)
auses the CPU to
stop what it is doing and to start exe
uting spe
ial
ode to handle the interruption;
the interrupt handling
ode. One of these pins might be
onne
ted to an interval
timer and re
eive an interrupt every 1000th of a se
ond, others may be
onne
ted to
the other devi
es in the system, su
h as the SCSI
ontroller.
Systems often use an interrupt
ontroller to group the devi
e interrupts together
before passing on the signal to a single interrupt pin on the CPU. This saves interrupt
pins on the CPU and also gives
exibility when designing systems. The interrupt
ontroller has mask and status registers that
ontrol the interrupts. Setting the bits
in the mask register enables and disables interrupts and the status register returns
75
CPU
P
I
C
1
0
1
2
Keyboard
4
5
Serial
sound
floppy
7
0
P
I
C
2
SCSI
6
7
ide0
ide1
See
request irq(),
enable irq() and
disable irq() in
ar
h/*/kernel
irq.
See
See
ar h/alpha/kernel/bios32.
the probe. The driver now turns probing o and the unassigned interrupts are all
disabled. If the ISA devi
e driver has su
essfully found its IRQ number then it
an
now request
ontrol of it as normal.
PCI based systems are mu
h more dynami
than ISA based systems. The interrupt
pin that an ISA devi
e uses is often set using jumpers on the hardware devi
e and
xed in the devi
e driver. On the other hand, PCI devi
es have their interrupts
allo
ated by the PCI BIOS or the PCI subsystem as PCI is initialized when the
system boots. Ea
h PCI devi
e may use one of four interrupt pins, A, B, C or D.
This was xed when the devi
e was built and most devi
es default to interrupt on
pin A. The PCI interrupt lines A, B, C and D for ea
h PCI slot are routed to the
interrupt
ontroller. So, Pin A from PCI slot 4 might be routed to pin 6 of the
interrupt
ontroller, pin B of PCI slot 4 to pin 7 of the interrupt
ontroller and so
on.
How the PCI interrupts are routed is entirely system spe
i
and there must be
some set up
ode whi
h understands this PCI interrupt routing topology. On Intel
based PCs this is the system BIOS
ode that runs at boot time but for system's
without BIOS (for example Alpha AXP based systems) the Linux kernel does this
setup. The PCI set up
ode writes the pin number of the interrupt
ontroller into the
PCI
onguration header for ea
h devi
e. It determines the interrupt pin (or IRQ)
number using its knowledge of the PCI interrupt routing topology together with the
devi
es PCI slot number and whi
h PCI interrupt pin that it is using. The interrupt
pin that a devi
e uses is xed and is kept in a eld in the PCI
onguration header
for this devi
e. It writes this information into the interrupt line eld that is reserved
for this purpose. When the devi
e driver runs, it reads this information and uses it
to request
ontrol of the interrupt from the Linux kernel.
There may be many PCI interrupt sour
es in the system, for example when PCI-PCI
bridges are used. The number of interrupt sour
es may ex
eed the number of pins on
the system's programmable interrupt
ontrollers. In this
ase, PCI devi
es may share
interrupts, one pin on the interrupt
ontroller taking interrupts from more than one
PCI devi
e. Linux supports this by allowing the rst requestor of an interrupt sour
e
de
lare whether it may be shared. Sharing interrupts results in several irqa
tion
data stru
tures being pointed at by one entry in the irq a
tion ve
tor ve
tor.
When a shared interrupt happens, Linux will
all all of the interrupt handlers for
that sour
e. Any devi
e driver that
an share interrupts (whi
h should be all PCI
devi
e drivers) must be prepared to have its interrupt handler
alled when there is
no interrupt to be servi
ed.
the
oppy
ontroller is one of the xed interrupts in a PC system as, by
onvention,
the
oppy
ontroller is always wired to interrupt 6.
irq_action
irqaction
handler
flags
name
next
Interrupt
handling
routine
for this
device
irqaction
irqaction
handler
handler
flags
flags
name
name
next
next
1
0
Chapter 8
Devi e Drivers
See
fs/devi es.
primary IDE disk has a dierent minor devi
e number. So, /dev/hda2, the se
ond
partition of the primary IDE disk has a major number of 3 and a minor number of 2.
Linux maps the devi
e spe
ial le passed in system
alls (say to mount a le system
on a blo
k devi
e) to the devi
e's devi
e driver using the major devi
e number and
a number of system tables, for example the
hara
ter devi
e table,
hrdevs .
Linux supports three types of hardware devi
e:
hara
ter, blo
k and network. Chara
ter devi
es are read and written dire
tly without buering, for example the system's
serial ports /dev/
ua0 and /dev/
ua1. Blo
k devi
es
an only be written to and
read from in multiples of the blo
k size, typi
ally 512 or 1024 bytes. Blo
k devi
es
are a
essed via the buer
a
he and may be randomly a
essed, that is to say, any
blo
k
an be read or written no matter where it is on the devi
e. Blo
k devi
es
an
be a
essed via their devi
e spe
ial le but more
ommonly they are a
essed via the
le system. Only a blo
k devi
e
an support a mounted le system. Network devi
es
are a
essed via the BSD so
ket interfa
e and the networking subsytems des
ribed
in the Networking
hapter (Chapter 10).
There are many dierent devi
e drivers in the Linux kernel (that is one of Linux's
strengths) but they all share some
ommon attributes:
kernel ode Devi e drivers are part of the kernel and, like other ode within the
kernel, if they go wrong they
an seriously damage the system. A badly written
driver may even
rash the system, possibly
orrupting le systems and losing
data,
Kernel interfa es Devi e drivers must provide a standard interfa e to the Linux
kernel or to the subsystem that they are part of. For example, the terminal
driver provides a le I/O interfa
e to the Linux kernel and a SCSI devi
e driver
provides a SCSI devi
e interfa
e to the SCSI subsystem whi
h, in turn, provides
both le I/O and buer
a
he interfa
es to the kernel.
Kernel me
hanisms and servi
es Devi
e drivers make use of standard kernel
servi
es su
h as memory allo
ation, interrupt delivery and wait queues to operate,
Congurable Linux devi
e drivers
an be built into the kernel. Whi
h devi
es are
built is
ongurable when the kernel is
ompiled,
Dynami As the system boots and ea h devi e driver is initialized it looks for the
hardware devi
es that it is
ontrolling. It does not matter if the devi
e being
ontrolled by a parti
ular devi
e driver does not exist. In this
ase the devi
e
driver is simply redundant and
auses no harm apart from o
upying a little
of the system's memory.
ommand has
ompleted. The devi
e drivers
an either poll the devi
e or they
an
use interrupts.
Polling the devi
e usually means reading its status register every so often until the
devi
e's status
hanges to indi
ate that it has
ompleted the request. As a devi
e
driver is part of the kernel it would be disasterous if a driver were to poll as nothing
else in the kernel would run until the devi
e had
ompleted the request. Instead
polling devi
e drivers use system timers to have the kernel
all a routine within the
devi
e driver at some later time. This timer routine would
he
k the status of the
ommand and this is exa
tly how Linux's
oppy driver works. Polling by means of
timers is at best approximate, a mu
h more e
ient method is to use interrupts.
An interrupt driven devi
e driver is one where the hardware devi
e being
ontrolled
will raise a hardware interrupt whenever it needs to be servi
ed. For example, an
ethernet devi
e driver would interrupt whenever it re
eives an ethernet pa
ket from
the network. The Linux kernel needs to be able to deliver the interrupt from the
hardware devi
e to the
orre
t devi
e driver. This is a
hieved by the devi
e driver
registering its usage of the interrupt with the kernel. It registers the address of an
interrupt handling routine and the interrupt number that it wishes to own. You
an
see whi
h interrupts are being used by the devi
e drivers, as well as how many of
ea
h type of interrupts there have been, by looking at /pro
/interrupts:
0:
1:
2:
3:
4:
5:
11:
13:
14:
15:
727432
20534
0
79691
28258
1
20868
1
247
170
+
+
+
+
+
timer
keyboard
as
ade
serial
serial
sound blaster
ai
7xxx
math error
ide0
ide1
8.3 Memory
Devi
e drivers have to be
areful when using memory. As they are part of the Linux
kernel they
annot use virtual memory. Ea
h time a devi
e driver runs, maybe as
an interrupt is re
eived or as a bottom half or task queue handler is s
heduled, the
urrent pro
ess may
hange. The devi
e driver
annot rely on a parti
ular pro
ess
running even if it is doing work on its behalf. Like the rest of the kernel, devi
e
drivers use data stru
tures to keep tra
k of the devi
e that it is
ontrolling. These
data stru
tures
an be stati
ally allo
ated, part of the devi
e driver's
ode, but that
would be wasteful as it makes the kernel larger than it need be. Most devi
e drivers
allo
ate kernel, non-paged, memory to hold their data.
Linux provides kernel memory allo
ation and deallo
ation routines and it is these
that the devi
e drivers use. Kernel memory is allo
ated in
hunks that are powers
of 2. For example 128 or 512 bytes, even if the devi
e driver asks for less. The
number of bytes that the devi
e driver requests is rounded up to the next blo
k size
boundary. This makes kernel memory deallo
ation easier as the smaller free blo
ks
an be re
ombined into bigger blo
ks.
It may be that Linux needs to do quite a lot of extra work when the kernel memory
is requested. If the amount of free memory is low, physi
al pages may need to be dis
arded or written to the swap devi
e. Normally, Linux would suspend the requestor,
putting the pro
ess onto a wait queue until there is enough physi
al memory. Not
all devi
e drivers (or indeed Linux kernel
ode) may want this to happen and so the
kernel memory allo
ation routines
an be requested to fail if they
annot immediately allo
ate memory. If the devi
e driver wishes to DMA to or from the allo
ated
memory it
an also spe
ify that the memory is DMA'able. This way it is the Linux
kernel that needs to understand what
onstitutes DMA'able memory for this system,
and not the devi
e driver.
chrdevs
name
fops
file operations
lseek
read
write
readdir
select
ioclt
mmap
open
release
fsyn
fasync
check_media_change
revalidate
See in lude/linux/major.h
See
in
fs/ext2/inode.
def hr fops
See
hrdev open() in
fs/devi
es.
Chara
ter devi
es, the simplest of Linux's devi
es, are a
essed as les, appli
ations
use standard system
alls to open them, read from them, write to them and
lose
them exa
tly as if the devi
e were a le. This is true even if the devi
e is a modem
being used by the PPP daemon to
onne
t a Linux system onto a network. As a
hara
ter devi
e is initialized its devi
e driver registers itself with the Linux kernel
by adding an entry into the
hrdevs ve
tor of devi
e stru
t data stru
tures. The
devi
e's major devi
e identier (for example 4 for the tty devi
e) is used as an
index into this ve
tor. The major devi
e identier for a devi
e is xed. Ea
h entry
in the
hrdevs ve
tor, a devi
e stru
t data stru
ture
ontains two elements; a
pointer to the name of the registered devi
e driver and a pointer to a blo
k of le
operations. This blo
k of le operations is itself the addresses of routines within the
devi
e
hara
ter devi
e driver ea
h of whi
h handles spe
i
le operations su
h as
open, read, write and
lose. The
ontents of /pro
/devi
es for
hara
ter devi
es is
taken from the
hrdevs ve
tor.
When a
hara
ter spe
ial le representing a
hara
ter devi
e (for example /dev/
ua0)
is opened the kernel must set things up so that the
orre
t
hara
ter devi
e driver's
le operation routines will be
alled. Just like an ordinairy le or dire
tory, ea
h
devi
e spe
ial le is represented by a VFS inode . The VFS inode for a
hara
ter
spe
ial le, indeed for all devi
e spe
ial les,
ontains both the major and minor
identiers for the devi
e. This VFS inode was
reated by the underlying lesystem,
for example EXT2, from information in the real lesystem when the devi
e spe
ial
le's name was looked up.
Ea
h VFS inode has asso
iated with it a set of le operations and these are dierent
depending on the lesystem obje
t that the inode represents. Whenever a VFS
inode representing a
hara
ter spe
ial le is
reated, its le operations are set to the
default
hara
ter devi
e operations . This has only one le operation, the open le
operation. When the
hara
ter spe
ial le is opened by an appli
ation the generi
open le operation uses the devi
e's major identier as an index into the
hrdevs
ve
tor to retrieve the le operations blo
k for this parti
ular devi
e. It also sets up the
file data stru
ture des
ribing this
hara
ter spe
ial le, making its le operations
pointer point to those of the devi
e driver. Thereafter all of the appli
ations le
operations will be mapped to
alls to the
hara
ter devi
es set of le operations.
blk_dev
blk_dev_struct
request_fn()
current_request
:
:
request
request
rq_status
rq_dev
rq_status
rq_dev
mcd
mcd
sem
bh
sem
bh
tail
tail
next
next
buffer_head
b_dev
b_blocknr
b_state
b_count
b_size
0x0301
39
1024
b_next
b_prev
b_data
See
fs/devi es.
See
See in lude/linux/blkdev.h
This unlo
king of the buffer head will wake up any pro
ess that has been sleeping
waiting for the blo
k operation to
omplete. An example of this would be where
a le name is being resolved and the EXT2 lesystem must read the blo
k of data
that
ontains the next EXT2 dire
tory entry from the blo
k devi
e that holds the
lesystem. The pro
ess sleeps on the buffer head that will
ontain the dire
tory
entry until the devi
e driver wakes it up. The request data stru
ture is marked as
free so that it
an be used in another blo
k request.
This means that it has 1050
ylinders (tra
ks), 16 heads (8 platters) and 63 se
tors
per tra
k. With a se
tor, or blo
k, size of 512 bytes this gives the disk a storage
apa
ity of 529200 bytes. This does not mat
h the disk's stated
apa
ity of 516
Mbytes as some of the se
tors are used for disk partitioning information. Some disks
automati
ally nd bad se
tors and re-index the disk to work around them.
Hard disks
an be further subdivided into partitions. A partition is a large group
of se
tors allo
ated for a parti
ular purpose. Partitioning a disk allows the disk to
be used by several operating system or for several purposes. A lot of Linux systems
have a single disk with three partitions; one
ontaining a DOS lesystem, another
an EXT2 lesystem and a third for the swap partition. The partitions of a hard disk
gendisk_head
gendisk
gendisk
major
major_name
minor_shift
max_p
max_nr
init()
part
sizes
nr_real
real_devices
next
8
"sd"
major
major_name
minor_shift
max_p
max_nr
init()
part
sizes
nr_real
real_devices
next
3
"ide0"
hd_struct[]
start_sect
nr_sects
:
:
:
max_p
start_sect
nr_sects
Begin
1
479
Start
1
479
End
478
510
Blo
ks
489456
32768
Id System
83 Linux native
82 Linux swap
This shows that the rst partition starts at
ylinder or tra
k 0, head 1 and se
tor 1
and extends to in
lude
ylinder 477, se
tor 32 and head 63. As there are 32 se
tors in
a tra
k and 64 read/write heads, this partition is a whole number of
ylinders in size.
fdisk alligns partitions on
ylinder boundaries by default. It starts at the outermost
ylinder (0) and extends inwards, towards the spindle, for 478
ylinders. The se
ond
partition, the swap partition, starts at the next
ylinder (478) and extends to the
innermost
ylinder of the disk.
During initialization Linux maps the topology of the hard disks in the system. It
nds out how many hard disks there are and of what type. Additionally, Linux
dis
overs how the individual disks have been partitioned. This is all represented by a
list of gendisk data stru
tures pointed at by the gendisk head list pointer. As ea
h
disk subsystem, for example IDE, is initialized it generates gendisk data stru
tures
representing the disks that it nds. It does this at the same time as it registers its le
operations and adds its entry into the blk dev data stru
ture. Ea
h gendisk data
stru
ture has a unique major devi
e number and these mat
h the major numbers
of the blo
k spe
ial devi
es. For example, the SCSI disk subsystem
reates a single
gendisk entry (``sd'') with a major number of 8, the major number of all SCSI
disk devi
es. Figure 8.3 shows two gendisk entries, the rst one for the SCSI disk
subsystem and the se
ond for an IDE disk
ontroller. This is ide0, the primary IDE
ontroller.
Although the disk subsystems build the gendisk entries during their initialization
they are only used by Linux during partition
he
king. Instead, ea
h disk subsystem
maintains its own data stru
tures whi
h allow it to map devi
e spe
ial major and
minor devi
e numbers to partitions within physi
al disks. Whenever a blo
k devi
e
is read from or written to, either via the buer
a
he or le operations, the kernel
dire
ts the operation to the appropriate devi
e using the major devi
e number found
in its blo
k spe
ial devi
e le (for example /dev/sda2). It is the individual devi
e
driver or subsystem that maps the minor devi
e number to the real physi
al devi
e.
BUS FREE No devi
e has
ontrol of the bus and there are no transa
tions
urrently
happening,
ARBITRATION A SCSI devi e has attempted to get ontrol of the SCSI bus, it
does this by asserting its SCSI identifer onto the address pins. The highest
number SCSI identier wins.
SELECTION When a devi e has su eeded in getting ontrol of the SCSI bus
through arbitration it must now signal the target of this SCSI request that it
wants to send a
ommand to it. It does this by asserting the SCSI identier of
the target on the address pins.
RESELECTION SCSI devi
es may dis
onne
t during the pro
essing of a request.
The target may then resele
t the initiator. Not all SCSI devi
es support this
phase.
DATA IN, DATA OUT During these phases data is transfered between the initiator and the target,
STATUS This phase is entered after
ompletion of all
ommands and allows the
target to send a status byte indi
ating su
ess or failure to the initiator,
The Linux SCSI subsystem is made up of two basi elements, ea h of whi h is represented by data stru tures:
host A SCSI host is a physi al pie e of hardware, a SCSI ontroller. The NCR810
PCI SCSI
ontroller is an example of a SCSI host. If a Linux system has more
than one SCSI
ontroller of the same type, ea
h instan
e will be represented by
a separate SCSI host. This means that a SCSI devi
e driver may
ontrol more
than one instan
e of its
ontroller. SCSI hosts are almost always the initiator
of SCSI
ommands.
Devi
e The most
ommon set of SCSI devi
e is a SCSI disk but the SCSI standard
supports several more types; tape, CD-ROM and also a generi
SCSI devi
e.
SCSI devi
es are almost always the targets of SCSI
ommands. These devi
es
must be treated dierently, for example with removable media su
h as CDROMs or tapes, Linux needs to dete
t if the media was removed. The dierent
disk types have dierent major devi
e numbers, allowing Linux to dire
t blo
k
devi
e requests to the appropriate SCSI type.
Scsi_Host_Template
scsi_hosts
next
name
"Buslogic"
Device
Driver
Routines
Scsi_Host
scsi_hostlist
next
this_id
max_id
hostt
scsi_devices
Scsi_Device
Scsi_Device
next
next
id
type
id
type
host
host
stru
ture, ea
h of whi
h points to its parent S
si Host. All of the S
si Devi
e data
stru
tures are added to the s
si devi
es list. Figure 8.4 shows how the main data
stru
tures relate to one another.
There are four SCSI devi
e types: disk, tape, CD and generi
. Ea
h of these SCSI
types are individually registered with the kernel as dierent major blo
k devi
e types.
However they will only register themselves if one or more of a given SCSI devi
e
type has been found. Ea
h SCSI type, for example SCSI disk, maintains its own
tables of devi
es. It uses these tables to dire
t kernel blo
k operations (le or buer
a
he) to the
orre
t devi
e driver or SCSI host. Ea
h SCSI type is represented by
a S
si Devi
e Template data stru
ture. This
ontains information about this type
of SCSI devi
e and the addresses of routines to perform various tasks. The SCSI
subsystem uses these templates to
all the SCSI type routines for ea
h type of SCSI
devi
e. In other words, if the SCSI subsystem wishes to atta
h a SCSI disk devi
e it
will
all the SCSI disk type atta
h routine. The S
si Type Template data stru
tures
are added to the s
si devi
elist list if one or more SCSI devi
es of that type have
been dete
ted.
The nal phase of the SCSI subsystem initialization is to
all the nish fun
tions for
ea
h registered S
si Devi
e Template. For the SCSI disk type this spins up all of
the SCSI disks that were found and then re
ords their disk geometry. It also adds the
gendisk data stru
ture representing all SCSI disks to the linked list of disks shown
in Figure 8.3.
Name Unlike blo
k and
hara
ter devi
es whi
h have their devi
e spe
ial les
reated using the mknod
ommand, network devi
e spe
ial les appear spontaniously as the system's network devi
es are dis
overed and initialized. Their
names are standard, ea
h name representing the type of devi
e that it is. Multiple devi
es of the same type are numbered upwards from 0. Thus the ethernet
devi
es are known as /dev/eth0,/dev/eth1,/dev/eth2 and so on. Some
ommon network devi
es are:
/dev/ethN
/dev/slN
/dev/pppN
/dev/lo
Ethernet devi
es
SLIP devi
es
PPP devi
es
Loopba
k devi
es
Bus Information This is information that the devi e driver needs in order to on-
trol the devi
e. The irq number is the interrupt that this devi
e is using. The
base address is the address of any of the devi
e's
ontrol and status registers
in I/O memory. The DMA
hannel is the DMA
hannel number that this net-
work devi
e is using. All of this information is set at boot time as the devi
e
is initialized.
Interfa e Flags These des ribe the hara teristi s and abilities of the network devi e:
See
in lude/linux/netdevi e.h
IFF
IFF
IFF
IFF
IFF
IFF
IFF
IFF
IFF
UP
BROADCAST
DEBUG
LOOPBACK
POINTTOPOINT
NOTRAILERS
RUNNING
NOARP
PROMISC
IFF ALLMULTI
IFF MULTICAST
Proto
ol Information Ea
h devi
e des
ribes how it may be used by the network
proto
ool layers:
mtu The size of the largest pa
ket that this network
an transmit not in
luding
any link layer headers that it needs to add. This maximum is used by the
proto
ol layers, for example IP, to sele
t suitable pa
ket sizes to send.
Family The family indi ates the proto ol family that the devi e an support.
The family for all Linux network devi
es is AF INET, the Internet address
family.
Type The hardware interfa e type des ribes the media that this network de-
vi
e is atta
hed to. There are many dierent types of media that Linux
network devi
es support. These in
lude Ethernet, X.25, Token Ring, Slip,
PPP and Apple Lo
altalk.
Addresses The devi
e data stru
ture holds a number of addresses that are
relevent to this network devi
e, in
luding its IP addresses.
Support Fun
tions Ea
h devi
e provides a standard set of routines that proto
ol
layers
all as part of their interfa
e to this devi
e's link layer. These in
lude
setup and frame transmit routines as well as routines to add standard frame
headers and
olle
t statisti
s. These statisti
s
an be seen using the if
ong
ommand.
and so on, no matter what their underlying devi
e drivers are. The problem of
\missing" network devi
es is easily solved. As the initialization routine for ea
h
network devi
e is
alled, it returns a status indi
ating whether or not it lo
ated an
instan
e of the
ontroller that it is driving. If the driver
ould not nd any devi
es, its
entry in the devi
e list pointed at by dev base is removed. If the driver
ould nd
a devi
e it lls out the rest of the devi
e data stru
ture with information about the
devi
e and the addresses of the support fun
tions within the network devi
e driver.
The se
ond problem, that of dynami
ally assigning ethernet devi
es to the standard
/dev/ethN devi
e spe
ial les is solved more elegantly. There are eight standard
entries in the devi
es list; one for eth0, eth1 and so on to eth7. The initialization
routine is the same for all of them, it tries ea
h ethernet devi
e driver built into the
kernel in turn until one nds a devi
e. When the driver nds its ethernet devi
e it
lls out the ethN devi
e data stru
ture, whi
h it now owns. It is also at this time
that the network devi
e driver initializes the physi
al hardware that it is
ontrolling
and works out whi
h IRQ it is using, whi
h DMA
hannel (if any) and so on. A
driver may nd several instan
es of the network devi
e that it is
ontrolling and, in
this
ase, it will take over several of the /dev/ethN devi
e data stru
tures. On
e
all eight standard /dev/ethN have been allo
ated, no more ethernet devi
es will be
probed for.
Chapter 9
This
hapter des
ribes how the Linux kernel maintains the les in the le
systems that it supports. It des
ribes the Virtual File System (VFS) and
explains how the Linux kernel's real le systems are supported.
One of the most important features of Linux is its support for many dierent le
systems. This makes it very
exible and well able to
oexist with many other operating systems. At the time of writing, Linux supports 15 le systems; ext, ext2,
xia, minix, umsdos, msdos, vfat, pro
, smb, n
p, iso9660, sysv, hpfs, affs and
ufs, and no doubt, over time more will be added.
In Linux, as it is for UnixTM , the separate le systems the system may use are not
a
essed by devi
e identiers (su
h as a drive number or a drive name) but instead
they are
ombined into a single hierar
hi
al tree stru
ture that represents the le
system as one whole single entity. Linux adds ea
h new le system into this single
le system tree as it is mounted. All le systems, of whatever type, are mounted onto
a dire
tory and the les of the mounted le system
over up the existing
ontents
of that dire
tory. This dire
tory is known as the mount dire
tory or mount point.
When the le system is unmounted, the mount dire
tory's own les are on
e again
revealed.
When disks are initialized (using fdisk, say) they have a partition stru
ture imposed
on them that divides the physi
al disk into a number of logi
al partitions. Ea
h
partition may hold a single le system, for example an EXT2 le system. File systems
organize les into logi
al hierar
hi
al stru
tures with dire
tories, soft links and so on
held in blo
ks on physi
al devi
es. Devi
es that
an
ontain le systems are known
as blo
k devi
es. The IDE disk partition /dev/hda1, the rst partition of the rst
IDE disk drive in the system, is a blo
k devi
e. The Linux le systems regard these
blo
k devi
es as simply linear
olle
tions of blo
ks, they do not know or
are about
the underlying physi
al disk's geometry. It is the task of ea
h blo
k devi
e driver to
map a request to read a parti
ular blo
k of its devi
e into terms meaningful to its
devi
e; the parti
ular tra
k, se
tor and
ylinder of its hard disk where the blo
k is
99
kept. A le system has to look, feel and operate in the same way no matter what
devi
e is holding it. Moreover, using Linux's le systems, it does not matter (at least
to the system user) that these dierent le systems are on dierent physi
al media
ontrolled by dierent hardware
ontrollers. The le system might not even be on
the lo
al system, it
ould just as well be a disk remotely mounted over a network
link. Consider the following example where a Linux system has its root le system
on a SCSI disk:
A
C
D
E
F
bin
boot
drom
dev
et
fd
home
lib
pro
mnt
opt
tmp
root
var
lost+found
Neither the users nor the programs that operate on the les themselves need know
that /C is in fa
t a mounted VFAT le system that is on the rst IDE disk in the
system. In the example (whi
h is a
tually my home Linux system), /E is the master
IDE disk on the se
ond IDE
ontroller. It does not matter either that the rst IDE
ontroller is a PCI
ontroller and that the se
ond is an ISA
ontroller whi
h also
ontrols the IDE CDROM. I
an dial into the network where I work using a modem
and the PPP network proto
ol using a modem and in this
ase I
an remotely mount
my Alpha AXP Linux system's le systems on /mnt/remote.
The les in a le system are
olle
tions of data; the le holding the sour
es to this
hapter is an ASCII le
alled filesystems.tex. A le system not only holds the
data that is
ontained within the les of the le system but also the stru
ture of
the le system. It holds all of the information that Linux users and pro
esses see as
les, dire
tories soft links, le prote
tion information and so on. Moreover it must
hold that information safely and se
urely, the basi
integrity of the operating system
depends on its le systems. Nobody would use an operating system that randomly
lost data and les1 .
Minix, the rst le system that Linux had is rather restri
tive and la
king in performan
e. Its lenames
annot be longer than 14
hara
ters (whi
h is still better
than 8.3 lenames) and the maximum le size is 64MBytes. 64Mbytes might at
rst glan
e seem large enough but large le sizes are ne
essary to hold even modest
databases. The rst le system designed spe
i
ally for Linux, the Extended File
system, or EXT, was introdu
ed in April 1992 and
ured a lot of the problems but it
was still felt to la
k performan
e. So, in 1993, the Se
ond Extended File system,
or EXT2, was added. It is this le system that is des
ribed in detail later on in this
hapter.
An important development took pla
e when the EXT le system was added into
Linux. The real le systems were separated from the operating system and system
servi
es by an interfa
e layer known as the Virtual File system, or VFS. VFS allows
Linux to support many, often very dierent, le systems, ea
h presenting a
ommon
software interfa
e to the VFS. All of the details of the Linux le systems are translated
by software so that all le systems appear identi
al to the rest of the Linux kernel
and to programs running in the system. Linux's Virtual File system layer allows you
to transparently mount the many dierent le systems at the same time.
The Linux Virtual File system is implemented so that a
ess to its les is as fast and
e
ient as possible. It must also make sure that the les and their data are kept
1 Well,
not knowingly, although I have been bitten by operating systems with more lawyers than
Linux has developers
usr
sbin
Super
Block
Block
Block
Block
Group 0
Group N-1
Group N
Group
Descriptors
Block
Bitmap
Inode
Bitmap
Inode
Table
Data
Blocks
See fs/ext2/*
ext2_inode
Mode
Data
Owner info
Size
Data
Timestamps
Direct Blocks
Data
Data
Data
Indirect blocks
Double Indirect
Data
Triple Indirect
Data
Data
See
In the EXT2 le system, the inode is the basi
building blo
k; every le and dire
tory
in the le system is des
ribed by one and only one inode. The EXT2 inodes for
ea
h Blo
k Group are kept in the inode table together with a bitmap that allows
the system to keep tra
k of allo
ated and unallo
ated inodes. Figure 9.2 shows the
format of an EXT2 inode, amongst other information, it
ontains the following elds:
in lude/linux/ext2 fs i.h
mode This holds two pie
es of information; what this inode des
ribes and the permissions that users have to it. For EXT2, an inode
an des
ribe one of le,
dire
tory, symboli
link, blo
k devi
e,
hara
ter devi
e or FIFO.
Owner Information The user and group identiers of the owners of this le or
dire tory. This allows the le system to orre tly allow the right sort of a esses,
Datablo
ks Pointers to the blo
ks that
ontain the data that this inode is des
ribing. The rst twelve are pointers to the physi
al blo
ks
ontaining the data
des
ribed by this inode and the last three pointers
ontain more and more levels of indire
tion. For example, the double indire
t blo
ks pointer points at a
blo
k of pointers to blo
ks of pointers to data blo
ks. This means that les less
than or equal to twelve data blo
ks in length are more qui
kly a
essed than
larger les.
You should note that EXT2 inodes
an des
ribe spe
ial devi
e les. These are not
real les but handles that programs
an use to a
ess devi
es. All of the devi
e les
in /dev are there to allow programs to a
ess Linux's devi
es. For example the mount
program takes as an argument the devi
e le that it wishes to mount.
Magi Number This allows the mounting software to he k that this is indeed the
Superblo
k for an EXT2 le system. For the
urrent version of EXT2 this is
0xEF53.
Revision Level The major and minor revision levels allow the mounting ode to de-
termine whether or not this le system supports features that are only available
in parti
ular revisions of the le system. There are also feature
ompatibility
elds whi
h help the mounting
ode to determine whi
h new features
an safely
be used on this le system,
Mount Count and Maximum Mount Count Together these allow the system
Blo k Group Number The Blo k Group number that holds this opy of the Superblo k,
Blo
k Size The size of the blo
k for this le system in bytes, for example 1024
bytes,
Blo
ks per Group The number of blo
ks in a group. Like the blo
k size this is
xed when the le system is
reated,
See
in lude/linux/ext2 fs sb.h
rst inode in an EXT2 root le system would be the dire
tory entry for the '/'
dire
tory.
in in lude/-
linux/ext2 fs.h
Ea
h Blo
k Group has a data stru
ture des
ribing it. Like the Superblo
k, all the
group des
riptors for all of the Blo
k Groups are dupli
ated in ea
h Blo
k Group
in
ase of le system
orruption. Ea
h Group Des
riptor
ontains the following
information:
Blo
ks Bitmap The blo
k number of the blo
k allo
ation bitmap for this Blo
k
Group. This is used during blo
k allo
ation and deallo
ation,
Inode Bitmap The blo
k number of the inode allo
ation bitmap for this Blo
k
Group. This is used during inode allo
ation and deallo
ation,
Inode Table The blo k number of the starting blo k for the inode table for this
Blo
k Group. Ea
h inode is represented by the EXT2 inode data stru
ture
des
ribed below.
Free blo
ks
ount, Free Inodes
ount, Used dire
tory
ount
The group des
riptors are pla
ed on after another and together they make the group
des
riptor table. Ea
h Blo
ks Group
ontains the entire table of group des
riptors
after its
opy of the Superblo
k. Only the rst
opy (in Blo
k Group 0) is a
tually
used by the EXT2 le system. The other
opies are there, like the
opies of the
Superblo
k, in
ase the main
opy is
orrupted.
in in lude/-
linux/ext2 fs.h
In the EXT2 le system, dire
tories are spe
ial les that are used to
reate and hold
a
ess paths to the les in the le system. Figure 9.3 shows the layout of a dire
tory
entry in memory. A dire
tory le is a list of dire
tory entries, ea
h one
ontaining
the following information:
inode The inode for this dire tory entry. This is an index into the array of inodes
held in the Inode Table of the Blo
k Group. In gure 9.3, the dire
tory entry
for the le
alled file has a referen
e to inode number i1,
0
i1
15
15 5
file
55
i2 40 14 very_long_name
inode table
If there are enough free blo
ks in the le system, the pro
ess tries to allo
ate one.
If the EXT2 le system has been built to preallo
ate data blo
ks then we may
be able to take one of those. The preallo
ated blo
ks do not a
tually exist, they
are just reserved within the allo
ated blo
k bitmap. The VFS inode representing
the le that we are trying to allo
ate a new data blo
k for has two EXT2 spe
i
elds, preallo
blo
k and preallo
ount, whi
h are the blo
k number of the rst
preallo
ated data blo
k and how many of them there are, respe
tively. If there were
no preallo
ated blo
ks or blo
k preallo
ation is not enabled, the EXT2 le system
must allo
ate a new blo
k. The EXT2 le system rst looks to see if the data blo
k
after the last data blo
k in the le is free. Logi
ally, this is the most e
ient blo
k
to allo
ate as it makes sequential a
esses mu
h qui
ker. If this blo
k is not free,
then the sear
h widens and it looks for a data blo
k within 64 blo
ks of the of the
ideal blo
k. This blo
k, although not ideal is at least fairly
lose and within the same
Blo
k Group as the other data blo
ks belonging to this le.
If even that blo
k is not free, the pro
ess starts looking in all of the other Blo
k
Groups in turn until it nds some free blo
ks. The blo
k allo
ation
ode looks for a
luster of eight free data blo
ks somewhere in one of the Blo
k Groups. If it
annot
nd eight together, it will settle for less. If blo
k preallo
ation is wanted and enabled
it will update preallo
blo
k and preallo
ount a
ordingly.
Wherever it nds the free blo
k, the blo
k allo
ation
ode updates the Blo
k Group's
blo
k bitmap and allo
ates a data buer in the buer
a
he. That data buer is
uniquely identied by the le system's supporting devi
e identier and the blo
k
number of the allo
ated blo
k. The data in the buer is zero'd and the buer is
marked as \dirty" to show that it's
ontents have not been written to the physi
al
Inode
Cache
VFS
MINIX
EXT2
Directory
Cache
Buffer
Cache
Disk
Drivers
See fs/*
the root le system, the VFS must read its superblo
k. Ea
h le system type's
superblo
k read routine must work out the le system's topology and map that
information onto a VFS superblo
k data stru
ture. The VFS keeps a list of the
mounted le systems in the system together with their VFS superblo
ks. Ea
h VFS
superblo
k
ontains information and pointers to routines that perform parti
ular
fun
tions. So, for example, the superblo
k representing a mounted EXT2 le system
ontains a pointer to the EXT2 spe
i
inode reading routine. This EXT2 inode
read routine, like all of the le system spe
i
inode read routines, lls out the elds
in a VFS inode. Ea
h VFS superblo
k
ontains a pointer to the rst VFS inode on
the le system. For the root le system, this is the inode that represents the ``/''
dire
tory. This mapping of information is very e
ient for the EXT2 le system but
moderately less so for other le systems.
See fs/inode.
See fs/buffer.
As the system's pro
esses a
ess dire
tories and les, system routines are
alled that
traverse the VFS inodes in the system. For example, typing ls for a dire
tory or
at
for a le
ause the the Virtual File System to sear
h through the VFS inodes that
represent the le system. As every le and dire
tory on the system is represented
by a VFS inode, then a number of inodes will be being repeatedly a
essed. These
inodes are kept in the inode
a
he whi
h makes a
ess to them qui
ker. If an inode
is not in the inode
a
he, then a le system spe
i
routine must be
alled in order
to read the appropriate inode. The a
tion of reading the inode
auses it to be put
into the inode
a
he and further a
esses to the inode keep it in the
a
he. The less
used VFS inodes get removed from the
a
he.
All of the Linux le systems use a
ommon buer
a
he to
a
he data buers from the
underlying devi
es to help speed up a
ess by all of the le systems to the physi
al
devi
es holding the le systems. This buer
a
he is independent of the le systems
and is integrated into the me
hanisms that the Linux kernel uses to allo
ate and
read and write data buers. It has the distin
t advantage of making the Linux
le systems independent from the underlying media and from the devi
e drivers that
support them. All blo
k stru
tured devi
es register themselves with the Linux kernel
and present a uniform, blo
k based, usually asyn
hronous interfa
e. Even relatively
omplex blo
k devi
es su
h as SCSI devi
es do this. As the real le systems read
data from the underlying physi
al disks, this results in requests to the blo
k devi
e
drivers to read physi
al blo
ks from the devi
e that they
ontrol. Integrated into this
blo
k devi
e interfa
e is the buer
a
he. As blo
ks are read by the le systems they
are saved in the global buer
a
he shared by all of the le systems and the Linux
kernel. Buers within it are identied by their blo
k number and a unique identier
for the devi
e that read it. So, if the same data is needed often, it will be retrieved
from the buer
a
he rather than read from the disk, whi
h would take somewhat
longer. Some devi
es support read ahead where data blo
ks are spe
ulatively read
just in
ase they are needed.
The VFS also keeps a
a
he of dire
tory lookups so that the inodes for frequently
used dire
tories
an be qui
kly found. As an experiment, try listing a dire
tory that
you have not listed re
ently. The rst time you list it, you may noti
e a slight pause
but the se
ond time you list its
ontents the result is immediate. The dire
tory
a
he
does not store the inodes for the dire
tories itself; these should be in the inode
a
he,
the dire
tory
a
he simply stores the mapping between the full dire
tory names and
their inode numbers.
See in lude/linux/fs.h
Devi
e This is the devi
e identier for the blo
k devi
e that this le system is
ontained in. For example, /dev/hda1, the rst IDE hard disk in the system
has a devi
e identier of 0x301,
Inode pointers The mounted inode pointer points at the rst inode in this le system. The
overed inode pointer points at the inode representing the dire
tory
that this le system is mounted on. The root le system's VFS superblo
k
does not have a
overed pointer,
Blo
ksize The blo
k size in bytes of this le system, for example 1024 bytes,
Superblo
k operations A pointer to a set of superblo
k routines for this le system. Amongst other things, these routines are used by the VFS to read and
write inodes and superblo
ks.
File System type A pointer to the mounted le system's file system type data
stru
ture,
devi
e This is the devi
e identifer of the devi
e holding the le or whatever that
this VFS inode represents,
inode number This is the number of the inode and is unique within this le system.
The
ombination of devi
e and inode number is unique within the Virtual File
System,
mode Like EXT2 this eld des
ribes what this VFS inode represents as well as
a
ess rights to it,
ount The number of system
omponents
urrently using this VFS inode. A
ount
of zero means that the inode is free to be dis
arded or reused,
See in lude/linux/fs.h
file_systems
file_system_type
file_system_type
file_system_type
*read_super()
*read_super()
*read_super()
name
name
"ext2"
"proc"
name
requires_dev
requires_dev
requires_dev
next
next
next
"iso9660"
lo
k This eld is used to lo
k the VFS inode, for example, when it is being read
from the le system,
dirty Indi
ates whether this VFS inode has been written to, if so the underlying le
system will need modifying,
See
When you build the Linux kernel you are asked if you want ea
h of the supported
le systems. When the kernel is built, the le system startup
ode
ontains
alls to
the initialisation routines of all of the built in le systems. Linux le systems may
also be built as modules and, in this
ase, they may be demand loaded as they are
needed or loaded by hand using insmod. Whenever a le system module is loaded
it registers itself with the kernel and unregisters itself when it is unloaded. Ea
h
le system's initialisation routine registers itself with the Virtual File System and is
represented by a file system type data stru
ture whi
h
ontains the name of the le
system and a pointer to its VFS superblo
k read routine. Figure 9.5 shows that the
file system type data stru
tures are put into a list pointed at by the file systems
pointer. Ea
h file system type data stru
ture
ontains the following information:
Superblo
k read routine This routine is
alled by the VFS when an instan
e of
the le system is mounted,
File System name The name of this le system, for example ext2,
Devi
e needed Does this le system need a devi
e to support? Not all le system
need a devi
e to hold them. The /pro
le system, for example, does not
require a blo
k devi
e,
he
king, it does not know whi
h le systems this kernel has been built to support
or that the proposed mount point a
tually exists. Consider the following mount
ommand:
$ mount -t iso9660 -o ro /dev/
drom /mnt/
drom
This mount
ommand will pass the kernel three pie
es of information; the name of
the le system, the physi
al blo
k devi
e that
ontains the le system and, thirdly,
where in the existing le system topology the new le system is to be mounted.
The rst thing that the Virtual File System must do is to nd the le system.
See do mount()
To do this it sear
hes through the list of known le systems by looking at ea
h in fs/super.
file system type data stru
ture in the list pointed at by file systems. If it nds
a mat
hing name it now knows that this le system type is supported by this kernel See
get fs type() in
and it has the address of the le system spe
i
routine for reading this le system's fs/super.
superblo
k. If it
annot nd a mat
hing le system name then all is not lost if the
kernel is built to demand load kernel modules (see Chapter 12). In this
ase the
kernel will request that the kernel daemon loads the appropriate le system module
before
ontinuing as before.
Next if the physi
al devi
e passed by mount is not already mounted, it must nd the
VFS inode of the dire
tory that is to be the new le system's mount point. This
VFS inode may be in the inode
a
he or it might have to be read from the blo
k
devi
e supporting the le system of the mount point. On
e the inode has been found
it is
he
ked to see that it is a dire
tory and that there is not already some other
le system mounted there. The same dire
tory
annot be used as a mount point for
more than one le system.
At this point the VFS mount
ode must allo
ate a VFS superblo
k and pass it the
mount information to the superblo
k read routine for this le system. All of the
system's VFS superblo
ks are kept in the super blo
ks ve
tor of super blo
k data
stru
tures and one must be allo
ated for this mount. The superblo
k read routine
must ll out the VFS superblo
k elds based on information that it reads from the
physi
al devi
e. For the EXT2 le system this mapping or translation of information
is quite easy, it simply reads the EXT2 superblo
k and lls out the VFS superblo
k
from there. For other le systems, su
h as the MS DOS le system, it is not quite su
h
an easy task. Whatever the le system, lling out the VFS superblo
k means that
the le system must read whatever des
ribes it from the blo
k devi
e that supports
it. If the blo
k devi
e
annot be read from or if it does not
ontain this type of le
system then the mount
ommand will fail.
Ea
h mounted le system is des
ribed by a vfsmount data stru
ture; see gure 9.6.
These are queued on a list pointed at by vfsmntlist. Another pointer, vfsmnttail
points at the last entry in the list and the mru vfsmnt pointer points at the most
re
ently used le system. Ea
h vfsmount stru
ture
ontains the devi
e number of the
blo
k devi
e holding the le system, the dire
tory where this le system is mounted
and a pointer to the VFS superblo
k allo
ated when this le system was mounted. In
turn the VFS superblo
k points at the file system type data stru
ture for this sort
of le system and to the root inode for this le system. This inode is kept resident
in the VFS inode
a
he all of the time that this le system is loaded.
See
add vfsmnt() in
fs/super.
vfsmntlist
vfsmount
mnt_dev
mnt_devname
mnt_dirname
mnt_flags
mnt_sb
next
0x0301
/dev/hda1
/
VFS
super_block
s_dev
s_blocksize
s_type
file_system_type
0x0301
1024
s_flags
*read_super()
name
requires_dev
"ext2"
next
s_covered
s_mounted
VFS
inode
i_dev
i_ino
0x0301
42
See
remove vfsmnt()
in fs/super.
The workshop manual for my MG usually des
ribes assembly as the reverse of disassembly and the reverse is more or less true for unmounting a le system. A le
system
annot be unmounted if something in the system is using one of its les. So,
for example, you
annot umount /mnt/
drom if a pro
ess is using that dire
tory or
any of its
hildren. If anything is using the le system to be unmounted there may be
VFS inodes from it in the VFS inode
a
he, and the
ode
he
ks for this by looking
through the list of inodes looking for inodes owned by the devi
e that this le system
o
upies. If the VFS superblo
k for the mounted le system is dirty, that is it has
been modied, then it must be written ba
k to the le system on disk. On
e it has
been written to disk, the memory o
upied by the VFS superblo
k is returned to the
kernel's free pool of memory. Finally the vfsmount data stru
ture for this mount is
unlinked from vfsmntlist and freed.
See fs/inode.
The VFS inode
a
he is implmented as a hash table whose entries are pointers to
lists of VFS inodes that have the same hash value. The hash value of an inode is
al
ulated from its inode number and from the devi
e identier for the underlying
physi
al devi
e
ontaining the le system. Whenever the Virtual File System needs
to a
ess an inode, it rst looks in the VFS inode
a
he. To nd an inode in the
a
he, the system rst
al
ulates its hash value and then uses it as an index into the
inode hash table. This gives it a pointer to a list of inodes with the same hash value.
It then reads ea
h inode in turn until it nds one with both the same inode number
and the same devi
e identier as the one that it is sear
hing for.
If it
an nd the inode in the
a
he, its
ount is in
remented to show that it has
another user and the le system a
ess
ontinues. Otherwise a free VFS inode must
be found so that the le system
an read the inode from memory. VFS has a number
of
hoi
es about how to get a free inode. If the system may allo
ate more VFS inodes
then this is what it does; it allo
ates kernel pages and breaks them up into new, free,
inodes and puts them into the inode list. All of the system's VFS inodes are in a
list pointed at by first inode as well as in the inode hash table. If the system
already has all of the inodes that it is allowed to have, it must nd an inode that is
a good
andidate to be reused. Good
andidates are inodes with a usage
ount of
zero; this indi
ates that the system is not
urrently using them. Really important
VFS inodes, for example the root inodes of le systems always have a usage
ount
greater than zero and so are never
andidates for reuse. On
e a
andidate for reuse
has been lo
ated it is
leaned up. The VFS inode might be dirty and in this
ase it
needs to be written ba
k to the le system or it might be lo
ked and in this
ase the
system must wait for it to be unlo
ked before
ontinuing. The
andidate VFS inode
must be
leaned up before it
an be reused.
However the new VFS inode is found, a le system spe
i
routine must be
alled
to ll it out from information read from the underlying real le system. Whilst it is
being lled out, the new VFS inode has a usage
ount of one and is lo
ked so that
nothing else a
esses it until it
ontains valid information.
To get the VFS inode that is a
tually needed, the le system may need to a
ess
several other inodes. This happens when you read a dire
tory; only the inode for
the nal dire
tory is needed but the inodes for the intermediate dire
tories must also
be read. As the VFS inode
a
he is used and lled up, the less used inodes will be
dis
arded and the more used inodes will remain in the
a
he.
hash_table
buffer_head
buffer_head
b_dev
b_blocknr
b_state
b_count
b_size
0x0301
42
1024
b_dev
b_blocknr
b_state
b_count
b_size
b_next
b_prev
b_next
b_prev
b_data
b_data
0x0801
17
2048
buffer_head
b_dev
b_blocknr
b_state
b_count
b_size
0x0301
39
1024
b_next
b_prev
b_data
9.3 The Buer Ca
he
As the mounted le systems are used they generate a lot of requests to the blo
k
devi
es to read and write data blo
ks. All blo
k data read and write requests are
given to the devi
e drivers in the form of buffer head data stru
tures via standard
kernel routine
alls. These give all of the information that the blo
k devi
e drivers
need; the devi
e identier uniquely identies the devi
e and the blo
k number tells
the driver whi
h blo
k to read. All blo
k devi
es are viewed as linear
olle
tions
of blo
ks of the same size. To speed up a
ess to the physi
al blo
k devi
es, Linux
maintains a
a
he of blo
k buers. All of the blo
k buers in the system are kept
somewhere in this buer
a
he, even the new, unused buers. This
a
he is shared
between all of the physi
al blo
k devi
es; at any one time there are many blo
k
buers in the
a
he, belonging to any one of the system's blo
k devi
es and often in
many dierent states. If valid data is available from the buer
a
he this saves the
system an a
ess to a physi
al devi
e. Any blo
k buer that has been used to read
data from a blo
k devi
e or to write data to it goes into the buer
a
he. Over time
it may be removed from the
a
he to make way for a more deserving buer or it may
remain in the
a
he as it is frequently a
essed.
Blo
k buers within the
a
he are uniquely idented by the owning devi
e identier
and the blo
k number of the buer. The buer
a
he is
omposed of two fun
tional
parts. The rst part is the lists of free blo
k buers. There is one list per supported
buer size and the system's free blo
k buers are queued onto these lists when they
are rst
reated or when they have been dis
arded. The
urrently supported buer
sizes are 512, 1024, 2048, 4096 and 8192 bytes. The se
ond fun
tional part is the
a
he itself. This is a hash table whi
h is a ve
tor of pointers to
hains of buers
that have the same hash index. The hash index is generated from the owning devi
e
identier and the blo
k number of the data blo
k. Figure 9.7 shows the hash table
together with a few entries. Blo
k buers are either in one of the free lists or they
are in the buer
a
he. When they are in the buer
a
he they are also queued onto
Least Re
ently Used (LRU) lists. There is an LRU list for ea
h buer type and these
are used by the system to perform work on buers of a type, for example, writing
buers with new data in them out to disk. The buer's type re
e
ts its state and
Linux
urrently supports the following types:
See bdflush()
in fs/buffer.
# update -d
bdflush version 1.4
0:
60 Max fra
tion of LRU list to examine for dirty blo
ks
1: 500 Max number of dirty blo
ks to write ea
h time bdflush a
tivated
2:
64 Num of
lean buffers to be loaded onto free list by refill_freelist
3: 256 Dirty blo
k threshold for a
tivating bdflush in refill_freelist
4:
15 Per
entage of
a
he to s
an for free
lusters
5: 3000 Time for data buffers to age before flushing
6: 500 Time for non-data (dir, bitmap, et
) buffers to age before flushing
7: 1884 Time buffer
a
he load average
onstant
8:
2 LAV ratio (used to determine threshold for buffer fratri
ide).
All of the dirty buers are linked into the BUF DIRTY LRU list whenever they are
made dirty by having data written to them and bdflush tries to write a reasonable
number of them out to their owning disks. Again this number
an be seen and
ontrolled by the update
ommand and the default is 500 (see above).
sys bdflush() in
fs/buffer.
The update
ommand is more than just a
ommand; it is also a daemon. When run
as superuser (during system initialisation) it will periodi
ally
ush all of the older
dirty buers out to disk. It does this by
alling a system servi
e routine that does
more or less the same thing as bdflush. Whenever a dirty buer is nished with,
it is tagged with the system time that it should be written out to its owning disk.
Every time that update runs it looks at all of the dirty buers in the system looking
for ones with an expired
ush time. Every expired buer is written out to disk.
/pro /devi es? The /pro le system, like a real le system, registers itself with the
Virtual File System. However, when the VFS makes
alls to it requesting inodes as
its les and dire
tories are opened, the /pro
le system
reates those les and dire
tories from information within the kernel. For example, the kernel's /pro
/devi
es
le is generated from the kernel's data stru
tures des
ribing its devi
es.
The /pro
le system presents a user readable window into the kernel's inner workings. Several Linux subsystems, su
h as Linux kernel modules des
ribed in
hapter 12,
reate entries in the the /pro
le system.
1 root
disk
3,
Within the kernel, every devi
e is uniquely des
ribed by a kdev t data type, this is
two bytes long, the rst byte
ontaining the minor devi
e number and the se
ond
byte holding the major devi
e number. The IDE devi
e above is held within the
kernel as 0x0301. An EXT2 inode that represents a blo
k or
hara
ter devi
e keeps
the devi
e's major and minor numbers in its rst dire
t blo
k pointer. When it is
read by the VFS, the VFS inode data stru
ture representing it has its i rdev eld
set to the
orre
t devi
e identier.
see
/in
lude/linux/
major.h for all of
Linux's major
devi
e numbers.
Chapter 10
Networks
Networking and Linux are terms that are almost synonymous. In a very
real sense Linux is a produ
t of the Internet or World Wide Web (WWW).
Its developers and users use the web to ex
hange information ideas,
ode,
and Linux itself is often used to support the networking needs of organizations. This
hapter des
ribes how Linux supports the network proto
ols
known
olle
tively as TCP/IP.
The TCP/IP proto
ols were designed to support
ommuni
ations between
omputers
onne
ted to the ARPANET, an Ameri
an resear
h network funded by the US government. The ARPANET pioneered networking
on
epts su
h as pa
ket swit
hing
and proto
ol layering where one proto
ol uses the servi
es of another. ARPANET
was retired in 1988 but its su
essors (NSF1 NET and the Internet) have grown even
larger. What is now known as the World Wide Web grew from the ARPANET and
is itself supported by the TCP/IP proto
ols. UnixTM was extensively used on the
ARPANET and the rst released networking version of UnixTM was 4.3 BSD. Linux's
networking implementation is modeled on 4.3 BSD in that it supports BSD so
kets
(with some extensions) and the full range of TCP/IP networking. This programming
interfa
e was
hosen be
ause of its popularity and to help appli
ations be portable
between Linux and other UnixTM platforms.
S ien e Foundation
119
example, 16.42.0.9. This IP address is a
tually in two parts, the network address
and the host address. The sizes of these parts may vary (there are several
lasses of IP
addresses) but using 16.42.0.9 as an example, the network address would be 16.42
and the host address 0.9. The host address is further subdivided into a subnetwork
and a host address. Again, using 16.42.0.9 as an example, the subnetwork address
would be 16.42.0 and the host address 16.42.0.9. This subdivision of the IP address
allows organizations to subdivide their networks. For example, 16.42
ould be the
network address of the ACME Computer Company; 16.42.0 would be subnet 0
and 16.42.1 would be subnet 1. These subnets might be in separate buildings,
perhaps
onne
ted by leased telephone lines or even mi
rowave links. IP addresses
are assigned by the network administrator and having IP subnetworks is a good way
of distributing the administration of the network. IP subnet administrators are free
to allo
ate IP addresses within their IP subnetworks.
Generally though, IP addresses are somewhat hard to remember. Names are mu
h
easier. linux.a
me.
om is mu
h easier to remember than 16.42.0.9 but there must
be some me
hanism to
onvert the network names into an IP address. These names
an be stati
ally spe
ied in the /et
/hosts le or Linux
an ask a Distributed
Name Server (DNS server) to resolve the name for it. In this
ase the lo
al host
must know the IP address of one or more DNS servers and these are spe
ied in
/et
/resolv.
onf.
Whenever you
onne
t to another ma
hine, say when reading a web page, its IP
address is used to ex
hange data with that ma
hine. This data is
ontained in IP
pa
kets ea
h of whi
h have an IP header
ontaining the IP addresses of the sour
e
and destination ma
hine's IP addresses, a
he
ksum and other useful information.
The
he
ksum is derived from the data in the IP pa
ket and allows the re
eiver of
IP pa
kets to tell if the IP pa
ket was
orrupted during transmission, perhaps by a
noisy telephone line. The data transmitted by an appli
ation may have been broken
down into smaller pa
kets whi
h are easier to handle. The size of the IP data pa
kets
varies depending on the
onne
tion media; ethernet pa
kets are generally bigger than
PPP pa
kets. The destination host must reassemble the data pa
kets before giving
the data to the re
eiving appli
ation. You
an see this fragmentation and reassembly
of data graphi
ally if you a
ess a web page
ontaining a lot of graphi
al images via
a moderately slow serial link.
Hosts
onne
ted to the same IP subnet
an send IP pa
kets dire
tly to ea
h other, all
other IP pa
kets will be sent to a spe
ial host, a gateway. Gateways (or routers) are
onne
ted to more than one IP subnet and they will resend IP pa
kets re
eived on
one subnet, but destined for another onwards. For example, if subnets 16.42.1.0
and 16.42.0.0 are
onne
ted together by a gateway then any pa
kets sent from
subnet 0 to subnet 1 would have to be dire
ted to the gateway so that it
ould route
them. The lo
al host builds up routing tables whi
h allow it to route IP pa
kets to
the
orre
t ma
hine. For every IP destination there is an entry in the routing tables
whi
h tells Linux whi
h host to send IP pa
kets to in order that they rea
h their
destination. These routing tables are dynami
and
hange over time as appli
ations
use the network and as the network topology
hanges.
The IP proto
ol is a transport layer that is used by other proto
ols to
arry their data.
The Transmission Control Proto
ol (TCP) is a reliable end to end proto
ol that uses
IP to transmit and re
eive its own pa
kets. Just as IP pa
kets have their own header,
TCP has its own header. TCP is a
onne
tion based proto
ol where two networking
ETHERNET FRAME
Destination
ethernet
address
Source
ethernet
address
Protocol
Data
Checksum
IP PACKET
Length
Protocol
Checksum
Source
IP address
Destination
IP address
Data
TCP PACKET
Source TCP
address
Destination
TCP address
SEQ
ACK
Data
addresses are reserved for multi
ast purposes and ethernet frames sent with these
destination addresses will be re
eived by all hosts on the network. As ethernet frames
an
arry many dierent proto
ols (as data) they, like IP pa
kets,
ontain a proto
ol
identier in their headers. This allows the ethernet layer to
orre
tly re
eive IP
pa
kets and to pass them onto the IP layer.
In order to send an IP pa
ket via a multi-
onne
tion proto
ol su
h as ethernet, the
IP layer must nd the ethernet address of the IP host. This is be
ause IP addresses
are simply an addressing
on
ept, the ethernet devi
es themselves have their own
physi
al addresses. IP addresses on the other hand
an be assigned and reassigned
by network administrators at will but the network hardware responds only to ethernet frames with its own physi
al address or to spe
ial multi
ast addresses whi
h
all ma
hines must re
eive. Linux uses the Address Resolution Proto
ol (or ARP)
to allow ma
hines to translate IP addresses into real hardware addresses su
h as
ethernet addresses. A host wishing to know the hardware address asso
iated with
an IP address sends an ARP request pa
ket
ontaining the IP address that it wishes
translating to all nodes on the network by sending it to a multi
ast address. The
target host that owns the IP address, responds with an ARP reply that
ontains its
physi
al hardware address. ARP is not just restri
ted to ethernet devi
es, it
an
resolve IP addresses for other physi
al media, for example FDDI. Those network
devi
es that
annot ARP are marked so that Linux does not attempt to ARP. There
is also the reverse fun
tion, Reverse ARP or RARP, whi
h translates phsyi
al network addresses into IP addresses. This is used by gateways, whi
h respond to ARP
requests on behalf of IP addresses that are in the remote network.
Network
Applications
User
Kernel
BSD
Sockets
Socket
Interface
INET
Sockets
TCP
UDP
Protocol
Layers
IP
ARP
Network
Devices
PPP
SLIP
Ethernet
Stream These so
kets provide reliable two way sequen
ed data streams with a guarantee that data
annot be lost,
orrupted or dupli
ated in transit. Stream
so
kets are supported by the TCP proto
ol of the Internet (INET) address
family.
Datagram These so kets also provide two way data transfer but, unlike stream
so
kets, there is no guarantee that the messages will arrive. Even if they
do arrive there is no guarantee that they will arrive in order or even not be
dupli
ated or
orrupted. This type of so
ket is supported by the UDP proto
ol
of the Internet address family.
Raw This allows pro
esses dire
t (hen
e \raw") a
ess to the underlying proto
ols.
It is, for example, possible to open a raw so
ket to an ethernet devi
e and see
raw IP data tra
.
Reliable Delivered Messages These are very like datagram so
kets but the data
is guaranteed to arrive.
Sequen
ed Pa
kets These are like stream so
kets ex
ept that the data pa
ket sizes
are xed.
Pa
ket This is not a standard BSD so
ket type, it is a Linux spe
i
extension that
allows pro
esses to a
ess pa
kets dire
tly at the devi
e level.
Pro
esses that
ommuni
ate using so
kets use a
lient server model. A server provides
a servi
e and
lients make use of that servi
e. One example would be a Web Server,
whi
h provides web pages and a web
lient, or browser, whi
h reads those pages. A
server using so
kets, rst
reates a so
ket and then binds a name to it. The format
of this name is dependent on the so
ket's address family and it is, in ee
t, the lo
al
address of the server. The so
ket's name or address is spe
ied using the so
kaddr
data stru
ture. An INET so
ket would have an IP port address bound to it. The
registered port numbers
an be seen in /et
/servi
es; for example, the port number
for a web server is 80. Having bound an address to the so
ket, the server then listens
for in
oming
onne
tion requests spe
ifying the bound address. The originator of the
request, the
lient,
reates a so
ket and makes a
onne
tion request on it, spe
ifying
the target address of the server. For an INET so
ket the address of the server is its
IP address and its port number. These in
oming requests must nd their way up
through the various proto
ol layers and then wait on the server's listening so
ket.
On
e the server has re
eived the in
oming request it either a
epts or reje
ts it. If
the in
oming request is to be a
epted, the server must
reate a new so
ket to a
ept
it on. On
e a so
ket has been used for listening for in
oming
onne
tion requests it
annot be used to support a
onne
tion. With the
onne
tion established both ends
are free to send and re
eive data. Finally, when the
onne
tion is no longer needed it
an be shutdown. Care is taken to ensure that data pa
kets in transit are
orre
tly
dealt with.
The exa
t meaning of operations on a BSD so
ket depends on its underlying address
family. Setting up TCP/IP
onne
tions is very dierent from setting up an amateur
radio X.25
onne
tion. Like the virtual lesystem, Linux abstra
ts the so
ket interfa
e with the BSD so
ket layer being
on
erned with the BSD so
ket interfa
e to
the appli
ation programs whi
h is in turn supported by independent address family
spe
i
software. At kernel initialization time, the address families built into the
kernel register themselves with the BSD so
ket interfa
e. Later on, as appli
ations
reate and use BSD so
kets, an asso
iation is made between the BSD so
ket and
its supporting address family. This asso
iation is made via
ross-linking data stru
tures and tables of address family spe
i
support routines. For example there is an
address family spe
i
so
ket
reation routine whi
h the BSD so
ket interfa
e uses
when an appli
ation
reates a new so
ket.
When the kernel is
ongured, a number of address families and proto
ols are built
into the proto
ols ve
tor. Ea
h is represented by its name, for example \INET"
and the address of its initialization routine. When the so
ket interfa
e is initialized
at boot time ea
h proto
ol's initialization routine is
alled. For the so
ket address
families this results in them registering a set of proto
ol operations. This is a set
of routines, ea
h of whi
h performs a a parti
ular operation spe
i
to that address
family. The registered proto
ol operations are kept in the pops ve
tor, a ve
tor of
pointers to proto ops data stru
tures. The proto ops data stru
ture
onsists of
the address family type and a set of pointers to so
ket operation routines spe
i
to a parti
ular address family. The pops ve
tor is indexed by the address family
identier, for example the Internet address family identier (AF INET is 2).
See in lude/linux/net.h
files_struct
count
close_on_exec
open_fs
fd[0]
file
fd[1]
f_mode
f_pos
f_flags
fd[255]
f_count
f_owner
f_op
BSD Socket
File Operations
lseek
read
write
select
ioctl
close
fasync
f_inode
f_version
inode
socket
type
SOCK_STREAM
Address Family
socket operations
ops
data
sock
type
SOCK_STREAM
protocol
socket
routines from the registered INET proto ops data stru
ture to perform work for it.
For example a BSD so
ket
reate request that gives the address family as INET will
use the underlying INET so
ket
reate fun
tion. The BSD so
ket layer passes the
so
ket data stru
ture representing the BSD so
ket to the INET layer in ea
h of these
operations. Rather than
lutter the BSD so
ket wiht TCP/IP spe
i
information,
the INET so
ket layer uses its own data stru
ture, the so
k whi
h it links to the
BSD so
ket data stru
ture. This linkage
an be seen in Figure 10.3. It links the
so
k data stru
ture to the BSD so
ket data stru
ture using the data pointer in
the BSD so
ket. This means that subsequent INET so
ket
alls
an easily retrieve
the so
k data stru
ture. The so
k data stru
ture's proto
ol operations pointer is
also set up at
reation time and it depends on the proto
ol requested. If TCP is
requested, then the so
k data stru
ture's proto
ol operations pointer will point to
the set of TCP proto
ol operations needed for a TCP
onne
tion.
See
sys so
ket() in
net/so
ket.
address family and whose interfa
e is up and able to be used. You
an see whi
h
network interfa
es are
urrently a
tive in the system by using the if
ong
ommand.
The IP address may also be the IP broad
ast address of either all 1's or all 0's.
These are spe
ial addresses that mean \send to everybody"3 . The IP address
ould
also be spe
ied as any IP address if the ma
hine is a
ting as a transparent proxy or
rewall, but only pro
esses with superuser privileges
an bind to any IP address. The
IP address bound to is saved in the so
k data stru
ture in the re
v addr and saddr
elds. These are used in hash lookups and as the sending IP address respe
tively.
The port number is optional and if it is not spe
ied the supporting network is
asked for a free one. By
onvention, port numbers less than 1024
annot be used
by pro
esses without superuser privileges. If the underlying network does allo
ate a
port number it always allo
ates ones greater than 1024.
As pa
kets are being re
eived by the underlying network devi
es they must be routed
to the
orre
t INET and BSD so
kets so that they
an be pro
essed. For this reason
UDP and TCP maintain hash tables whi
h are used to lookup the addresses within
in
oming IP messages and dire
t them to the
orre
t so
ket/so
k pair. TCP is a
onne
tion oriented proto
ol and so there is more information involved in pro
essing
TCP pa
kets than there is in pro
essing UDP pa
kets.
UDP maintains a hash table of allo
ated UDP ports, the udp hash table. This
onsists of pointers to so
k data stru
tures indexed by a hash fun
tion based on the
port number. As the UDP hash table is mu
h smaller than the number of permissible
port numbers (udp hash is only 128 or UDP HTABLE SIZE entries long) some entries in
the table point to a
hain of so
k data stru
tures linked together using ea
h so
k's
next pointer.
TCP is mu
h more
omplex as it maintains several hash tables. However, TCP
does not a
tually add the binding so
k data stu
ture into its hash tables during the
bind operation, it merely
he
ks that the port number requested is not
urrently
being used. The so
k data stru
ture is added to TCP's hash tables during the listen
operation.
REVIEW NOTE: What about the route entered?
entry so that UDP pa
kets sent on this BSD so
ket do not need to
he
k the routing
database again (unless this route be
omes invalid). The
a
hed routing information
is pointed at from the ip route
a
he pointer in the INET so
k data stru
ture. If
no addressing information is given, this
a
hed routing and IP addressing information
will be automati
ally be used for messages sent using this BSD so
ket. UDP moves
the so
k's state to TCP ESTABLISHED.
For a
onne
t operation on a TCP BSD so
ket, TCP must build a TCP message
ontaining the
onne
tion information and send it to IP destination given. The TCP
message
ontains information about the
onne
tion, a unique starting message sequen
e number, the maximum sized message that
an be managed by the initiating
host, the transmit and re
eive window size and so on. Within TCP all messages
are numbered and the initial sequen
e number is used as the rst message number.
Linux
hooses a reasonably random value to avoid mali
ious proto
ol atta
ks. Every
message transmitted by one end of the TCP
onne
tion and su
essfully re
eived by
the other is a
knowledged to say that it arrived su
essfully and un
orrupted. Una
knowledges messages will be retransmitted. The transmit and re
eive window size is
the number of outstanding messages that there
an be without an a
knowledgement
being sent. The maximum message size is based on the network devi
e that is being
used at the initiating end of the request. If the re
eiving end's network devi
e supports smaller maximum message sizes then the
onne
tion will use the minimum of
the two. The appli
ation making the outbound TCP
onne
tion request must now
wait for a response from the target appli
ation to a
ept or reje
t the
onne
tion
request. As the TCP so
k is now expe
ting in
oming messages, it is added to the
t
p listening hash so that in
oming TCP messages
an be dire
ted to this so
k
data stru
ture. TCP also starts timers so that the outbound
onne
tion request
an
be timed out if the target appli
ation does not respond to the request.
See in lude/linux/skbuff.h
One of the problems of having many layers of network proto
ols, ea
h one using the
servi
es of another, is that ea
h proto
ol needs to add proto
ol headers and tails to
data as it is transmitted and to remove them as it pro
esses re
eived data. This make
passing data buers between the proto
ols di
ult as ea
h layer needs to nd where
its parti
ular proto
ol headers and tails are. One solution is to
opy buers at ea
h
layer but that would be ine
ient. Instead, Linux uses so
ket buers or sk buffs
to pass data between the proto
ol layers and the network devi
e drivers. sk buffs
ontain pointer and length elds that allow ea
h proto
ol layer to manipulate the
appli
ation data via standard fun
tions or \methods".
Figure 10.4 shows the sk buff data stru
ture; ea
h sk buff has a blo
k of data
asso
iated with it. The sk buff has four data pointers, whi
h are used to manipulate
and manage the so
ket buer's data:
head points to the start of the data area in memory. This is xed when the sk buff
and its asso
iated data blo
k is allo
ated,
data points at the
urrent start of the proto
ol data. This pointer varies depending
on the proto
ol layer that
urrently owns the sk buff,
tail points at the urrent end of the proto ol data. Again, this pointer varies depending on the owning proto ol layer,
end points at the end of the data area in memory. This is xed when the sk buff
is allo
ated.
sk_buff
next
prev
dev
head
data
tail
end
truesize
len
Packet
to be
transmitted
push This moves the data pointer towards the start of the data area and in rements
the len eld. This is used when adding data or proto
ol headers to the start See skb push()
in in
lude/of the data to be transmitted,
pull This moves the data pointer away from the start, towards the end of the data
linux/skbuff.h
area and de
rements the len eld. This is used when removing data or proto
ol See skb pull()
in in
lude/headers from the start of the data that has been re
eived,
put This moves the tail pointer towards the end of the data area and in
rements
the len eld. This is used when adding data or proto
ol information to the
end of the data to be transmitted,
trim This moves the tail pointer towards the start of the data area and de rements
linux/skbuff.h
the len eld. This is used when removing data or proto
ol tails from the See skb trim()
in in
lude/re
eived pa
ket.
The sk buff data stru
ture also
ontains pointers that are used as it is stored in
doubly linked
ir
ular lists of sk buff's during pro
essing. There are generi
sk buff
routines for adding sk buffs to the front and ba
k of these lists and for removing
them.
linux/skbuff.h
dev base list. Ea h devi e data stru ture des ribes its devi e and provides a set of
allba
k routines that the network proto
ol layers
all when they need the network
driver to perform work. These fun
tions are mostly
on
erned with transmitting
data and with the network devi
e's addresses. When a network devi
e re
eives pa
kets from its network it must
onvert the re
eived data into sk buff data stru
tures.
These re
eived sk buff's are added onto the ba
klog queue by the network drivers as
See netif rx()
in
they are re
eived. If the ba
klog queue grows too large, then the re
eived sk buff's
net/
ore/dev.
are dis
arded. The network bottom half is
agged as ready to run as there is work
to do.
When the network bottom half handler is run by the s
heduler it pro
esses any
network pa
kets waiting to be transmitted before pro
essing the ba
klog queue of
See net bh() in
sk buff's determining whi
h proto
ol layer to pass the re
eived pa
kets to. As the
net/
ore/dev.
Linux networking layers were initialized, ea
h proto
ol registered itself by adding a
pa
ket type data stru
ture onto either the ptype all list or into the ptype base
hash table. The pa
ket type data stru
ture
ontains the proto
ol type, a pointer
to a network devi
e, a pointer to the proto
ol's re
eive data pro
essing routine and,
nally, a pointer to the next pa
ket type data stru
ture in the list or hash
hain.
The ptype all
hain is used to snoop all pa
kets being re
eived from any network
devi
e and is not normally used. The ptype base hash table is hashed by proto
ol
identier and is used to de
ide whi
h proto
ol should re
eive the in
oming network
pa
ket. The network bottom half mat
hes the proto
ol types of in
oming sk buff's
against one or more of the pa
ket type entries in either table. The proto
ol may
mat
h more than one entry, for example when snooping all network tra
, and in this
See ip re
v() in
ase the sk buff will be
loned. The sk buff is passed to the mat
hing proto
ol's
net/ipv4/handling routine.
ip input.
See in lude/net/route.h
Pa
kets are transmitted by appli
ations ex
hanging data or else they are generated by
the network proto
ols as they support established
onne
tions or
onne
tions being
established. Whi
hever way the data is generated, an sk buff is built to
ontain the
data and various headers are added by the proto
ol layers as it passes through them.
The sk buff needs to be passed to a network devi
e to be transmitted. First though
the proto
ol, for example IP, needs to de
ide whi
h network devi
e to use. This
depends on the best route for the pa
ket. For
omputers
onne
ted by modem to
a single network, say via the PPP proto
ol, the routing
hoi
e is easy. The pa
ket
should either be sent to the lo
al host via the loopba
k devi
e or to the gateway at
the end of the PPP modem
onne
tion. For
omputers
onne
ted to an ethernet the
hoi
es are harder as there are many
omputers
onne
ted to the network.
For every IP pa
ket transmitted, IP uses the routing tables to resolve the route for
the destination IP address. Ea
h IP destination su
essfully looked up in the routing
tables returns a rtable data stru
ture des
ribing the route to use. This in
ludes
the sour
e IP address to use, the address of the network devi
e data stru
ture and,
sometimes, a prebuilt hardware header. This hardware header is network devi
e
spe
i
and
ontains the sour
e and destination physi
al addresses and other media
spe
i
information. If the network devi
e is an ethernet devi
e, the hardware header
would be as shown in Figure 10.1 and the sour
e and destination addresses would be
physi
al ethernet addresses. The hardware header is
a
hed with the route be
ause
it must be appended to ea
h IP pa
ket transmitted on this route and
onstru
ting
it takes time. The hardware header may
ontain physi
al addresses that have to be
resolved using the ARP proto
ol. In this
ase the outgoing pa
ket is stalled until the
address has been resolved. On
e it has been resolved and the hardware header built,
the hardware header is
a
hed so that future IP pa
kets sent using this interfa
e do
not have to ARP.
See
ip build xmit()
in net/ipv4/
ip output.
See ip r
v() in
net/ipv4/ip input.
the ARP servi
es to translate the destination IP address into a physi
al address.
The ARP proto
ol itself is very simple and
onsists of two message types, an ARP
request and an ARP reply. The ARP request
ontains the IP address that needs
translating and the reply (hopefully)
ontains the translated IP address, the hardware
address. The ARP request is broad
ast to all hosts
onne
ted to the network, so,
for an ethernet network, all of the ma
hines
onne
ted to the ethernet will see the
ARP request. The ma
hine that owns the IP address in the request will respond to
the ARP request with an ARP reply
ontaining its own physi
al address.
The ARP proto
ol layer in Linux is built around a table of arp table data stru
tures
whi
h ea
h des
ribe an IP to physi
al address translation. These entries are
reated
as IP addresses need to be translated and removed as they be
ome stale over time.
Ea
h arp table data stru
ture has the following elds:
last used
last updated
ags
IP address
hardware address
hardware header
timer
ates an ARP reply using the hardware address kept in the re
eiving devi
e's devi
e
data stru
ture.
Network topologies
an
hange over time and IP addresses
an be reassigned to
dierent hardware addresses. For example, some dial up servi
es assign an IP address
as ea
h
onne
tion is established. In order that the ARP table
ontains up to date
entries, ARP runs a periodi
timer whi
h looks through all of the arp table entries
to see whi
h have timed out. It is very
areful not to remove entries that
ontain
one or more
a
hed hardware headers. Removing these entries is dangerous as other
data stru
tures rely on them. Some arp table entries are permanent and these are
marked so that they will not be deallo
ated. The ARP table
annot be allowed to
grow too large; ea
h arp table entry
onsumes some kernel memory. Whenever the
a new entry needs to be allo
ated and the ARP table has rea
hed its maximum size
the table is pruned by sear
hing out the oldest entries and removing them.
10.7 IP Routing
The IP routing fun
tion determines where to send IP pa
kets destined for a parti
ular
IP address. There are many
hoi
es to be made when transmitting IP pa
kets. Can
the destination be rea
hed at all? If it
an be rea
hed, whi
h network devi
e should
be used to transmit it? If there is more than one network devi
e that
ould be used
to rea
h the destination, whi
h is the better one? The IP routing database maintains
information that gives answers to these questions. There are two databases, the most
important being the Forwarding Information Database. This is an exhaustive list of
known IP destinations and their best routes. A smaller and mu
h faster database,
the route
a
he is used for qui
k lookups of routes for IP destinations. Like all
a
hes,
it must
ontain only the frequently a
essed routes; its
ontents are derived from the
Forwarding Information Database.
Routes are added and deleted via IOCTL requests to the BSD so
ket interfa
e.
These are passed onto the proto
ol to pro
ess. The INET proto
ol layer only allows
pro
esses with superuser privileges to add and delete IP routes. These routes
an be
xed or they
an be dynami
and
hange over time. Most systems use xed routes
unless they themselves are routers. Routers run routing proto
ols whi
h
onstantly
he
k on the availability of routes to all known IP destinations. Systems that are
not routers are known as end systems. The routing proto
ols are implemented as
daemons, for example GATED, and they also add and delete routes via the IOCTL
BSD so
ket interfa
e.
fib_zones
fib_node
fib_next
fib_zone
fz_next
fz_hash_table
fz_list
fz_nent
fz_logmask
fz_mask
fib_dst
fib_use
fib_info
fib_metric
fib_tos
fib_info
fib_next
fib_prev
fib_gateway
fib_dev
fib_refcnt
fib_window
fib_flags
fib_mtu
fib_irtt
fib_node
fib_next
fib_dst
fib_use
fib_info
fib_metric
fib_tos
fib_info
fib_next
fib_prev
fib_gateway
fib_dev
fib_refcnt
fib_window
fib_flags
fib_mtu
fib_irtt
See ip rt
he
k expire()
in net/ipv4/route.
best spread of hash values. Ea
h rtable entry
ontains information about the route;
the destination IP address, the network devi
e to use to rea
h that IP address, the
maximum size of message that
an be used and so on. It also has a referen
e
ount,
a usage
ount and a timestamp of the last time that they were used (in jiffies).
The referen
e
ount is in
remented ea
h time the route is used to show the number
of network
onne
tions using this route. It is de
remented as appli
ations stop using
the route. The usage
ount is in
remented ea
h time the route is looked up and is
used to order the rtable entry in its
hain of hash entries. The last used timestamp
for all of the entries in the route
a
he is periodi
ally
he
ked to see if the rtable
is too old . If the route has not been re
ently used, it is dis
arded from the route
a
he. If routes are kept in the route
a
he they are ordered so that the most used
entries are at the front of the hash
hains. This means that nding them will be
qui
ker when routes are looked up.
mask. All routes to the same subnet are des
ribed by pairs of fib node and fib info
data stru
tures queued onto the fz list of ea
h fib zone data stru
ture. If the
number of routes in this subnet grows large, a hash table is generated to make
nding the fib node data stru
tures easier.
Several routes may exist to the same IP subnet and these routes
an go through one
of several gateways. The IP routing layer does not allow more than one route to a
subnet using the same gateway. In other words, if there are several routes to a subnet,
then ea
h route is guaranteed to use a dierent gateway. Asso
iated with ea
h route
is its metri
. This is a measure of how advantagious this route is. A route's metri
is, essentially, the number of IP subnets that it must hop a
ross before it rea
hes the
destination subnet. The higher the metri
, the worse the route.
Chapter 11
Kernel Me hanisms
This
hapter des
ribes some of the general tasks and me
hanisms that
the Linux kernel needs to supply so that other parts of the kernel work
ee
tively together.
bh_active
bh_base
31
bh_mask
31
See
in lude/linux/interrupt.h
tq_struct
task queue
tq_struct
next
next
sync
sync
*routine()
*routine()
*data
*data
TIMER This handler is marked as a
tive ea
h time the system's periodi
timer
interrupts and is used to drive the kernel's timer queue me
hanisms,
See
do bottom half()
in kernel/softirq.
Whenever a devi
e driver, or some other part of the kernel, needs to s
hedule work to
be done later, it adds work to the appropriate system queue, for example the timer
queue, and then signals the kernel that some bottom half handling needs to be done.
It does this by setting the appropriate bit in bh a
tive. Bit 8 is set if the driver
has queued something on the immediate queue and wishes the immediate bottom
half handler to run and pro
ess it. The bh a
tive bitmask is
he
ked at the end of
ea
h system
all, just before
ontrol is returned to the
alling pro
ess. If it has any
bits set, the bottom half handler routines that are a
tive are
alled. Bit 0 is
he
ked
rst, then 1 and so on until bit 31. The bit in bh a
tive is
leared as ea
h bottom
half handling routine is
alled. bh a
tive is transient; it only has meaning between
alls to the s
heduler and is a way of not
alling bottom half handling routines when
there is no work for them to do.
linux/tqueue.h
Task queues are the kernel's way of deferring work until later. Linux has a generi
me
hanism for queuing work on queues and for pro
essing them later. Task queues
are often used in
onjun
tion with bottom half handlers; the timer task queue is
pro
essed when the timer queue bottom half handler runs. A task queue is a simple
data stru
ture, see gure 11.2 whi
h
onsists of a singly linked list of tq stru
t data
stru
tures ea
h of whi
h
ontains the address of a routine and a pointer to some data.
The routine will be
alled when the element on the task queue is pro
essed and it
will be passed a pointer to the data.
Anything in the kernel, for example a devi
e driver,
an
reate and use task queues
but there are three task queues
reated and managed by the kernel:
timer This queue is used to queue work that will be done as soon after the next
immediate This queue is also pro essed when the s heduler pro esses the a tive
bottom half handlers. The immediate bottom half handler is not as high in
priority as the timer queue bottom half handler and so these tasks will be run
later.
s heduler This task queue is pro essed dire tly by the s heduler. It is used to
support other task queues in the system and, in this
ase, the task to be run
will be a routine that pro
esses a task queue, say for a devi
e driver.
When task queues are pro
essed, the pointer to the rst element in the queue is
removed from the queue and repla
ed with a null pointer. In fa
t, this removal is
an atomi
operation, one that
annot be interrupted. Then ea
h element in the
queue has its handling routine
alled in turn. The elements in the queue are often
stati
ally allo
ated data. However there is no inherent me
hanism for dis
arding
allo
ated memory. The task queue pro
essing routine simply moves onto the next
element in the list. It is the job of the task itself to ensure that it properly
leans up
any allo
ated kernel memory.
11.3 Timers
An operating system needs to be able to s
hedule an a
tivity sometime in the future.
A me
hanism is needed whereby a
tivities
an be s
heduled to run at some relatively
pre
ise time. Any mi
ropro
essor that wishes to support an operating system must
have a programmable interval timer that periodi
ally interrupts the pro
essor. This
periodi
interrupt is known as a system
lo
k ti
k and it a
ts like a metronome,
or
hestrating the system's a
tivities. Linux has a very simple view of what time it
is; it measures time in
lo
k ti
ks sin
e the system booted. All system times are
based on this measurement, whi
h is known as jiffies after the globally available
variable of the same name.
Linux has two types of system timers, both queue routines to be
alled at some
system time but they are slightly dierent in their implementations. Figure 11.3
shows both me
hanisms. The rst, the old timer me
hanism, has a stati
array of 32
pointers to timer stru
t data stru
tures and a mask of a
tive timers, timer a
tive.
See in lude/linux/timer.h
timer_table
timer_struct
expires
*fn()
timer_struct
expires
*fn()
31
timer_active
31
timer_head
next
prev
expires
data
*function()
timer_list
next
prev
expires
data
*function()
timer_list
next
prev
expires
data
*function()
See
in
kernel/s hed.
See
Both methods use the time in jiffies as an expiry time so that a timer that wished
to run in 5s would have to
onvert 5s to units of jiffies and add that to the
urrent
system time to get the system time in jiffies when the timer should expire. Every
system
lo
k ti
k the timer bottom half handler is marked as a
tive so that the
when the s
heduler next runs, the timer queues will be pro
essed. The timer bottom
half handler pro
esses both types of system timer. For the old system timers the
timer a
tive bit mask is
he
k for bits that are set. If the expiry time for an a
tive
timer has expired (expiry time is less than the
urrent system jiffies), its timer
routine is
alled and its a
tive bit is
leared. For new system timers, the entries in
the linked list of timer list data stru
tures are
he
ked. Every expired timer is
removed from the list and its routine is
alled. The new timer me
hanism has the
advantage of being able to pass an argument to the timer routine.
in
kernel/s hed.
See in lude/linux/wait.h
wait queue
*task
*next
Figure 11.4: Wait Queue
ruptible or uninterruptible. Interruptible pro
esses may be interrupted by events
su
h as timers expiring or signals being delivered whilst they are waiting on a wait
queue. The waiting pro
esses state will re
e
t this and either be INTERRUPTIBLE or
UNINTERRUPTIBLE. As this pro
ess
an not now
ontinue to run, the s
heduler is run
and, when it sele
ts a new pro
ess to run, the waiting pro
ess will be suspended. 1
When the wait queue is pro
essed, the state of every pro
ess in the wait queue is
set to RUNNING. If the pro
ess has been removed from the run queue, it is put ba
k
onto the run queue. The next time the s
heduler runs, the pro
esses that are on
the wait queue are now
andidates to be run as they are now no longer waiting.
When a pro
ess on the wait queue is s
heduled the rst thing that it will do is
remove itself from the wait queue. Wait queues
an be used to syn
hronize a
ess
to system resour
es and they are used by Linux in its implementation of semaphores
(see below).
11.5 Buzz Lo
ks
These are better known as spin lo
ks and they are a primitive way of prote
ting a
data stru
ture or pie
e of
ode. They only allow one pro
ess at a time to be within
a
riti
al region of
ode. They are used in Linux to restri
t a
ess to elds in data
stru
tures, using a single integer eld as a lo
k. Ea
h pro
ess wishing to enter the
region attempts to
hange the lo
k's initial value from 0 to 1. If its
urrent value is
1, the pro
ess tries again, spinning in a tight loop of
ode. The a
ess to the memory
lo
ation holding the lo
k must be atomi
, the a
tion of reading its value,
he
king
that it is 0 and then
hanging it to 1
annot be interrupted by any other pro
ess.
Most CPU ar
hite
tures provide support for this via spe
ial instru
tions but you
an
also implement buzz lo
ks using un
a
hed main memory.
When the owning pro
ess leaves the
riti
al region of
ode it de
rements the buzz
lo
k, returning its value to 0. Any pro
esses spinning on the lo
k will now read it as
0, the rst one to do this will in
rement it to 1 and enter the
riti
al region.
11.6 Semaphores
Semaphores are used to prote
t
riti
al regions of
ode or data stru
tures. Remember
that ea
h a
ess of a
riti
al pie
e of data su
h as a VFS inode des
ribing a dire
tory
is made by kernel
ode running on behalf of a pro
ess. It would be very dangerous
to allow one pro
ess to alter a
riti
al data stru
ture that is being used by another
pro
ess. One way to a
hieve this would be to use a buzz lo
k around the
riti
al pie
e
of data is being a
essed but this is a simplisti
approa
h that would not give very
good system performan
e. Instead Linux uses semaphores to allow just one pro
ess
1 REVIEW
INTERRUPTIBLE being
the s heduler runs? Pro esses in a wait queue should never run until they are woken up.
See in lude/asm/semaphore.h
at a time to a
ess
riti
al regions of
ode and data; all other pro
esses wishing to
a
ess this resour
e will be made to wait until it be
omes free. The waiting pro
esses
are suspended, other pro
esses in the system
an
ontinue to run as normal.
A Linux semaphore data stru
ture
ontains the following information:
ount This eld keeps tra
k of the
ount of pro
esses wishing to use this resour
e.
A positive value means that the resour
e is available. A negative or zero value
means that pro
esses are waiting for it. An initial value of 1 means that one
and only one pro
ess at a time
an use this resour
e. When pro
esses want
this resour
e they de
rement the
ount and when they have nished with this
resour
e they in
rement the
ount,
waking This is the
ount of pro
esses waiting for this resour
e whi
h is also the
number of pro
ess waiting to be woken up when this resour
e be
omes free,
wait queue When pro
esses are waiting for this resour
e they are put onto this
wait queue,
Chapter 12
Modules
This
hapter des
ribes how the Linux kernel
an dynami
ally load fun
tions, for example lesystems, only when they are needed.
Linux is a monolithi
kernel; that is, it is one, single, large program where all the
fun
tional
omponents of the kernel have a
ess to all of its internal data stru
tures and routines. The alternative is to have a mi
ro-kernel stru
ture where the
fun
tional pie
es of the kernel are broken out into separate units with stri
t
ommuni
ation me
hanisms between them. This makes adding new
omponents into
the kernel via the
onguration pro
ess rather time
onsuming. Say you wanted to
use a SCSI driver for an NCR 810 SCSI and you had not built it into the kernel.
You would have to
ongure and then build a new kernel before you
ould use the
NCR 810. There is an alternative, Linux allows you to dynami
ally load and unload
omponents of the operating system as you need them. Linux modules are lumps of
ode that
an be dynami
ally linked into the kernel at any point after the system has
booted. They
an be unlinked from the kernel and removed when they are no longer
needed. Mostly Linux kernel modules are devi
e drivers, pseudo-devi
e drivers su
h
as network drivers, or le-systems.
You
an either load and unload Linux kernel modules expli
itly using the insmod and
rmmod
ommands or the kernel itself
an demand that the kernel daemon (kerneld)
loads and unloads the modules as they are needed. Dynami
ally loading
ode as it
is needed is attra
tive as it keeps the kernel size to a minimum and makes the kernel
very
exible. My
urrent Intel kernel uses modules extensively and is only 406Kbytes
long. I only o
asionally use VFAT le systems and so I build my Linux kernel to
automati
ally load the VFAT le system module as I mount a VFAT partition. When
I have unmounted the VFAT partition the system dete
ts that I no longer need the
VFAT le system module and removes it from the system. Modules
an also be useful
for trying out new kernel
ode without having to rebuild and reboot the kernel every
time you try it out. Nothing, though, is for free and there is a slight performan
e
and memory penalty asso
iated with kernel modules. There is a little more
ode that
a loadable module must provide and this and the extra data stru
tures take a little
145
more memory. There is also a level of indire
tion introdu
ed that makes a
esses of
kernel resour
es slightly less e
ient for modules.
On
e a Linux module has been loaded it is as mu
h a part of the kernel as any normal
kernel
ode. It has the same rights and responsibilities as any kernel
ode; in other
words, Linux kernel modules
an
rash the kernel just like all kernel
ode or devi
e
drivers
an.
So that modules
an use the kernel resour
es that they need, they must be able to nd
them. Say a module needs to
all kmallo
(), the kernel memory allo
ation routine.
At the time that it is built, a module does not know where in memory kmallo
() is,
so when the module is loaded, the kernel must x up all of the module's referen
es
to kmallo
() before the module
an work. The kernel keeps a list of all of the
kernel's resour
es in the kernel symbol table so that it
an resolve referen
es to those
resour
es from the modules as they are loaded. Linux allows module sta
king, this
is where one module requires the servi
es of another module. For example, the VFAT
le system module requires the servi
es of the FAT le system module as the VFAT
le system is more or less a set of extensions to the FAT le system. One module
requiring servi
es or resour
es from another module is very similar to the situation
where a module requires servi
es and resour
es from the kernel itself. Only here
the required servi
es are in another, previously loaded module. As ea
h module is
loaded, the kernel modies the kernel symbol table, adding to it all of the resour
es
or symbols exported by the newly loaded module. This means that, when the next
module is loaded, it has a
ess to the servi
es of the already loaded modules.
When an attempt is made to unload a module, the kernel needs to know that the
module is unused and it needs some way of notifying the module that it is about to
be unloaded. That way the module will be able to free up any system resour
es that
it has allo
ated, for example kernel memory or interrupts, before it is removed from
the kernel. When the module is unloaded, the kernel removes any symbols that that
module exported into the kernel symbol table.
Apart from the ability of a loaded module to
rash the operating system by being
badly written, it presents another danger. What happens if you load a module built
for an earlier or later kernel than the one that you are now running? This may
ause a problem if, say, the module makes a
all to a kernel routine and supplies the
wrong arguments. The kernel
an optionally prote
t against this by making rigorous
version
he
ks on the module as it is loaded.
insmod ommand to manually insert the it into the kernel. The se ond, and mu h
more
lever way, is to load the module as it is needed; this is known as demand
loading. When the kernel dis
overs the need for a module, for example when the
kerneld is in the
user mounts a le system that is not in the kernel, the kernel will request that the
modules pa
kage
kernel daemon (kerneld) attempts to load the appropriate module.
along with
, lsmod
.
insmod
and
rmmod
See in lude/linux/kerneld.h
The kernel daemon is a normal user pro
ess albeit with super user privileges. When
it is started up, usually at system boot time, it opens up an Inter-Pro
ess Communi
ation (IPC)
hannel to the kernel. This link is used by the kernel to send messages
to the kerneld asking for various tasks to be performed. Kerneld's major fun
tion
module_list
module
module
next
next
ref
symtab
name
ref
symtab
name
"fat"
size
size
addr
addr
state
state
*cleanup()
*cleanup()
"vfat"
symbol_table
size
n_symbols
n_refs
symbol_table
size
n_symbols
n_refs
symbols
symbols
references
references
When insmod has xed up the module's referen
es to exported kernel symbols, it asks
the kernel for enough spa
e to hold the new kernel, again using a privileged system
all. The kernel allo
ates a new module data stru
ture and enough kernel memory
to hold the new module and puts it at the end of the kernel modules list. The new
module is marked as UNINITIALIZED. Figure 12.1 shows the list of kernel modules
after two modules, VFAT and VFAT have been loaded into the kernel. Not shown in the
diagram is the rst module on the list, whi
h is a pseudo-module that is only there
to hold the kernel's exported symbol table. You
an use the
ommand lsmod to list
all of the loaded kernel modules and their interdependen
ies. lsmod simply reformats
/pro
/modules whi
h is built from the list of kernel module data stru
tures. The
memory that the kernel allo
ates for it is mapped into the insmod pro
ess's address
spa
e so that it
an a
ess it. insmod
opies the module into the allo
ated spa
e and
relo
ates it so that it will run from the kernel address that it has been allo
ated.
This must happen as the module
annot expe
t to be loaded at the same address
twi
e let alone into the same address in two dierent Linux systems. Again, this
relo
ation involves pat
hing the module image with the appropriate addresses.
The new module also exports symbols to the kernel and insmod builds a table of
these exported images. Every kernel module must
ontain module initialization and
module
leanup routines and these symbols are deliberately not exported but insmod
must know the addresses of them so that it
an pass them to the kernel. All being
well, insmod is now ready to initialize the module and it makes a privileged system
all
passing the kernel the addresses of the module's initialization and
leanup routines.
When a new module is added into the kernel, it must update the kernel's set of
symbols and modify the modules that are being used by the new module. Modules
that have other modules dependent on them must maintain a list of referen
es at the
end of their symbol table and pointed at by their module data stru
ture. Figure 12.1
shows that the VFAT le system module is dependent on the FAT le system module.
So, the FAT module
ontains a referen
e to the VFAT module; the referen
e was
added when the VFAT module was loaded. The kernel
alls the modules initialization
routine and, if it is su
essful it
arries on installing the module. The module's
leanup routine address is stored in it's module data stru
ture and it will be
alled
by the kernel when that module is unloaded. Finally, the module's state is set to
RUNNING.
Module:
msdos
vfat
fat
The
ount is the number of kernel entities that are dependent on this module. In the
above example, the vfat and msdos modules are both dependent on the fat module
and so it has a
ount of 2. Both the vfat and msdos modules have 1 dependent,
whi
h is a mounted le system. If I were to load another VFAT le system then the
vfat module's
ount would be
ome 2. A module's
ount is held in the rst longword
of its image.
This eld is slightly overloaded as it also holds the AUTOCLEAN and VISITED
ags.
Both of these
ags are used for demand loaded modules. These modules are marked
as AUTOCLEAN so that the system
an re
ognize whi
h ones it may automati
ally
unload. The VISITED
ag marks the module as in use by one or more other system
omponents; it is set whenever another
omponent makes use of the module. Ea
h
time the system is asked by kerneld to remove unused demand loaded modules it
looks through all of the modules in the system for likely
andidates. It only looks
at modules marked as AUTOCLEAN and in the state RUNNING. If the
andidate has
its VISITED
ag
leared then it will remove the module, otherwise it will
lear the
VISITED
ag and go on to look at the next module in the system.
Assuming that a module
an be unloaded, its
leanup routine is
alled to allow it See sys delete module()
to free up the kernel resour
es that it has allo
ated. The module data stru
ture is in
marked as DELETED and it is unlinked from the list of kernel modules. Any other kernel/module.
modules that it is dependent on have their referen
e lists modied so that they no
longer have it as a dependent. All of the kernel memory that the module needed is
deallo
ated.
Chapter 13
Pro essors
Linux runs on a number of pro
essors; this
hapter gives a brief outline
of ea
h of them.
13.1 X86
TBD
13.2 ARM
The ARM pro
essor implements a low power, high performan
e 32 bit RISC ar
hite
ture. It is being widely used in embedded devi
es su
h as mobile phones and PDAs
(Personal Data Assistants). It has 31 32 bit registers with 16 visible in any mode. Its
instru
tions are simple load and store instru
tions (load a value from memory, perform an operation and store the result ba
k into memory). One interesting feature
it has is that every instru
tion is
onditional. For example, you
an test the value
of a register and, until you next test for the same
ondition, you
an
onditionally
exe
ute instru
tions as and when you like. Another interesting feature is that you
an perform arithmeti
and shift operations on values as you load them. It operates
in several modes, in
luding a system mode that
an be entered from user mode via
a SWI (software interrupt).
It is a synthasisable
ore and ARM (the
ompany) does not itself manufa
ture pro
essors. Instead the ARM partners (
ompanies su
h as Intel or LSI for example)
implement the ARM ar
hite
ture in sili
on. It allows other pro
essors to be tightly
oupled via a
o-pro
essor interfa
e and it has several memory management unit
variations. These range from simple memory prote
tion s
hemes to
omplex page
hierar
hies.
151
Chapter 14
This
hapter des
ribes where in the Linux kernel sour
es you should start
looking for parti
ular kernel fun
tions.
This book does not depend on a knowledge of the 'C' programming language or
require that you have the Linux kernel sour
es available in order to understand how
the Linux kernel works. That said, it is a fruitful exer
ise to look at the kernel sour
es
to get an in-depth understanding of the Linux operating system. This
hapter gives
an overview of the kernel sour
es; how they are arranged and where you might start
to look for parti
ular
ode.
kernels. That way they are tested for the whole
ommunity. Remember that it is
always worth ba
king up your system thoroughly if you do try out non-produ
tion
kernels.
Changes to the kernel sour
es are distributed as pat
h les. The pat
h utility is used
to apply a series of edits to a set of sour
e les. So, for example, if you have the
2.0.29 kernel sour
e tree and you wanted to move to the 2.0.30 sour
e tree, you would
obtain the 2.0.30 pat
h le and apply the pat
hes (edits) to that sour
e tree:
$
d /usr/sr
/linux
$ pat
h -p1 < pat
h-2.0.30
This saves
opying whole sour
e trees, perhaps over slow serial
onne
tions. A good
sour
e of kernel pat
hes (o
ial and uno
ial) is the http://www.linuxhq.
om web
site.
ar h The ar h subdire tory ontains all of the ar hite ture spe i kernel ode. It
has further subdire
tories, one per supported ar
hite
ture, for example i386
and alpha.
in lude The in lude subdire tory ontains most of the in lude les needed to build
the kernel
ode. It too has further subdire
tories in
luding one for every ar
hite
ture supported. The in
lude/asm subdire
tory is a soft link to the real
in
lude dire
tory needed for this ar
hite
ture, for example in
lude/asm-i386.
To
hange ar
hite
tures you need to edit the kernel makele and rerun the
Linux kernel
onguration program.
init This dire
tory
ontains the initialization
ode for the kernel and it is a very
good pla
e to start looking at how the kernel works.
mm This dire
tory
ontains all of the memory management
ode. The ar
hite
ture spe
i
memory management
ode lives down in ar
h/*/mm/, for example
ar
h/i386/mm/fault.
.
drivers All of the system's devi
e drivers live in this dire
tory. They are further
sub-divided into
lasses of devi
e driver, for example blo
k.
ip
This dire
tory
ontains the kernels inter-pro
ess
ommuni
ations
ode.
modules This is simply a dire
tory used to hold built modules.
fs All of the le system
ode. This is further sub-divided into dire
tories, one per
supported le system, for example vfat and ext2.
kernel The main kernel
ode. Again, the ar
hite
ture spe
i
kernel
ode is in
ar
h/*/kernel.
lib This dire
tory
ontains the kernel's library
ode. The ar
hite
ture spe
i
library
ode
an be found in ar
h/*/lib/.
s
ripts This dire
tory
ontains the s
ripts (for example awk and tk s
ripts) that are
used when the kernel is
ongured.
Memory Management
This
ode is mostly in mm but the ar
hite
ture spe
i
ode is in ar
h/*/mm. The
page fault handling
ode is in mm/memory.
and the memory mapping and page
a
he
ode is in mm/filemap.
. The buer
a
he is implemented in mm/buffer.
and the
swap
a
he in mm/swap state.
and mm/swapfile.
.
Kernel
Most of the relevent generi
ode is in kernel with the ar
hite
ture spe
i
ode
in ar
h/*/kernel. The s
heduler is in kernel/s
hed.
and the fork
ode is in
kernel/fork.
. The bottom half handling
ode is in in
lude/linux/interrupt.h.
The task stru
t data stru
ture
an be found in in
lude/linux/s
hed.h.
PCI
The PCI pseudo driver is in drivers/p
i/p
i.
with the system wide denitions in
in
lude/linux/p
i.h. Ea
h ar
hite
ture has some spe
i
PCI BIOS
ode, Alpha
AXP's is in ar
h/alpha/kernel/bios32.
.
Interrupt Handling
The kernel's interrupt handling
ode is almost all mi
ropro
essor (and often platform)
spe
i
. The Intel interrupt handling
ode is in ar
h/i386/kernel/irq.
and its
denitions in in
lude/asm-i386/irq.h.
Devi
e Drivers
Most of the lines of the Linux kernel's sour
e
ode are in its devi
e drivers. All of
Linux's devi
e driver sour
es are held in drivers but these are further broken out
by type:
/
har This the pla
e to look for
hara
ter based devi
es su
h as ttys, serial ports
and mi
e.
/ drom All of the CDROM ode for Linux. It is here that the spe ial CDROM
/p i This are the sour es for the PCI pseudo-driver. A good pla e to look at how
the PCI subsystem is mapped and initialized. The Alpha AXP PCI xup
ode
is also worth looking at in ar
h/alpha/kernel/bios32.
.
/s
si This is where to nd all of the SCSI
ode as well as all of the drivers for the
s
si devi
es supported by Linux.
/net This is where to look to nd the network devi
e drivers su
h as the DECChip
21040 PCI ethernet driver whi
h is in tulip.
.
File Systems
The sour
es for the EXT2 le system are all in the fs/ext2/ dire
tory with data stru
ture denitions in in
lude/linux/ext2 fs.h, ext2 fs i.h and ext2 fs sb.h. The
Virtual File System data stru
tures are des
ribed in in
lude/linux/fs.h and the
ode is in fs/*. The buer
a
he is implemented in fs/buffer.
along with the
update kernel daemon.
Network
The networking
ode is kept in net with most of the in
lude les in in
lude/net.
The BSD so
ket
ode is in net/so
ket.
and the IP version 4 INET so
ket
ode is
in net/ipv4/af inet.
. The generi
proto
ol support
ode (in
luding the sk buff
handling routines) is in net/
ore with the TCP/IP networking
ode in net/ipv4.
The network devi
e drivers are in drivers/net.
Modules
The kernel module
ode is partially in the kernel and partially in the modules pa
kage. The kernel
ode is all in kernel/modules.
with the data stru
tures and kernel demon kerneld messages in in
lude/linux/module.h and in
lude/linux/kerneld.h respe
tively. You may want to look at the stru
ture of an ELF obje
t le
in in
lude/linux/elf.h.
Appendix A
This appendix lists the major data stru
tures that Linux uses and whi
h are des
ribed
in this book. They have been edited slightly to t the paper.
See
in
lude/linux/
blkdev.h
stru
t blk_dev_stru
t {
void (*request_fn)(void);
stru
t request *
urrent_request;
stru
t request
plug;
stru
t tq_stru
t plug_tq;
};
buer head
The buffer head data stru
ture holds information about a blo
k buer in the buer
a
he.
/* bh state bits */
#define BH_Uptodate
#define BH_Dirty
#define BH_Lo
k
#define BH_Req
#define BH_Tou
hed
#define BH_Has_aged
#define BH_Prote
ted
#define BH_FreeOnIO
0
1
2
3
4
5
6
7
/*
/*
/*
/*
/*
/*
/*
/*
1
1
1
0
1
1
1
1
if
if
if
if
if
if
if
to
stru t buffer_head {
159
*/
*/
*/
*/
*/
*/
*/
*/
See
in
lude/linux/
fs.h
/* First
a
he line: */
unsigned long
b_blo
knr;
kdev_t
b_dev;
kdev_t
b_rdev;
unsigned long
b_rse
tor;
stru
t buffer_head *b_next;
stru
t buffer_head *b_this_page;
/*
/*
/*
/*
/*
/*
blo
k number
*/
devi
e (B_FREE = free)
*/
Real devi
e
*/
Real buffer lo
ation on disk
*/
Hash queue list
*/
ir
ular list of buffers in one
page
*/
/* Se
ond
a
he line: */
unsigned long
b_state;
/* buffer state bitmap (above)
stru
t buffer_head *b_next_free;
unsigned int
b_
ount;
/* users using this blo
k
unsigned long
b_size;
/* blo
k size
/* Non-performan
e-
riti
al data
har
*b_data;
unsigned int
b_list;
unsigned long
b_flushtime;
*/
*/
*/
unsigned long
b_lru_time;
follows. */
/* pointer to data blo
k
/* List that this buffer appears
/* Time when this (dirty) buffer
* should be written
/* Time when this buffer was
* last used.
*/
stru
t
stru
t
stru
t
stru
t
*b_wait;
*b_prev;
/* doubly linked hash list
*b_prev_free; /* doubly linked list of buffers
*b_reqnext;
/* request queue
*/
*/
*/
wait_queue
buffer_head
buffer_head
buffer_head
*/
*/
*/
};
See
in
lude/linux/
netdevi
e.h
devi
e
Every network devi
e in the system is represented by a devi
e data stru
ture.
stru
t devi
e
{
/*
* This is the first field of the "visible" part of this stru
ture
* (i.e. as seen by users in the "Spa
e.
" file). It is the name
* the interfa
e.
*/
har
*name;
/* I/O spe
ifi
fields
unsigned long
unsigned long
unsigned long
unsigned long
unsigned long
unsigned
har
rmem_end;
rmem_start;
mem_end;
mem_start;
base_addr;
irq;
*/
*/
*/
*/
*/
*/
*/
/* start an operation
/* interrupt arrived
/* transmitter busy
*/
*/
*/
/*
/*
/*
/*
/*
/*
stru t devi e
*next;
*/
/* Some hardware also needs these fields, but they are not part of
the usual set spe
ified in Spa
e.
. */
unsigned
har
if_port;
/* Sele
table AUI,TP,
*/
unsigned
har
dma;
/* DMA
hannel
*/
stru
t enet_statisti
s* (*get_stats)(stru
t devi
e *dev);
/*
* This marks the end of the "visible" part of the stru
ture. All
* fields hereafter are internal to the system, and may
hange at
* will (read: may be
leaned up at will).
*/
/* These may be needed for future network-power-down
ode.
*/
unsigned long
trans_start;
/* Time (jiffies) of
last transmit
*/
unsigned long
last_rx;
/* Time of last Rx
*/
unsigned short
flags;
/* interfa
e flags (BSD)*/
unsigned short
family;
/* address family ID
*/
unsigned short
metri
;
/* routing metri
*/
unsigned short
mtu;
/* MTU value
*/
unsigned short
type;
/* hardware type
*/
unsigned short
hard_header_len; /* hardware hdr len
*/
void
*priv;
/* private data
*/
/* Interfa
e address info. */
unsigned
har
broad
ast[MAX_ADDR_LEN;
unsigned
har
pad;
unsigned
har
dev_addr[MAX_ADDR_LEN;
unsigned
har
addr_len;
/* hardware
unsigned long
pa_addr;
/* proto
ol
unsigned long
pa_brdaddr;
/* proto
ol
unsigned long
pa_dstaddr;
/* proto
ol
unsigned long
pa_mask;
/* proto
ol
unsigned short
pa_alen;
/* proto
ol
addr len
*/
address
*/
broad
ast addr*/
P-P other addr*/
netmask
*/
address len */
*m
_list;
m
_
ount;
*ip_m
_list;
tx_queue_len;
*/
*/
*/
*/
*/
*/
stru t sk_buff_head
buffs[DEV_NUMBUFFS;
devi
e stru
t
devi
e stru
t data stru
tures are used to register
hara
ter and blo
k devi
es (they
See
fs/devi es.
hold its name and the set of le operations that
an be used for this devi
e). Ea
h
valid member of the
hrdevs and blkdevs ve
tors represents a
hara
ter or blo
k
devi
e respe
tively.
stru
t devi
e_stru
t {
onst
har * name;
stru
t file_operations * fops;
};
le
See
in
lude/linux/
fs.h
stru
t file {
mode_t f_mode;
loff_t f_pos;
unsigned short f_flags;
unsigned short f_
ount;
unsigned long f_reada, f_ramax, f_raend, f_ralen, f_rawin;
stru
t file *f_next, *f_prev;
int f_owner;
/* pid or -pgrp where SIGIO should be sent */
stru
t inode * f_inode;
stru
t file_operations * f_op;
unsigned long f_version;
void *private_data; /* needed for tty driver, and maybe others */
};
les stru
t
The files stru
t data stru
ture des
ribes the les that a pro
ess has open.
See
in
lude/linux/
s
hed.h
stru
t files_stru
t {
int
ount;
fd_set
lose_on_exe
;
fd_set open_fds;
stru
t file * fd[NR_OPEN;
};
fs stru t
See
in
lude/linux/
s
hed.h
stru
t fs_stru
t {
int
ount;
unsigned short umask;
stru
t inode * root, * pwd;
};
gendisk
The gendisk data stru
ture holds information about a hard disk. They are used
during initialization when the disks are found and then probed for partitions.
stru
t hd_stru
t {
long start_se
t;
long nr_se
ts;
};
stru
t gendisk {
int major;
onst
har *major_name;
int minor_shift;
int max_p;
int max_nr;
See
in
lude/linux/
genhd.h
do our thing */
/* partition table */
/* devi
e size in blo
ks,
opied to
blk_size[ */
/* number of real devi
es */
/* internal use */
};
See
in
lude/linux/
fs.h
inode
The VFS inode data stru
ture holds information about a le or dire
tory on disk.
stru
t inode {
kdev_t
unsigned long
umode_t
nlink_t
uid_t
gid_t
kdev_t
off_t
time_t
time_t
time_t
unsigned long
unsigned long
unsigned long
unsigned long
stru
t semaphore
stru
t inode_operations
stru
t super_blo
k
stru
t wait_queue
stru
t file_lo
k
stru
t vm_area_stru
t
stru
t page
stru
t dquot
stru
t inode
stru
t inode
stru
t inode
stru
t inode
unsigned short
unsigned short
unsigned
har
unsigned
har
unsigned
har
unsigned
har
unsigned
har
unsigned
har
unsigned short
union {
stru
t pipe_inode_info
stru
t minix_inode_info
i_dev;
i_ino;
i_mode;
i_nlink;
i_uid;
i_gid;
i_rdev;
i_size;
i_atime;
i_mtime;
i_
time;
i_blksize;
i_blo
ks;
i_version;
i_nrpages;
i_sem;
*i_op;
*i_sb;
*i_wait;
*i_flo
k;
*i_mmap;
*i_pages;
*i_dquot[MAXQUOTAS;
*i_next, *i_prev;
*i_hash_next, *i_hash_prev;
*i_bound_to, *i_bound_by;
*i_mount;
i_
ount;
i_flags;
i_lo
k;
i_dirt;
i_pipe;
i_so
k;
i_seek;
i_update;
i_write
ount;
pipe_i;
minix_i;
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
void
} u;
ext_inode_info
ext2_inode_info
hpfs_inode_info
msdos_inode_info
umsdos_inode_info
iso_inode_info
nfs_inode_info
xiafs_inode_info
sysv_inode_info
affs_inode_info
ufs_inode_info
so
ket
ext_i;
ext2_i;
hpfs_i;
msdos_i;
umsdos_i;
isofs_i;
nfs_i;
xiafs_i;
sysv_i;
affs_i;
ufs_i;
so
ket_i;
*generi
_ip;
};
ip
perm
The ip
perm data stru
ture des
ribes the a
ess permissions of a System V IPC
obje
t .
stru
t ip
_perm
{
key_t key;
ushort uid;
ushort gid;
ushort
uid;
ushort
gid;
ushort mode;
ushort seq;
};
See
in
lude/linux/
ip
.h
irqa
tion
The irqa
tion data stru
ture is used to des
ribe the system's interrupt handlers.
See
in
lude/linux/
interrupt.h
linux binfmt
Ea
h binary le format that Linux understands is represented by a linux binfmt
data stru
ture.
stru
t linux_binfmt {
stru
t linux_binfmt * next;
long *use_
ount;
See
in
lude/linux/
binfmts.h
mem map t
See
in
lude/linux/
mm.h
The mem map t data stru
ture (also known as page) is used to hold information about
ea
h page of physi
al memory.
typedef stru
t page {
/* these must be first (free area handling) */
stru
t page
*next;
stru
t page
*prev;
stru
t inode
*inode;
unsigned long
offset;
stru
t page
*next_hash;
atomi
_t
ount;
unsigned
flags;
/* atomi
flags, some possibly
updated asyn
hronously */
unsigned
dirty:16,
age:8;
stru
t wait_queue *wait;
stru
t page
*prev_hash;
stru
t buffer_head *buffers;
unsigned long
swap_unlo
k_entry;
unsigned long
map_nr;
/* page->map_nr == page - mem_map */
} mem_map_t;
mm stru
t
See
in
lude/linux/
s
hed.h
The mm stru
t data stru
ture is used to des
ribe the virtual memory of a task or
pro
ess.
stru
t mm_stru
t {
int
ount;
pgd_t * pgd;
unsigned long
ontext;
unsigned long start_
ode, end_
ode, start_data, end_data;
unsigned long start_brk, brk, start_sta
k, start_mmap;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, lo
ked_vm;
unsigned long def_flags;
stru
t vm_area_stru
t * mmap;
stru
t vm_area_stru
t * mmap_avl;
stru
t semaphore mmap_sem;
};
See
in
lude/linux/
p
i.h
p
i bus
Every PCI bus in the system is represented by a p
i bus data stru
ture.
stru
t p
i_bus {
stru
t p
i_bus
stru
t p
i_bus
stru
t p
i_bus
stru
t p
i_dev
stru
t p
i_dev
void
unsigned
unsigned
unsigned
unsigned
};
*parent;
*
hildren;
*next;
*self;
*devi
es;
*sysdata;
har
har
har
har
number;
primary;
se
ondary;
subordinate;
bus number */
number of primary bridge */
number of se
ondary bridge */
max number of subordinate buses */
p
i dev
Every PCI devi
e in the system, in
luding PCI-PCI and PCI-ISA bridge devi
es is
represented by a p
i dev data stru
ture.
/*
* There is one p
i_dev stru
ture for ea
h slot-number/fun
tion-number
*
ombination:
*/
stru
t p
i_dev {
stru
t p
i_bus *bus;
/* bus this devi
e is on */
stru
t p
i_dev *sibling; /* next devi
e on this bus */
stru
t p
i_dev *next;
/*
hain of all devi
es */
void
*sysdata;
See
in
lude/linux/
p
i.h
request
See
in
lude/linux/
blkdev.h
request data stru tures are used to make requests to the blo k devi es in the system.
The requests are always to read or write blo
ks of data to or from the buer
a
he.
stru
t request {
volatile int rq_status;
#define RQ_INACTIVE
#define RQ_ACTIVE
#define RQ_SCSI_BUSY
#define RQ_SCSI_DONE
#define RQ_SCSI_DISCONNECTING
(-1)
1
0xffff
0xfffe
0xffe0
kdev_t rq_dev;
int
md;
/* READ or WRITE */
int errors;
unsigned long se
tor;
unsigned long nr_se
tors;
unsigned long
urrent_nr_se
tors;
har * buffer;
stru
t semaphore * sem;
stru
t buffer_head * bh;
stru
t buffer_head * bhtail;
stru
t request * next;
};
rtable
See
in
lude/net/
route.h
Ea
h rtable data stru
ture holds information about the route to take in order to
send pa
kets to an IP host. rtable data stru
tures are used within the IP route
a
he.
stru
t rtable
{
stru
t rtable
__u32
__u32
__u32
atomi
_t
atomi
_t
unsigned long
atomi
_t
stru
t hh_
a
he
stru
t devi
e
unsigned short
unsigned short
unsigned short
unsigned
har
};
See
in
lude/asm/
semaphore.h
*rt_next;
rt_dst;
rt_sr
;
rt_gateway;
rt_ref
nt;
rt_use;
rt_window;
rt_lastuse;
*rt_hh;
*rt_dev;
rt_flags;
rt_mtu;
rt_irtt;
rt_tos;
semaphore
Semaphores are used to prote
t
riti
al data stru
tures and regions of
ode. y
stru
t semaphore {
int
ount;
int waking;
int lo
k ;
stru
t wait_queue *wait;
};
sk bu
The sk buff data stru
ture is used to des
ribe network data as it moves between
the layers of proto
ol.
See
in
lude/linux/
skbuff.h
stru
t sk_buff
{
stru
t sk_buff
*next;
/* Next buffer in list
*/
stru
t sk_buff
*prev;
/* Previous buffer in list
*/
stru
t sk_buff_head *list;
/* List we are on
*/
int
magi
_debug_
ookie;
stru
t sk_buff
*link3;
/* Link for IP proto
ol level buffer
hains */
stru
t so
k
*sk;
/* So
ket we are owned by
*/
unsigned long
when;
/* used to
ompute rtt's
*/
stru
t timeval
stamp;
/* Time we arrived
*/
stru
t devi
e
*dev;
/* Devi
e we arrived on/are leaving by
*/
union
{
stru
t t
phdr
*th;
stru
t ethhdr
*eth;
stru
t iphdr
*iph;
stru
t udphdr
*uh;
unsigned
har
*raw;
/* for passing file handles in a unix domain so
ket */
void
*filp;
} h;
union
{
/* As yet in
omplete physi
al layer views */
unsigned
har
*raw;
stru
t ethhdr
*ethernet;
} ma
;
stru
t iphdr
unsigned long
unsigned long
__u32
__u32
__u32
__u32
__u32
__u32
unsigned
har
volatile
har
*ip_hdr;
/*
len;
/*
sum;
/*
saddr;
/*
daddr;
/*
raddr;
/*
seq;
/*
end_seq;
/*
a
k_seq;
/*
proto_priv[16;
a
ked,
/*
used,
/*
free,
/*
For IPPROTO_RAW
Length of a
tual data
Che
ksum
IP sour
e address
IP target address
IP next hop address
TCP sequen
e number
seq [+ fin [+ syn + datalen
TCP a
k sequen
e number
*/
*/
*/
*/
*/
*/
*/
*/
*/
Are we a
ked ?
Are we in use ?
How to free this buffer
*/
*/
*/
arp;
/* Has IP/ARP resolution finished
tries,
/* Times tried
lo
k,
/* Are we lo
ked ?
lo
alroute, /* Lo
al routing asserted for this frame
pkt_type,
/* Pa
ket
lass
pkt_bridged, /* Tra
ker for bridging
ip_summed;
/* Driver fed us an IP
he
ksum
#define PACKET_HOST
0
/* To us
#define PACKET_BROADCAST
1
/* To all
#define PACKET_MULTICAST
2
/* To group
#define PACKET_OTHERHOST
3
/* To someone else
unsigned short
users;
/* User
ount - see datagram.
,t
p.
unsigned short
proto
ol;
/* Pa
ket proto
ol from driver.
unsigned int
truesize;
/* Buffer size
atomi
_t
ount;
/* referen
e
ount
stru
t sk_buff
*data_skb;
/* Link to the a
tual data skb
unsigned
har
*head;
/* Head of buffer
unsigned
har
*data;
/* Data head pointer
unsigned
har
*tail;
/* Tail pointer
unsigned
har
*end;
/* End pointer
void
(*destru
tor)(stru
t sk_buff *); /* Destru
t fun
tion
__u16
redirport;
/* Redire
t port
};
unsigned
har
so
k
See
in
lude/linux/
net.h
Ea
h so
k data stru
ture holds proto
ol spe
i
information about a BSD so
ket.
For example, for an INET (Internet Address Domain) so
ket this data stru
ture
would hold all of the TCP/IP and UDP/IP spe
i
information.
stru
t so
k
{
/* This must be first. */
stru
t so
k
*sklist_next;
stru
t so
k
*sklist_prev;
stru
t options
*opt;
atomi
_t
wmem_allo
;
atomi
_t
rmem_allo
;
unsigned long
allo
ation;
/* Allo
ation mode */
__u32
write_seq;
__u32
sent_seq;
__u32
a
ked_seq;
__u32
opied_seq;
__u32
r
v_a
k_seq;
unsigned short
r
v_a
k_
nt;
/*
ount of same a
k */
__u32
window_seq;
__u32
fin_seq;
__u32
urg_seq;
__u32
urg_data;
__u32
syn_seq;
int
users;
/* user
ount */
/*
*
Not all are volatile, but some are, so we
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*
might as well say they all are.
*/
volatile
har
dead,
urginline,
intr,
blog,
done,
reuse,
keepopen,
linger,
delay_a
ks,
destroy,
a
k_timed,
no_
he
k,
zapped,
broad
ast,
nonagle,
bsdism;
unsigned long
lingertime;
int
pro
;
stru
t
stru
t
stru
t
stru
t
stru
t
int
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
stru
t
long
stru
t
so
k
so
k
so
k
so
k
so
k
so
k
sk_buff
sk_buff
sk_buff
sk_buff_head
sk_buff
timer_list
sk_buff_head
stru
t proto
stru
t wait_queue
__u32
__u32
__u32
unsigned short
unsigned short
__u32
__u32
volatile
volatile
volatile
unsigned
/*
*
unsigned long
unsigned long
unsigned long
int
*next;
**pprev;
*bind_next;
**bind_pprev;
*pair;
hashent;
*prev;
*volatile send_head;
*volatile send_next;
*volatile send_tail;
ba
k_log;
*partial;
partial_timer;
retransmits;
write_queue,
re
eive_queue;
*prot;
**sleep;
daddr;
saddr;
/* Sending sour
e */
r
v_saddr;
/* Bound address */
max_una
ked;
window;
lastwin_seq;
/* sequen
e number when we last
updated the window we offer */
high_seq;
/* sequen
e number when we did
urrent fast retransmit */
ato;
/* a
k timeout */
lr
vtime;
/* jiffies at last data r
v */
idletime;
/* jiffies at last r
v */
bytes_r
v;
*/
unsigned
volatile
volatile
volatile
unsigned
unsigned
unsigned
volatile
volatile
volatile
volatile
volatile
volatile
volatile
short
unsigned
unsigned
unsigned
long
int
short
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
mtu;
short mss;
short user_mss;
short max_window;
window_
lamp;
ssthresh;
num;
short
ong_window;
short
ong_
ount;
short pa
kets_out;
short shutdown;
long rtt;
long mdev;
long rto;
unsigned
volatile
unsigned
unsigned
unsigned
unsigned
int
int
unsigned
unsigned
/*
*
*
*/
har
unsigned
har
har
har
har
har
short
har
proto
ol;
state;
a
k_ba
klog;
max_a
k_ba
klog;
priority;
debug;
r
vbuf;
sndbuf;
type;
lo
alroute;
stru
t unix_opt
af_unix;
#if defined(CONFIG_ATALK) || defined(CONFIG_ATALK_MODULE)
stru
t atalk_so
k
af_at;
#endif
#if defined(CONFIG_IPX) || defined(CONFIG_IPX_MODULE)
stru
t ipx_opt
af_ipx;
#endif
#ifdef CONFIG_INET
stru
t inet_pa
ket_opt af_pa
ket;
#ifdef CONFIG_NUTCP
stru
t t
p_opt
af_t
p;
#endif
#endif
} protinfo;
/*
*
IP 'private area'
*/
int
int
stru
t t
phdr
stru
t timer_list
stru
t timer_list
stru
t timer_list
int
stru
t rtable
unsigned
har
#ifdef CONFIG_IP_MULTICAST
int
int
har
stru
t ip_m
_so
klist
#endif
/*
*
*/
/*
*
*/
ip_ttl;
ip_tos;
dummy_th;
keepalive_timer;
retransmit_timer;
dela
k_timer;
ip_xmit_timeout;
*ip_route_
a
he;
ip_hdrin
l;
/* TTL setting */
/* TOS */
/*
/*
/*
/*
/*
/*
TCP keepalive ha
k */
TCP retransmit timer */
TCP delayed a
k timer */
Why the timeout is running */
Ca
hed output route */
In
lude headers ? */
ip_m
_ttl;
/* Multi
asting TTL */
ip_m
_loop;
/* Loopba
k */
ip_m
_name[MAX_ADDR_LEN; /* Multi
ast devi
e name */
*ip_m
_list;
/* Group array */
timeout;
timer;
stru t timeval
stamp;
Identd
stru
t so
ket
/*
*
Callba
ks
*/
void
void
void
void
*so ket;
};
so
ket
Ea
h so
ket data stru
ture holds information about a BSD so
ket. It does not exist
independently; it is, instead, part of the VFS inode data stru
ture.
stru
t so
ket {
short
so
ket_state
long
stru
t proto_ops
void
stru
t so
ket
stru
t so
ket
stru
t so
ket
stru
t wait_queue
stru
t inode
type;
state;
flags;
*ops;
*data;
*
onn;
*i
onn;
*next;
**wait;
*inode;
/* SOCK_STREAM, ...
*/
/*
/*
/*
/*
*/
*/
*/
*/
*/
See
in
lude/linux/
net.h
See
in
lude/linux/
s
hed.h
*/
*/
task stru
t
Ea
h task stru
t data stru
ture des
ribes a pro
ess or task in the system.
stru
t task_stru
t {
/* these are hard
oded - don't tou
h */
volatile long
state;
/* -1 unrunnable, 0 runnable, >0 stopped */
long
ounter;
long
priority;
unsigned
long signal;
unsigned
long blo
ked;
/* bitmap of masked signals */
unsigned
long flags;
/* per pro
ess flags, defined below */
int errno;
long
debugreg[8;
/* Hardware debugging registers */
stru
t exe
_domain
*exe
_domain;
/* various fields */
stru
t linux_binfmt *binfmt;
stru
t task_stru
t
*next_task, *prev_task;
stru
t task_stru
t
*next_run, *prev_run;
unsigned long
saved_kernel_sta
k;
unsigned long
kernel_sta
k_page;
int
exit_
ode, exit_signal;
/* ??? */
unsigned long
personality;
int
dumpable:1;
int
did_exe
:1;
int
pid;
int
pgrp;
int
tty_old_pgrp;
int
session;
/* boolean value for session group leader */
int
leader;
int
groups[NGROUPS;
/*
* pointers to (original) parent pro
ess, youngest
hild, younger sibling,
* older sibling, respe
tively. (p->father
an be repla
ed with
* p->p_pptr->pid)
*/
stru
t task_stru
t
*p_opptr, *p_pptr, *p_
ptr,
*p_ysptr, *p_osptr;
stru
t wait_queue
*wait_
hldexit;
unsigned short
uid,euid,suid,fsuid;
unsigned short
gid,egid,sgid,fsgid;
unsigned long
timeout, poli
y, rt_priority;
unsigned long
it_real_value, it_prof_value, it_virt_value;
unsigned long
it_real_in
r, it_prof_in
r, it_virt_in
r;
stru
t timer_list
real_timer;
long
utime, stime,
utime,
stime, start_time;
/* mm fault and swap info: this
an arguably be seen as either
mm-spe
ifi
or thread-spe
ifi
*/
unsigned long
min_flt, maj_flt, nswap,
min_flt,
maj_flt,
nswap;
int swappable:1;
unsigned long
swap_address;
unsigned long
old_maj_flt;
/* old value of maj_flt */
unsigned long
de
_flt;
/* page fault
ount of the last time */
unsigned long
swap_
nt;
/* number of pages to swap on next pass */
/* limits */
stru
t rlimit
rlim[RLIM_NLIMITS;
unsigned short
used_math;
har
omm[16;
/* file system info */
int
link_
ount;
stru
t tty_stru
t
*tty;
/* NULL if no tty */
/* ip
stuff */
stru
t sem_undo
*semundo;
stru
t sem_queue
*semsleeping;
/* ldt for this task - used by Wine. If NULL, default_ldt is used */
stru
t des
_stru
t *ldt;
/* tss for this task */
stru
t thread_stru
t tss;
/* filesystem information */
stru
t fs_stru
t
*fs;
/* open file information */
stru
t files_stru
t *files;
/* memory management info */
stru
t mm_stru
t
*mm;
/* signal handlers */
stru
t signal_stru
t *sig;
#ifdef __SMP__
int
pro
essor;
int
last_pro
essor;
int
lo
k_depth;
/* Lo
k depth.
We
an
ontext swit
h in and out
of holding a sys
all kernel lo
k... */
#endif
};
timer list
timer list data stru
ture's are used to implement real time timers for pro
esses.
stru
t timer_list {
stru
t timer_list *next;
stru
t timer_list *prev;
unsigned long expires;
unsigned long data;
void (*fun
tion)(unsigned long);
};
tq stru
t
Ea
h task queue (tq stru
t) data stru
ture holds information about work that has
been queued. This is usually a task needed by a devi
e driver but whi
h does not
See
in
lude/linux/
timer.h
See
in
lude/linux/
tqueue.h
/*
/*
/*
/*
vm area stru
t
See
in
lude/linux/
mm.h
Ea
h vm area stru
t data stru
ture des
ribes an area of virtual memory for a pro
ess.
stru
t vm_area_stru
t {
stru
t mm_stru
t * vm_mm; /* VM area parameters */
unsigned long vm_start;
unsigned long vm_end;
pgprot_t vm_page_prot;
unsigned short vm_flags;
/* AVL tree of VM areas per task, sorted by address */
short vm_avl_height;
stru
t vm_area_stru
t * vm_avl_left;
stru
t vm_area_stru
t * vm_avl_right;
/* linked list of VM areas per task, sorted by address */
stru
t vm_area_stru
t * vm_next;
/* for areas with inode, the
ir
ular list inode->i_mmap */
/* for shm areas, the
ir
ular list of atta
hes */
/* otherwise unused */
stru
t vm_area_stru
t * vm_next_share;
stru
t vm_area_stru
t * vm_prev_share;
/* more */
stru
t vm_operations_stru
t * vm_ops;
unsigned long vm_offset;
stru
t inode * vm_inode;
unsigned long vm_pte;
/* shared mem */
};
Appendix B
The following World Wide Web and ftp sites are useful:
AXP Linux web site and it is the pla
e to go for all of the Alpha AXP HOWTOs. It also has a large number of pointers to Linux and Alpha AXP spe
i
information su
h as CPU data sheets.
http://www.redhat.
om/ Red Hat's web site. This has a lot of useful pointers.
ftp://sunsite.un
.edu This is the major site for a lot of free software. The Linux
spe
i
software is held in pub/Linux.
http://www.intel.
om Intel's web site and a good pla
e to look for Intel
hip
information.
http://www.bla kdown.org/java-linux.html This is the primary site for information on Java on Linux.
Appendix C
C.1 Overview
The Linux Do
umentation Proje
t is working on developing good, reliable do
s for
the Linux operating system. The overall goal of the LDP is to
ollaborate in taking
are of all of the issues of Linux do
umentation, ranging from online do
s (man
pages, texinfo do
s, and so on) to printed manuals
overing topi
s su
h as installing,
using, and running Linux. The LDP is essentially a loose team of volunteers with
little
entral organization; anyone who is interested in helping is wel
ome to join in
the eort. We feel that working together and agreeing on the dire
tion and s
ope
of Linux do
umentation is the best way to go, to redu
e problems with
on
i
ting
eorts|two people writing two books on the same aspe
t of Linux wastes someone's
time along the way.
The LDP is set out to produ
e the
anoni
al set of Linux online and printed do
umentation. Be
ause our do
s will be freely available (like software li
ensed under
the terms of the GNU GPL) and distributed on the net, we are able to easily update
the do
umentation to stay on top of the many
hanges in the Linux world. If you
are interested in publishing any of the LDP works, see the se
tion \Publishing LDP
Manuals", below.
179
The opyright noti e above and this permission noti e must be preserved omplete on all omplete or partial opies.
If you distribute this work in part, instru
tions for obtaining the
omplete
version of this manual must be in
luded, and a means for obtaining a
omplete
version provided.
Ex
eptions to these rules may be granted for a
ademi
purposes: Write to the author
and ask. These restri
tions are here to prote
t us as authors, not to restri
t you as
learners and edu
ators.
All sour
e
ode in this do
ument is pla
ed under the GNU General Publi
Li
ense,
available via anonymous FTP from prep.ai.mit.edu:/pub/gnu/COPYING.
Appendix D
Printed below is the GNU General Publi
Li
ense (the GPL or
opyleft ), under
whi
h Linux is li
ensed. It is reprodu
ed here to
lear up some of the
onfusion
about Linux's
opyright status|Linux is not shareware, and it is not in the publi
domain. The bulk of the Linux kernel is
opyright
1993 by Linus Torvalds, and
other software and parts of the kernel are
opyrighted by their authors. Thus, Linux
is
opyrighted, however, you may redistribute it under the terms of the GPL printed
below.
D.1 Preamble
The li
enses for most software are designed to take away your freedom to share and
hange it. By
ontrast, the GNU General Publi
Li
ense is intended to guarantee
your freedom to share and
hange free software{to make sure the software is free
for all its users. This General Publi
Li
ense applies to most of the Free Software
Foundation's software and to any other program whose authors
ommit to using
it. (Some other Free Software Foundation software is
overed by the GNU Library
General Publi
Li
ense instead.) You
an apply it to your programs, too.
When we speak of free software, we are referring to freedom, not pri
e. Our General
Publi
Li
enses are designed to make sure that you have the freedom to distribute
opies of free software (and
harge for this servi
e if you wish), that you re
eive
183
sour
e
ode or
an get it if you want it, that you
an
hange the software or use
pie
es of it in new free programs; and that you know you
an do these things.
To prote
t your rights, we need to make restri
tions that forbid anyone to deny
you these rights or to ask you to surrender the rights. These restri
tions translate
to
ertain responsibilities for you if you distribute
opies of the software, or if you
modify it.
For example, if you distribute
opies of su
h a program, whether gratis or for a fee,
you must give the re
ipients all the rights that you have. You must make sure that
they, too, re
eive or
an get the sour
e
ode. And you must show them these terms
so they know their rights.
We prote
t your rights with two steps: (1)
opyright the software, and (2) oer you
this li
ense whi
h gives you legal permission to
opy, distribute and/or modify the
software.
Also, for ea
h author's prote
tion and ours, we want to make
ertain that everyone
understands that there is no warranty for this free software. If the software is modied
by someone else and passed on, we want its re
ipients to know that what they have
is not the original, so that any problems introdu
ed by others will not re
e
t on the
original authors' reputations.
Finally, any free program is threatened
onstantly by software patents. We wish to
avoid the danger that redistributors of a free program will individually obtain patent
li
enses, in ee
t making the program proprietary. To prevent this, we have made it
lear that any patent must be li
ensed for everyone's free use or not li
ensed at all.
The pre
ise terms and
onditions for
opying, distribution and modi
ation follow.
D.2 Terms and Conditions for Copying, Distribution, and Modi
ation
0. This Li
ense applies to any program or other work whi
h
ontains a noti
e
pla
ed by the
opyright holder saying it may be distributed under the terms of
this General Publi
Li
ense. The \Program", below, refers to any su
h program
or work, and a \work based on the Program" means either the Program or
any derivative work under
opyright law: that is to say, a work
ontaining
the Program or a portion of it, either verbatim or with modi
ations and/or
translated into another language. (Hereinafter, translation is in
luded without
limitation in the term \modi
ation".) Ea
h li
ensee is addressed as \you".
A
tivities other than
opying, distribution and modi
ation are not
overed
by this Li
ense; they are outside its s
ope. The a
t of running the Program is
not restri
ted, and the output from the Program is
overed only if its
ontents
onstitute a work based on the Program (independent of having been made by
running the Program). Whether that is true depends on what the Program
does.
1. You may
opy and distribute verbatim
opies of the Program's sour
e
ode
as you re
eive it, in any medium, provided that you
onspi
uously and appropriately publish on ea
h
opy an appropriate
opyright noti
e and dis
laimer
of warranty; keep inta
t all the noti
es that refer to this Li
ense and to the
absen
e of any warranty; and give any other re
ipients of the Program a
opy
of this Li
ense along with the Program.
You may
harge a fee for the physi
al a
t of transferring a
opy, and you may
at your option oer warranty prote
tion in ex
hange for a fee.
2. You may modify your
opy or
opies of the Program or any portion of it,
thus forming a work based on the Program, and
opy and distribute su
h
modi
ations or work under the terms of Se
tion 1 above, provided that you
also meet all of these
onditions:
a. You must
ause the modied les to
arry prominent noti
es stating that
you
hanged the les and the date of any
hange.
b. You must
ause any work that you distribute or publish, that in whole or
in part
ontains or is derived from the Program or any part thereof, to
be li
ensed as a whole at no
harge to all third parties under the terms of
this Li
ense.
. If the modied program normally reads
ommands intera
tively when run,
you must
ause it, when started running for su
h intera
tive use in the
most ordinary way, to print or display an announ
ement in
luding an
appropriate
opyright noti
e and a noti
e that there is no warranty (or
else, saying that you provide a warranty) and that users may redistribute
the program under these
onditions, and telling the user how to view a
opy of this Li
ense. (Ex
eption: if the Program itself is intera
tive but
does not normally print su
h an announ
ement, your work based on the
Program is not required to print an announ
ement.)
These requirements apply to the modied work as a whole. If identiable
se
tions of that work are not derived from the Program, and
an be reasonably
onsidered independent and separate works in themselves, then this Li
ense,
and its terms, do not apply to those se
tions when you distribute them as
separate works. But when you distribute the same se
tions as part of a whole
whi
h is a work based on the Program, the distribution of the whole must be
on the terms of this Li
ense, whose permissions for other li
ensees extend to
the entire whole, and thus to ea
h and every part regardless of who wrote it.
Thus, it is not the intent of this se
tion to
laim rights or
ontest your rights
to work written entirely by you; rather, the intent is to exer
ise the right to
ontrol the distribution of derivative or
olle
tive works based on the Program.
In addition, mere aggregation of another work not based on the Program with
the Program (or with a work based on the Program) on a volume of a storage
or distribution medium does not bring the other work under the s
ope of this
Li
ense.
3. You may
opy and distribute the Program (or a work based on it, under Se
tion
2) in obje
t
ode or exe
utable form under the terms of Se
tions 1 and 2 above
provided that you also do one of the following:
a. A
ompany it with the
omplete
orresponding ma
hine-readable sour
e
ode, whi
h must be distributed under the terms of Se
tions 1 and 2 above
on a medium
ustomarily used for software inter
hange; or,
b. A
ompany it with a written oer, valid for at least three years, to give any
third party, for a
harge no more than your
ost of physi
ally performing
sour
e distribution, a
omplete ma
hine-readable
opy of the
orresponding sour
e
ode, to be distributed under the terms of Se
tions 1 and 2
above on a medium
ustomarily used for software inter
hange; or,
. A
ompany it with the information you re
eived as to the oer to distribute
orresponding sour
e
ode. (This alternative is allowed only for
non
ommer
ial distribution and only if you re
eived the program in obje
t
ode or exe
utable form with su
h an oer, in a
ord with Subse
tion b
above.)
The sour
e
ode for a work means the preferred form of the work for making
modi
ations to it. For an exe
utable work,
omplete sour
e
ode means all
the sour
e
ode for all modules it
ontains, plus any asso
iated interfa
e denition les, plus the s
ripts used to
ontrol
ompilation and installation of the
exe
utable. However, as a spe
ial ex
eption, the sour
e
ode distributed need
not in
lude anything that is normally distributed (in either sour
e or binary
form) with the major
omponents (
ompiler, kernel, and so on) of the operating
system on whi
h the exe
utable runs, unless that
omponent itself a
ompanies
the exe
utable.
If distribution of exe
utable or obje
t
ode is made by oering a
ess to
opy
from a designated pla
e, then oering equivalent a
ess to
opy the sour
e
ode
from the same pla
e
ounts as distribution of the sour
e
ode, even though third
parties are not
ompelled to
opy the sour
e along with the obje
t
ode.
4. You may not
opy, modify, subli
ense, or distribute the Program ex
ept as
expressly provided under this Li
ense. Any attempt otherwise to
opy, modify,
subli
ense or distribute the Program is void, and will automati
ally terminate
your rights under this Li
ense. However, parties who have re
eived
opies, or
rights, from you under this Li
ense will not have their li
enses terminated so
long as su
h parties remain in full
omplian
e.
5. You are not required to a
ept this Li
ense, sin
e you have not signed it. However, nothing else grants you permission to modify or distribute the Program
or its derivative works. These a
tions are prohibited by law if you do not a
ept this Li
ense. Therefore, by modifying or distributing the Program (or any
work based on the Program), you indi
ate your a
eptan
e of this Li
ense to
do so, and all its terms and
onditions for
opying, distributing or modifying
the Program or works based on it.
6. Ea
h time you redistribute the Program (or any work based on the Program),
the re
ipient automati
ally re
eives a li
ense from the original li
ensor to
opy,
distribute or modify the Program subje
t to these terms and
onditions. You
may not impose any further restri
tions on the re
ipients' exer
ise of the rights
granted herein. You are not responsible for enfor
ing
omplian
e by third
parties to this Li
ense.
7. If, as a
onsequen
e of a
ourt judgment or allegation of patent infringement
or for any other reason (not limited to patent issues),
onditions are imposed
on you (whether by
ourt order, agreement or otherwise) that
ontradi
t the
onditions of this Li
ense, they do not ex
use you from the
onditions of this
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE
IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED
one line to give the program's name and a brief idea of what it does.
Also add information on how to
onta
t you by ele
troni
and paper mail.
If the program is intera
tive, make it output a short noti
e like this when it starts
in an intera
tive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision
omes with ABSOLUTELY NO WARRANTY; for details type `show w'. This
is free software, and you are wel
ome to redistribute it under
ertain
onditions; type `show
' for details.
The hypotheti
al
ommands `show w' and `show
' should show the appropriate
parts of the General Publi
Li
ense. Of
ourse, the
ommands you use may be
alled
something other than `show w' and `show
'; they
ould even be mouse-
li
ks or
menu items{whatever suits your program.
You should also get your employer (if you work as a programmer) or your s
hool, if
any, to sign a \
opyright dis
laimer" for the program, if ne
essary. Here is a sample;
alter the names:
Yoyodyne, In
., hereby dis
laims all
opyright interest in the program
`Gnomovision' (whi
h makes passes at
ompilers) written by James Ha
ker.
signature of Ty Coon
This General Publi
Li
ense does not permit in
orporating your program into proprietary programs. If your program is a subroutine library, you may
onsider it more
useful to permit linking proprietary appli
ations with the library. If this is what you
want to do, use the GNU Library General Publi
Li
ense instead of this Li
ense.
Glossary
Argument Fun
tions and routines are passed arguments to pro
ess.
ARP Address Resolution Proto
ol. Used to translate IP addresses into physi
al
hardware addresses.
CPU Central Pro essing Unit. The main engine of the omputer, see also mi ropro essor and pro essor.
Exe utable image A stru tured le ontaining ma hine instru tions and data.
This le
an be loaded into a pro
ess's virtual memory and exe
uted. See
also program.
Fun
tion A pie
e of software that performs an a
tion. For example, returning the
bigger of two numbers.
example, the interfa
e between two layers of
ode might be expressed in terms
of routines that pass and return a parti
ular data stru
ture. Linux's VFS is a
good example of an interfa
e.
Obje
t le A le
ontaining ma
hine
ode and data that has not yet been linked
with other obje
t les or libraries to be
ome an exe
utable image.
Pro
ess This is an entity whi
h
an exe
ute programs. A pro
ess
ould be thought
of as a program in a
tion.
Peripheral An intelligent pro
essor that does work on behalf of the system's CPU.
For example, an IDE
ontroller
hip,
Program A
oherent set of CPU instru
tions that performs a task, su
h as printing
\hello world". See also exe
utable image.
Routine Similar to a fun
tion ex
ept that, stri
tly speaking, routines do not return
values.
and a human user. Also
alled a
ommand shell, the most
ommonly used shell
in Linux is the bash shell.
SMP Symmetri
al multipro
essing. Systems with more than one pro
essor whi
h
fairly share the work amongst those pro
essors.
So
ket A so
ket represents one end of a network
onne
tion, Linux supports the
BSD So
ket interfa
e.
Software CPU instru
tions (both assembler and high level languages like C) and
data. Mostly inter
hangable with Program.
Bibliography
[1 Ri
hard L. Sites. Alpha Ar
hite
ture Referen
e Manual Digital Press
[2 Matt Welsh and Lar Kaufman. Running Linux O'Reilly & Asso
iates, In
, ISBN
1-56592-100-3
[3 PCI Spe
ial Interest Group PCI Lo
al Bus Spe
i
ation
[4 PCI Spe
ial Interest Group PCI BIOS ROM Spe
i
ation
[5 PCI Spe
ial Interest Group PCI to PCI Bridge Ar
hite
ture Spe
i
ation
[6 Intel Peripheral Components Intel 296467, ISBN 1-55512-207-8
[7 Brian W. Kernighan and Dennis M. Ri
hie The C Programming Language Prenti
e Hall, ISBN 0-13-110362-8
[8 Steven Levy Ha
kers Penguin, ISBN 0-14-023269-9
[9 Intel Intel486 Pro
essor Family: Programmer's Referen
e Manual Intel
[10 Comer D. E. Interworking with TCP/IP, Volume 1 - Prin
iples, Proto
ols and
Ar
hite
ture Prenti
e Hall International In
[11 David Jagger ARM Ar
hite
tural Referen
e Manual Prenti
e Hall, ISBN 0-13736299-4
195
Index
/pro
le system, 116
PAGE ACCESSED, bit in Alpha AXP
PTE, 21
PAGE DIRTY, bit in Alpha AXP PTE,
21
Aging, Pages, 31
all requests, list of request data stru
tures, 87
Alpha AXP Pro
essor, 152
Alpha AXP PTE, 20
Alpha AXP, ar
hite
ture, 152
Altair 8080, 1
ARM Pro
essor, 151
arp table data stru
ture, 134, 135
arp tables ve
tor, 134
Assembly languages, 7
awk
ommand, 155
ELF, 48
ELF shared libraries, 49
Exe
uting Programs, 47
C programming language, 8
196
EXT, 100
EXT2, 100, 101
EXT2 Blo
k Groups, 102
EXT2 Dire
tories, 104
EXT2 Group Des
riptor, 104
EXT2 Inode, 102
EXT2 Superblo
k, 103
Extended File system, 100
fd data stru
ture, 43
fd ve
tor, 127
fdisk
ommand, 89, 99
b info data stru
ture, 137
b node data stru
ture, 137
b zone data stru
ture, 136, 137
b zones ve
tor, 136
le data stru
ture, 43, 53, 54, 86, 127,
162
File system, 99
File System, mounting, 110
File System, registering, 110
File System, unmounting, 112
le system type data stru
ture, 109{
111
le systems data stru
ture, 110, 111
Files, 42, 53
Files,
reating, 112
Files, nding, 112
les stru
t data stru
ture, 42, 43, 163
Filesystems, 11
Finding a File, 112
rst inode data stru
ture, 113
Free Software Foundation, iv
free area data stru
ture, 24
free area ve
tor, 23{25, 29, 31
fs stru
t data stru
ture, 42
GATED daemon, 135
gendisk data stru
ture, 90, 91, 94, 163
GNU, iv
groups ve
tor, 38
Hard disks, 88
Hexade
imal, 3
hh
a
he data stru
ture, 134
IDE disks, 90
ide drive t data stru
ture, 91
ide hwif t data stru
ture, 91
ide hwifs ve
tor, 91
Identiers, 38
reni
e
ommand, 40
request data stru
ture, 87, 88, 94, 168
Ri
hie, Dennis, iii
Rights identiers, 38
rmmod
ommand, 145, 146, 148
Routing, IP, 135
rs
si disks ve
tor, 94
rtable data stru
ture, 132, 135, 136,
168
s
heduler, 39
S
heduling, 39
S
heduling in multipro
essor systems,
41
S
ript Files, 50
SCSI disks, 91
SCSI, initializing, 92
S
si Cmd data stru
ture, 94
S
si Cmnd data stru
ture, 93
S
si Devi
e data stru
ture, 93, 94
S
si Devi
e Template data stru
ture,
94
s
si devi
elist list, 94
s
si devi
es list, 94
S
si Disk data stru
ture, 94
S
si Host data stru
ture, 93, 94
S
si Host Template data stru
ture, 93
s
si hostlist list, 93
s
si hosts list, 93
S
si Type Template data stru
ture, 94
Se
ond Extended File system, 100
sem data stru
ture, 57
sem queue data stru
ture, 57
sem undo data stru
ture, 58
semaphore data stru
ture, 144
Semaphores, 56, 143
Semaphores, System V, 56
semary data stru
ture, 57
semid ds data stru
ture, 57, 58
semid ds, data stru
ture, 57
Shared libraries, ELF, 49
Shared memory, 58
Sharing virtual memory, 19
Shells, 47
shm segs data stru
ture, 58
shmid ds data stru
ture, 30, 58, 59
shmid ds, data stru
ture, 58
siga
tion data stru
ture, 52, 53
signal data stru
ture, 52
Signals, 51