Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Locking in OS Kernels for SMP Systems

From the seminar


Hot Topics in Operating Systems
TU Berlin, March 2006
Arwed Stare
Abstract: !hen designing an operating system ernel "or a shared memory
symmetric m#ltiprocessor system, shared data has to $e protected "rom
conc#rrent access% &ritical iss#es in this area are the increasing code
comple'ity as well as the per"ormance and scala$ility o" a SM( ernel% An
introd#ction to SM()sa"e locing primiti*es, and how locing can $e applied
to SM( ernels will $e gi*en, and we will "oc#s on how to increase
scala$ility $y red#cing loc contention, and the growing negati*e impact on
locing per"ormance $y caches and memory $arriers% +ew, per"ormance)
aware approaches "or m#t#al e'cl#sion in SM( systems will $e presented,
that made it into today,s -in#' 2%6 ernel. The Se/-oc and the read)copy)
#pdate 01&U2 mechanism%
1 Introduction
1.1 Introduction to SMP systems
As Moore,s law is a$o#t to "ail, since cloc speeds can not $e raised $y a "actor o" two
e*ery year any more as it #sed to $e in the 3good old times3, most o" the gains in comp#ting
power are now achie*ed $y increasing the n#m$er o" processors or processing #nits
woring parallel. The tri#mph o" SM( systems is ine*ita$le%
The a$$re*iation SM( stands "or tightly co#pled, shared memory symmetric m#ltiprocessor
system% A set o" e/#al &(Us accesses a common physical memory 0and 45O ports2 *ia a
shared "ront side $#s% Th#s, the FSB $ecomes a contended reso#rce% A $#s master manages
all read5write accesses to the $#s% A read or write operation is g#aranteed to complete
atomically, which means, $e"ore any other read or write operation is carried o#t on the bus%
4" two &(Us access the $#s within the same cloc cycle, the $#s master
nondeterministically 0"rom the programmers *iew2 selects one o" them to $e "irst to access
the $#s% 4" a &(U accesses the $#s while it is still occ#pied, the operation is delayed% This
can $e seen as a hardware meas#re o" synchronisation%
1.2 Introduction to Locking
4" more than one process can access data at the same time, as is the case in preempti*e
m#ltitasing systems and SM( systems, m#t#al e'cl#sion m#st $e introd#ced to protect this
shared data%
!e can di*ide m#t#al e'cl#sion into three classes. Short)term m#t#al e'cl#sion, short)term
m#t#al e'cl#sion with interr#pts, and long)term m#t#al e'cl#sion 6Sch789% -et #s tae a
loo at the typical #niprocessor 0U(2 ernel sol#tions "or these pro$lem classes, and why
they do not wor "or SM( systems%
Short-term mutual exclusion re"ers to pre*enting race conditions in short critical sections%
They occ#r, when two processes access the same data str#ct#re in memory 3at the same
time3, th#s ca#sing inconsistent states o" data% On U( systems, this co#ld only occ#r i" one
process is preempted $y the other% To protect critical sections, they are g#arded with some
sort o" preempt_disable5 preempt_enable call to disa$le preemption, so a process
can "inish the critical section witho#t $eing interr#pted $y another process% 4n a non)
preempti*e ernel, no meas#res ha*e to $e taen at all% Un"ort#nately, this does not wor
"or SM( systems, $eca#se processes do not ha*e to $e preempted to r#n 3parallel3, there
can $e two processes e'ec#ting the e'act same line o" code at the e'act same time% +o
disa$ling o" preemption will pre*ent that%
Short-term mutual exclusion with interrupts in*ol*es interr#pt handlers that access shared
data% To pre*ent interr#pt handler code "rom interr#pting a process in a critical section, it is
s#""icient to g#ard a critical section in the process conte't with some sort o" cli5sti
0disa$le5 ena$le all interr#pts2 call% Un"ort#nately, this approach does not wor on SM(
systems as well, $eca#se all other &(Us, interr#pts are still acti*e and can e'ec#te the
interr#pt handler code at any time%
-ong)term m#t#al e'cl#sion re"ers to processes $eing held #p accessing a shared reso#rce
"or a longer time% For e'ample, once a write system call to a reg#lar "ile $egins, it is
g#aranteed $y the operating system that any other read or write system calls to the same "ile
2
will $e held #ntil the c#rrent one completes% A write system call may re/#ire one or more
dis 45O operations in order to complete the system call% :is 45O operations are relati*ely
long operations when compared to the amo#nt o" wor that the &(U can accomplish d#ring
that time% 4t wo#ld there"ore $e highly #ndesira$le to inhi$it preemption "or s#ch long
operations, $eca#se the &(U wo#ld sit idle waiting "or the 45O to complete% To a*oid this,
the process e'ec#ting the write system call needs to allow itsel" to $e preempted so other
processes can r#n% As yo# pro$a$ly already now, semaphores are #sed to sol*e this
pro$lem% This also holds tr#e "or SM( systems%
2 !e basic SMP locking "rimiti#es
!hen we tal a$o#t m#t#al e'cl#sion, we mean that we want changes to appear as i" they
were an atomic operation% 4" we can not #pdate data with an atomic operation, we need to
mae an #pdate #ninterr#pti$le and se/#entiali;e it with all other processes that co#ld
access the data% B#t sometimes, we can%
2.1 Atomic O"erations
Most SM( architect#res possess some operations that read and change data within a single,
#ninterr#pti$le step, called atomic operations% &ommon atomic operations are test and set
0TS12, which ret#rns the c#rrent *al#e o" a memory location and replaces it with a gi*en
new *al#e, compare and swap 0&AS2, which compares the content o" a memory location
with a gi*en *al#e, and, i" they e/#al, replaces it with a gi*en new *al#e, or the load lin5
store conditional instr#ction pair 0--5S&2% Many SM( systems also "eat#re atomic
arithmetical operations. Addition $y gi*en *al#e, s#$traction $y gi*en *al#e, atomic
increment, decrement, among others%
The ta$le $elow is an e'ample o" how the line counter++ might appear in assem$ler
code 06Sch7892% 4" this line is e'ec#ted at the same time $y two &(Us, the res#lt is wrong,
$eca#se the operation is not atomical%
CPU 1 CPU 2
Time Instruction Executed Register
R0
Value of
Counter
Instruction Executed Register
R0
1 load R0, counter 0 0 load R0, counter 0
2 add R0, 1 1 0 add R0, 1 1
3 store R0, counter 1 1 store R0, counter 1
To sol*e s#ch pro$lems witho#t e'tra locing, one can #se an atomic increment operation
as shown in -isting <% 04n -in#', the atomic operations are de"ined in atomic%h% Operations
not s#pported $y the hardware are em#lated with critical sections%2
The shared data is still there, $#t the critical section co#ld $e eliminated% Atomic updates
=
atomic_t counter = ATOMIC_INIT(0);
atomic_inc(&counter);
Listing 1: Atomic increment in Linux.
can $e done on se*eral common occasions, "or e'ample the replacement o" a lined list
element 0-isting 22% +ot e*en a special atomic operation is necessary to do that%
+on)$locing synchronisation algorithms solely rely on atomic operations%
2.2 S"in Locks
Spin locs are $ased on some atomic operation, "or e'ample test and set% The principle is
simple. A "lag *aria$le indicates i" a process is c#rrently in the critical section 0loc>*ar ?
<2 or i" no process is in the critical section 0loc>*ar ? 02% A process spins 0$#sy waits2 #ntil
the loc is reset, then sets the loc% Testing and setting o" the loc stat#s "lag m#st $e done
in one step, with an atomic operation% To release the loc, a process resets the loc *aria$le%
-isting = shows a possi$le implementation "or loc and unloc%
+ote that a spin loc can not $e ac/#ired rec#rsi*ely @ it wo#ld deadloc on the second call
to loc% This has two conse/#ences. A process holding a spin loc may not $e preempted,
or else a deadloc sit#ation co#ld occ#r% And spin locs can not $e #sed within interr#pt
handlers, $eca#se i" an interr#pt handler tries to ac/#ire a loc that is already held $y the
process it interr#pted, it deadlocs%
2.2.1 I$%&Safe S"in Locks
The -in#' ernel "eat#res se*eral spin loc *ariants that are sa"e "or #sing with interr#pts%
A critical section in process conte't is g#arded $y spin_loc_ir! and
spin_unloc_ir!, while critical sections in interr#pt handlers are g#arded $y the
normal spin_loc and spin_unloc% The only di""erence $etween these "#nctions is,
that the ir/)sa"e *ersions o" spin_loc disa$le all interr#pts on the local &(U "or the
critical section% The possi$ility o" a deadloc is there"ore eliminated%
Fig#re < shows how two &(Us interact when trying to ac/#ire the same ir/)sa"e spin loc%
8
"" set up ne# element
ne#$%data = some_data;
ne#$%ne&t = old$%ne&t;
"" replace old element #it' it
pre($%ne&t = ne#;
Listing 2: Atomical update of single linked list replace
!old! with !new!
(oid loc((olatile int )loc_(ar_p) *
#'ile (test_and_set_bit(0+ loc_(ar_p) == ,);
-
(oid unloc((olatile int )loc_(ar_p) *
)loc_(ar_p = 0;
-
Listing ": Spin lock #Sch$%&
!hile &(U < 0in process conte't2 is holding the loc, any incoming interr#pt re/#ests on
&(U < are stalled #ntil the loc is released% An interr#pt on &(U 2 $#sy waits on the loc,
$#t it does not deadloc% A"ter &(U < releases the loc, &(U 2 0in interr#pt conte't2 o$tains
it, and &(U < 0now e'ec#ting the interr#pt handler2 waits "or &(U 2 to release it%
2.2.2 'n!ancements of t!e sim"le S"in Lock
Sometimes it is wanted to allow spin locs to $e nested% To do so, the spin loc is e'tended
$y a nesting co#nter and a *aria$le indicating which &(U holds the loc% 4" the loc is held,
the loc "#nction checs i" the c#rrent &(U is the one holding the loc% 4n this case, the
nesting co#nter is incremented and it e'its "rom the spin loop% The #nloc "#nction
decrements the nesting co#nter% The loc is released when the #nloc "#nction has $een
called the same n#m$er o" times the loc "#nction was called $e"ore% This ind o" spin loc
is called a recursi'e lock%
-ocs can also $e modi"ied to allow $locing% -in#', and FreeBS:,s $ig ernel loc is
dropped i" the process holding it sleeps 0$locs2, and reac/#ired when it waes #p%
Solaris 2%' pro*ides a type o" locing nown as adapti'e locks% !hen one thread attempts
to ac/#ire one o" these that is held $y another thread, it checs to see i" the second thread is
acti*e on a processor% 4" it is, the "irst thread spins% 4" the second thread is $loced, the "irst
thread $locs as well%
2.( Sema"!ores )mute*+
Aside "rom the classical #se o" semaphores e'plained in section <%<, semaphores 0initiali;ed
with a co#nter *al#e o" <2 can also $e #sed "or protecting critical sections% For per"ormance
reasons, semaphores #sed "or this ind o" wor are o"ten reali;ed as a separate primiti*e,
called m#te', that replaces the co#nter with a simple loc stat#s "lag%
Using m#te'es instead o" spin locs is prod#cti*e, i" the critical section taes longer than a
conte't switch% Alse, the o*erhead o" $locing compared to $#sy waiting "or a loc to $e
released is worse% On the pro side, m#te'es imply a ind o" "airness, while processes co#ld
B
(igure 1: )*+-safe spin locks
star*e on hea*ily contended spin locs% M#te'es can not $e #sed in interr#pts, $eca#se it is
generally not allowed to $loc in interr#pt conte't%
A semaphore is a comple' shared data str#ct#re itsel", and m#st there"ore $e protected $y
an own spin loc%
2., $eader-.riter Locks
As reading a data str#ct#re does not a""ect the integrity o" data, it is not necessary to
m#t#ally e'cl#de two processes "rom reading the same data at the same time% 4" a data
str#ct#re is read o"ten, allowing readers to operate in parallel is a great ad*antage "or SM(
so"tware%
An rwloc eeps co#nt o" the readers c#rrently holding a read)only loc and has a /#e#e
"or $oth waiting writers and waiting readers% 4" the writer /#e#e is empty, new readers may
gra$ the loc% 4" a writer enters the scene, it has to wait "or all readers to complete, then it
gets an e'cl#si*e loc% Meanwhile arri*ing writers or readers are /#e#ed #ntil the write
loc is dropped% Then, all readers waiting in the /#e#e are allowed to enter, and the game
starts anew 0waiting writers are p#t on hold a"ter a writer completes to pre*ent star*ation o"
readers2% Fig#re 2 shows a typical se/#ence%
The rwloc in*ol*es a marginal o*erhead, $#t should yield almost linear scala$ility "or
read)mostly data str#ct#res 0we will see a$o#t this later2%
( Locking /ranularity in SMP Kernels
(.1 /iant Locking
The designers o" the -in#' operating systems did not ha*e to worry m#ch a$o#t m#t#al
e'cl#sion in their #niprocessor ernels, $eca#se they made the whole ernel non)
preempti*e 0see section <%<2% The "irst -in#' SM( ernel 0*ersion 2%02 #sed the most simple
approach to mae the traditional U( ernel code wor on m#ltiple &(Us. 4t protected the
6
(igure 2: reader,writer lock
whole ernel with a single loc, the $ig ernel loc 0BC-2% The BC- is a spin loc that
co#ld $e nested and is $locing)sa"e 0see section 2%2%22%
There co#ld $e no two &(Us in the ernel at the same time% The only ad*antage o" this was
that the rest o" the ernel co#ld $e le"t #nchanged%
(.2 0oarse&grained Locking
4n -in#' 2%2, the BC- was remo*ed "rom the ernel entry points, and each s#$system was
protected $y an own loc% +ow, a "ile system call wo#ld not ha*e to wait "or a so#nd dri*er
ro#tine or a networ s#$system call to "inish% Still, it was not data that was protected $y the
locs, $ot rather conc#rrent "#nction calls that were se/#entiali;ed% Also, the BC- co#ld not
$e remo*ed "rom all mod#les, $eca#se it was o"ten #nclear which data it protected% And
data protected $y the BC- co#ld $e accessed anywhere in the ernel%
(.( 1ine&grained Locking
Fine)grained locing means. 4ndi*id#al data str#ct#res, not whole s#$systems or mod#les,
are protected $y their own locs% The degree o" gran#larity can $e increased "rom locs
protecting $ig data str#ct#res 0lie, "or e'ample, a whole "ile system or the whole process
ta$le2 to locs protecting indi*id#al data str#ct#res 0"or e'ample, a single "ile or a process
control $loc2 or e*en single elements o" a data str#ct#re% Fine)grained locing was
introd#ced in the -in#' 2%8 ernel series, and has $een "#rthered in the 2%6 series%
Fine)grained locing has also $een introd#ced into the FreeBS: operating system $y the
SM(ng team, into the Solaris ernel, and into the !indows +T ernels as well%
Un"ort#nately, the BC- is still not dead% &hanges to locing code had to $e implemented
*ery ca#tio#sly, as to not $ring in hard to trac down deadloc "ail#res% So, e*ery time a
BC- was considered #seless "or a piece o" code, it was mo*ed into the "#nctions this code
called, $eca#se it was not always o$*io#s i" these "#nctions relied on the BC-% Th#s, the
occ#rrences o" BC- increased e*en more, and mod#le maintainers did not always react to
calls to remo*e the BC- "rom their code%
D
(igure ": -isual description of locking granularity in .S kernels #/ag01&
, Performance 0onsiderations
The -in#' 2%0%80 contains a total o" <D BC- calls, while the -in#' 2%8%=0 ernel contains a
total o" 226 BC-, =27 spin loc and <2< rwloc calls% The -in#' 2%6%<<%D ernel contains
<0< BC-, <D<D spin loc and =87 rwloc calls, as well as B6 se/ loc and <8 1&U 0more
on these synchronisation mechanisms later2 0n#m$ers taen "rom 6Cag0B92%
The reason, why the -in#' programmers too so m#ch wor #pon them, is that coarse)
grained ernels scale poorly on more than =)8 &(Us% The optimal per"ormance "or a n)&(U
SM( system is n times the per"ormance o" a <)&(U system o" the same type% B#t this
optimal per"ormance can only $e achie*ed i" all &(Us are doing prod#cti*e wor all the
time% B#sy waiting on a loc wastes time, and the more contended a loc is, the more
processors will liely $#sy wait to get it%
Hence, the ernel de*elopers r#n special loc contention $enchmars to detect which locs
ha*e to $e split #p to distri$#te the in*ocations on them% The loc *aria$les are e'tended $y
a loc)in"ormation str#ct#re that contains a co#nter "or hits 0s#ccess"#l attempts to gra$ a
loc2, misses 0#ns#ccess"#l attempts to gra$ a loc2, and spins 0total o" waiting loops2
6&am7=9% The n#m$er o" spins5misses shows how contended a loc is% 4" this n#m$er is
high, processes waste a lot o" time waiting "or the loc%
Meas#ring loc contention is a common practice to loo "or $ottlenecs% Bryant and
Hawes wrote a speciali;ed tool to meas#re loc contention in the ernel, which they #sed
to analy;e "ilesystem per"ormance 6Bry029% Others 6Cra0<9 "oc#sed on contention in the
2%8%' sched#ler, which has since $een completely rewritten% Today, the -in#' sched#ler
mostly operates on per)&(U ready /#e#es and scales "ine #p to B<2 &(Us%
-ocing is most prono#nced with applications that access shared reso#rces, s#ch as the
*irt#al "ilesystem 0EFS2 and networ, and applications that spawn many processes% Atison
et% al% #sed se*eral $enchmars that stress these s#$systems as an e'ample o" how the
C-ogger ernel logging and analysis tool can $e #sed "or meas#ring loc contention , #sing
*arying degrees o" paralleli;ation. They meas#red the percentage o" time spent spinning on
locs d#ring mae r#nning a parallel compilation o" the -in#' ernel, +etper" 0a networ
per"ormance e*al#ation tool2, and an Apache we$ ser*er with (erl &F4 $eing stress)tested
6Ati0B9 0see Fig#re B2%
Meas#rements lie these help to spot and eliminate locing $ottlenecs%
G
(igure %: 2xtract from a lock-contention benchmark on 3nix S-*%,45 #6am$"&
hits misses spins spins/miss
>(ageTa$le-oc4n"o < 7,6B6 G0 7,BD< <20
>:ispatcherH#e#e-oc4n"o < 87,7D7 =G2 =0,B0G G0
>SleepHashH#e#e-oc4n"o < 2B,B87 D0G B6,<72 D7
,.1 Scalability
Simon CIgstrJm made similar $enchmars to compare the scala$ility o" the -in#' ernel
on <)G &(Us "rom *ersion 2%0 to 2%6% He meas#red the relati*e speed#p in regard to the
0giant loced2 2%0%80 U( ernel with the (ostmar $enchmar 0Fig#re 62%
The res#lt o" this $enchmar is not s#rprising% As we can see, the more we increase locing
gran#larity 0-in#' 2%62, the $etter the system scales% B#t how "ar can we increase locing
gran#larityK
7
(igure 1: 5ercentage of cycles spent on spinning on locks for
each of the test applications #2ti01&
(igure 7: 5ostmark benchmark of se'eral Linux kernels 6.8()9:S45;y #/ag01&
O" co#rse, we cannot ignore the increasing comple'ity ind#ced $y "iner locing gran#larity%
As more locs ha*e to $e held to per"orm a speci"ic operation, the ris o" deadloc
increases% There is an ongoing disc#ssion in the -in#' comm#nity a$o#t how m#ch locing
hierarchy is too m#ch% !ith more locing comes more need "or doc#mentation o" locing
order, or need "or tools lie deadloc analy;ers% :eadloc "a#lts are among the most
di""ic#lt to come $y%
The o*erhead o" locing operations matters as well% The &(U does not only spend time in a
critical section, it also taes some time to ac/#ire and to release a loc% &ompare the graph
o" the 2%6 ernel with the 2%8 ernel "or less than "o#r &(Us. The ernel with more locs is
the slower one% The e""iciency o" e'ec#ting a critical section can $e mathematically
e'pressed as. Time within critical section 5 0Time within critical section L Time to ac/#ire
loc L Time to release loc2% 4" yo# split a critical section into two, the time to ac/#ire and
release a loc can $e ro#ghly m#ltiplied $y two%
S#rprisingly, e*en one time the ac/#isition o" a loc is generally one time too m#ch, and
the per"ormance penalty o" a simple loc ac/#isition, e*en i" s#ccess"#l at the "irst attempt,
is $ecoming worse and worse% To #nderstand why, we ha*e to "orget the $asic model o" a
simple scalar processor witho#t caches and loo at today,s reality%
,.2 Performance Penalty of Lock O"erations
4mage < shows typical instr#ction costs o" se*eral operations on a G)&(U <%8B FH; ((&
system
<
% The gap $etween normal instr#ctions, cache)hitting memory accesses 0not listed
hereM they are generally =)8 times "aster than an atomic increment operation2 and a loc
operation $ecomes o$*io#s%
-et #s loo at the architect#re o" today,s SM( systems and it,s impact on o#r spin loc%
,.2.1 0ac!es
As &(U power has increased ro#ghly $y "actor 2 each year, memory speeds ha*e not ept
pace, and increased $y only <0 to <BN each year% Th#s, memory operations impose a $ig
per"ormance penalty on todays, comp#ters%
< 4" yo# wonder why an instr#ction taes less time then a &(U cycle, remem$er that we
are looing at a G)&(U SM( system, and *iew these n#m$ers as 3typical instr#ction
cost3%
<0
)mage 1: )nstruction costs on a <-653 1.%19=> 556
system #4c/01&
As a conse/#ence o" this de*elopment, small S1AM caches were introd#ced, which are
m#ch "aster than main memory% :#e to temporal and spatial locality o" re"erence in
programs 0see 6Sch789 "or e'planation2, e*en a comparati*ely small cache achie*es hit
ratios o" 70N and higher% On SM( systems, each processor has its own cache% This has the
$ig ad*antage that cache hits ca#se no load on the common memory $#s, $#t it introd#ces
the pro$lem o" cache consistency%
!hen a memory word is accessed $y a &(U, it is "irst looed #p in the &(U,s local cache%
4" it is not "o#nd there, the whole cache line
2
containing the memory word is copied into the
cache% This is called a cache miss 0to increase the n#m$er o" cache hits, it is th#s *ery
ad*isa$le to align data along cache lines in physical memory and operate on data str#ct#res
that "it within a single cache line2% S#$se/#ent read accesses to that memory address will
ca#se a cache hit% B#t what happens on a write access to a memory word that lies in the
cacheK This depends on the 3write policy3%
3!rite thro#gh3 means that a"ter e*ery write access, the cache line is written $ac to main
memory% This ins#res consistency $etween the cache and memory 0and $etween all caches
o" a SM( system, i" the other caches snoop the $#s "or write accesses2, $#t it is also the
slowest method, $eca#se a memory access is needed on e*ery write access to a cache line%
The 3write $ac3 policy is m#ch more common% On a write access, data is not written $ac
to memory immediately, $#t the cache line gets a 3modi"ied3 tag% 4" a cache line with a
modi"ied tag is "inally replaced $y another line, it,s content is written $ac to memory%
S#$se/#ent write operations hit in the cache as long as the line is not remo*ed%
On SM( systems, the same piece "rom physical memory co#ld lie in more then one cache%
The SM( architect#re needs a protocol to ins#re consistency $etween all caches% 4" two
&(Us want to read the same memory word "rom their cache, e*erything goes well% 4n
addition, $oth read operations can e'ec#te at the same time% B#t i" two &(Us wanted to
write to the same memory word in their cache at the same time, there wo#ld $e a modi"ied
*ersion o" this cache line in $oth caches a"terwards and th#s, two *ersions o" the same
cache line wo#ld e'ist% To pre*ent this, a &(U trying to modi"y a cache line has to get the
3e'cl#si*e3 right on it% !ith that, this cache line is mared in*alid in all other caches%
Another &(U trying to modi"y a cache line has to wait #ntil &(U < drops the e'cl#si*e
right, and has to re)read the cache line "rom &(U <,s cache%
-et #s loo at the e""ects o" the simple spin loc code "rom -isting =, i" a loc is held $y
&(U <, and &(U 2 and = wait "or it.
The loc *aria$le is set $y the test>and>set operation on e*ery spinning cycle% !hile &(U
< is in the critical section, &(U 2 and = constantly read and write to the cache line
containing the loc *aria$le% The line is constantly trans"erred "rom one cache to the other,
$eca#se $oth &(Us m#st ac/#ire an e'cl#si*e copy o" the line when they test)and)set the
loc *aria$le again% This is called 3cache line $o#ncing3, and it imposes a $ig load on the
memory $#s% The impact on per"ormance wo#ld $e e*en worse i" the data protected $y the
loc was also lying in the same cache line%
!e can howe*er modi"y the implementation o" the spin loc to "it the "#nctionality o" a
2 To "ind data in the cache, each line o" the cache 0thin o" the cache as a spreadsheet2 has
a tag containing it,s address% 4" one cache line wo#ld consist o" only one memory word,
a lot o" lines and th#s, a lot o" address tags, wo#ld $e needed% To red#ce this o*erhead,
cache lines contain #s#ally a$o#t =2)<2G $ytes, accessed $y the same tag, and the least
signi"icant $its o" the address ser*e as the $yte o""set within the cache line%
<<
cache%
The atomic read)modi"y)write operation cannot possi$ly ac/#ire the loc while it is held $y
another processor% 4t is there"ore #nnecessary to #se s#ch an operation #ntil the loc is
"reed% 4nstead, other processors trying to ac/#ire a loc that is in #se can simply read the
c#rrent state o" the loc and only #se the atomic operation once the loc has $een "reed%
-isting 8 gi*es an alternate implementation o" the loc "#nction #sing this techni/#e%
Here, one attempt is made to ac/#ire the loc $e"ore entering the inner loop, which then
waits #ntil the loc is "reed% 4" the loc is already taen again on the test>and>set operation,
the &(U spins again in the inner loop% &(Us spinning in the inner loop only wor on a
shared cache line and do not re/#est the cache line e'cl#si*e% They wor cache)local and
do not waste $#s $andwidth% !hen &(U < releases the loc, it mars the cache line
e'cl#si*e and sets the loc *aria$le to ;ero% The other &(Us re)read the cache line and try
to ac/#ire the loc again%
+e*ertheless, spin loc operations are still *ery time)cons#mpti*e, $eca#se they #s#ally
in*ol*e at least one cache line trans"er $etween caches or "rom memory%
,.2.2 Memory 2arriers
!ith the s#perscalar architect#re, parallelism was introd#ced into the &(U cores% 4n a
s#perscalar &(U, there are se*eral "#nctional #nits o" the same type, along with additional
circ#itry to dispatch instr#ctions to the #nits% For instance, most s#perscalar designs incl#de
more than one arithmetic)logical #nit% The dispatcher reads instr#ctions "rom memory and
decides which ones can $e r#n in parallel, dispatching them to the two #nits% The
per"ormance o" the dispatcher is ey to the o*erall per"ormance o" a s#perscalar design.
The #nits, pipelines sho#ld $e as "#ll as possi$le% A s#perscalar &(U,s dispatcher hardware
there"ore reorders instr#ctions "or optimal thro#ghp#t% This holds tr#e "or load5store
operations as well%
For e'ample, imagine a program that adds two integers "rom main memory% The "irst
arg#ment that is "etched is not in the cache and m#st $e "etched "rom main memory%
Meanwhile, the second arg#ment is "etched "rom the cache% The second load operation is
liely to complete earlier% Meanwhile, a third load operation can $e iss#ed% The dispatcher
#ses interlocs to prohi$it that the add operation is iss#ed $e"ore the load operations it
depends on are "inished%
Also, most modern &(Us sport a small register set called store $#""ers, where se*eral store
operations are gathered to $e e'ec#ted at once at a later time% They can $e $#""ered in)order
0which is called total store ordering2 or @ as common with s#perscalar &(Us @ o#t o" order
0partial store ordering2% 4n short. As long as a load or store operation does not access the
<2
(oid loc((olatile loc_t )loc_status)
*
#'ile (test_and_set(loc_status) == ,)
#'ile ()loc_status == ,); "" spin
-
Listing %: Spin lock implementation a'oiding excessi'e cache line
bouncing #Sch$%&
same memory word as a prior store operation 0or *ice *ersa2, they can $e e'ec#ted in any
possi$le order $y the &(U%
This needs "#rther meas#res to ens#re correctness o" SM( code% The simple atomic list
insert code "rom section 2%< co#ld $e e'ec#ted as shown in Fig#re D%
The method re/#ires the new node,s ne't pointer 0and all it,s data2 to $e initiali;ed before
the new element is inserted at the list,s head% 4" these instr#ctions are o#t o" order, the list
will $e in an inconsistent state #ntil the second instr#ction completes% Meanwhile, another
&(U co#ld tra*erse the list, the thread co#ld $e preempted, etc%
4t is not necessary that $oth operations are e'ec#ted right a"ter each other, $#t it is
important that the "irst one was e'ec#ted $e"ore the second% To "orce "inishing o" all read or
write operations in the instr#ction pipeline $e"ore the ne't operation is "etched, s#perscalar
&(Us ha*e so)called memory $arrier instr#ctions% !e disting#ish read memory $arriers
0wait #ntil all pending read operations ha*e completed2, write memory $arriers, and
memory $arriers 0wait #ntil all pending memory operations ha*e completed, read and
write2% &orrect code wo#ld read.
4nstr#ction reordering can also ca#se operations in a critical section to 3$leed o#t3 0Fig#re
G2% The line that claims to $e in a critical section is o$*io#sly not, $eca#se the operation
releasing the loc *aria$le was e'ec#ted earlier% Another &(U co#ld long ago ha*e altered a
part o" the data, with the res#lts $eing #npredicta$le%
<=
ne#$%ne&t = i$%ne&t;
smp_#mb(); "" #rite memor. barrier/
i$%ne&t = ne#;
Listing 1: 6orrect code of atomic list insertion on
machines without se?uential memory model
(igure @: )mpact of non se?uential memory model on atomic list insertion
algorithm
ne#$%ne&t = i$%ne&t;
i$%ne&t = ne#;
code in memory: execution order:
i$%ne&t = ne#;
ne#$%ne&t = i$%ne&t;
To pre*ent this, we ha*e to alter o#r locing operations again 0note that it is not necessary
to pre*ent load5store operations prior to the critical section to 3$leed3 into it% And o" co#rse,
dispatcher #nits do not o*erride the logical instr#ction "low, so e*ery operation in the
critical section will $e e'ec#ted a"ter the &(U e'its that while loop2.
A memory $arrier "l#shes the store $#""er and stalls the pipelines 0to carry o#t all pending
read5write operations $e"ore new ones are e'ec#ted2, so it negati*ely impacts the
per"ormance proportional to the n#m$er o" pipeline stages and n#m$er o" "#nctional #nits%
This is why the memory $arrier operations tae so m#ch time in the chart presented earlier%
Atomic operations tae so long $eca#se they also "l#sh the store $#""er in order to $e
carried o#t immediately%
,.2.( 3as!&able 2enc!mark
Below are the res#lts o" a $enchmar that per"ormed search operations on a hash ta$le with
a dense array o" $#cets, do#$ly lined hash chains, and one element per hash chain% Under
the locing designs tested "or this hash ta$le were. Flo$al spin loc, glo$al reader5writer
<8
(oid loc((olatile loc_t )loc_(ar)
*
#'ile (test_and_set(loc_(ar) == ,)
#'ile ()loc_(ar == ,); "" spin
-
(oid unloc((olatile loc_t )loc_(ar)
*
mb(); "" read$#rite memor. barrier
)loc_(ar = 0;
-
Listing 7: Spin lock with memory barrier
(igure <: )mpact of weak store ordering on critical sections
data01oo = .; "" in critical section
data0ne&t = &bar;
)loc_(ar = 0; ") unloc )"
code in memory:
execution order:
)loc_(ar = 0; ") unloc )"
data01oo = .; "" in critical section
data0ne&t = &bar;
loc, per)$#cet spin loc and rwloc and -in#', $ig reader loc% Aspecially the res#lts "or
the allegedly parallel reader5writer locs seem s#rprising, $#t only s#pport the things said in
the last two sections. The locing instr#ctions, o*erhead thwarts any parallelism%
The e'planation "or this is now rather simple 0Fig#re 72. The ac/#isition and release o"
rwlocs tae so m#ch time 0remem$er the cache line $o#ncing etc%2, that the act#al critical
section is not e'ec#ted parallel any more%
<B
)mage 2: 5erformance of se'eral locking strategies for a hash
table A#4c/01&B
4 Lock&A#oiding Sync!roni5ation Primiti#es
M#ch e""ort has $een p#t in de*eloping synchronisation primiti*es that a*oid locs% -oc)
"ree and wait)"ree synchronisation also plays a maOor part in real time operating systems,
where time g#arantees m#st $e gi*en% Another way to red#ce loc contention is $y #sing
per)&(U data%
Two new synchronisation mechanisms that get $y totally witho#t locing or atomic
operations on the reader side were introd#ced into the -in#' 2%6 ernel to address the a$o*e
iss#es. The Se/-oc and the 1ead)&opy)Update mechanism% !e will disc#ss them in
partic#lar%
4.1 Se6 Locks
Se/ locs 0short "or se/#ence locs2 are a *ariant o" the reader)writer loc, $ased on spin
locs% They are intended "or short critical sections and t#ned "or "ast data access and low
latency%
4n contrast to the rwloc, a se/#ence loc warrants writer access immediately, regardless o"
any readers% !riters ha*e to ac/#ire a writer spin loc which pro*ides m#t#al e'cl#sion "or
m#ltiple writers% They then can alter the data witho#t paying regard to possi$le readers%
There"ore, readers also do not need to ac/#ire any loc to synchroni;e with possi$le
<6
(igure $: 2ffects of cache line bouncing and
memory synchronisation delay on rwlockCs
efficiency A#4c/0"&B
writers% Th#s, a read access generally does not ac/#ire a loc, $#t readers are in charge to
chec i" they read *alid data. 4" a write access too place while the data was read, the data
is in*alid and has to $e read again% The identi"ication o" write accesses is reali;ed with a
co#nter 0see Fig#re <02% A*ery writer increments this ;ero)initiali;ed co#nter once $e"ore
he changes any data, and again a"ter all changes are done% The reader reads the co#nter
*al#e $e"ore he reads the data, then compares it to the c#rrent co#nter *al#e a"ter reading
the data% 4" the co#nter *al#e has increased, the reading was tampered $y one or more
conc#rrent writers and the data has to $e read again% Also, i" the co#nter *al#e was #ne*en
at the $eginning o" the read)side critical section, a writer was in progress while the data was
read and it has to $e discarded% So, strictly speaing, the while loop,s condition is
((count_pre /= count_post) && (count_pre 2 3 == 0))%
4n the worst case, the readers wo#ld ha*e to loop in"initely i" there was a non)ending chain
o" writers% B#t #nder normal conditions, the readers read the data s#ccess"#lly within a "ew,
i" not only one tries% By minimi;ing the time spent in the read)side critical section, the
pro$a$ility o" $eing interr#pted $y a writer can $e red#ced greatly% There"ore, it is part o"
the method to only copy the shared data in the critical section and wor on it later%
-isting D shows how to read shared data protected $y the se/ loc "#nctions o" the -in#'
ernel% time_loc is the se/ loc *aria$le, $#t read_se!be4in and read_
se!retr. only read the se/ loc,s co#nter and do not access the loc *aria$le%
4.2 !e $ead&0o"y&7"date Mec!anism
As yo# can see, synchronisation is a mechanism and a coding con*ention% The coding
con*ention "or m#t#al e'cl#sion with a spin loc is, that yo# ha*e to hold a loc $e"ore yo#
access the data protected $y it, and that yo# ha*e to release it a"ter yo# are done% The
<D
unsi4ned lon4 se!;
do *
se! = read_se!be4in(&time_loc);
no# = time;
- #'ile( read_se!retr.(&time_loc+se!) );
"" (alue in 5no#5 can no# be used
Listing @: Se? Lock: *ead-side critical section
(igure 10: Se? Lock schematics Afigure based upon #+ua0%&B
coding con*ention "or non)$locing synchronisation is, that e*ery data manip#lation only
needs a single atomic operation 0e%g% &AS!, &AS!2 or o#r atomic list #pdate e'ample2%
The 1&U mechanism is $ased on something called /#iescent states% A /#iescent state is a
point in time, where a process that has $een reading shared data does not hold any
re"erences to this data any more% !ith the 1&U mechanism, processes can enter a read)
side critical section any time and can ass#me that the data they read is consistent as long as
they wor on it% B#t a"ter a process lea*es it,s read)side critical section, it must not hold any
re"erences to the data any longer% The process enters the /#iescent state%
This imposes some constraints on how to #pdate shared data str#ct#res% As readers do not
chec i" the data they read is consistent 0as in the Se/-oc mechanism2, writers ha*e to
apply all their changes with one atomical operation% 4" a reader read the data $e"ore the
#pdate, it sees the old state, i" the reader reads the data a"ter the #pdate, it sees the new
state% O" co#rse, readers sho#ld read data once and than wor with it, and not read the same
data se*eral times and than "ail i" they read di""ering *ersions% &onsider a lined list
protected $y 1&U% To #pdate an element o" the list 0see Fig#re <<2, the writer has to read
the old element,s contents, mae a copy o" them (1), #pdate this element (2), and then
e'change the old element with the new one with an atomic operation 0writing the new
element,s address to the pre*io#s element,s ne't pointer2 (3)%
As yo# can see, readers could still read stale data e*en a"ter the writer has "inished
#pdating, i" they entered the read)side critical section $e"ore the writer "inished (4)%
There"ore, the writer cannot immediately delete the old element% 4t has to de"er the
destr#ction, #ntil all processes that were in a read)side critical section at the time the writer
"inished ha*e dropped their re"erences to the stale data (5)% Or in other words, entered the
/#iescent state% This time span is called the grace period% A"ter that time, there can $e
readers holding re"erences to the data, $#t none o" them co#ld possi$ly re"erence the old
data, $eca#se they started at a time when the old data was not *isi$le any more% The old
element can $e deleted (6)%
<G
(igure 11: Dhe six steps of the *63 mechanism
prev cur next
cur
prev cur next
new
prev old next
new
next
prev next
new
,+
4+
8+ (+
2+
1+
prev old
new
next prev old
new
The 1&U mechanism re/#ires data that is stored within some sort o" container that is
re"erenced $y a pointer% The #pdate step consists o" changing that pointer% Th#s, lined lists
are the most common type o" data protected $y 1&U% 4nsertion and deletion o" elements is
done lie presented in section 2%<% O" co#rse, we need memory $arrier operations on
machines with wea ordering% More comple' #pdates, lie sorting a list, need some other
ind o" synchronisation mechanism% 4" we ass#me that readers tra*erse a list in search o" an
element once, and not se*eral times $ac and "orth 0as we ass#med anyways2, we can also
#se do#$ly lined lists 0see -isting G2%
The 1&U mechanism is optimal "or read)mostly data str#ct#res, where readers can tolerate
stale data 0it is "or e'ample #sed in the -in#' ro#ting ta$le implementation2% !hile readers
generally do not ha*e to worry a$o#t m#ch, things get more comple' on the writer,s side%
First o" all, writers ha*e to ac/#ire a loc, O#st lie with the se/loc mechanism% 4" they
wo#ld not, two writers co#ld o$tain a copy o" a data element, per"orm their changes, and
then replace it% The data str#ct#re will still $e intact, $#t one #pdate wo#ld $e lost% Second,
writers ha*e to de"er the destr#ction o" an old *ersion o" the data #ntil sometime% !hen
e'actly is it sa"e to delete old dataK
A"ter all readers that were in a read)side critical section at the time o" the #pdate ha*e le"t
their critical section, or, entered a /#iescent state 0Fig#re <22% A simple approach wo#ld $e
to tae a co#nter that indicates how many processes are within a read)side critical section,
and de"er destr#ction o" all stale *ersions o" data elements #ntil that time% B#t as yo# can
see, later readers are not taen into acco#nt, so this approach "ails% !e co#ld also incl#de a
re"erence co#nter in e*ery data element, i" the architect#re "eat#res an atomic increment
operation% A*ery reader wo#ld increment this re"erence co#nter as it gets a re"erence to the
<7
static inline (oid __list_add_rcu(struct list_'ead ) ne#+
struct list_'ead ) pre(+ struct list_'ead ) ne&t)
*
ne#$%ne&t = ne&t;
ne#$%pre( = pre(;
smp_#mb();
ne&t$%pre( = ne#;
pre($%ne&t = ne#;
-
Listing <: extract from Linux 2.7.1% kernel list.h
(igure 12: *63 9race 5eriod: After all processes that
entered a read-side critical section AgrayB before the writer
AredB finished ha'e entered a ?uiescent state it is sa'e
delete an old element
data element, and decrement it when the read)side critical section completes% 4" an old data
element has a re"co#nt o" ;ero, it can $e deleted 6McC0<9% !hile this sol*es the pro$lem, it
reintrod#ces the per"ormance iss#es o" atomic operations that we wanted to a*oid%
-et #s ass#me that a process in a 1&U read)side critical section does not yield the &(U%
This means. (reemption is disa$led while in the critical section and no "#nctions that might
$loc m#st $e #sed 6McC089% Then no re"erences can $e held across a conte't switch, and
we can "airly ass#me a &(U that has gone thro#gh a conte't switch to $e in a /#iescent
state% The earliest time when we can $e a$sol#tely s#re that no process on any other &(U is
still holding a re"erence to stale data, is a"ter e*ery other &(U has gone thro#gh a conte't
switch at least once a"ter the writer "inished% The writer has to de"er destr#ction o" stale data
#ntil then, either $y waiting or $y registering a call$ac "#nction that "rees the space
occ#pied $y the stale data% This call$ac "#nction is called a"ter the grace period is o*er%
The -in#' ernel "eat#res $oth *ariants%
A simple mechanism to detect when all &(Us ha*e gone thro#gh a conte't switch is to start
a high priority thread on &(U < that repeatedly resched#les itsel" on the ne't &(U #ntil it
reaches the last &(U% This thread then e'ec#tes the call$ac "#nctions or waes #p any
processes waiting "or the grace period to end% There are more e""ecti*e algorithms o#t there
to detect the end o" a grace period, $#t these are o#t o" the scope o" this doc#ment%
-isting 7 presents the -in#' 1&U A(4 0witho#t the -in#' 1&U list A(42% +ote that, while it
is necessary to g#ard read)side critical sections with rcu_read_loc and
rcu_read_unloc, the only thing these "#nctions do 0e'cept "or *is#ally highlighting a
critical section2 is disa$ling preemption "or the critical section% 4" the -in#' ernel is
compiled with 6788M6T_8NA9:8=no, they do nothing% !rite)side critical sections are
protected $y spin>loc02 and spin>#nloc02, and wait "or the grace period a"terwards with
s.nc'roni;e_ernel() or register a call$ac "#nction to destroy the old element
with call_rcu()0
20
(igure 1": Simple detection of a grace period: Dhread !u!
runs once on e'ery 653 #4c/01&
The 1&U mechanism is widely $elie*ed to ha*e $een de*eloped at Se/#ent &omp#ter
Systems, who were then $o#ght $y 4BM, who holds se*eral patents on this techni/#e% The
patent holders ha*e gi*en permission to #se this mechanism #nder the F(-% There"ore,
-in#' is c#rrently the only maOor OS #sing it% 1&U is also part o" the S&O claims in the
S&O *s% 4BM laws#it%
8 0onclusion
The introd#ction o" SM( systems has greatly increased the comple'ity o" locing in OS
ernels% 4n order to stri*e "or optimal per"ormance on all plat"orms, the operating system
designers ha*e to meet con"licti*e goals. 4ncrease gran#larity to increase scala$ility o" their
ernels, and red#ce #sing o" locs to increase the e""iciency o" critical sections, and th#s the
per"ormance o" their code% !hile a spin loc can always $e #sed, it is not always the right
tool "or the Oo$% +on)$locing synchronisation, Se/ -ocs or the 1&U mechanism o""er
$etter per"ormance than spin locs or rwlocs% B#t these synchronisation methods re/#ire
more e""ort than a simple replacement% 1&U re/#ires a complete rethining and rewriting
o" the data str#ct#res it is #sed "or, and the code it is #sed in%
4t too hal" a decade "or -in#' "rom it,s "irst giant loced SM( ernel implementation to a
reasona$ly "ine gran#lar% This time co#ld ha*e $een greatly red#ced, i" the -in#' ernel had
$een written as a preempti*e ernel with "ine gran#lar locing "rom the $eginning% !hen it
comes to m#t#al e'cl#sion, it is always a good thing to thin the whole thing thro#gh "rom
the $eginning% Starting with an approach that is #gly $#t wors, and t#ning it to a well
r#nning sol#tion later, o"ten lea*es yo# coding the same thing twice, and e'periencing
greater pro$lems than i" yo# tried to do it nicely "rom the start%
As m#ltiprocessor systems and modern architect#res lie s#perscalar, s#per pipelined, and
hyperthreaded &(Us as well as m#lti)le*el caches $ecome normalcy, simple code that loos
"ine at "irst glance can ha*e se*ere impact on per"ormance% Th#s, programmers need to
ha*e a thoro#gh #nderstanding o" the hardware they write code "or%
F#rther 1eading
This paper detailed on the per"ormance aspects o" locing in SM( ernels% 4" yo# are
2<
(oid s.nc'roni;e_ernel((oid);
(oid call_rcu(struct rcu_'ead )'ead+
(oid ()1unc)((oid )ar4)+
(oid )ar4);
struct rcu_'ead *
struct list_'ead list;
(oid ()1unc)((oid )ob<);
(oid )ar4;
-;
(oid rcu_read_loc((oid);
(oid rcu_read_unloc((oid);
Listing $: Dhe Linux 2.7 *63 A5) functions
interested in the implementation comple'ity o" "ine grained SM( ernels or e'periences
"rom per"orming a m#ltiprocessor port, please re"er to 6Cag0B9% For a more in)depth
introd#ction to SM( architect#re and caching, read &#rt Schimmel,s $oo 06Sch7892%
4" yo# want to gain deeper nowledge o" the 1&U mechanism, yo# can start at (a#l A%
McCenney,s 1&U we$site 0http.55www%rdrop%com5#sers5pa#lmc51&U2%
1e"erences
6Bry029 1% Bryant, 1% Forester, P% Hawes. Filesystem per"ormance and scala$ility in
-in#' 2%8%<D% 5roceedings of the 3senix Annual Dechnical 6onference, 2002
6&am7=9 Mar :% &amp$ell, 1#ss -% Holt. -oc)Fran#larity Analysis Tools in
SE185M(% )222 Software, March <77=
6Ati0B9 Qoa* Atison, :an Tsa"rir et% al%. Fine Frained Cernel -ogging with C-ogger.
A'perience and 4nsights% =ebrew 3ni'ersity, 200B
6Cag0B9 Simon CIgstrJm. (er"ormance and 4mplementation &omple'ity in
M#ltiprocessor Operating System Cernels% Elekinge )nstitute of Dechnology,
200B
6Cra0<9 M% Cra*et;, H% Frane. Anhancing the -in#' sched#ler% 5roceedings of the
.ttawa Linux Symposium, 200<
6McC0<9 (a#l A% McCenney. 1ead)&opy Update% 5roceedings of the .ttawa Linux
Symposium, 2002
6McC0=9 (a#l A% McCenney. Cernel Corner ) Using 1&U in the -in#' 2%B Cernel%
Linux 4aga>ine, 200=
6McC089 (a#l A% McCenney. 1&U *s% -ocing (er"ormance on :i""erent Types o" &(Us%
http:,,www.rdrop.com,users,paulmck,*63,L6A200%.02.1"a.pdf, 2008
6McC0B9 (a#l McCenney. A$straction, 1eality &hecs, and 1&U%
http:,,www.rdrop.com,users,paulmck,*63,*63intro.2001.0@.27bt.pdf, 200B
6H#a089 PRrgen H#ade, A*a)Catharina C#nst. -in#' Trei$er entwiceln% Fpunkt.'erlag,
2008
6Sch789 &#rt Schimmel. U+4S Systems "or Modern Architect#res. Symmetric
M#ltiprocessing and &aching "or Cernel (rogrammers% Addison Gesley, <778
22

You might also like