Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back

Presentation "Cache & SpinLocks Udi & Haim. A...
Search...
Search
http://slideplayer.com/slide/224800/
Upload
Log in
Download presentation
Cache coherence
We think you have liked this presentation. If you wish to

download it, please recommend it to your friends in any social
system. Share buttons are a little bit lower. Thank you!
Buttons:
Write-invalidate Snooping Protocol For Write-back
When a block is first loaded in the cache it is marked "valid".

On a read miss to the local cache, the read request is broadcast
on the bus. If one
5
has cached that address and it is in the state "dirty", it changes the state to "valid"
and sends the copy to requesting node. The "valid" state means that the cache line is
current.
Cancel
Download
When writing a block in state "valid" its state is changed
to "dirty" and a broadcast is sent out to all cache
controllers to invalidate their copies.
Cache & SpinLocks Udi & Haim.

102
Agenda
Caching
background
Why do we need caching? Caching in
modern desktop. Cache writing. Cache
coherence. Cache.
Published by Jackson Gibson
1 of 21
Similar presentations
Modied over 2 years ago
06/12/2016 09:27 AM
Presentation "Cache & SpinLocks Udi & Haim. A...
Embed
0 Comments
Sort by Oldest
Buttons:
Facebook Comments Plugin
Cancel
Presentation on theme: "Cache & SpinLocks Udi & Haim.
Download
Agenda Caching background Why do we need caching?

Caching in modern desktop. Cache writing. Cache coherence.
Cache." Presentation transcript:
1
2
Cache & SpinLocks Udi & Haim

Agenda Caching background Why do we need caching?
Caching in modern desktop. Cache writing. Cache coherence.

Cache & Spinlocks
Agenda Concurrent Systems Synchronization Types Spinlock
Semaphore Mutex Seqlocks RCU Spinlock in linux kernel Caching

and locking
Cache
Why caching? Accessing the main memory is expensive. And is
becoming the pc performance bottleneck. Slower CPU Faster CPU
Caching in modern desktop What is caching? A computer
memory with very short access time used for storage of frequently
used instructions or data webster.com Modern desktop have at
least three caches: TLB translation lookaside buer I-Cache
instruction cache D-Cachedata cache
Caching in modern desktop Locality Temporal locality Spatial
locality Cache coloring Replacement policies LRU MRU Direct Map

2 ofcache
21 Cache performance = The proportion of accesses that result
in a cache hit
06/12/2016 09:27 AM
Presentation
"Cache
& SpinLocks
Haim.writing
A...
writing
There areUdi
two& basic
approaches: Write8 Cache
through Write is done synchronously both to the cache and to the

backing store. Write-back (or Write-behind) Initially, writing is done
only to the cache. The
write to presentation
the backing store is postponed until
Download
the cache blocks containing the data are about to be
modied/replaced by new content.
Cache writing Two approaches for situations of write-misses:
No-write allocate (aka
Write Share
around)
The missed-write
is not you!
system.
buttons
are a little bitlocation
lower. Thank
loaded to cache, and is written directly to the backing store. In this
approach, only system
reads are being cached. Write allocate (aka
Buttons:
Fetch on write) The missed-write location is loaded to cache, followed
by a write-hit operation. In this approach, write misses are similar to
read-misses.
10
Cache coherence Coherence denes the behavior of reads
and writes to the same memory location.
11
Cancel
Download
Cache coherence The coherence of caches is obtained if the
following conditions are met: In a read made by a processor P to a

location X that follows a write by the same processor P to X, with no
writes of X by another processor occurring between the write and the
read instructions made by P, X must always return the value written
by P. This condition is related with the program order preservation,
and this must be achieved even in monoprocessed architectures. A
read made by a processor P1 to location X that follows a write by
another processor P2 to X must return the written value made by P2 if
no other writes to X made by any processor occur between the two
accesses. This condition denes the concept of coherent view of
memory. If processors can read the same old value after the write
made by P2, we can say that the memory is incoherent. Writes to the
same location must be sequenced. In other words, if location X
received two dierent values A and B, in this order, from any two
processors, the processors can never read location X as B and then
read it as A. The location X must be seen with values A and B in that
order
12
Cache
coherence
Cache
coherence
mechanisms
Directory-based Snooping (BUS-based) And many more .
13
Cache coherence Directory-based In a directory-based system,
the data being shared is placed in a common directory that maintains

the coherence between caches. The directory acts as a lter through
which the processor must ask permission to load an entry from the
primary memory to its cache. When an entry is changed the directory
either updates or invalidates the other caches with that entry.
14
3 of 21
Cache coherence Snooping (BUS-based) Snooping is the
process where the individual caches monitor address lines for

accesses to memory locations that they have cached. It is called a
06/12/2016 09:27 AM
write invalidate
whenUdi
a&
write
operation
is observed to a
Presentation
"Cacheprotocol
& SpinLocks
Haim.
A...
location that a cache has a copy of. There are two implementation for
the invalidate protocol: Write-update When a local cache block is
updated, the new data block is broadcast to all caches containing a
copy of the block for updating them Write-invalidate Invalidate all
remote copies of cache when a local cache block is updated.
We think Coherence
you have liked
this presentation.
you wish to
coherence
protocol
example: IfWritedownload it, please recommend it to your friends in any social
invalidate Snooping Protocol For Write-through Writes invalidate all
other caches
15
Cache
16
Buttons:
Cache coherence Write-invalidate Snooping Protocol For
Write-back When a block is rst loaded in the cache it is marked

"valid". On a read miss to the local cache, the read request is
broadcast on the bus. If one has cached that address and it is in the 5
state "dirty", it changes the state to "valid" and sends the copy to
requesting node. The "valid" state means that the cache line is
current. When writing a block in state "valid" its state is changed to
"dirty" and a broadcast is sent out to all cache controllers to Cancel
invalidate Download
their copies.
17
Cache coherence - MESI MESI Modied Exclusive Shared
Invalid
18
Cache coherence - MESI Every cache line is marked with one
of the four following states : Modied - The cache line is present only
in the current cache, and is dirty; it has been modied from the value
in main memory. The cache is required to write the data back to main
memory at some time in the future, before permitting any other read
of the (no longer valid) main memory state. The write-back changes
the line to the Exclusive state. Exclusive - The cache line is present
only in the current cache, but is clean; it matches main memory. It
may be changed to the Shared state at any time, in response to a read
request. Alternatively, it may be changed to the Modied state when
writing to it. Shared - Indicates that this cache line may be stored in
other caches of the machine and is "clean" ; it matches the main
memory. The line may be discarded (changed to the Invalid state) at
any time. Invalid - Indicates that this cache line is invalid (unused). To
summarize, the MESI is an extension of MSI algo. The MESI adds
division between modifying cache point the exist only in my cache
AND modifying cache point the exist also in other caches
19
Cache coherence - MESI For any given pair of caches, the
permitted states of a given cache line are as follows: The Exclusive

state is an opportunistic optimization: If the CPU wants to modify a
cache line that is in state S, a bus transaction is necessary to invalidate
all other cached copies. State E enables modifying a cache line with no
bus transaction.
4 of 21
20 Cache coherence
06/12/2016 09:27 AM
Presentation
"Cache
& SpinLocks
A... what is done by the
What
is done Udi
by &
theHaim.
OS and
21 Cache
hardware? In Intel X86 series, caching is implement in hardware, all

you need and can do it to change the conguration with registers
interface called Control
registers.presentation
The control registers are sets in to 7
Download
groups : CR0, CR1, CR2, CR3, CR4, And another 2 groups called: EFER,
CR8 (added to support X64 series) Our main interest in the
presentation revolved around caching, but bear in mind that this
interface contain every parameter you can set on Intel architecture.
CR0 CD (bit 30) Globally enables/disable the memory cache CR0
NW (bit 29) Globally enables/disable write-back caching (or writethrow) ushing of TLB
entries can be done in Linux using API called
Buttons:
vpid_sync_contextvpid_sync_context The implementation is done by
using: vpid_sync_vcpu_single or vpid_sync_vcpu_global for single or all
Cpus
22
Caching & Spinlock
23
Caching and spin lock spin_lock: mov eax, 1 xchg eax, [locked]
Cancel
Download
test eax, eax jnz spin_lock ret spin_unlock: mov eax, 0 xchg eax,
[locked] ret
24
[locked] ret
25
[locked] ret
26
[locked] ret
27
[locked] ret
28
[locked] ret
29
[locked] ret
30
ret The other CPU action
5 of[locked]
21
06/12/2016 09:27 AM
Presentation
"Cacheand
& SpinLocks
Udi & Haim.
spin lock spin_lock:
movA...
eax, 1 xchg eax, [locked]
31 Caching
[locked] ret
32
[locked] ret
spin lock spin_lock: mov eax, 1 xchg eax, [locked]
33 Caching and system.
Share buttons are a little bit lower. Thank you!
Buttons:
[locked] ret
34
[locked] ret
35
test eax, eax jnz spin_lock ret spin_unlock: mov eax, 0 Cancel
xchg eax, Download
[locked] ret
36
[locked] ret
37
Caching and spin lock spin_lock: mov eax, [locked] test eax,
eax jnz spin_lock mov eax, 1 xchg eax, [locked] test eax, eax jnz
spin_lock ret spin_unlock: mov eax, 0 xchg eax, [locked] ret
38
Caching and ticket lock void spin_lock(spinlock_t *lock){ t =
atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

spin_unlock(spinlock_t *lock){ lock->current_ticket++; } struct
spinlock_t { int current_ticket; int next_ticket; }
39

40

41

6 of 21
42
06/12/2016 09:27 AM
atomic_inc(lock->next_ticket);
!= lock->current_ticket)
} void
Presentation
"Cache & SpinLockswhile
Udi &(tHaim.
A...
spin_unlock(spinlock_t *lock){ lock->current_ticket++;
43
struct
Caching andDownload
ticket lockpresentation
void spin_lock(spinlock_t *lock){ t =

Sharevoid
buttons
are a little bit lower.
Thank
Caching andsystem.
ticket lock
spin_lock(spinlock_t
*lock){
t = you!
44
Buttons: while (t != lock->current_ticket) } void
spinlock_t { int current_ticket; int next_ticket; } SPIN
45

Cancel
Download
46

47

48

49

50

51

7 of 21
52
06/12/2016 09:27 AM
!= lock->current_ticket)
} void
Presentation
"Cache & SpinLockswhile
Udi &(tHaim.
A...
spin_unlock(spinlock_t *lock){ lock->current_ticket++;
53
struct
Caching andDownload
ticket lockpresentation
void spin_lock(spinlock_t *lock){ t =

Sharevoid
buttons
are a little bit lower.
Thank
Caching andsystem.
ticket lock
spin_lock(spinlock_t
*lock){
t = you!
54
Buttons: while (t != lock->current_ticket) } void
55

Cancel
Download
56

57

58

59

60
Interrupt
61
An interrupt is simply a signal that the hardware can send
when it wants the processors attention. driver need only register a

handler for its devices interrupts, and handle them properly when
they arrive
62
Interrupt Cont Register API: int request_irq(unsigned int irq,
irqreturn_t (*handler)(int, void *, struct pt_regs *), unsigned long ags,

const char *dev_name, void *dev_id); Un register API: void
free_irq(unsigned int irq, void *dev_id);
8 of 21
06/12/2016 09:27 AM
Presentation
"Cache &
SpinLocks
Udi
Haim.
A...
Cont
unsigned
int&irq
The
interrupt number being
63 Interrupt
requested irqreturn_t (*handler)(int, void *, struct pt_regs *) The

pointer to the handling function being installed
64
Interrupt Cont unsigned long ags a bit mask of options
related to interrupt SA_INTERRUPT - When set, this indicates a fast

interrupt handler. Fast
executed
with
interrupts disabled
Wehandlers
think youare
have
liked this
presentation.
If you wish to
on the current processor
SA_SHIRQ
This
bit
signals
that
the
interrupt
can be shared between
devices
char
The Thank
string you!
system.
Shareconst
buttons
are*dev_name
a little bit lower.
passed to request_irq is used in /proc/interrupts
Buttons:
Interrupt
Cont
void *dev_id Pointer used for shared interrupt
65
lines. It is a unique identier that is used when the interrupt line is
freed and that may also be used by the driver to point to its own
private data area
66
Top & Bottom Half how to perform lengthy tasks within a
handler? splitting the interrupt handler into two halves.

The Download
Cancel
so-called top half is the routine that actually responds to the interrupt
The bottom half is a routine that is scheduled by the top half to be
executed later, at a safer time. all interrupts are enabled during
execution of the bottom half
67
Top & Bottom Half Cont Two dierent mechanisms that may
be used to implement bottom-half processing Takslet - fast and must

be atomic (SW interrupt) Workqueue - higher latency but are allowed
to sleep
68
Tasklet Fast & atomic (can not sleep) Guaranteed to run on
the same CPU as the function that rst schedules them Interrupt
handler can be secure that a tasklet does not begin executing before
the handler has completed
69
Tasklet Cont Another interrupt can certainly be delivered
while the tasklet is running, so locking between the tasklet and the
interrupt handler may still be required They may be scheduled to run
multiple times, but tasklet scheduling is not cumulative, the tasklet
runs only once, even if it is requested repeatedly before it is launched
70
Tasklet Cont No tasklet ever runs in parallel with itself, since
they run only once, but tasklets can run in parallel with other tasklets
on SMP systems, so locking between tasklets are required
71
Tasklet
Example
void
short_do_tasklet(unsigned
long);
DECLARE_TASKLET(short_tasklet, short_do_tasklet, 0); irqreturn_t

short_tl_interrupt(int irq, void *dev_id, struct pt_regs *regs) { /*
Handle fast path IRQ */ tasklet_schedule(&short_tasklet); /* Schedule
return IRQ_HANDLED; }
9 oftasklet*/
21
06/12/2016 09:27 AM
Presentation
"Cache & SpinLocks
Udi &
Haim.
Higher latency
but
are A...
allowed to sleep Invoke a
72 Workqueue
function at some future time in the context of a special worker

process. Workqueue function runs in process context, it can sleep if
need be. You cannot,
however,presentation
copy data into user space from a
Download
workqueue process Workqueue process does not have access to any
other processs address space
Workqueue Example static struct work_struct short_wq; /* this
line is in short_init()
*/ INIT_WORK(&short_wq,
(*)(void
*)) you!
system.
Share buttons are a little(void
bit lower.
Thank
short_do_tasklet, NULL); irqreturn_t short_wq_interrupt(int irq, void
*dev_id, struct pt_regs
*regs) { /* Handle fast path IRQ */
Buttons:
schedule_work(&short_wq);/*
Schedule
workqueue*/
return
IRQ_HANDLED; }
73
74
Locks
75
Concurrent Systems Concurrency - what happens when the
system tries to do more than one thing at once! In computer

science, Download
Cancel
concurrency is a property of systems in which several computations
are executing simultaneously, and potentially interacting with each
other. The computations may be executing on: 1. Multiple cores in the
same chip. 2. Preemptively time-shared threads on the same
processor. 3. Executed on physically separated processors. (wiki)
76
Concurrent Systems Management The management of
concurrency is one of the core problems in operating systems

programming and the resulting outcome can be indeterminate
77
Concurrent Systems Management Concurrency faults can lead
to: Race condition - uncontrolled access to shared data. Starvation where a process is perpetually denied necessary resources. Without
those resources, the program can never nish its task. Deadlock - is
a situation in which two or more competing actions are each waiting
for the other to nish, and thus neither ever doe.
78
Race
Condition
Example
for
race
condition:
Lock;
if
(!dptr->data[s_pos])
{
dptr->data[s_pos]
=
kmalloc(quantum,
GFP_KERNEL); if (!dptr->data[s_pos]) goto __cleanup; } UnLock; Leads
to memory leak !!!
79
Deadlock Example for deadlock: Lock; if (!dptr->data[s_pos]) {
dptr->data[s_pos]
=
kmalloc(quantum,
GFP_KERNEL);
if
(!dptr->data[s_pos]) return -1; } UnLock; Leads to deadlock!!! goto
__cleanup;
80
Solution Avoid shared resources whenever possible In many
situations it is possible to design data structures that do not require

locking,
e.g. by using per-thread or per-CPU data and disabling
10 of
21
interrupts Problem - such sharing is often required, hardware
06/12/2016 09:27 AM
resources "Cache
are, by&their
nature,Udi
shared,
andA...
software resources also
Presentation
SpinLocks
& Haim.
must often be available to more than one thread
81
Solution Cont mutual exclusion - making sure that only one
thread of execution can manipulate a shared resource at any time

Not all critical sections are the same, so the kernel provides dierent
primitives for dierent needs Process context can sleep Interrupt
context cannot sleep
system.isShare
buttons
are a little
bitthat
lower.
Spinlock A spinlock
a mutual
exclusion
device
canThank
have you!
82
only two values: locked

and unlocked If the kernel control path nds
Buttons:
the spin lock unlocked, it acquires the lock and continues its execution
83
Spinlock Cont if the kernel control path nds the lock locked, it
spins around, repeatedly executing a tight instruction loop, until the

lock is released spinlocks may be used in code that cannot sleep, such
as interrupt handler
84
Spinlock Scenario #1 1.Driver acquires a spinlocksCancel

2.Driver Download
loses the processor due to: the driver call a function which put the
process to sleep (e.g copy_from_user) kernel preemption kicks in higher priority process push the driver code aside
85
Spinlock Scenario #1 Result: Diver holds a spinlocks which
will not be free in the near future In the best case, if another thread
tries to acquire the locks it will spin for long time In the worst case
deadlock can occur
86
Spinlock Scenario #1 Conclusion Conclusion: Code holding a
spinlock must be atomic and can not go to sleep (and sometimes not
even handle interrupt) Preemption is disabled on processor which
hold a spinlock
87
Spinlock Scenario #2 1.Driver acquires a spinlocks 2.Async -
device issue an interrupt The interrupt handler of the device, trying

to acquire the spinlock
88
Spinlock Scenario #2 Result: what happens if the interrupt
routine executes in the same processor as the code that took out the
lock originally? Deadlock !!!
89
Spinlock Scenario #2 Conclusion Conclusion: In this case the
acquire of spinlock must disable interrupt Spinlocks critical section

must be as little as possible ( The longer you hold a lock, the longer
another processor may have to spin waiting for you to release it)
90
Spinlock Scenario #2 Conclusion Cont Long lockhold times
also keep the current processor from scheduling, meaning that a

higher priority processwhich really should be able to get the CPUmay
have to wait
11 of 21
06/12/2016 09:27 AM
Presentation
"Cache & API
SpinLocks
Udi &lock
Haim.
A...
Spinlock
Initialize
APIs:
91
spinlock_t
lock
SPIN_LOCK_UNLOCKED spin_lock_init(spinlock_t *lock)
92
Spinlock API Lock APIs and, possibly, disabling interrupts:
void spin_lock(spinlock_t *lock) void spin_lock_irqsave(spinlock_t

*lock, unsigned long ags) Interrupts can execute in nested fashion
the previous interrupt
state
is you
stored
in liked
ags this
(safe)
We
think
have
presentation. If you wish to
void spin_lock_irq(spinlock_t *lock) If you are
93 Spinlock APIsystem.
Share buttons are a little bit lower. Thank you!
absolutely sure nothing else might have already disabled interrupts
on your processor IfButtons:
you are sure that you should enable interrupts
when you release your spinlock void spin_lock_bh(spinlock_t *lock)
disables software interrupts before taking the lock
94
Spinlock API Cont Try lock APIs: int spin_trylock(spinlock_t
*lock) int spin_trylock_bh(spinlock_t *lock) Non spinning versions of

the above functions
95
Cancel
Download
Spinlock API Cont Un lock APIs: void spin_unlock(spinlock_t
*lock) void spin_unlock_irqrestore(spinlock_t *lock, unsigned long

ags)
void
spin_unlock_irq(spinlock_t
*lock)
void
spin_unlock_bh(spinlock_t *lock)
96
Semaphore single integer value combined with a pair of
functions: one which acquires a semaphore one which releases a

semaphore if the value of the semaphore is greater than zero: value
is decremented by one process continues
97
Semaphore Cont otherwise the process goes to sleep till
another process will release the semaphore (increment by one the

semaphore value) if necessary, wakes up processes that are waiting
98
Semaphore Struct Semaphore struct can be found at:
include/linux/semaphore.h struct semaphore { raw_spinlock_t lock;

unsigned int count; struct list_head wait_list; };
99
Semaphore Struct Cont The ->lock (spinlock) controls access
to the other members of the semaphore The ->count variable

represents how many more tasks can acquire this semaphore. If it's
zero, there may be tasks waiting on the wait_list The ->wait_list is a list
of tasks waiting for the semaphore (FIFO)
100
Semaphore Lock Procedure Semaphore Lock: 1.acquire
spinlock 2.if(count >0) i.count-- 3.else i.insert calling task to the tail of
wait_list ii.set wakeup ag to 0 iii.repeat: release spinlock put task to
sleep acquire spinlock if (wakeup ag ==1) exit repeat section
4.release spinlock
12 of 21
101
Semaphore Unlock Procedure Semaphore Unlock: 1.acquire
06/12/2016 09:27 AM
spinlock 2.if(wait_list
is empty) Udi
i.count
++ 3.else
Presentation
"Cache & SpinLocks
& Haim.
A... i.node = get wait_list
head ii.remove node from wait_list iii.set wakeup ag to 1 iv.wakeup

process 4.release spinlock
102
presentation
SemaphoreDownload
APIs Create
a semaphore with initalize counter
value of val void sema_init(struct semaphore *sem, int val) #dene

DEFINE_SEMAPHORE(name) Lock a semaphore void down(struct
semaphore *sem)
system.
buttons are
a little
bit lower.
Thank
Semaphore
APIsShare
Interruptible
lock
(process
can
be you!
103
interrupted by a signal)
int down_interruptible(struct semaphore
Buttons:
*sem) Try lock (never sleep) int down_trylock(struct semaphore
*sem) Un lock semaphore void up(struct semaphore *sem)
104
RW Semaphore Semaphores perform mutual exclusion for
all callers Many tasks break down into two distinct types of work:
Readers Writers
105
Cancel
Download
RW Semaphore Allow multiple concurrent readers Optimize
performance An RW semaphore allows either one writer or an

unlimited number of readers to hold the semaphore
106
RW Semaphore cont Since multiple readers may hold the
lock at once writer may continue waiting for the lock while new
reader threads are able to acquire the lock write starvation
107
RW
Semaphore
API
RW
Semaphore
struct:
struct
rw_semaphore { long count; raw_spinlock_t wait_lock; struct list_head

wait_list; }
108
RW
Semaphore
API
Initialize
RW
semaphore
init_rwsem(struct rw_semaphore *sem) Obtaining and releasing read

access to a reader/writer semaphore void down_read(struct
rw_semaphore *sem) int down_read_trylock(struct rw_semaphore
*sem) void up_read(struct rw_semaphore *sem)
109
RW Semaphore API Cont Obtaining and releasing write
access to a reader/writer semaphore void down_write(struct

rw_semaphore *sem) int down_write_trylock(struct rw_semaphore
*sem)
void
up_write(struct
rw_semaphore
*sem)
void
downgrade_write(struct rw_semaphore *sem)
110
RW Semaphore API Cont Obtaining and releasing write
access to a reader/writer semaphore void down_write(struct

rw_semaphore *sem) int down_write_trylock(struct rw_semaphore
*sem)
void
up_write(struct
rw_semaphore
*sem)
void
downgrade_write(struct rw_semaphore *sem)
13 of 21
111
Mutex Similar to semaphore Mutex struct struct mutex { /*
1: unlocked, 0: locked, negative: locked, possible waiters */ atomic_t
06/12/2016 09:27 AM
count; spinlock_t
struct
struct task_struct
Presentation
"Cache wait_lock;
& SpinLocks
Udilist_head
& Haim. wait_list;
A...
*owner; };
112
Mutex Vs Semaphore Only one task can hold the mutex at a
time (binary semaphore) Only the owner of the mutex can unlock the
mutex Recursive locks Improvement: try to spin for acquisition when
we nd that there are no pending waiters and the lock owner is
currently running on a (dierent) CPU (it is likely to release the lock
soon)
113
Seqlocks In read write locks: readers must wait until the

Buttons:
writer has nished writer must wait until all readers have nished
Seqlocks give a much higher priority to writers writer is allowed to
proceed even when readers are active
114
Seqlocks Cont The Seqlock struct can be found at /include
/linux/seqlock.h typedef struct { struct seqcoun_t seqcount; spinlock_t

lock; } seqlock_t;
Cancel
Download
115
Seqlocks Read Access Read access works by: obtaining an
(unsigned) integer sequence value on entry into the critical section

do some reading operations Compare the current sequence # with
the one obtained if there is a mismatch, the read access must be
retried
116
Seqlocks Write Access The write lock is implemented with a
spinlock, so all the usual constraints apply give a much higher

priority to writers writer is allowed to proceed even when readers
are active Increment the sequence #
117
Seqlocks Read Example A typical code example will look like:
unsigned int seq; do { seq = read_seqbegin(&the_lock); /* Do what you

need to do */ } while (read_seqretry(&the_lock, seq));
118
Seqlocks Summary Pros writer never waits (unless another
writer is active) free access for readers Cons reader may sometimes
be forced to read the same data several times until it gets a valid copy
generally cannot be used to protect data structures involving pointers,
because the reader may be following a pointer that is invalid while the
writer is changing the data structure
119
Seqlocks
API
Initialize
seqlocks:
seqlock_t
lock
SEQLOCK_UNLOCKED; seqlock_init(seqlock_t *lock); Obtaining read

access: unsigned int read_seqbegin(seqlock_t *lock); unsigned int
read_seqbegin_irqsave(seqlock_t *lock, unsigned long ags);
120
Seqlocks API Cont int read_seqretry(seqlock_t *lock,
unsigned int seq); int read_seqretry_irqrestore(seqlock_t *lock,

unsigned int seq, unsigned long ags);
14 of 21
06/12/2016 09:27 AM
Presentation
& SpinLocks
Udi &
Haim. A... write
Seqlocks
API Cont
Obtaining
121 "Cache
access:
void
write_seqlock(seqlock_t *lock); void write_seqlock_irqsave(seqlock_t

*lock, unsigned long ags); void write_seqlock_irq(seqlock_t *lock);
void
write_seqlock_bh(seqlock_t
*lock);
int
write_tryseqlock(seqlock_t *lock);
122
API Cont Releasing write access: void

write_sequnlock(seqlock_t
*lock);
void in any social
download it, please recommend
it to your friends
write_sequnlock_irqrestore(seqlock_t
*lock,
long Thank
ags); you!
system. Share buttons
are unsigned
a little bit lower.
void
write_sequnlock_irq(seqlock_t
*lock);
void
write_sequnlock_bh(seqlock_t
Buttons: *lock);
123
Seqlocks
RCU Read Copy Update An improvement for seqlocks RCU
allows many readers and many writers to proceed concurrently (an

improvement over seqlocks, which allow only one writer to proceed)
optimized for situations where reads are common and writes are rare
124
RCU Constrains Constraints: resources being protected

Cancel
Download
should be accessed via pointers all references to those resources
must be held only by atomic code (process can not sleep inside a
critical region protected by RCU)
125
RCU How ? On the reader side, code using an RCU- protected
data structure should disable\enable preemption struct my_stu

*stu;
rcu_read_lock(
);//disable
preemption
stu
=
nd_the_stu(args...); /* Do what you need to do */ rcu_read_unlock(
);//enable preemption
126
RCU How Cont? On the writer side: Allocates a new
structure Copies data from the old one, Replaces the pointer that is
seen by the read code At this point from reader perspective, the
change is complete. any code entering the critical section sees the
new version of the data
127
RCU Cleanup The only problem is when no free the old
pointer (reader might have reference for the pointer) Since all code
holding references to this data structure must (by the rules) be
atomic, we know that once every processor on the system has been
scheduled at least once, all references must be gone. RCU sets aside
a callback that waits until all processors have scheduled; that callback
is then run to perform the cleanup work
128
Simple Spinlock Using test and set atomic function test-
and-set instruction is an instruction used to write to a memory

location and return its old value as a single atomic operation int
test_and_set(int *lock)
129
15 of 21
Simple Spinlock #dene LOCKED 1 int test_and_set(int*
lockPtr) { int oldValue; oldValue = SwapAtomic(lockPtr, LOCKED);
06/12/2016 09:27 AM
return (oldValue
Presentation
"Cache==
& LOCKED);
SpinLocks} Udi & Haim. A...
130
Test and set mutex Implemention Lock: int lock(int* lockPtr)
{ while (TestAndSet(lock)==LOCKED) //wait a bit } UnLock: int

un_lock(int* lockPtr) {* lockPtr=0;}
131
Problems Grants requests in unpredictable order Accelerate

inter-CPU bus trac (cache)
system.
Share lock
buttons
areas
a little
bit lower.
Thank you!
Ticket Spinlocks
A ticket
works
follows:
Two integer
132
values which initializeButtons:

to 0 Queue ticket Dequeue ticket
133
Ticket Spinlocks Cont Acquire lock procedure: Obtain &
increments queue ticket Compares its ticket's value (before the

increment) with the dequeue ticket's value If they are the same, the 5
thread is permitted to enter the critical section else, then another
thread must already be in the critical section and this thread must
busy-wait or yield
Cancel
Download
134
Ticket Spinlocks Cont Release lock procedure: Increments
the dequeue ticket This permits the next waiting thread to enter the
critical section
135
Ticket Spinlock Summary Grants requests in FIFO order
Problems Accelerate inter-CPU bus trac (cache)
136
Linux Scalability What is scalability? Application does N
times as much work on N cores as it could on 1 core. Scalability may

be limited by Amdahl's Law: Locks, shared data structures,... Shared
hardware (DRAM, NIC,...)
137
Linux Scalability
138
Linux Scalability Cont
142
Test-and-Set Lock Repeatedly test-and-set a Boolean ag
indicating whether the lock is held Problem: contention for the ag

(read- modify-write instructions are expensive) Causes lots of
network trac, especially on cache-coherent architectures (because
of cache invalidations) Variation: test-and-test-and-set less trac
143
Ticket Lock 2 counters (nr_requests, and nr_releases) Lock
acquire: fetch-and-increment on the nr_requests counter, waits until

its ticket is equal to the value of the nr_releases counter Lock release:
increment of the nr_releases counter
144
Ticket Lock Cont Advantage over T&S: polls with read
operations
only BUT - Still generates lots of trac and contention
16 of
21
All threads spin on the same shared location causing cache-
06/12/2016 09:27 AM
coherence"Cache
trac on
every successful
access
Presentation
& SpinLocks
Udi & lock
Haim.
A...
145
The Problem Busy-waiting techniques is heavily used in
synchronization on shared memory Busy-waiting synchronization

constructs tend to: Have signicant impact on network trac due to
cache invalidations Contention leads to poor scalability
WeCont
think
you signicant
have liked impact
this presentation.
you wish to
The Problem
Have
on networkIf trac
due to cache invalidations: Even in the case of two CPUs are
repeatedly acquiring a spinlock, the memory location representing
that lock will bounce
back and forth between those CPUs' caches.
Buttons:
Even if neither CPU ever has to wait for the lock, the process of
moving it between caches will slow things down considerably
146
147
The Problem Cont Contention leads to poor scalability: The
simple act of spinning for a lock clearly is not going to be good for
performance Cache contention would appear to be less of an issue
(CPU spinning on a lock will cache its contents in a shared mode) No
Cancel
Download
cache bouncing should occur until the CPU owning the lock releases it
(Releasing the lock and its acquisition by another CPU requires writing
to the lock, and that requires exclusive cache access)
148
The Problem Cont Contention leads to poor scalability: Case
2: lock is contended, there will be one or more other CPUs constantly

querying its value, obtaining shared access to that same cache line
and depriving the lock holder of the exclusive access it needs. A
subsequent modication of data within the aected cache line will
thus incur a cache miss. So CPUs querying a contended lock can slow
the lock owner considerably, even though that owner is not accessing
the lock directly.
149
The Problem Cont Contention leads to poor scalability:
Kernel code will acquire a lock to work with (and, usually, modify) a
structure's contents. Often, changing a eld within the protected
structure will require access to the same cache line that holds the
structure's spinlock. Case 1: lock is uncontended, that access is not a
problem, the CPU owning the lock probably owns the cache line as
well. Case 2: lock is contended, there will be one or more other CPUs
constantly querying its value, obtaining shared access to that same
cache line and depriving the lock holder of the exclusive access it
needs. A subsequent modication of data within the aected cache
line will thus incur a cache miss. So CPUs querying a contended lock
can slow the lock owner considerably, even though that owner is not
accessing the lock directly.
150
151
17 of 21
The Source of the Problem Spinning on remote variables

The Proposed Solution Insert delay (backo) Minimize
access to remote variables - spin on local variables instead
06/12/2016 09:27 AM
Presentation
"Cache
& SpinLocks
Udi &
Haim.than
A... spinning tightly and
Lock
With Backo
Rather
152 Spin
querying a contended lock's status, a waiting CPU should wait a bit

more patiently, only querying the lock occasionally Cause a waiting
CPU to loop a number
of times
doing nothing at all before it gets
Download
presentation
impatient and checks the lock again
153
Spin Lock With Backo Pros Pros While a CPU is looping

without querying thedownload
lock it cannot
be bouncing
cache
it, please
recommend
it tolines
youraround,
friends in any social
so the lock holder should
be
able
to
make
faster
progress
Calculate
proportional backo using the value of the ticket minus the number
of ticket which is currently
Buttons: served multiply the static backo loop
Cons too much looping will cause the lock to sit idle before the
owner of the next ticket notices that its turn has come; that, too, will
hurt performance All threads spin on the same shared location
5
causing cache-coherence trac on every successful lock access
154
Spin Lock With Backo Cons Cons too much looping will
cause the lock to sit idle before the owner of the next ticket notices
Cancel
Download
that its turn has come; that, too, will hurt performance All threads
spin on the same shared location causing cache-coherence trac on
every successful lock access
155
Array Lock #dene NUM_OF_PROC 100 #dene HAS_LOCK 1
#dene MUST_WAIT 0 struct arrLock{ int slot[NUM_OF_PROC]; int

next_slot; }
156
Array Lock Cont #dene INIT_ARR_LOCK(name) \ struct
arrLock name;\ name.slot = [0=HAS_LOCK, 1 NUM_OF_PROC-1 =

MUST_WAIT];\ name.next_slot = 0;
157
Array
Lock
Acquire
int arr_lock_lock
(struct arr_lock
*arr_lock_p, int *my_slot) { *my_slot = fetch_and_increment

(arr_lock_p->next_slot); // returns old value *my_slot %=
NUM_OF_PROC ; // get the slot inside the array while
(arr_lock_p->slots[*my_slot]
=
MUST_WAIT)
{};
//
spin
arr_lock_p->slots[*my_slot] = MUST_WAIT; // init for next time return
0; }
158
Array Lock Release int arr_lock_unlock (struct arr_lock
*arr_lock_p, int my_slot) { arr_lock_p->slots[(my_slot + 1) %

NUM_OF_PROC] = HAS_LOCK; return 0; } Each CPU clears the lock for
its successor (sets it from must-wait to has-lock) Lock-acquire while
(slots[my_place] == MUST_WAIT); Lock-release slots[(my_place + 1) %
NUM_OF_PROC] = HAS_LOCK;
159
Array Lock Cons adjacent data items share a single cache
line. A write to one item invalidates that items cache line

18 of 21
160
Array Lock Cons How to solve it: Pad array elements so that
06/12/2016 09:27 AM
distinct elements
mapped to
distinct
cache
Presentation
"Cache are
& SpinLocks
Udi
& Haim.
A...lines
161
Array Lock Cons The ALock is not space-ecien We dont
know NUM_OF_PROC value?
162
Array Lock Pros Spin on local variables, no cache jumps
think you have3 liked

this presentation.
you 2
wish to
Ticket LockWe
Improvements
counters
(nr_requests,Ifand
nr_releases) each counter is in dirent cache line (padding with
zeroes) Counter init values: nr_requests = 1 array of nr_releases:
nr_releases[0] = 1 nr_releases[1]
=0
Buttons:
163
164
Ticket Lock Improvements Algo The algorithm: Lock: fetch-
and-increment
(nr_requests)
//get
ticket
while(ticket
!=nr_releases[(ticket+1) %2]) //wait for my turn Lock release:
nr_releases[(ticket % 2)]+=2 //increment by 2
165
Ticket Lock Improvements Summary Advenatege: Divide by

Cancel
Download
half chache miss (linear to the array size of nr_release) Can be
generalized for n releases counters Disadventage: The lock is not
space-ecient - each counter is in distincit cache line
166
MCS Lock List Based Queue Lock Goals: Reduce bus trafc
on cc machines (by spinning on local varibles) Space ecient

Requires
atomic
instructions
available
on
some
CPUs:
ATOMIC_COMPARE_AND_SWAP: CAS (mem, old, new) If *mem ==
old, then set *mem = new and return true
167
MCS Lock List Based Queue Lock typedef struct qnode {
struct qnode *next; bool locked; } mcs_lock_qnode; A lock is just a

pointer to a qnode typedef mcs_lock_qnode *mcs_lock;
168
MCS Lock Acquire acquire (mcs_lock *L, mcs_lock_qnode *I) {
I->next = NULL; qnode *predecessor = I; ATOMIC_SWAP (predecessor,

*L); if (predecessor != NULL) { I->locked = true; predecessor->next = I;
while (I->locked) ; }
169
MCS Lock Acquire If unlocked, L is NULL If locked, no waiters,
L is owners qnode If waiters, *L is tail of waiter list
170
MCS Lock Release release (mcs_lock *L, mcs_lock_qnode *I) {
if (!I->next) { if (ATOMIC_COMPARE_AND_SWAP (*L, I, NULL)) return; }

while (!I->next) ; I->next->locked = false; }
171
MCS Lock Release If I->next NULL and *L == I No one else is
waiting for lock, OK to set *L = NULL If I->next NULL and *L != I

Another thread is in the middle of aquire Just wait for I->next to be
non-NULL
If I->next is non-NULL I->next oldest waiter, wake up w.
19 of
21
I->next->loked = false
06/12/2016 09:27 AM
Presentation
"Cache & Cache
SpinLocks
Haim. A...given below is used to
Line Udi
the &technique
172 Exclusive
force alignment of data structures on cache boundaries: Dynamic:

#dene ALIGN 64 void *aligned_malloc(int size) { void *mem =
kmalloc(size+ALIGN+sizeof(void*),
GFP_KERNEL); void **ptr = (void**)
((long)(mem+ALIGN+sizeof(void*)) & ~(ALIGN-1)); ptr[-1] = mem;
return ptr; }
Cache Line void aligned_free(void *ptr) {
free(((void**)ptr)[-1]);system.
} static:Share
int __attribute__((aligned(64)))
lock;Thank you!
buttons are a little bit lower.
173
174
Exclusive
Memory pool
Lookaside Caches Allocating many objects of
Buttons:
the same size, over and over in the kernel. API kmem_cache_t
*kmem_cache_create(const char *name, size_t size, size_t oset,
unsigned long ags, void (*constructor)(void *, kmem_cache_t *,
unsigned long ags), void (*destructor)(void *, kmem_cache_t *, 5
unsigned long ags)); void *kmem_cache_alloc(kmem_cache_t
*cache, int ags); void kmem_cache_free(kmem_cache_t *cache,
const void *obj); int kmem_cache_destroy(kmem_cache_t *cache);
Cancel
Download
ags = SLAB_HWCACHE_ALIGN This ag requires each data object to
be aligned to a cache line;
175
Memory pool Memory Pools There are places in the kernel
where memory allocations cannot be allowed tofail. A memory pool

is really just a form of a lookaside cache that tries to always keep a list
of free memory around for use in emergencies. API mempool_t
*mempool_create(int
min_nr,
mempool_alloc_t
*alloc_fn,
mempool_free_t
*free_fn,
void
*pool_data);
typedef
void
*(mempool_alloc_t)(int gfp_mask, void *pool_data); typedef void
(mempool_free_t)(void
*element,
void
*pool_data);
void
*mempool_alloc(mempool_t
*pool,
int
gfp_mask);
void
mempool_free(void
*element,
mempool_t
*pool);
int
mempool_resize(mempool_t *pool, int new_min_nr, int gfp_mask);
void mempool_destroy(mempool_t *pool);
176
Memory pool Code example #dene ALIGN 64 const char*
cachName = slots union slot{ char spaceKeeper[ALIGN] ; int val; };

cache
=
kmem_cache_create(cachName,sizeof(slot),0,
SLAB_HWCACHE_ALIGN,...);
pool
=
mempool_create(MY_POOL_MINIMUM,
mempool_alloc_slab,
mempool_free_slab, cache); union slot * obj = (union slot *)
mempool_alloc(pool,..);
177
Testing
178
Testing Application The test application is divided into 2
parts: User mode a performance test application Kernel mode a

char device
20 of 21
179
Testing Application - kernel Create a char device driver Via
06/12/2016 09:27 AM
ioctl control
then&following:
of spinlock:
ticket lock Array
Presentation
"Cache
SpinLockscreation
Udi & Haim.
A...
lock MCS lock acquire spinlock (the which was created) Release
spinlock (the which was created)
180
Download
presentation
Testing Application
- User
The test application will generate
each run a dierent type of spinlock The test application will run
several fork\threads (one for each CPU core) Each thread will run on
separate (unique) core (sched_setanity)
system. Share
buttons
are Pseudo
a little bitCode:
lower.fopen
Thank you!
Testing Application
User
Cont
181
spinlock device Create

a spinlock type for i = 0; i < MAX_OF_CORES;
Buttons:
i++ sched_setanity(i)
182
Testing Application User Cont Each thread will do the
following in a loop of x iterators: acquire a lock suspend himself

(sched_yield) and let other threads to run release the lock Measure
the time that all threads nished
183
Cancel
Testing Application User Cont Pseudo Code:
fopen Download
spinlock device create a spinlock type start_tick= Get Tick for i = 0;

i < MAX_OF_CORES; i++ Run thread (i) Make sure all threads nished
working Time = Get Tick start_tick
184
Testing
Application
User
Cont
Inside
thread
sched_setanity(i) Loop: lock acquire suspend (sched_yield) lock

release
185
Thank you
Download "Cache & SpinLocks Udi & Haim. Agenda Caching

background Why do we need caching? Caching in modern desktop.
Cache writing. Cache coherence. Cache."
2016 SlidePlayer.com Inc.

All rights reserved.
Feedback
About project
Privacy Policy
SlidePlayer
Feedback
Terms of Service
Search...
21 of 21
Search
06/12/2016 09:27 AM

Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back

Uploaded by

Copyright:

Available Formats

You might also like

Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back

Uploaded by

Copyright:

Available Formats

Presentation "Cache & SpinLocks Udi & Haim. A...

We think you have liked this presentation. If you wish to

Write-invalidate Snooping Protocol For Write-back

When a block is first loaded in the cache it is marked "valid".

Cache & SpinLocks Udi & Haim.

Modied over 2 years ago

Presentation "Cache & SpinLocks Udi & Haim. A...

Facebook Comments Plugin

Presentation on theme: "Cache & SpinLocks Udi & Haim.

Agenda Caching background Why do we need caching?

Cache & SpinLocks Udi & Haim

Caching in modern desktop. Cache writing. Cache coherence.

Agenda Concurrent Systems Synchronization Types Spinlock

Semaphore Mutex Seqlocks RCU Spinlock in linux kernel Caching

Why caching? Accessing the main memory is expensive. And is

becoming the pc performance bottleneck. Slower CPU Faster CPU

Caching in modern desktop What is caching? A computer

Caching in modern desktop Locality Temporal locality Spatial

locality Cache coloring Replacement policies LRU MRU Direct Map

through Write is done synchronously both to the cache and to the

Cache coherence Coherence denes the behavior of reads

and writes to the same memory location.

following conditions are met: In a read made by a processor P to a

Directory-based Snooping (BUS-based) And many more .

Cache coherence Directory-based In a directory-based system,

the data being shared is placed in a common directory that maintains

Cache coherence Snooping (BUS-based) Snooping is the

process where the individual caches monitor address lines for

Write-back When a block is rst loaded in the cache it is marked

Cache coherence - MESI MESI Modied Exclusive Shared

Cache coherence - MESI Every cache line is marked with one

Cache coherence - MESI For any given pair of caches, the

permitted states of a given cache line are as follows: The Exclusive

hardware? In Intel X86 series, caching is implement in hardware, all

Caching & Spinlock

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void

Caching and ticket lock void spin_lock(spinlock_t *lock){ t =

irqreturn_t (handler)(int, void , struct pt_regs *), unsigned long ags,

requested irqreturn_t (handler)(int, void , struct pt_regs *) The

lock) int spin_trylock_bh(spinlock_t lock) Non spinning versions of

lock) void spin_unlock_irqrestore(spinlock_t lock, unsigned long