Professional Documents
Culture Documents
Chapter 07
Chapter 07
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Protocols
Elegant Important But nave
Art of Multiprocessor Programming 2
Protocols
Elegant (in their fashion) Important (why else would we pay attention) And realistic (your mileage may vary)
Art of Multiprocessor Programming 3
Kinds of Architectures
SISD (Uniprocessor)
Single instruction stream Single data stream
SIMD (Vector)
Single instruction Multiple data
MIMD (Multiprocessors)
Multiple instruction Multiple data.
Art of Multiprocessor Programming 4
Kinds of Architectures
SISD (Uniprocessor)
Single instruction stream Single data stream
SIMD (Vector)
Single instruction Multiple data
Our space
MIMD (Multiprocessors)
Multiple instruction Multiple data.
(1) Art of Multiprocessor Programming 5
MIMD Architectures
memory
Shared Bus
Distributed
8 (1)
our focus
Art of Multiprocessor Programming 9
Basic Spin-Lock
. .
10
Basic Spin-Lock
lock introduces sequential bottleneck
. .
11
Basic Spin-Lock
lock suffers from contention
. .
12
Basic Spin-Lock
lock suffers from contention
. .
Basic Spin-Lock
lock suffers from contention
. .
Basic Spin-Lock
lock suffers from contention
. .
Contention ???
Art of Multiprocessor Programming 15
Review: Test-and-Set
Boolean value Test-and-set (TAS)
Swap true with current value Return value tells if prior value was true or false
Review: Test-and-Set
public class AtomicBoolean { boolean value;
public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; }
17 (5)
Review: Test-and-Set
public class AtomicBoolean { boolean value;
public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Package
java.util.concurrent.atomic
18
Review: Test-and-Set
public class AtomicBoolean { boolean value;
public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; }
Review: Test-and-Set
AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getAndSet(true)
20
Review: Test-and-Set
AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getAndSet(true)
21 (5)
Test-and-Set Locks
Locking
Lock is free: value is false Lock is taken: value is true
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Art of Multiprocessor Programming 23
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); Lock state }}
is AtomicBoolean
24
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); Keep trying until }}
lock acquired
25
Test-and-set Lock
class TASlock { Release lock AtomicBoolean state = by resetting new AtomicBoolean(false); state to false
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Art of Multiprocessor Programming 26
Space Complexity
TAS spin-lock has small footprint N thread spin-lock uses O(1) space As opposed to O(n) Peterson/Bakery How did we overcome the W(n) lower bound? We used a RMW operation
Art of Multiprocessor Programming 27
Performance
Experiment
n threads Increment shared counter 1 million times
28
Graph
time
no speedup because of sequential bottleneck
ideal
threads
Art of Multiprocessor Programming 29
Mystery #1
TAS lock
time
Ideal
threads
Art of Multiprocessor Programming
Test-and-Test-and-Set Locks
Lurking stage
Wait until lock looks free Spin while read returns true (lock taken) As soon as lock looks available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking
Art of Multiprocessor Programming 31
Pouncing state
Test-and-test-and-set Lock
class TTASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } }
Art of Multiprocessor Programming 32
Test-and-test-and-set Lock
class TTASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } }
Test-and-test-and-set Lock
class TTASlock { AtomicBoolean state = new AtomicBoolean(false);
34
Mystery #2
TAS lock
time
threads
Art of Multiprocessor Programming 35
Mystery
Both
TAS and TTAS Do the same thing (in our model)
Except that
TTAS performs much better than TAS Neither approaches ideal
36
Opinion
Our memory abstraction is broken TAS & TTAS methods
Are provably the same (in our model) Except they arent (in field tests)
37
Bus-Based Architectures
cache
cache
Bus
cache
memory
Art of Multiprocessor Programming 38
Bus-Based Architectures
Random access memory (10s of cycles)
cache cache
Bus
cache
memory
Art of Multiprocessor Programming 39
Shared Bus Broadcast medium One broadcaster at a time Processors and memory all snoop
cache cache
Bus
Bus-Based Architectures
cache
memory
Art of Multiprocessor Programming 40
Per-Processor Caches Bus-Based Architectures Small Fast: 1 or 2 cycles Address & state information
cache
cache
Bus
cache
memory
Art of Multiprocessor Programming 41
Jargon Watch
Cache hit
I found what I wanted in my cache Good Thing
42
Jargon Watch
Cache hit
I found what I wanted in my cache Good Thing
Cache miss
I had to shlep all the way to memory for that data Bad Thing
Art of Multiprocessor Programming 43
Cave Canem
This model is still a simplification
But not in any essential way Illustrates basic principles
44
cache
cache
Bus
cache
memory
Art of Multiprocessor Programming
data
45
cache
cache
Bus
cache
memory
Art of Multiprocessor Programming
data
46
Memory Responds
cache
Got your data right here
cache
Bus
cache
Bus
memory
Art of Multiprocessor Programming
data
47
data
cache
Bus
cache
memory
Art of Multiprocessor Programming
data
48
data
cache
Bus
cache
memory
Art of Multiprocessor Programming
data
49
data
cache
Bus
cache
memory
Art of Multiprocessor Programming
data
50
I got data
data
cache
Bus
cache
Bus
memory
Art of Multiprocessor Programming
data
51
data
cache
Bus
cache
Bus
memory
Art of Multiprocessor Programming
data
52
data
data
Bus
cache
memory
Art of Multiprocessor Programming
data
53 (1)
data
data data
Bus
cache
memory
Art of Multiprocessor Programming
data
54 (1)
data
data
Bus
cache
memory
Art of Multiprocessor Programming
data
55
data
data
Bus
cache
data
56
Cache Coherence
We have lots of copies of data
Original copy in memory Cached copies at processors
57
Write-Back Caches
Accumulate changes in cache Write back when needed
Need the cache for something else Another processor wants it
On first modification
Invalidate other entries Requires non-trivial protocol
Art of Multiprocessor Programming 58
Write-Back Caches
Cache entry has three states
Invalid: contains raw seething bits Valid: I can read but I cant write Dirty: Data has been modified
Intercept other load requests Write back to memory before using cache
59
Invalidate
data
data
Bus
cache
memory
Art of Multiprocessor Programming
data
60
Invalidate
data
data
Bus Bus
cache
memory
Art of Multiprocessor Programming
data
61
Invalidate
Uh,oh
data cache
data
Bus Bus
cache
memory
Art of Multiprocessor Programming
data
62
Invalidate
Other caches lose read permission
cache
data
Bus
cache
memory
Art of Multiprocessor Programming
data
63
Invalidate
Other caches lose read permission
cache
data
Bus
cache
memory
data
64
Invalidate
Memory provides data only if not present in any cache, so no need to change it now (expensive) data cache cache
Bus
memory
Art of Multiprocessor Programming
data
65 (2)
cache
data
Bus Bus
cache
memory
Art of Multiprocessor Programming
data
66 (2)
Owner Responds
Here it is!
cache
data
Bus Bus
cache
memory
Art of Multiprocessor Programming
data
67 (2)
data
data
Bus
cache
Mutual Exclusion
What do we want to optimize?
Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock
69
Simple TASLock
TAS invalidates cache lines Spinners
Miss in cache Go to bus
70
Test-and-test-and-set
Wait until lock looks free
Spin on local cache No bus use while lock busy
71
busy
busy
Bus
busy
memory
Art of Multiprocessor Programming
busy
72
On Release
invalid
invalid
Bus
free
memory
Art of Multiprocessor Programming
free
73
On Release
invalid miss
invalid miss
Bus
free
memory
Art of Multiprocessor Programming
free
74 (1)
On Release
TAS() invalid
TAS() invalid
Bus
free
memory
Art of Multiprocessor Programming
free
75 (1)
Problems
Everyone misses
Reads satisfied sequentially
In critical section, run ops X then ops Y. As long as Quiescence time is less than X, no drop in performance.
Quiescence Time
Increses linearly with the number of processors for bus architecture
time
threads
Art of Multiprocessor Programming
78
Mystery Explained
TAS lock
time
time
r2d r1d d
spin lock
80
time
4d 2d d
spin lock
If I fail to get lock wait random duration before retry Each subsequent failure doubles expected wait
Art of Multiprocessor Programming 81
Spin-Waiting Overhead
TTAS Lock
time
Backoff lock
threads
Art of Multiprocessor Programming 88
Bad
Must choose parameters carefully Not portable across platforms
89
Idea
Avoid useless invalidations
By keeping a queue of threads
Each thread
Notifies next in line Without bothering the others
90
idle
flags
91
acquiring getAndIncrement
flags
92
acquiring getAndIncrement
flags
93
acquired
Mine!
flags
94
acquired
acquiring
flags
95
acquired
acquiring
flags
getAndIncrement
F F F F F F F
96
acquired
acquiring
flags
getAndIncrement
F F F F F F F
97
acquired
acquiring
flags
98
released
acquired
flags
99
released
acquired
flags
Yow!
100
101
102
Thread-local variable
Art of Multiprocessor Programming 104
105
slot
106
to go
107
re-use
108
go
109
Local Spinning
next
released
acquired
Spin on my bit
flags
False Sharing
next
Result: contention flags
released
acquired
released
acquired
Spin on my line
flags
112
Performance
TTAS
Shorter handover than backoff queue Curve is practically flat Scalable performance
113
114
115
CLH Lock
FIFO order Small, constant-size overhead per thread
116
Initially
idle
tail
false
Art of Multiprocessor Programming 117
Initially
idle
tail
false
Art of Multiprocessor Programming
Queue tail
118
Initially
idle
Lock is free
tail
false
Art of Multiprocessor Programming 119
Initially
idle
tail
false
Art of Multiprocessor Programming 120
tail
false
Art of Multiprocessor Programming 121
tail
false
true
122
Swap
tail
false
true
123
tail
false
true
124
tail
false
true
true
125
Swap
tail
false
true
true
126
tail
false
true
true
127
tail
false
true
true
128
true
true
129
tail
false
true
true
130
true
tail
false
true
true
131
Purple Releases
release acquiring
false
Bingo!
true
132
tail
false
false
Purple Releases
released acquired
tail
true
Art of Multiprocessor Programming 133
Space Usage
Let
L = number of locks N = number of threads
ALock
O(LN)
CLH lock
O(L+N)
Art of Multiprocessor Programming 134
135
136
Queue tail
138
Thread-local Qnode
139
142
Notify successor
143
CLH Lock
Good
Lock release affects predecessor only Small, constant-sized space
Bad
Doesnt work for uncached NUMA architectures
146
NUMA Architecturs
Acronym:
Non-Uniform Memory Architecture
Illusion:
Flat shared memory
Truth:
No caches (sometimes) Some memory regions faster than others
Art of Multiprocessor Programming 147
NUMA Machines
NUMA Machines
CLH Lock
Each thread spins on predecessors memory Could be far away
150
MCS Lock
FIFO order Spin on local memory only Small, Constant-size overhead
151
Initially
idle
tail
false
Art of Multiprocessor Programming 152
Acquiring
acquiring
(allocate Qnode)
true
tail
false
Art of Multiprocessor Programming 153
Acquiring
acquired
swap
tail
false
true
154
Acquiring
acquired
true
tail
false
Art of Multiprocessor Programming 155
Acquired
acquired
true
tail
false
Art of Multiprocessor Programming 156
Acquiring
acquired acquiring
false
tail
swap
Art of Multiprocessor Programming
true
157
Acquiring
acquired acquiring
false
tail
true
Art of Multiprocessor Programming 158
Acquiring
acquired acquiring
false
tail
true
Art of Multiprocessor Programming 159
Acquiring
acquired acquiring
false
tail
true
Art of Multiprocessor Programming 160
Acquiring
acquired acquiring
true
tail
true false
Art of Multiprocessor Programming 161
Acquiring
acquired acquiring
true
Yes!
false true
tail
162
163
Purple Release
releasing swap
false false
Art of Multiprocessor Programming 174
Purple Release
false false
Art of Multiprocessor Programming 175
Purple Release
Purple Release
releasing prepare to spin
true false
Art of Multiprocessor Programming 177
Purple Release
releasing spinning
true false
Art of Multiprocessor Programming 178
Purple Release
releasing spinning
Purple Release
releasing Acquired lock
Abortable Locks
What if you want to give up waiting for a lock? For example
Timeout Database transaction aborted by user
181
Back-off Lock
Aborting is trivial
Just return from lock() call
Extra benefit:
No cleaning up Wait-free Immediate return
182
Queue Locks
Cant just quit
Thread in line behind will starve
183
Queue Locks
spinning spinning spinning
true
true
true
184
Queue Locks
locked spinning spinning
false
true
true
185
Queue Locks
locked spinning
false
true
186
Queue Locks
locked
false
187
Queue Locks
spinning spinning spinning
true
true
true
188
Queue Locks
spinning spinning
true
true
true
189
Queue Locks
locked spinning
false
true
true
190
Queue Locks
spinning
false
true
191
Queue Locks
pwned
false
true
192
Idea:
let successor deal with it.
193
Initially
idle Pointer to predecessor (or null)
tail
A
Art of Multiprocessor Programming 194
Initially
idle Distinguished available node means lock is free
tail
A
Art of Multiprocessor Programming 195
Acquiring
acquiring
tail
A
Art of Multiprocessor Programming 196
A
Art of Multiprocessor Programming 197
Acquiring
acquiring
Swap
A
Art of Multiprocessor Programming 198
Acquiring
acquiring
A
Art of Multiprocessor Programming 199
Acquired
locked
Pointer to AVAILABLE means lock is free.
A
Art of Multiprocessor Programming 200
Normal Case
locked spinning spinning
202
Successor Notices
locked Timed out spinning
204
205
Time-out Lock
public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode;
207
Time-out Lock
public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode;
Time-out Lock
public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode;
Time-out Lock
public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> myNode;
Remember my node
Art of Multiprocessor Programming 210
Time-out Lock
public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred== null || myPred.prev == AVAILABLE) { return true; }
211
Time-out Lock
public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; }
Time-out Lock
public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; }
Time-out Lock
public boolean lock(long timeout) { Qnode qnode = new Qnode(); myNode.set(qnode); qnode.prev = null; Qnode myPred = tail.getAndSet(qnode); if (myPred == null || myPred.prev == AVAILABLE) { return true; } ...
Time-out Lock
locked
spinning
spinning
long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } }
Art of Multiprocessor Programming 215
Time-out Lock
long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Keep trying for a }
Art of Multiprocessor Programming
while
216
Time-out Lock
long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } } prev field
Art of Multiprocessor Programming
Spin on predecessors
217
Time-out Lock
long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } } Predecessor released
Art of Multiprocessor Programming
lock
218
Time-out Lock
long start = now(); while (now()- start < timeout) { Qnode predPred = myPred.prev; if (predPred == AVAILABLE) { return true; } else if (predPred != null) { myPred = predPred; } Predecessor aborted, } advance one
Art of Multiprocessor Programming 219
Time-out Lock
if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } }
220
Time-out Lock
if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } }
Time-out Lock
if (!tail.compareAndSet(qnode, myPred)) qnode.prev = myPred; return false; } }
Time-Out Unlock
public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; }
223
Time-out Unlock
public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; }
Timing-out Lock
public void unlock() { Qnode qnode = myNode.get(); if (!tail.compareAndSet(qnode, null)) qnode.prev = AVAILABLE; }
227