The Performance of Spin Lock Alternatives For Shared-Memory Multiprocessors

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
THOMAS E. ANDERSON Presented by Daesung Park
Introduction
In shared-memory multiprocessors, each processor can directly access memory For consistency of the data structure, we need a method to serialize the operations done on it Shared-memory multiprocessors provide some form of hardware support for mutual exclusion - atomic instructions
Why lock is needed?
If the operations on critical sections are simple enough
Encapsulate these operations into single atomic instruction Mutual exclusion is directly guaranteed Each processor attempts to access the shared data waits its turn without returning control back to software
A LOCK is needed If the lock is busy, waiting is done in software Two choices, block or spin
If the operations are not simple
The topics of this paper
Are there efficient algorithms for software spinning for busy lock?
5 software solutions are presented
Are more complex kinds of hardware support needed for performance?
Hardware solutions for Multistage Interconnection Network Multiprocessors and Single Bus Multiprocessors are presented
Multiprocessor Architectures
How processors are connected to memory
Multistage interconnection network or Bus
Where or not each processor has a coherent private cache
Yes or No Invalidation-based or Distributed-write
What is the coherence protocol
For the performance

Minimize the communication bandwidth Minimize the delay between a lock is released and reacquired Minimize latency by using simple algorithm
When there is no lock contention
The problem of spinning
Spin on Test-and-Set

The performance of spinning on test-and-set degrades as the number of spinning processors increases The lock holder must contend with spinning processors to access the lock location and other locations for normal operation
The problem of spinning Spin on TAS
P1
P2
P3
P4
MEMORY
lock := CLEAR; while (TestAndSet(lock) = BUSY) lock := CLEAR;
BUS, Write-Through, Invalidation-based, Spin on Read
The problem of spinning
Spin on Read (Test-and-Test-and-Set)
Use cache to reduce the cost of spinning When lock is released, each cache is updated or invalidated The waiting processor sees the change and performs a test-and-set When critical section is small, this is as poor as spin on test-and-set This is most pronounced for systems with invalidationbased cache coherence, but also occurs with distributedwrite
The problem of spinning Spin on read
P1
P2
P3
P4
1 0
1 0
1 0
1 0
1 0 MEMORY
while (lock = BUSY or TestAndSet(lock) = BUSY)
BUS, Write-Through, Invalidation-based
Reasons for the poor performance of spin on read
There is a separation between detecting that the lock is released and attempting to acquire it with a testand-set instruction
More than one test-and-set can occur
Cache is invalidated by test-and-set even if the value is not changed Invalidation-based cache coherence requires O(P) bus or network cycle to broadcast invalidation
Problem of spinning
Measurement Result1
Problem of spinning
Measurement Result2
Software solutions
Delay Alternatives

Insert delay into the spinning loop Where to insert delay
After the lock has been released After every separate access to the lock Static or dynamic
The length of delay
Lock latency is not affected because processors try to get lock before delay
Delay Alternatives
Delay after Spinning processor Notices Lock has been Released
Reduce the number of test-and-sets when spin on read Each processor can be statically assigned a separate slot, or amount of time to delay The spinning processor with smallest delay gets the lock Others may resume spinning without test-and-set When there are few spinning processors, using fewer slots is better When there are many spinning processors, using fewer slots results in many attempts to test-and-set
Delay Alternatives
Vary spinning behavior based on the number of waiting processors The number of collision = The number of processors Initially assume that there are no other waiting processors Try to test-and-set->fail->collision Double the delay up to some limit
Delay Alternatives

Delay Between Each Memory Reference Can be used on architectures without cache or with invalidation-based cache Reduce bandwidth consumption of spinning processors Mead delay can be set statically or dynamically More frequently polling improves performance when there are few spinning processors
Software Solutions
Queuing in Shared Memory
Each processor insert itself into a queue then spins on a separate memory location flag When a processor finishes with critical section, it sets the flag next processor in the queue Only one cache read miss occurs Maintaining queue is expensive much worse for small critical sections
Queuing
Init flags[0] := HAS_LOCK; flags[1..P-1] := MUST_SAIT; queueLast := 0; myPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P] = MUST_WAIT) ; CRITICAL SECTION;
Lock
Unlock
flags[myPlace mod P] := MUST_WAIT; flags[(myPlace + 1) mod P] := HAS_LOCK;
Queuing
Implementations among architectures
Distributed-write cache coherence
All processors share counter To release lock, a processor writes its sequence number into shared counter Each cache is updated, directly notifying the next processor to get lock Each processor should wait on a flag in a separate cache block One of caches is invalidated and one read miss occurs Each processor should wait on a flag in a separate cache block Have to poll to learn when it is their turn
Invalidation-based cache coherence

Multistage-network without coherence

Queuing
Implementations
Bus without coherence

Processors must poll to find out if it is their turn This can swamp bus A delay can be inserted between each poll according to the position of processors in the queue and the execution time of critical sections Lock is needed One of delay alternatives above may be helpful for contention Increment of counter Make its location 0, set another location If there is no contention, this latency is loss of performance
Without atomic read-and-increment instruction

Problem : Increment lock latency

Measurement Results of Software Alternatives1
Measurement Result of Software Alternatives2
Measurement Result of Software Alternatives3
Hardware Solutions
Multistage Interconnection Network Multiprocessors
Combining networks
For spin on test-and-set Only one of test-and-set requests are forwarded to memory and all other requests are returned with the value set Lock latency may increase Eliminates polling across the network without coherence Issues enter and exit instructions to the memory module Lock latency is likely to be better than software queuing Stores the name of the next processors in the queue directly in each processors cache
Hardware queuing at the memory module

Caches to hold queue links
Hardware Solutions
Single Bus Multiprocessors
Read broadcast
Eliminates duplicate read miss requests If a read occurs in the bus that is invalid in a processors cache, the cache takes the data and make itself valid Thus invalid caches of processors can be validated by another processors read Processor can spin on test-and-set, acquiring the lock quickly when it is free without consuming bandwidth when it is busy If test-and-set seems to fail, it is not committed
Special handling test-and-set requests in the cache
Conclusion
Simple method of spin-waiting degrade performance as the number of spinning processors increases Software queuing and backoff have good performance even for large numbers of spinning processors Backoff has better performance when there is no contention, queuing performs best when there are contention Special hardware support can improve performance, too

The Performance of Spin Lock Alternatives For Shared-Memory Multiprocessors

Uploaded by

Copyright:

Available Formats

You might also like

The Performance of Spin Lock Alternatives For Shared-Memory Multiprocessors

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Performance of Spin Lock Alternatives For Shared-Memory Multiprocessors

Uploaded by

Copyright:

Available Formats

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

THOMAS E. ANDERSON Presented by Daesung Park

Why lock is needed?

If the operations on critical sections are simple enough

If the operations are not simple

The topics of this paper

5 software solutions are presented

Are more complex kinds of hardware support needed for performance?

How processors are connected to memory

Multistage interconnection network or Bus

Where or not each processor has a coherent private cache

Yes or No Invalidation-based or Distributed-write

What is the coherence protocol

For the performance

When there is no lock contention

The problem of spinning

The problem of spinning Spin on TAS

lock := CLEAR; while (TestAndSet(lock) = BUSY) lock := CLEAR;

BUS, Write-Through, Invalidation-based, Spin on Read

The problem of spinning

Spin on Read (Test-and-Test-and-Set)

The problem of spinning Spin on read

while (lock = BUSY or TestAndSet(lock) = BUSY)

BUS, Write-Through, Invalidation-based

Reasons for the poor performance of spin on read

More than one test-and-set can occur

Insert delay into the spinning loop Where to insert delay

The length of delay

Delay after Spinning processor Notices Lock has been Released

flags[myPlace mod P] := MUST_WAIT; flags[(myPlace + 1) mod P] := HAS_LOCK;

Distributed-write cache coherence

Invalidation-based cache coherence

Multistage-network without coherence

Bus without coherence

Without atomic read-and-increment instruction

Problem : Increment lock latency

Measurement Results of Software Alternatives1

Measurement Result of Software Alternatives2

Measurement Result of Software Alternatives3

Hardware queuing at the memory module

Caches to hold queue links

Special handling test-and-set requests in the cache

You might also like