Professional Documents
Culture Documents
Mehmet Senvar - Cache Coherence Protocols
Mehmet Senvar - Cache Coherence Protocols
Mehmet envar
Cache Coherence Protocols 1
Outline
Caches allow greater performance by storing frequently used data in faster memory Since all processors share the same address space, it is possible for more than one processor to cache an address (or data item) at a time If one processor updates the data item without informing the other processor, inconsistencies may result and cause incorrect executions
For correct execution, coherence must be enforced between the caches Two major factors are:
performance implementation cost coherence detection strategy coherence enforcement strategy precision of block-sharing information cache block size
Performance of WU and WI vary depending on the application and the number of writes Hybrid protocols switch between WU and WI based on the # of writes to a block
Cache Coherence Protocols 6
Consistency Models
A consistency model defines how the consistency of data values is maintained Some consistency models are:
sequential consistency weak consistency release consistency
Weak consistency models are more efficient to implement and require fewer coherence messages
Cache Coherence Protocols 7
Make shared data non-cacheable One of the simplest software solution Also at hardware, make cache locations unreachable
Every cache write request is sent to all other caches Firstly need to discover whether each cache hold this data Other copies are either updated or invalidated Significant additional memory transactions occur
10
Hardware Protocols
11
Snooping protocols rely on a shared bus between the processors for coherence
On a processor write, the write is passed through the cache to main memory on the bus Any processor caching the address may update or invalidate its cache entry as appropriate
Snooping protocols do not scale well beyond 32 processors because of the shared bus The choice between WU, WI, and CU is especially important to reduce communication
Cache Coherence Protocols 12
13
14
MESI Example
15
Directory-Based Protocols
Directory-based protocols do not rely on a shared bus to exchange coherence information (use point-to-point connections)
more scaleable (can have hundreds of processors) each processor can have its own memory implement weak consistency for efficiency
16
Each node maintains a directory storing cache information and memory information A processor communicates with the directory to access memory
if a processor requests a non-local memory page, the directory uses its information to find the page Then, it uses messages to retrieve the page and insure all other processors have consistent info. Since the directory maintains which processors are caching the page, it only needs to send messages to those processors
Cache Coherence Protocols 17
Cache block granularity is the size of the cache and the size of a cache line
CC-NUMA machines have a separate, smaller cache from main memory COMA machines use nodes entire memory as cache for remote pages Block size affects performance (false sharing)
Cache Coherence Protocols 18
Cache controller is hardware that maintains the directory and processes memory requests
custom hardware programmable protocol processor
The directory structure is how the cache and memory information is organized
p+1-bit full directory linked-list directories tagged directories
Directory Models
Full Directory
Limited Directory
Chained (linked)Directory
20
21
Lock-Based Protocols
New work that promises to be more scaleable than directory protocols Implements scope consistency which is similar to lazy release consistency Coherence information exchanged by reading and writing notices from the lock which protects the shared memory Currently, implemented in software similar to DSM, but may move to hardware if performance gains can be realized
Cache Coherence Protocols 22
Software Protocols
Software protocols enforce consistency with limited hardware support by relying either on the compiler or specialized software handlers Similar to distributed shared memory (DSM) systems but at a lower level
sharing usually in blocks not pages needs to be more efficient for better performance architecture support for sharing
dynamism - compile-time or run-time analysis selectivity - level of coherence actions restrictiveness - conservative or as-needed consistency enforcement adaptivity - can protocol adapt to access patterns granularity - size and structure of coherence data blocking - program block on which coherence is enforced positioning - position of coherence instructions updating - how memory is updated after a write checking - how incoherence is detected
Cache Coherence Protocols 24
Compiler must generate consistent code as no hardware coherence provided Hardware maintains time tags which are updated on every write On a read, compiler generates coherence reads which check time tags to insure data is consistent Relies on the compiler to detect read which may be inconsistent, and the hardware must maintain these time tags Using tags, it is also possible to perform dynamic selfinvalidation of blocks Many techniques based on using these time tags
25
If hardware has no time tags, Petersen and Li developed an algorithm which uses only page translation hardware and page status tables Sharing information is maintained by a software handler at the page-level On a page access or fault, the software handler checks the sharing information, updates page tables, and performs coherence actions Slower than hardware as software handlers involve the OS and are on the critical memory access path
Cache Coherence Protocols 26
Compilers can also guarantee coherence by structuring the language to limit parallelism
easier to enforce coherence limits the programmer and potential parallelism simplifies compiler design good performance can be achieved with no hardware support
27
Optimizing Compilers
Optimizing compilers are designed to maintain coherence with limited hardware support without overly restricting the programmer
rely on detecting data dependencies may use synchronization variables (locks, barriers) can provide the hardware with hints can detect when coherence is not needed may have problems with dynamic sharing offer good performance, but are hard to design
28
Future Work
Hardware protocols are well defined, and the directory structure is near optimal Cost improvements can be obtained by mass producing cache controller chips Software protocols are a good area for future research because they are also applicable at higher-levels of sharing (DSM, databases, ...) Optimizing compilers need to be improved to detect data dependencies and optimize code for the parallel environment
Cache Coherence Protocols 29
Conclusions
Hardware protocols offer the best performance but require high hardware costs Software protocols can be used when there is no hardware support with a slight performance penalty Optimizing compilers can enforce coherence or provide hints to the hardware A combination of hardware and compiler optimizations is the best
30