You are on page 1of 22

Chip Multiprocessors (CMP)

CMP is the mantra of today’s microprocessor industry


-Intel’s dual-core Pentium 4: each core is still hyper threaded (just uses
existing cores)

-Intel’s quad-core Whitefield is coming up in a year or so

-For the server market Intel has announced a dual-core Itanium 2 (code
named Montecito); again each core is 2-way threaded

-AMD has released dual-core Opteron in 2005

-IBM released their first dual-core processor POWER4 circa 2001; next-
generation POWER5 also uses two cores but each core is also 2-way
threaded

-Sun’s UltraSPARC IV (released in early 2004) is a dual-core processor and


integrates two UltraSPARC III cores.

Why CMP?

Today microprocessor designers can afford to have a lot of


transistors on the die
-Ever-shrinking feature size leads to dense packing.
-What would you do with so many transistors?
-Can invest some to cache, but beyond a certain point it doesn’t help.
-Natural choice was to think about greater level of integration.
-Few chip designers decided to bring the memory and coherence controllers
along with the router on the die.
-The next obvious choice was to replicate the entire core; it is fairly simple:
just use the existing cores and connect them through a coherent
interconnect.

Moore’s Law

Moore's Law describes a long-term trend in the history of computing


hardware, in which the number of transistors that can be placed
inexpensively on an integrated circuit has doubled approximately
every two years.[1] Rather than being a naturally-occurring "law" that
cannot be controlled, however, Moore's Law is effectively a business practice
in which the advancement of transistor counts occurs at a fixed rate.[2] [see
image on right]
The law is named for Intel co-founder Gordon E. Moore, who introduced the
concept in a 1965 paper. It has since been used in
the semiconductor industry to guide long term planning and to set targets
for research and development.

Plot of CPU transistor counts against dates of introduction. The curve shows
counts doubling every two years.

Consequences and Limitations

Transistor count versus computing performance


The exponential processor transistor growth predicted by Moore does not
always translate into exponentially greater practical computing performance.
For example, the higher transistor density in multi-core CPUs doesn't greatly
increase speed on many consumer applications that are not parallelize.
Wire Delay
Wires don’t scale with transistor technology: wire delay becomes the
bottleneck
-Wire delay doesn’t decrease with transistor size
-Short wires are good: dictates localized logic design.
-But superscalar processors exercise a “centralized” control requiring long
wires (or pipelined long wires).
-However, to utilize the transistors well, we need to overcome the memory
wall problem.
-To hide memory latency we need to extract more independent instructions
i.e. more ILP
Extracting more ILP directly requires more available in-flight
instructions
- But for that we need bigger ROB which in turn requires a bigger register file
-Also we need to have bigger issue queues to be able to find more
parallelism
-None of these structures scale well: main problem is wiring
-So the best solution to utilize these transistors effectively with a low cost
must not require long wires and must be able to leverage existing
technology: CMP satisfies these goals exactly (use existing processors and
invest transistors to have more of these on-chip instead of trying to scale the
existing processor for more ILP).

Shared L2 Vs Tiled CMP

A chip multiprocessor (CMP) system having several processor cores may


utilize a tiled architecture, with each tile having a processor core, a private
cache (L1), a second private or shared cache (L2), and a directory to track
copies of cached private copies. Historically, these tiled architectures may
have one of two styles of L2 organization (e.g. Intel Pentium D, Dual Core
Opteron, Intel Montecito, Sun UltraSPARC IV, and IBM Cell).

Due to constructive data sharing between threads, CMP systems performing


multi-threaded workloads may use a shared L2 cache approach. A shared L2
cache approach may maximize effective L2 cache capacity due to no data
duplication, but also increases average hit latency, compared to a private L2
cache. These designs may treat the L2 cache and directory as one structure
(e.g. Intel Woodcrest, Intel Conroe, Sun Niagara, IBM Power4, and IBM
Power5).
CMP systems performing scalar and latency sensitive workloads may prefer a
private L2 cache organization for latency optimization at the expense of
potential reduction in effective cache capacity due to potential data
replication. A private L2 cache may offer cache isolation, yet disallow cache
borrowing. Cache intensive applications on some cores may not borrow
cache from inactive cores or cores running small data footprint applications.
Some generic CMP systems may have 3-levels of caches. The L1 cache and
L2 cache may form two private levels. A third L3 cache may be shared across
all cores.
Differences (On Basis of Performance)
Shared caches are often very large in the CMPs.
-They are banked to avoid worst-case wire delay
- The banks are usually distributed across the floor of the chip on an
interconnect
-In shared caches, getting a block from a remote bank takes time
proportional to the physical distance between the requester and the bank.
# Non-uniform cache architecture (NUCA)
-This is same for private caches, if the data resides in a remote cache
-Shared cache may have higher average hit latency than the private
cache
# Hopefully most hits in the latter will be local.
-Shared caches are most likely to have less misses than private caches
# Latter wastes space due to replication.

Snoopy Coherence
Snoopy Protocols
Cache coherence protocols implemented in bus-based machines are called
snoopy protocols.
-The processors snoop or monitor the bus and take appropriate protocol
actions based on snoop results.
-Cache controller now receives requests both from processor and bus.
-Since cache state is maintained on a per line basis that also dictates the
coherence granularity.
-Cannot normally take a coherence action on parts of a cache line.
-The coherence protocol is implemented as a finite state machine on a per
cache line basis.
-The snoop logic in each processor grabs the address from the bus and
decides if any action should be taken on the cache line containing that
address (only if the line is in cache).

Write Through Caches

There are only two cache line states


-Invalid (I): not in cache
-Valid (V): present in cache, may be present in other caches also

Read access to a cache line in I state generates a BusRd request on


the bus.
-Memory controller responds to the request and after reading from memory
launches the line on the bus
-Requester matches the address and picks up the line from the bus and fills
the cache in V state
-A store to a line always generates a BusWr transaction on the bus (since
write through); other sharers either invalidate the line in their caches or
update the line with new value

State Transition

The finite state machine for each cache line:

On a write miss no line is allocated


-The state remains at I: called write through write no-allocated
-A/B means: A is generated by processor, B is the resulting bus transaction (if
any)
-Changes for write through write allocate?

Ordering Memory Op

Assume that the bus is atomic


-It takes up the next transaction only after finishing the previous one.

Read misses and writes appear on the bus and hence are visible to
all processors

What about read hits?


-They take place transparently in the cache.
-But they are correct as long as they are correctly ordered with respect to
writes.
-And all writes appear on the bus and hence are visible immediately in the
presence of an atomic bus.

In general, in between writes reads can happen in any order without


violating coherence
-Writes establish a partial order.

Back To Snoopy Protocols

No change to processor or cache


-Just extend the cache controller with snoop logic and exploit the bus

We will focus on writeback caches only


-Possible states of a cache line: Invalid (I), Shared (S), Modified or dirty (M),
Clean exclusive (E), Owned (O); every processor does not support all five
states.
-E state is equivalent to M in the sense that the line has permission to write,
but in E state the line is not yet modified and the copy in memory is the
same as in cache; if someone else requests the line the memory will provide
the line
-O state is exactly same as E state but in this case memory is not responsible
for servicing requests to the line; the owner must supply the line (just as in M
state)
-Stores really read the memory (as opposed to write).

Stores
Look at stores a little more closely.
-There are three situations at the time a store issues: the line is not in the
cache, the line is in the cache in S state, the line is in the cache in one of M,
E and O states
-If the line is in I state, the store generates a read-exclusive request on the
bus and gets the line in M state
-If the line is in S or O state, that means the processor only has read
permission for that line; the store generates an upgrade request on the bus
and the upgrade acknowledgment gives it the write permission (this is a
data-less transaction)
-If the line is in M or E state, no bus transaction is generated; the cache
already has write permission for the line (this is the case of a write hit;
previous two are write misses)

Invalidation vs. Update


Two main classes of protocols:
-Invalidation-based and update-based
-Dictates what action should be taken on a write.
-Invalidation-based protocols invalidate sharers when a write miss
(upgrade or readX) appears on the bus
-Update-based protocols update the sharer caches with new value on a
write: requires write transactions (carrying just the modified bytes) on the
bus even on write hits (not very attractive with writeback caches)
-Advantage of update-based protocols: sharers continue to hit in the
cache while in invalidation-based protocols sharers will miss next time
they try to access the line
-Advantage of invalidation-based protocols: only write misses go on bus
(suited for writeback caches) and subsequent stores to the same line are
cache hits.

Which One Is Better?

Difficult to answer
Depends on program behaviour and hardware cost.

When is update-based protocol good?


-What sharing pattern? (Large-scale producer/consumer).
Otherwise it would just waste bus bandwidth doing useless updates.

When is invalidation-protocol good?


Sequence of multiple writes to a cache line.
Saves intermediate write transactions.

Also think about the overhead of initiating small updates for every
write in update protocols
-Invalidation-based protocols are much more popular.
-Some systems support both or maybe some hybrid based on dynamic
sharing pattern of a cache line.

MSI PROTOCOL
The MSI protocol is a basic cache coherence protocol that is used in
multiprocessor systems. As with other cache coherency protocols, the letters
of the protocol name identify the possible states in which a cache line can
be. So, for MSI, each block contained inside a cache can have one of three
possible states:
• Modified: The block has been modified in the cache. The data in the
cache is then inconsistent with the backing store (e.g. memory). A
cache with a block in the "M" state has the responsibility to write the
block to the backing store when it is evicted.
• Shared: This block is unmodified and exists in at least one cache. The
cache can evict the data without writing it to the backing store.
• Invalid: This block is invalid, and must be fetched from memory or
another cache if the block is to be stored in this cache.
• These coherency states are maintained through communication
between the caches and the backing store. The caches have different
responsibilities when blocks are read or written, or when they learn of
other caches issuing reads or writes for a block.
• When a read request arrives at a cache for a block in the "M" or "S"
states, the cache supplies the data. If the block is not in the cache (in
the "I" state), it must verify that the line is not in the "M" state in any
other cache. Different caching architectures handle this differently. For
example, bus architectures often perform snooping, where the read
request is broadcast to all of the caches
If another cache has the block in the "M" state, it must write back the data to
the backing store and go to the "S" or "I" states. Once any "M" line is written
back, the cache obtains the block from either the backing store, or another
cache with the data in the "S" state. The cache can then supply the data to
the requestor. After supplying the data, the cache block is in the "S" state.

When a write request arrives at a cache for a block in the "M" state, the
cache modifies the data locally. If the block is in the "S" state, the cache
must notify any other caches that might contain the block in the "S" state
that they must evict the block. This notification may be via bus snooping or a
directory, as described above. Then the data may be locally modified. If the
block is in the "I" state, the cache must notify any other caches that might
contain the block in the "S" or "M" states that they must evict the block. If
the block is in another cache in the "M" state, that cache must either write
the data to the backing store or supply it to the requesting cache. If at this
point the cache does not yet have the block locally, the block is read from
the backing store before being modified in the cache. After the data is
modified, the cache block is in the "M" state.
For any given pair of caches, the permitted states of a given cache line are
as follows:

M S I

x √
M xN
N Y

√ √
S xN
Y Y
√ √
I √Y
Y Y

Processor requests to cache: PrRd, PrWr


•Bus transactions: BusRd, BusRdX, BusUpgr, BusWB.

MESI protocol
The MESI protocol (known also as Illinois protocol due to its development at
the University of Illinois at Urbana-Champaign) is a widely used cache
coherency and memory coherence protocol. It is the most common protocol
which supports write-back cache. Its use in personal computers became
widespread with the introduction of Intel's Pentium processor to "support the
more efficient write-back cache in addition to the write-through cache
previously used by the Intel 486 processor"[1].
States
Every cache line is marked with one of the four following states (coded in
two additional bits):
Modified
The cache line is present only in the current cache, and is dirty; it has
been modified from the value in main memory. The cache is required
to write the data back to main memory at some time in the future,
before permitting any other read of the (no longer valid) main memory
state. The write-back changes the line to the Exclusive state.
Exclusive
The cache line is present only in the current cache, but is clean; it
matches main memory. It may be changed to the Shared state at any
time, in response to a read request. Alternatively, it may be changed
to the Modified state when writing to it.
Shared
Indicates that this cache line may be stored in other caches of the
machine & is "clean" ; it matches the main memory. The line may be
discarded (changed to the Invalid state) at any time.
Invalid
Indicates that this cache line is invalid.
For any given pair of caches, the permitted states of a given cache line is as
follows:

M E S I

X X X √
M N N N Y

X X X √
E
N N N Y
X X √ √
S
N N Y Y

√ √ √
I √Y
Y Y Y

Operation
In a typical system, several caches share a common bus to main memory.
Each also has an attached CPU which issues read and write requests. The
caches' collective goal is to minimize the use of the shared main memory.
A cache may satisfy a read from any state except Invalid. An Invalid line
must be fetched (to the Shared or Exclusive states) to satisfy a read.
A write may only be performed if the cache line is in the Modified or
Exclusive state. If it is in the Shared state, all other cached copies must be
invalidated first. This is typically done by a broadcast operation known as
Read For Ownership (RFO).
A cache may discard a non-Modified line at any time, changing to the Invalid
state. A Modified line must be written back first.
A cache that holds a line in the Modified state must snoop (intercept) all
attempted reads (from all of the other caches in the system) of the
corresponding main memory location and insert the data that it holds. This is
typically done by forcing the read to back off (i.e. retry later), then writing
the data to main memory and changing the cache line to the Shared state.
A cache that holds a line in the Shared state must listen for invalidate or
read-for-ownership broadcasts from other caches, and discard the line (by
moving it into Invalid state) on a match.
A cache that holds a line in the Exclusive state must also snoop all read
transactions from all other caches, and move the line to Shared state on a
match.
The Modified and Exclusive states are always precise: i.e. they match the
true cache line ownership situation in the system. The Shared state may be
imprecise: if another cache discards a Shared line, this cache may become
the sole owner of that cache line, but it will not be promoted to Exclusive
state. Other caches do not broadcast notices when they discard cache lines,
and this cache could not use such notifications without maintaining a count
of the number of shared copies.
In that sense the Exclusive state is an opportunistic optimization: If the CPU
wants to modify a cache line that is in state S, a bus transaction is necessary
to invalidate all other cached copies. State E enables modifying a cache line
with no bus transaction
MOESI PROTOCOL
This is a full cache coherency protocol that encompasses all of the possible
states commonly used in other protocols. In addition to the four common
MESI protocol states, there is a fifth "Owned" state representing data that is
both modified and shared. This avoids the need to write modified data back
to main memory before sharing it. While the data must still be written back
eventually, the write-back may be deferred.
Each cache line is in one of five states:
Modified
A cache line in the modified state holds the most recent, correct copy
of the data. The copy in main memory is stale (incorrect), and no other
processor holds a copy. The cached data may be modified at will. The
cache line may be changed to the Exclusive state by writing the
modifications back to main memory.
Owned
A cache line in the owned state holds the most recent, correct copy of
the data. The owned state is similar to the shared state in that other
processors can hold a copy of the most recent, correct data. Unlike the
shared state, however, the copy in main memory can be stale
(incorrect). Only one processor can hold the data in the owned state—
all other processors must hold the data in the shared state. The cache
line may be changed to the Modified state after invalidating all shared
copies, or changed to the Shared state by writing the modifications
back to main memory.
Exclusive
A cache line in the exclusive state holds the most recent, correct copy
of the data. The copy in main memory is also the most recent, correct
copy of the data. No other processor holds a copy of the data. The
cache line may be changed to the Modified state at any time in order
to modify the data. It may also be discarded (changed to the Invalid
state) at any time.
Shared
A cache line in the shared state holds the most recent, correct copy of
the data. Other processors in the system may hold copies of the data
in the shared state, as well. The copy in main memory is also the most
recent, correct copy of the data, if no other processor holds it in owned
state. The cache line may not be written, but may be changed to the
Exclusive state after invalidating all shared copies. It may also be
discarded (changed to the Invalid state) at any time.
Invalid A cache line in the invalid state does not hold a valid copy of the
data. Valid copies of the data might be either in main memory or another
processor cache.
For any given pair of caches, the permitted states of a given cache line are
as follows:

M O E S I
X X X X √
M N N N N Y

X X X √ √
O
N N N Y Y

X X X X √
E
N N N N Y

X √ X √ √
S
N Y N Y Y

√ √ √ √
I √Y
Y Y Y Y

This protocol, a more elaborate version of the simpler MESI protocol, avoids
the need to write modifications back to main memory when another
processor tries to read it. Instead, the Owned state allows a processor to
supply the modified data directly to the other processor. This is beneficial
when the communication latency and bandwidth between two CPUs is
significantly better than to main memory. An example would be multi-core
CPUs with per-core L2 caches.
If a processor wishes to write to an Owned cache line, it must notify the
other processors that are sharing that cache line. Depending on the
implementation it may simply tell them to invalidate their copies (moving its
own copy to the Modified state), or it may tell them to update their copies
with the new contents (leaving its own copy in the Owned state).
Usages
This protocol was used in the SGI 4D machine
The MESI protocol adds an "Exclusive" state to reduce the traffic caused by
writes of blocks that only exist in one cache. The MOSI protocol adds an
"Owned" state to reduce the traffic caused by write-backs of blocks that are
read by other caches. The MOESI protocol does both of these things.

MOSI protocol
The MOSI protocol is an extension of the basic MSI cache coherency
protocol. It adds the Owned state, which indicates that the current processor
owns this block, and will service requests from other processors for the
block.
For any given pair of caches, the permitted states of a given cache line is as
follows:

M O S I

X X X √
M N N N Y

X X √ √
O
N N Y Y

X √ √ √
S
N Y Y Y

√ √ √
I √Y
Y Y Y
Sequential and Weak Consistency
Model
Atomicity and Event Ordering (Ref. Pg 248 Hwang)
The problem of memory inconsistency arises when the memory access order
differs from the program execution order. E.g. A uniprocessor system maps
an SISD sequence into similar execution sequence. Thus memory access (for
instructions and data) is consistent with the program execution order. This
property has been called as sequential consistency.

In shared-memory multiprocessors there are multiple instructions (i.e. MIMD


instruction sequences).

Memory Consistency Issues


The behavior of a shared-memory system as observed by processor is called
as memory model. Primitive memory operations for multiprocessor include
load (read), store (write), and swap (atomic load and store).

Event orderings
The order in which shared memory operations are performed by one process
may be used by other processes. Memory events correspond to shared
memory access. Consistency models specify the order by which the events
from one process should be ordered by other processes in the machine.

The event ordering can be used to declare whether a memory event is legal
or illegal, when several process are accessing a common set of memory
location. A program order is the order by which memory access occur for the
execution of a single process.

DUBOIS et al.(1986) has defined three primitive memory operations for the
purpose of specifying memory consistency models:

1. A load by processor Pi is considered performed with respect to


processor Pk at a point of time when the issuing of a store to the
same location by Pk cannot affect the value returned by the load.
2. A Store by Pi is considered performed with respect to Pk at one
time when issued a load to the same address by Pk returns the
value by this store.
3. A load is globally performed if it is performed with respect to all
processors if the store that is the source of the returned value
has been performed with respect to all processors.

Atomicity
There are 2 classes for Shared-Memory Multiprocessors:

1. Atomic Memory Access


2. Non-Atomic Memory Access

A shared –memory access is atomic if the memory updates are known to all
processors. Thus a store is atomic if the value stored becomes readable to all
processors at the same time.

A system can be non–atomic if an invalidation signal does not reach all


processors at the same time. With non-atomic memory multiprocessor
cannot be strongly ordered.

Sequential Consistency Model

Processors
P1 P2 P3 Pn
…………
………………

Shared Switch

Memory
Single port
System Memory
In this model, the loads, stores and swaps of all processors appear to
execute serially in a single global memory order.

Lamport’s Definition
He defined sequential consistency as follows: A multiprocessor system is
sequentially consistent if the result of any execution is the same as if the
operation for all the processors were executed in some sequential order, and
the operation of each individual processor appear in this sequence in the
order specified by its program.

Dubois, Scheurich and Briggs (1986) have provided following two sufficient
conditions

1. Before a load is allowed to perform with respect to any other


processor, all previous load accesses must be globally performed
and all previous store accesses must be performed with respect
to all processors.
2. Before a store is allowed to perform with respect to any other
processor, all previous load accesses must be globally performed
and all previous store accesses must be performed with respect
to all processor.

Lamport’s Definition sets the basic spirit of sequential consistency.

Implementation Considerations

Figure shows that the shared memory consist of a single port that is able to
service exactly one operation at a time, and a switch that connects this
memory to one of the processors for the duration of each memory operation.
The order in which switch is thrown from one processor to another
determines the global order of memory – access operations.

This model implies total ordering of stores/loads at the instruction level.

Strong ordering of all shared memory access in the sequential consistency


model preserves the program order in all processors.

A processor cannot issue another access until the most recently shared
writable memory access by a processor has been globally performed.

Drawbacks

When system becomes very large, this model reduces the scalability of a
multiprocessor system (poor memory performance).

Weak Consistency Model


The DSB Model (by Dubois, Scheurich and Briggs 1986)
1. All previous synchronization accesses must be performed, before a
load or a store access is allowed to perform with respect to any other
processor.
2. All previous load and store accesses must be performed before a
synchronization access is allowed to perform with respect to any other
processor.
3. Synchronization accesses are sequentially consistent with respect to
one another.

These conditions provide a weak ordering of memory-access event in


multiprocessor. The dependence conditions on shared variables are weaker
in such a system because they are only limited to hardware-recognized
synchronizing variable.

TSO (Total Store Order) Weak Consistency Model

Processors
P1 P2 .... Pn

………………………….

Stores, Swaps Stores, Swaps


Store, swaps

Switch
Single – Port Memory
Shared Memory System

TSO Model was developed by Sun Microsystems’ SPARC architecture group.


Sindhu et al. described that the stores and swap issued by the processor are
placed in a dedicated store buffer for the processor, which is operated in first
–in-first-out. Order is same as the processor issued them.

Description

A Load by a processor first checks its store buffer to see if it contains a store
to the same location. If it does, then the load returns the value the most
recent such store. Otherwise the load goes directly to memory. Since all
loads go to memory immediately, loads in general don’t appear in memory
order. A processor is logically blocked from issuing further operations until
the load returns a value. A swap behaves like a load and a store. It is placed
in the store buffer like a store, and it blocks the processor like a load. In
other words, the swap block until the store buffer is empty and then
proceeds to the memory.

PSO (Partial Store Order)

Relaxed Memory Consistency

It has been introduced for building scalable multiprocessors with distributed


shared memory.

Processor Consistency (PC)

Goodman (1989) introduced the Processor Consistency(PC) Model in which


writes issued by each individual processor are always in program order.
However, the order of writes from two different processors can be out of
program order. In other words consistency in writes is observed in each
processor, but the order of reads from each processors is not restricted as
long as they do not involve other processors.

PC model relaxes from the SC model.

Two Conditions related to other processors are required for ensuring


processor consistency:

1. Before a read is allowed to perform with respect to any other


processor, all previous read accesses must be performed.
2. Before a write is allowed to perform with respect to any other
processor, all previous read or write access must be performed.
Release Consistency (RC)

One of the most relaxed model.

Introduced by Gharachorloo et al. (1990)

It requires that synchronization accesses in the program be identified and


classified as either acquires (e.g. locks) or releases (e.g. unlocks).

Acquire is a read Operation

Release is a write Operation.

Advantage of relaxed model is potential for increased performance by


hiding as much write latency as possible.

Disadvantage is increased hardware complexity and a more complex


programming model.

Three conditions ensure Release consistency:

1. Before an ordinary read or write access is allowed to perform with


respect to any other processor, all previous acquire accesses must be
performed.
2. Before a release access is allowed to perform with respect to any other
processor, all previous ordinary read and store accesses must be
performed.
3. The ordering restrictions imposed by weak consistency are not present
in release consistency. Instead, RC requires PC & not SC.

Sequential Consistency

The result of any execution Strong


Model appears as some interleaving
of the operations of the
individual processors when
executed on a multithreaded
sequential machine

Processor Consistency Release ConsistencyWeak Consistency

Write issued by each Synchronization OperatorThe programmer enforces


individual processor are consistency using
never seen out of order, but 1. Acquire synchronization operators
the order of writes from two 2. Release guaranteed to be sequentially
different processors can be
Each type of operator isconsistent.
observed differently. guaranteed to be processor
Consistent
Relaxed Model

You might also like