Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Advanced Computer Architecture

Project Report

CSCI 5593
Spring Semester-2017

Simulating and Evaluating Shared Cache Replacement Algorithms for


Multi-Core Processors

Group Members

Alanoud Alsalman
alanoud.alsalman@ucdenver.edu

Arwa Almalki
arwa.almalki@ucdenver.edu

Samaher Alghamdi
samaher.alghamdi@ucdenver.edu

Norah Almaayouf
norah.almaayouf@ucdenver.edu
Table of Content
I. INTRODUCTION ......................................................................................................................................... 4

II. DESIGN ........................................................................................................................................................ 5

1. Evict-Write Eviction Strategy: .................................................................................................................. 5

a. EW LRU: ............................................................................................................................................... 6

b. EW SRRIP: ............................................................................................................................................ 7

2. MRU-Tour Cache Replacement Algorithm: ........................................................................................... 10

III. IMPLEMENTATION ............................................................................................................................. 13

1. Understanding The Sniper Multi-Core Simulator: .................................................................................. 13

a. Classes: ................................................................................................................................................ 13

b. Methods: .............................................................................................................................................. 14

2. Implementing And Integrating Our New Algorithms With Sniper: ........................................................ 16

3. Separating The New Algorithms' Files:................................................................................................... 18

IV. TESTING ENVIRONMENT .................................................................................................................. 19

VI. RESULTS AND ANALYSIS ................................................................................................................. 23

I. First Experiment: ..................................................................................................................................... 24

1- Evict Write (EW) algorithm .................................................................................................................... 24

2- MRU-T algorithm ................................................................................................................................... 30

II. Second Experiment .................................................................................................................................. 33

1- First scenario: .......................................................................................................................................... 34

2- Second scenario:...................................................................................................................................... 35

VII. CONCLUSION ....................................................................................................................................... 36

VIII. FUTURE WORK .37

Table of Tables
Table 1: Selected workloads list .......................................................................................................................... 21

Table 2: Some of Gainestown and Hydra's cache parameters for simulation ..................................................... 22

1
Table of Figures
Figure 1: EW LRU pseudo code: evictEW procedure........................................................................................... 6

Figure 2: EW LRU: updateEW procedure ............................................................................................................ 7

Figure 3: EW SRRIP: evictEW procedure. ........................................................................................................... 8

Figure 4: EW flowchart ......................................................................................................................................... 9

Figure 5: MRU-T Pseudocode............................................................................................................................. 11

Figure 6: MRUT flowchart .................................................................................................................................. 12

Figure 7: One to many relationship between CacheBase and CacheSet class..................................................... 15

Figure 8: ChacheSetEWLRU class diagram ....................................................................................................... 16

Figure 9: ChacheSetEWSRRIP class diagram .................................................................................................... 17

Figure 10: ChacheSetMRUT class diagram ........................................................................................................ 18

Figure 11: Two CPI stacks generated using Sniper simulator. ............................................................................ 20

Figure 12: Baseline average miss rates in shared LLC in Gainestown setup ...................................................... 21

Figure 13: Summary of tested experiments ......................................................................................................... 23

Figure 14: Average miss rate for LRU and EW LRU in shared LLC ................................................................. 25

Figure 15: Baseline average miss rate for LRU in shared LLC .......................................................................... 25

Figure 16: Execution time for LRU in shared LLC............................................................................................. 26

Figure 17: Average miss rate for LRU, SRRIP, and EW SRRIP in shared LLC ................................................ 27

Figure 18: Execution time for LRU, SRRIP, and EW SRRIP in shared LLC..................................................... 27

Figure 19: Average miss rate for LRU and EW LRU in Hydras shared LLC ................................................... 28

Figure 20: Execution time for LRU and EW LRU on Hydras shared LLC ....................................................... 29

Figure 21: Average miss rate for SRRIP and EW SRRIP in Hydras shared LLC ............................................. 29

Figure 22: Execution time for SRRIP and EW SRRIP on Hydra ........................................................................ 30

Figure 23: Average miss rate for MRU and MRU-T in shared LLC .................................................................. 31

Figure 24: Execution time for MRU and MRU-T ............................................................................................... 32

Figure 25: Implemented replacement policies speedup over LRU...................................................................... 33

Figure 26: Average miss rate of LRU and EW LRU with 2 workloads (each is allocated 2 cores) running
concurrently ................................................................................................................................................. 34

Figure 27: Execution Time of LRU and EW LRU with 2 workloads (each is allocated 2 cores) running
2
concurrently ................................................................................................................................................. 35

Figure 28: Average miss rate of LRU and EWLRU with 4 workloads (each is allocated 1 cores) running
concurrently ................................................................................................................................................. 35

Figure 29: Execution Time of LRU and EWLRU with 4 workloads (each is allocated 1 cores) running
concurrently. ................................................................................................................................................ 36

3
I. INTRODUCTION

Programs and applications currently depend on the efficiency of the underlying computer
architecture. Many studies and researches are conducted to optimize the gap between the CPU and
Memory speeds. The optimization moves toward parallelized architectures with multi-core processors
in hope to reduce capacity misses, conflict misses, and thus reducing the penalty of accessing the
main memory. These misses occur in the cache when handling large sizes of working sets. In order to
avoid the penalty of accessing the main memory, computer architectures nowadays are usually built
with 2 or 3 levels of caches. When there is a cache miss, instead of seeking the required block from
the main memory, the block is first seeked from the lower level of caches. Furthermore, since multi-
core architectures rely on running programs concurrently using multiple threads and cores, then the
use of a shared level of cache is necessary in order to synchronize data between the physical cores.
Thus, the emergence of shared last level caches (LLCs) came into production. However, Shared LLCs
are huge in size compared to the higher levels of the cache hierarchy, which makes it difficult to apply
a cache replacement algorithm that exploits temporal and/or spatial locality in a manner that
sufficiently reduces the number of misses in these type of caches. The most prefered cache
replacement algorithm is the Least-Recently Used (LRU) policy. While it is the most favored
replacement policy in the higher cache levels, it results in high miss rates within the shared LLC due
to its size, and high miss rates in shared LLCs lower the performance of parallel applications.

In order to address this problem, we implemented two newly introduced cache replacement
policies on the shared last level cache using the Sniper Multi-Core Simulator[6] as a testing tool to
emulate the cache configurations of the Gainestown and the Istanbul processors. These new
algorithms are the Evict Write (EW) eviction policy[1] and the Most Recently Used Tour (MRU-
Tour) cache replacement algorithm[2]. The EW strategy was implemented on the LRU and the Static
Re-reference Interval Prediction (SRRIP)[7] policies. Our goal is to reduce the execution time and the
shared LLC miss rates over the LRU policy as our baseline algorithm in order to achieve better
performance on several multi-threaded workloads.

4
II. DESIGN

Any cache replacement policy consists of three components:

Insertion policy: After evicting a victim block (capacity miss), or when filling an empty line
in the cache (compulsory miss), the new block will be inserted in the cache by some chosen
policy.
Eviction decision: When there are no empty lines in the cache, then a victim block would be
chosen by some eviction decision (capacity miss).
Promotion rules: When a referenced block is found in the cache (cache hit), that block's
position in the cache is updated by certain rules.
Based of the previous facts, we will describe our algorithms using pseudocodes and a
flowchart. Then we will briefly explain how we employed the Sniper Multi-Core Simulator
infrastructure for our project implementation.

1. Evict-Write Eviction Strategy:

EW was introduced in 2016 to tackle the number of misses that the LLC face when a block
that has a read operation get evicted. A study was conducted to measure the number of misses that
LLC get from evicting the load blocks, and the results are that load misses are three times larger than
write misses Therefore, EW assumption is that blocks that have been read at least one time will
continue to be read in the future, so they better stay in the cache to receive hits[1]. Moreover, this
strategy has light hardware overhead which only two bits are required, one for read and the other is
for write operation. Another feature in this algorithm is portability, which can be implemented within
any other replacement algorithms. For all of the mentioned above, our team chose to implement EW
to study how it leverages the execution time by reducing the load misses.

The Evict-Write strategy focuses on the eviction decision only, which priorities read blocks
overwritten blocks[1]. That means that a block that was written would most likely be the victim for
eviction, if no written blocks were found then the block that was both read and written would be the
next eviction candidate, otherwise, it will fallback to the LRU eviction decision for the read blocks.
When the candidate has been chosen, then it will be evicted, then replaced by the new block, and then
will be moved to the MRU position.

Evict-Write is integrated with two well-known algorithms, LRU and SRRIP. So, the eviction
decision does not just chooses any random block that was previously written only, but it also has to be
the Least Recently Used block that was written, in case of EW-LRU. For the SRRIP, there are two
5
options that will be discussed later.

a. EW LRU:
The LRU insertion policy inserts the block in cache in the MRU position. As for the eviction
decision, it evicts the block located in the LRU position. The LRU promotion rule updates the block
position after a hit by moving it to the MRU position. In order to convert the old LRU algorithm into
the new EW LRU, all we need to do is change the eviction decision from evicting a block at the Least
Recently Used position without any regard to the type of memory access operation, to the Evict-Write
decision that takes into account the type of access. However, we do need to update the block access
types every time there is an insertion, eviction, or a promotion too.

We will show the Evict-Write LRU algorithm through the pseudocode in figure (1).

Figure 1: EW LRU pseudo code: evictEW procedure.

The algorithm uses three variables: one to hold the LRU written (W) block index, another for
the LRU read and written (RW) block index, and last one for the LRU read (R) block index. The
algorithm requires a loop through all the blocks present in that cache set in order to pick the least
6
recently used block of each type of the previously mentioned variables and store its index in the
suitable variable. After the iteration, the control moves to the if statements at the bottom of the pseudo
code. These statements will first ask whether a written only block was saved in the W block. If so,
return the W block index. If not, proceed to the next condition that asks if a read and written block was
found. If yes, then return the RW block index. If no, return the R block index.

The update procedure of the EW strategy is shown in the pseudocode in figure (2).

Figure 2: EW LRU: updateEW procedure

This procedure has only one purpose, to update the index of the incoming block as follows: if
the incoming block is a read block, then mark it as being previously read. Otherwise, mark it as being
written.

b. EW SRRIP:

The SRRIP algorithm uses M-bit Re-Reference Prediction Values (RRPV), and there are 2M
possible RRPV intermediate re-reference intervals predictions. SRRIP eviction decision, on the other
hand, finds the block that has the max RRPV bits, then decrements the RRPV by one, and the
Insertion policy inserts the block in the position of the evicted block. Later for the promotion rule, two
options can be adopted. One is Hit Priority(HP). HP will set the RRPV bits to zero when a block gets
a cache hit. The other type is Frequency Priority (FP), which is the one used in this implementation.
FP will decrement RRPV bits when the block receives a hit. The integration can be seen in the
algorithm's eviction decision only.

Figure (3) below, shows the implementation of EW-SRRIP which uses the same variables that
EW-LRU uses:

7
Figure 3: EW SRRIP: evictEW procedure.

EW-SRRIP eviction decision finds the block that has the max RRPV bits and it was used only
by write operations. If there are no write blocks, it chooses the block that was read and written. When
the block with both operations is not found, then the strategy fallbacks to SRRIP eviction for the read
blocks, which evict the block with the max RRPV. Eventually, the block is found and its RRPV
decremented by one.

The flowchart in figure (4) illustrates the design of the EW-LRU cache replacement algorithm
in the Sniper Simulator. This flowchart can also be interpreted easily as the design of the EW-SRRIP
but instead of using the LRU bits to move the inserted block index to the Most Recently Used
position, it uses the SRRIP bits that reorders the indexes based on the Re-reference Prediction Value.

8
Figure 4: EW flowchart

9
2. MRU-Tour Cache Replacement Algorithm:

Studying another algorithm is another objective of our team. MRU-Tour was introduced in
2008. The algorithm assumption is that most of the blocks are not referenced again once they leave
the MRU position. In addition, the chance of referencing a block does not depend on its location in
the LRU stack. This algorithm has low overhead and only requires one bit of storage for storing a hit
for each cache block. The study that was conducted in the paper [2] showed that, on average, misses
per instruction can be reduced by 15% over LRU. Because of what is mentioned above, our team
implemented the baseline MRUT algorithm.

MRU-Tour algorithm's basic idea is to keep checking the number of times the block occupies
the MRU position while it is stored in the cache[2]. When the block is fetched, this will indicate that
the first tour for this block has begun. If the block has never been referenced, then it is a candidate for
eviction. Otherwise, the block is referenced, which represents that the second tour for the block has
begun. This means that the block has ran multiple MRU Tours.

In order to show the number of the MRUTs, an MRUT-bit is used. The implementation of this
algorithm utilizes m_mru_bits to store the hits and m_lru_bits to move the block to MRU
position. MRUT-bit is either 1, when the block receives a hit, or 0, when it was fetched from the
cache, and it has never been referenced again during its lifetime in the cache. To sum this algorithm,
MRUT eviction decision evicts the block that was fetched, meaning the one that has a 0 MRU-bit.
The insertion policy moves the line to the MRU position and sets the MRUT-bit to 0. Regarding the
promotion rules, the algorithm updates the block by moving it to the MRU position and setting the
MRUT-bit to 1.

Figure (5) below, shows the implementation of MRUT:

10
Figure 5: MRU-T Pseudocode

The flowchart in figure (6) illustrates the design of the MRU-T cache replacement algorithm
in the Sniper Simulator. This flowchart shows how the gateReplacementIndex() method will choose
either the block that has 0 or 1 as their m_mru_bits. In addition, the chart illustrate that in case of a
cache hit, updateReplacmentIndex() will update m_mru_bits to 1 and move the block to MRU
position.

11
Figure 6: MRUT flowchart

12
III. IMPLEMENTATION

In this section, we will give a detailed understanding of our implementation approach. We will
start by explaining the classes, methods, and files of the Sniper simulator that we have used and
modified for our project. Then we will describe the implementation and integration of the three
proposed cache replacement policies with the Sniper Simulator classes. Finally, the approach for
separating the new algorithms' files from the original algorithms' files will be clarified.

1. Understanding The Sniper Multi-Core Simulator:

In the Sniper Multi-Core Simulator, the code files are well defined and organized, but they are
poorly documented. Understanding the system management and implementing the two newly
introduced replacement algorithms, Evict Write (EW) and MRU-Tour, were quite challenging. Most
of the code for implementing a replacement algorithm is rather localized and can be found under the
Cache folder. Each replacement algorithm class is implemented in a class name corresponded to the
algorithm name with the associated header and source files (example: CacheSetMRUT is the MRU-
Tour class name, and cache_set_MRUT.h and cache_set_MRUT.c are its header and source
files respectively).

The implementation for those three replacement algorithms requires a deep understanding of
how Sniper Multi-Core Simulator classes are represented, such as dependency and inheritance. The
Sniper classes that are involved and modified in our implementation are the following: CacheCntrl,
Cache, CacheBase, CacheSet, CacheSetEWLRU, CacheSetEWSRRIP, and
CacheSetMRUT, which are briefly described next:

a. Classes:

CacheCntrl class is the main class for managing the cache related information. In this class, the
private cache and the shared cache are implemented. In order to get the important information
about the type of the operation, which is also to represent if the block gets a cache hit, a sequence
of methods in this class are invoked as follows:
The method CacheCntlr::processMemOpFromCore() is the first to be called to
check if the first private cache (L1) has the block that is required by the CPU. To get this
information along with the permission to access or insert,
CacheCntlr::processMemOpFromCore() calls CacheCntlr::-
operationPermissibleinCache() which collects the block information from

13
CacheCntlr::getCacheBlockInfo(). With these calls, the address of the block is sent
along.
In CacheCntlr::getCacheBlockInfo() method, a method from class Cache will be
invoked, Cache::peekSingleLine(), which will split the address that is sent through these
methods, and the index will be resolved, as shown in the flowchart, and checks its availability in
the cache using another method from class CacheSet, CacheSet::find().
In cacheSet class the cache will be divided into sets and the number of the blocks will be
assigned according to the number of the associativity; therefore, CacheSet::find()
checks m_cache_block_info_array[index], which is an array that represents
one set, for a matching tag of the sent index to see if it is a valid block.

When all the necessary information have been collected, the pointer returns back to
CacheCntlr::operationPermissibleinCache(). The method completes its code line
and checks if the block info that was returned is available for an operation. Mem_op_type
variable will tell us the operation type, which is either read or write.
Two cacheSet methods, CacheSet::readable() and CacheSet::writable()are
used so that if one of them returns true, then the block is available for this operation. This is a hit
in the L1.

However, when a cache miss happens in L1 cache,


CacheCntlr::processShmemReqFromPrevCache() will be invoked to see on which
level we got a cache hit instead. The same methods will be invoked again starting with
CacheCntlr::operationPermissibleinCache(). If it is L2 miss, then L3 cache
blocks will be checked. If it is cache miss in L3, then fetch the block from the main memory using
CacheCntlr::accessDRAM().

b. Methods:
For this project we traced and edited only the methods that are related to Shared Caches when
the block gets either a hit or a miss.
On a hit in LLC, the return call from
CacheCntlr::processShmemReqFromPrevCache()to
CacheCntlr::processMemOpFromCore() will return the type of the operation on a
cache level, which in our case will be the shared LLC.
To access this block to perform the operation, CacheCntlr::accessCache() will be called
and it will have the following parameters: cache level, the address, a Boolean value for the
update_replacement variable to indicate that an update for the block in LLC is required, and
14
the last parameter is the mem_op_type, which is the operation type that will occur on the block.
In the same time that this is tackled, the writethrough and writeback will be dealt with
using CacheCntlr::writeCacheBlock() and
CacheCntlr::updateCacheBlock() methods.
In Cache class, an array of instances of cacheSet type will be created according to the number
of the m_associativity variable. For each index in the array, another array
m_cache_block_info_array object of type CacheBlockInfo will be initialized for the
array to use the class methods, as shown in figure 7.

Figure 7: One to many relationships between CacheBase and


CacheSet class

Therefore, if Cache::LOAD is the operation then Cache::accessSingleLine()


method will access and call CacheSet::Read_line().
if the operation is Cache::STORE, then Cache::accessSingleLine() method
will call CacheSet::Write_line().
In case of a miss in LLC or update for the previous level, the previous sequence of methods accessing
will be the same, but it will be insert methods not access. Thus,
CacheCntrl::insertCacheBlock() method will be invoked. In the previous method,
theCache::insertSingleLine() method will be called.
Cache::insertSingleLine() will send all the information to the method
CacheSet::insert(). Inside CacheSet::insert() method, we will get the evicted block
that the replacement algorithm will choose with the help of the
CacheSet::getReplacementIndex() that is inherited in all the replacement policies classes.
To get the replacement policy, CacheSet::createCacheSet() method will indicate which one is
chosen.
15
2. Implementing And Integrating Our New Algorithms With Sniper:

The actual implementation of the three replacement policies starts when the pointer of
execution reaches CacheSet::read_line(). and CacheSet::write_line() methods. In
both methods, we added m_coming_EW_type variable and m_block_op array which will capture
if the operation is either a READ or a WRITE. In order to know in which cache level the methods have
been called, we checked the cache_type variable in the CacheSet constructor if it is equal to
CacheBase::SHARED_CACHE. However, we have eliminated the need for this condition
statement since we separated our new replacement files, which will be discussed later.

A. If the policy is EW-LRU, the utilized class is CacheSetEWLRU:

Figure 8: ChacheSetEWLRU class diagram

a. On a cache hit, the block is being accessed and the


CacheSetEWLRU::updateReplacementIndex() method will be invoked
from either CacheSet::Read_line() or CacheSet::Write_line().
b. The CacheSetEWLRU::updateReplacementIndex() method will apply the
promotion rule of the EW-LRU policy, which is done by calling two methods. The first
one is the CacheSetEWLRU::moveToMRU() method that simply reorders the
m_lru_bits values after giving the accessed index the value of 0 indicating that it is
now in the MRU position. The second one is the CacheSetEWLRU::updateEW()
method as mentioned in the design section, which updates the access type in the
m_stored_EW_type array of each block by either setting the was_read variable
to True if the access type was a READ, or setting was_written to True if it
was a WRITE.

16
c. If CacheSet::insert() is invoked this means that a compulsory miss or a
capacity miss happened, and that means a block needs to be evicted. The
CacheSetEWLRU::getReplacementIndex() will return the victim block
index.

d. The CacheSetEWLRU::getReplacementIndex()method will check If an


empty cache line is found, then that index will be returned to
CacheSet::insert() after updating the m_lru_bits using the moveTMRU()
method, and the was_read and was_written variables using the updateEW()
method. Otherwise a victim must be selected and evicted.

e. The victim selection happens by calling the CacheSetEWLRU::evictEW()


method. This method, as described in the design section, will perform the EW eviction
policy and then returns the selected victim block index. After that, the returned index
will be used for the update process using moveTMRU() and updateEW() methods.

B. If the policy is EW-SRRIP, the utilized class is CacheSetEWSRRIP:

Figure 9: ChacheSetEWSRRIP class diagram

a. On block access, CacheSetEWSRRIP::updateReplacementIndex() will be


invoked from either CacheSet::Read_line() or CacheSet::Write_line().
This method, CacheSetEWSRRIP::updateReplacementIndex(), will apply the
promotion rule for SRRIP as mintioned above utilizing m_rrip_bits array to decrement
the block RRPV bits.
b. If CacheSet::insert() is invoked this means it is a miss and a block need to be
evicted. The CacheSetEWSRRIP::getReplacementIndex() will return the victim
block. Using m_rrip_bits and m_block_op array the method, we will find the block

17
that has the max rrip bits and have write, read, or both operations.

C. If the policy is MRUT, the utilized class is CacheSetMRUT:

Figure 10: ChacheSetMRUT class diagram

a. For this policy, it is important to check if the Boolean variable update_replacement


in CacheSet::Read_line() or CacheSet::Write_line() is true which means
again it is a hit for a block that is already in the cache, and
CacheSetMRUT::updateReplacementIndex() will be invoked. In
CacheSetMRUT::updateReplacementIndex(), an array m_mru_bits sets the
block bit value to 1.
b. If CacheSet::insert() is invoked, then it is a miss and a block needs to be evicted.
The CacheSetMRUT::getReplacementIndex() will return the chosen block to be
evicted. Using m_mru_bits array to check if the block has a bit value of 1 or 0. When the
block is chosen, the block m_mru_bits will be set to 0.

3. Separating The New Algorithms' Files:

In order to integrate the new algorithms classes and files, some of the mentioned classes have been
altered, which are CacheSet and CacheBase.

a. In class CacheBase,the three policies have been named "EW_LRU", "EW_SRRIP", and
"MRUT" respectively, and were added in the enum type ReplacementPolicy.

b. In the class Cache, the method CacheSet::createCacheSet() is called which divides


the cache into sets depending on the associativity. Each index in m_sets array represents a
CacheSet object. Each set in the array will call CacheSet::createCacheSet() to form
the chosen replacement policy. CacheSet::parsePolicyType() is used to parse the name
of the cache replacement policy coming from either the terminal options or the configuration
file; therefore, the policies' names were also added in the parsePolicyType() method.

c. CacheSet::createCacheSet() method is also altered. Three switch cases were added


18
that correspond to the three replacement algorithms' names in CacheBase class with some
initialization values essential for creating the new cache set.

IV. TESTING ENVIRONMENT

Our implementation was carried out using the Sniper simulator source code. For our testing
phase, we used Sniper along with parallel benchmark suites to test and evaluate our implementation of
replacement policies. As we are focused on enhancing the performance of shared LLC replacement
policies in chip-multiprocessors (CMP), we used the two most common benchmarks used for CMP
studies: SPLASH-2 [3] and PARSEC [4].

The Stanford ParalleL Applications for SHared memory (SPLASH-2) was introduced in 1996
as scalability optimizations for SPLASH which both aimed at shared-memory processors. SPLASH-2
consists of a mixture of applications and kernels representing a variety of computations in scientific,
engineering, and graphics computing. However, SPLASH-2 is skew towards High Performance
Computation (HPC) applications, and this would affect building a general judgment on our
implementation performance in multi-core systems if we based our evaluation only on it.

Princeton Application Repository for Shared-memory Computers (PARSEC) benchmark suite


was developed and introduced in 2008 by Princeton and Intel Corporation. Its objective is to provide a
large and diverse collection of applications to be sufficiently representative for scientific studies of
CMP. PARSEC applications are considered the state-of-art technique in their area. The algorithms
these programs implement are useful, but their computational demands are very high on current
platforms.

Although PARSEC is more diverse than SPLASH-2, the two benchmarks workload differ in
important features such as data locality, number of shared cache lines, and working set sizes [5]. For
that, to make our evaluation appropriate and accurate, it is important to take all these differences in
concentration as it influence the cache performance in CMP. As a result, in our evaluation we selected
workloads from both PARSEC and SPLASH-2 benchmarks.

Since PARSEC has 12 workloads, and SPLASH-2 has 11 workloads, we needed to choose
between them to use in our tests and analysis. We chose 11 workloads ,4 from PARSEC and 7 from
SPLASH-2 , based on two criteria:

19
1. We gave the priority to the workloads that have the intensive use of shared LLC. In order to
determine those, we generated CPI (Cycles per Instruction) stacks for our baseline
configuration (Gainestown) when LRU replacement policy applied to the shared LLC using
Sniper simulator. CPI stacks show how many processor cycles, normalized by number of
instructions or execution time, were spent in each different components such as cache levels,
DRAM memory, instruction fetcher, and synchronization [1]. By using CPI stacks, we were
able to determine which workloads are memory intensive i.e. spend a large amount of time in
instructions with memory in shared LLC. Figure (11) shows two CPI stacks for two
workloads. On the left, the CPI stack shows the time that PARSEC-Canneal spent in the
shared L3. On the other hand, on the right, the CPI stack for PARSEC-Fluidanimate does not
show any significant time spent at the shared cache.

Figure 11: Two CPI stacks generated using Sniper simulator.

2. The other criteria used to determine the suitable workloads is average miss rate. We ran all 22
workloads using base LRU replacement policy and then chose the workloads that have a high
average miss rate. Figure (12) shows the baseline average miss rates in shared LLC for all
workloads in PARSEC and SPLASH-2 benchmarks.

20
Figure 12: Baseline average miss rates in shared LLC in Gainestown setup

The previous criterias led us to chose from PARSEC benchmark the following workloads:
Bodytrack, Streamcluster, Dedup, and Canneal, and from SPLASH-2 benchmark the following
workloads: FFT, FMM, Lu.cont, Lu.ncont, Water.nsq, Ocean, and Cholesky. Since each workloads
has multiple different input data set sizes, we restricted our input size in all experiments to be
simsmall in case of PARSEC benchmark and small in case of SPLASH-2 benchmark. Our restriction
was based on [1], and the reason for this restriction was to have reasonable trade-off between
accuracy, number of experiments, and simulation speed. Table (1) lists all selected workloads.

Parsec Splash-2

Workload Bodytrack Canneal Dedup Stream- Cholesky FFT FMM Lu. Lu. Ocean. Water.
Name cluster cont ncont cont nsq

Type App. Kernel Kernel Kernel Kernel Kernel App. Kernel Kernel App. App.

Domain Computer Engineer Enterpri Data HPC Signal HPC HPC HPC HPC HPC
Vision ing se Mining Process
Storage ing

Input size simsmall sim- sim- sim- small small small small small small small
small small small
Table 1: Selected workloads list

21
V. EXPERIMENT

To test the performance of our algorithms, we used two main metrics: Execution time and
average miss rate. The execution time was generated by Sniper simulator as an output and is measured
in nanoseconds. On the other hand, we calculated the average miss rate by taking the ratio between
total number of misses from all cores and total number of accesses from all cores, based on [1]. The
base hardware setup for all experiments was based on the default configuration for Sniper simulator.
Sniper simulator uses Intel Nehalem Xeon processor: quad-core Gainestown configuration.

In order to show the scalability of EW algorithm, we used another hardware setup. Since we
wanted to test our algorithm in a machine that is known and commonly used, we chose Hydra from
PDS lab machines. To obtain Hydras hardware specifications, we used the UC Denver PDS lab web
site [8] that provided the main architecture and [9] to get more insights about Istanbul processor used
in Hydra. However, due to some Sniper limitations, we had to change some of the Istanbul processor
configurations such as L1 data TLB size which is 48 entries, but Sniper would only accept power of 2
sizes. Alternatively, we set it to 32 entry. Another change was the number of cores. Sniper requires
power of 2 threads for some tested workloads, and to keep the testing environment unified for all
workloads, we set the number of cores to 8. Table (2) shows the main specification for cache
parameter in Gainestown multi-core processor and Hydra multi-core machine.

Parameter Gainestown Hydra (Istanbul)

Num. cores 4 2x6

L1-D size 32 KB 2x 6 x 64 KB

L1-D associativity 8 2

L1-I size 32 KB 2x 6 x 64 KB

L1-I associativity 4 2

L2 size (per core) 4 x 256 KB 2 x 6 x 512 KB

L2 associativity 8 16

L3 total size 8 MB 6 MB

L3 associativity 16 48
Table 2: Some of Gainestown and Hydras cache parameters for simulation

In terms of the tests types, we considered two types of experiments. First type of experiment is
running a single parallel workload on multiple cores. With Gainestown configuration, we ran the
single workload using four cores which means the work is divided between the four cores. Similarly,
22
with Hydra configuration, each benchmark is using eight cores to complete its work.

Second type of experiment is running multiple workloads in parallel using available cores. In
this case, the number of cores is divided equally between the workloads. We ran this test using
Gainestown configuration only, and we performed two testing scenarios. The first scenario is running
parallel workloads concurrently. The second scenario is running workloads in a sequential way (each
workload uses one core) concurrently. The goal of this test is to check the efficiency of multicore
processors cache replacement algorithms when the cache is shared between several workloads each is
running independently on one core, and not only shared between the cores that run one workload.
Figure(13) shows a summary of the experiments that we did along with used algorithms and
configurations.

Figure 13: Summary of tested experiments

VI. RESULTS AND ANALYSIS

Running the simulator generates several output files. The one we mostly benefits from is the
sim.out file which contains a detailed information about the simulation such as the number of
instructions, the execution time, cores idle time and important cache metrics for each cache level such
as number of access, miss rate, and number of misses.

To simplify the process of collecting the results, given the large number of tests (over 140
tests), we adapted a fixed directory naming format that holds all necessary information about the run.
With each simulation run, we set the output directory option to run-
results/simulted_policy/bencmark_name-workload_name-on-simulated_configration-simulted_policy
(ex. run-results/ewlru/parsec-canneal-on-gainestown-ewlru). With multiple workloads simulation, we
also added the cores distribution (i.e. how many core is assigned for each workload) and preserved the
23
order of running workloads to the output folder name. In other words,we used file naming as a simple
simulated runs reference.

Then, an ipython notebook was built to pass through each run output folder and scan sim.out
file in the collected result directory reading and recording all the cache metrics we need from it. Using
the same ipython notebook we calculated any additional metrics we want to examine, classified our
simulations, and generated our analysis charts. Since we are testing two different experiments, two
.ipynb files were built, one for each.

Here we will present our results and analysis. As our goal of the implementation is to enhance
the performance of shared LLC and reduce cache misses, the main metric we will be evaluating
shared LLC misses rates, execution time, and speed up over LRU. We will also evaluate the
scalability of EW algorithm by examining its results on Hydra machine configuration.

I. First Experiment:

1- Evict Write (EW) algorithm

As for EW LRU algorithm, the average miss rates for LRU compared to EW LRU is shown in
figure (14). The average miss rate was reduced in three benchmarks: Dedup, FFT, and Cholesky. The
reduction of average miss rate was about 1.3 % in Dedup, 1.7 % in Cholesky, and 2.5% in FFT. This
reduction of average miss in these three benchmarks can be explained by looking at figure (15) which
demonstrates the average miss rates for the selected workloads. All workloads that got benefit from
EW algorithm are the workloads that have the highest average miss rates. The remaining workloads
that did have reduction in average miss rates leads us to the fact that the overall enhancement in the
performance does not depend only on the total number of misses, but also in the type of generated
misses i.e. read and write misses. Since EW algorithm reduces the number of read misses, so we do
not expect any enhancement in the performance when the number of write misses is larger than the
number of read misses.

Based on [4] canneal and streamcluster shows trivial amounts of sharing. This could explain
why we didnt get any noticeable improvements in them. Canneal has a very large size working sets
which is 56 MB and more (classified by [4] as unbounded ), and its need of cache capacity grows as
the amount of data it process grows. The application need for large data set is caused by the algorithm
it runs. In Canneal most of its working set is shared with all threads, however, due to the working set
unbound size, only a tiny fraction of it can be fitted in the cache, and the line probability of being
accessed by a different thread before eviction is small.

24
Figure 14: Average miss rate for LRU and EW LRU in shared LLC

Figure 15: Baseline average miss rate for LRU in shared LLC

The execution time of LRU compared with EW LRU is shown in figure (16). Clearly, the
execution time for both algorithms are similar except for Dedup workload which had about 1.3 %
reduction in the execution time. Since Dedup workload had a reduction in average miss rate using EW
algorithm, so the reduction in execution time is reasonable.

25
Figure 16: Execution time for LRU in shared LLC

For EW SRRIP algorithm, figure (17) shows a comparison between LRU,SRRIP, and EW
SRRIP. As the figure shows, the use of EW SRRIP reduced the average miss rate only in FFT
workload by 12.9 %.

In the remaining workloads, when comparing the original SRRIP to LRU, we can see that
LRU was actually performing better. This could be due to the nature of SRRIP that inserts new blocks
with long re-reference prediction or the applied promotion rule FT that decrement RRPV bits on block
hit ,which doesn't give a general assumption on whither the block will be reused soon. Since the miss
rate hasnt been improved, the execution time isnt expected to enhance. Figure (18) shows the
execution time when applying EW SRRIP.

26
Figure 17: Average miss rate for LRU, SRRIP, and EW SRRIP in shared LLC

Figure 18: Execution time for LRU, SRRIP, and EW SRRIP in shared LLC

27
Next we will examine the performance of implementing EW algorithms on Hydra to test its

scalability. The workloads were run in the same manner as before, but with Hydras configuration that

uses twice the number of cores and less capacity of shared LCC. Table (2) shows a comparison

between the most important cache specifications.

The average miss rate in shared LLC when using EW LRU on Hydra is shown in figure (19).

We note that Bodytrack, FFT, and Ocean show an average reduction in miss rate of 1.3 %. This could

be due to the type of sharing these workloads have. For example, in Bodytrack the threads process the

same data (the input data), and so it has a substantial amount of sharing. When the number of cores is

increased, the sharing needed increased as well. On the other hand, some of the workloads show an

average degradation of 1.2 %. This can be explained as follow, most of the workload that experienced

a degradation have a very large working sets with a growth rate proportional to the number of cores.

Hydra uses more cores and smaller shared LLC compared to Gainestown. As a result, the working set

has grown, and only a small fraction of it can be fitted in the cache, which reflected badly on the

performance. Figure (20) shows execution time For EW LRU on Hydra, which reflects the miss rates.

Figure 19: Average miss rate for LRU and EW LRU in Hydras shared LLC
28
Figure 20: Execution time for LRU and EW LRU on Hydras shared LLC

Figure (21) shows the average miss rate in shared LLC when using EW SRRIP on Hydra. The

results are very close to Gainestown. The execution time can be seen in figure (22).

Figure 21: Average miss rate for SRRIP and EW SRRIP in Hydras shared LLC
29
Figure 22: Execution time for SRRIP and EW SRRIP on Hydra

2- MRU-T algorithm

MRU-T algorithm was tested on Gainestown configuration; a comparison in average miss


rates between MRU and MRU-T is shown in figure (23). It is clear that all workloads got reduction in
average miss rate using MRU-T except FFT who increased by only 0.3%. However, except for FFT,
MRU-T reduced the miss rate by an average of 33%.

30
Figure 23: Average miss rate for MRU and MRU-T in shared LLC

The execution time of MRU compared with MRU-T is shown in figure (24). It is obvious that
the reduction in average miss rates led to the reduction in execution time. The execution time was
reduced in almost all benchmarks. Canneal had about 47 % reduction in execution time, Dedup had
about 22 % reduction

31
Figure 24: Execution time for MRU and MRU-T

Since this project motivation was to evaluate different shared LLC replacement policies, we
will examine their performance compared to LRU. For a general evaluation of the performance of the
three implemented replacement policies, figure (25) gives the speedup over LRU.

32
Figure 25: Implemented replacement policies speedup over LRU

We can note that in average, of all three algorithms, MRU-T performed better with average
speedup of 1.02 over LRU.

II. Second Experiment

For the last part of the analysis we will examine the performance of the second experiment. In
this experiment ,we examined two scenarios:

1. Running two workloads concurrently where each of them is running on half of the
cores and they share LLC.
In this test, we randomly selected pairs of workloads and run them in parallel
where each workload gets 2 cores out of 4.

2. Running four workloads concurrently where each of them is running sequentially


(using one core) and independent of the others but all share LLC.
In this test, we randomly selected combinations of four workloads. We assigned
one core to each workload to force it to run sequentially, but they run concurrently.

33
The goal of these scenarios is to explore EW LRU performance with different sharing degrees
of the LLC. All previous tests were more focused on sharing LLC between cores running the same
program. Now, well look at how the performance of EW LRU would be influenced by decreasing
both the amount of sharing between cores (of the same program) and the LLC space allocated.

1- First scenario:

In figure (26), we can see that for most of the workload pairs EW LRU performance is close to
regular LRU. We can notice that for each pair, the performance was representing the dominant
workload i.e. the one generating the highest miss rate. For instance, the pairs containing Canneal
workload shows a high degradation. This ties with our previous results and analysis, for Canneal on
both Gainestown and Hydra. With the single workload test on Gainestown, Canneal kept its
performance. When Canneal was run on Hydra, as the number of assigned cores were higher, the
workload showed a small enhancement. In this scenario Canneal is not only sharing LLC with another
program, but also assigned less cores. Figure (27) shows the corresponding execution time.

Figure 26: Average miss rate of LRU and EW LRU with 2 workloads (each is allocated 2 cores) running
concurrently

34
Figure 27: Execution Time of LRU and EW LRU with 2 workloads (each is allocated 2 cores) running concurrently

2- Second scenario:

In figure (28), we can see that the runs that contain Dedup as one of the 4 workloads has a
reduction in shared LLC miss rates even when workloads are running sequentially. Most of the
workloads, however, got degradation when EW LRU was managing single core shared lines. This
behavior was expected as EW LRU design targets multi-core sharing. Figure (29) shows the
corresponding execution time.

Figure 28: Average miss rate of LRU and EWLRU with 4 workloads (each is allocated 1 cores) running
concurrently

35
Figure 29: Execution Time of LRU and EWLRU with 4 workloads (each is allocated 1 cores) running concurrently.

VII. CONCLUSION

Cache replacement algorithms such as LRU, MRU, etc, are currently being prefered in multi-
core architectures as the level of sharing is increasing in their memory hierarchy system. However,
there are room for improvement such that they achieve higher reduction in miss rates. Based on that
fact, our project was centralized on improving and enhancing the shared last-level cache (LLC),
which is a crucial component in multiprocessor performance. We do so by implementing two newly
introduced replacement algorithms, the Evict Write strategy and MRU-Tour algorithm. Using the
Sniper Multi-Core Simulator as our testing tool, we emulated the cache specifications of the
Gainestown processor and the specifications of the Istanbul AMD processor used in the Hydra cluster,
one of the PDS Labs clusters of the University of Colorado Denver, Department of Computer Science.

After running the algorithms on more than 10 benchmark applications, our results show an
average improvement in shared LLC miss rate of %(1.5) for the EW LRU algorithm over LRU and
%(30) for the MRU-Tour algorithm over MRU. While EW SRRIP showed an average degradation of
%(41) over SRRIP.

36
VIII. FUTURE WORK

Due to the fact that there is no one replacement algorithm that work for all different workload,
our future work idea is to implement set Dueling cache replacement algorithm [10]. This idea will
divide the cache into three parts. Two parts is dedicated for the replacement algorithms that they will
be chosen dynamically, one of the EW algorithms and the other one for MRUT. The last part called
Follower sets, which will record the winner between the two first parts. The Set Dueling will work
efficiently with Dynamic insertion policy (DIP), for DIP will provide low hardware overhead, low
complexity, high performance. Thus, we would like to test the performance when we implement DIP
with Set Dueling.

In addition, we would like to update our current code to allow tracing the number of read and
write misses in shared LLC to generate an accurate measurement for the performance of EW
algorithms. Moreover, This would help in checking the behavior of the tested workloads to use the
one that is expected to show enhancement.

37
ACKNOWLEDGMENT

This work would not have been possible without the great insight and experience of Team
Two that greatly assisted the research and the project. We all, Alanoud Alsalman , Arwa Almalki,
Samaher Alghamdi, and Norah Almaayouf, have had the pleasure to work together during this project.
We would also like to show our gratitude to God for his guidance. We are especially indebted
to Professor Gita Alaghband, University of Colorado in Denver, for sharing her pearls of wisdom with
us during the course which gives the project a prominent coherence in its result and manuscript. We
would also like to thank Huynh Manh for his support and patience with us.

38
References
[1] M. Geanta, L. Ghica and N. Tapus, Leverage Cache Replacement Policy In Multicore
Processors, 2016 IEEE 12th International Conference on Intelligent Computer
Communication and Processing (ICCP), Cluj-Napoca, 2016, pp.417-424.
doi: 10.1109/ICCP.2016.7737182

[2] A. Valero et al., MRU-Tour-based Replacement Algorithms for Last-Level Caches, in


Proceedings of the 23rd International Symposium on Computer Architecture and
High Performance Computing, October 2011, pp. 112119. doi: 10.1109/SBAC-PAD.2011.13

[3] PARSEC Group, "A Memo on Exploration of SPLASH-2 Input Sets", Princeton University,
2011.

[4] C. Bienia, S. Kumar, J. P. Singh and K. Li, "The PARSEC Benchmark Suite: Characterization
and Architectural Implications", Princeton University, Intel Labs, 2008.

[5] C. Bienia, S. Kumar, and Kai Li. PARSEC vs. SPLASH-2: A quantitative comparison of two
multithreaded benchmark suites on Chip-Multiprocessors. In Workload Characterization,
2008.IISWC 2008.IEEE International Symposium on, pages 4756, Sept 2008.

[6] The Sniper Multi-Core Simulator [Online].


Available: http://snipersim.org/w/The_Sniper_Multi-Core_Simulator

[7] A. Jaleel, K. Theobald, S. Steely and J. Emer, "High performance cache replacement using
re-reference interval prediction (RRIP), ACM SIGARCH Comput. Architecture News, vol. 38,
no. 3, pp. 60, 2010.

[8] University of Colorado at Denver. Parallel Distributed Systems Lab - PDS Lab [Online].
Available: http://pds.ucdenver.edu/webclass/index.html

[9] CPU-World (2017, Feb). AMD Opteron 2427 specifications [Online]. Available:
http://www.cpu-world.com/CPUs/K10/AMD-Six-Core%20Opteron%202427%20-
%20OS2427WJS6DGN%20%28OS2427WJS6DGNWOF%29.html

[10] M. Qureshi, A. Jaleel, Y. Patt, S. Steely Jr. and J. Emer, "Set-Dueling-Controlled Adaptive
Insertion for High-Performance Caching", IEEE Micro, vol. 28, no. 1, pp. 91-98, 2008.

39

You might also like