Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

A Coordinated Multi-Agent Reinforcement

Learning Approach to Multi-Level Cache Co-


partitioning
Presented by Preeti Ranjan Panda
Department of Computer Science and Engineering
Indian Institute of Technology Delhi, India

Rahul Jain
Sreenivas Subramoney
Preeti Ranjan Panda
Introduction
• Dynamic Cache Way Partitioning
• Cache Set-Ways assigned to individual Cores
• Dynamic: Requirement changes during program execution
• Isolation: Cores do not evict other Core’s data
• Simultaneous Multi-Threaded (SMT) Processor
• Parallel Execution of multiple threads on same physical core
• Hyper-Threading on Intel
• Improves Function Unit utilization
• Private Caches are shared: L1 and L2

29 March 2017 Indian Institute of Technology Delhi, India 2


DCCP Problem
• Dynamic Cache Co-partitioning
• Simultaneously perform cache partitioning at multiple levels
• Current Work: L2+L3 DCCP
• Optimize for System Throughput (STP)
• First Attempt at this problem

App1 with 100 KB WSS App2 with 1000 KB WSS

L2 128 KB Demand
based Data for
allocation both the
L3 1024 KB
Apps fits in
Cache
WSS: Working Set Size

29 March 2017 Indian Institute of Technology Delhi, India 3


Application Sensitivity to Cache Levels
Available Cache Ways

Latency normalized to base case


App L1 L2 L3
bzip2 L L H
gobmk H H H
povray H H L
astar L L L

29 March 2017 Indian Institute of Technology Delhi, India 4


Cache Co-partitioning Motivation
•Static Cache Partitioning (SCP) based on App Cache Sensitivity
• bzip2 (LLH), povray(HHL), gobmk(HHH), astar(LLL)
• SCP-L2 (SCP-L3) : SCP only at L2 (L3)
• SCP-L2L3: SCP at both L2 and L3

normalized to no-opt
baseline system
STP: System Throughput
29 March 2017 Indian Institute of Technology Delhi, India 5
Cache Partitioning
• Cache Partitioning Technique
• Step 1: Cache Allocation
• Based on Utility Computation
• Step 2: Partitioning Enforcement
•Cache Allocation Utility
• Working Set Size (WSS)
• Reuse Distance of Data
• Application Sensitivity to Latency
• High ILP app would be less sensitive
• Application MLP

29 March 2017 Indian Institute of Technology Delhi, India 6


Related Work
•[UCP-LA]: M. K. Qureshi et. al., MICRO-2006
• Used by most state-of-the-art DCP for Cache Allocation (Step 1)
• Utility Monitors (UMON) track cache hits per Way
• Look Ahead Algorithm for doing partitioning
• Cache Miss Utility
• O(W*C*C), W:cache ways, C: #cores sharing the cache
• [UCP-LU] I. Guney et. al., ISC-2015
• Reduces partition algorithm overhead but still requires ATD profilers
• Requires offline training per application
• [MCFQ] D. Kaseridis, et. al., IEEE TC-2014
• Utility Model
• MLP
• Cache Friendliness (WSS and Reuse Distance)
• ATD and Partition Algorithm overheads
• [MLM] R. Jain et. al., DATE-2016
• RL based DCP of LLC
• Single RL Agent scales poorly

29 March 2017 Indian Institute of Technology Delhi, India 7


DCCP using State-of-the-art DCP
• Extending Single-level Cache Techniques to Multi-level Caches
• Apply Single-Level Techniques simultaneously at L2 and L3
• No interaction between the L2 and L3 partition controller
•UCP-based for Co-partitioning
• 1 UMON per App per Cache Level
• 4-SMT requires 8 UMON instances
• Requests reaching L3 depends on the L2 allocation
• L2 and L3 allocators do not interact
• L2 (L3) ATDs results in significant power (latency)
overheads
• varying performance sensitivity to cache misses
•Our Proposal: Machine Learned Caches
• Use Model Free Reinforcement Learning
• No Special hardware profilers
• Learn Cache Utility online
29 March 2017 Indian Institute of Technology Delhi, India 8
Reinforcement Learning (RL)
• Learn by interaction with the Environment (Architecture)
• Cache Allocation Utility Model NOT REQUIRED
• No Models for WSS, Reuse distance, Latency Sensitivity, MLP, etc. and their
complex interactions
• Markov Decision Process (MDP) MDP Agent Model
• Framework to implement RL
State Agent’s View of the Architecture

Actions How to interact with Architecture (Cache


Reconfiguration) expecting state change

Reward Measure of utility of the performed cache


reconfiguration
Helps identify good actions from a state

29 March 2017 Indian Institute of Technology Delhi, India 9


Q-Learning Algorithm
•  • One of the most popular RL algorithms
• Model Free RL Technique
• Does not require an Environment (Architecture) Model
• Finds a good action-selection policy for an MDP
• Learns an Action-Value Function (Cache Allocation Utility)
• Expected Utility of an Action (Cache Reconfiguration) in a State
• One-Step Q-Learning

• t : current state and action


• : Learning rate (Importance of new data)
• : Reward Discount (Importance of Long Term Reward)
• : new state after performing at
29 March 2017 Indian Institute of Technology Delhi, India 10
Coordinated Learning

• Uncoordinated Agents
• Agents Actions are Independent
• Selfish Agent Actions: What is best for me is best for the
system
• Maximum Utility to Agent
• Can the agents work together ?
• Pick actions jointly for better global optimization
• Maximum Utility to System

29 March 2017 Indian Institute of Technology Delhi, India 11


Coordinated Joint Action
• Q-Table represents the Utility of Actions (Cache Reconfigurations)
• Central Controller Searches for Joint Actions
• Search only feasible joint actions
• Total allocation and deallocation requests should match
• with Max Utility to System
• Hill Climbing Search : Low overhead

29 March 2017 Indian Institute of Technology Delhi, India 12


Machine Learned Caches (MLC)
•1 MDP Agent/SMT Core
•Hill Climbing Search for Best Joint Actions
•Agent Action Space
• resize only one cache level at a time
• Request increase(+)/decrease(-) in cache ways
• L2Request = [-2,-1,0,+1,+2]
• L3Request = [-4,-2,-1,0,+1,+2,+4]
• MDP Agent Model
< L2Request, L3Request > : 11 Actions
State Quantized IPC values
Actions Resize Cache : < L2Request, L3Request >
Reward Instruction Per Cycle (IPC)

29 March 2017 Indian Institute of Technology Delhi, India 13


Smart Updates
• Exploit IPC cache size relation L2 resize
• IPC cannot decrease with increasing cache size Actions
• Infer Multiple Learnings
• Better adaptation to Application Phase +2 Smart Update with
Changes same L3 size: Lower
bound on IPC utility
+1
L3 resize
Actions -4 -2 -1 0 +1 +2 +4

Q-Update Smart Update with


for Action
-1 more L3 size: Lower
<0, -4> bound on IPC utility

-2

29 March 2017 Indian Institute of Technology Delhi, India 14


Smart Updates
L2 resize
Actions

+2
Q-Update Smart Update with more
for Action Cache size (L2/L3): Lower
<0, 0> +1 bound on IPC utility

L3 resize
Actions -4 -2 -1 0 +1 +2 +4

Smart Update with lesser -1


Cache size: Upper bound on
IPC utility
-2

29 March 2017 Indian Institute of Technology Delhi, India 15


Multi-Agent Coordinated Working
Agents get
their State

Agents performs Perform Hill Climbing


Smart Updates Search to find Best Joint
Action

Agents receive
reward (IPC) for its Agents executes its
last Action and action of Cache
learns Reconfiguration
System performs
Cache
Reconfigurations
and Executes

29 March 2017 Indian Institute of Technology Delhi, India 16


Experimental Setup and Results
Simulator SniperSim + McPAT
Architecture Intel x86 Nehalem
Cores 3.2 GHz, 8-SMT cores
(2-SMT per physical core) DCCP STP EDP
L1 Cache split, private, 3 cycle access coMCFQ 4.9% 6.8%
32 KB L1I, 32 KB L1D
L2 Cache private, 256 KB, 8-way, 10 cycle access coMLM 4.6% 7.14%
L3 Cache 8 MB, 32-way, 30 cycle access coUCP-LU 7.7% 9.4%
8-SMT core shared
MLC 9.35% 13.5%
Uncore 3.2 GHz (NoC+LLC)
DRAM 60ns ( ~200 cycles) access
Workloads 15 8-benchmark WL using Spec2006

29 March 2017 Indian Institute of Technology Delhi, India 17


Conclusion
• Dynamic Cache Co-partitioning (DCCP)
• New problem proposed
• Perform cache allocation across cache levels
• DCCP outperforms DCP
• Extending state-of-the-art DCP techniques not efficient
• Multi-Agent Reinforcement Learning
• No Cache Utility Model required, learns the model online
• No special data profilers required
• Low implementation overhead
• Machine Learned Caches
• Coordinated Multi-Agent RL model
• Smart Updates for faster adaptability to phase changes
• Avg 9.35% (13.5%) STP (EDP) improvement evaluated on 8-SMT system

29 March 2017 Indian Institute of Technology Delhi, India 18


Thank You
• This work was partially supported by SERB and CII, Govt. of
India under the Prime Minister’s Doctoral Fellowship with Intel
as the industry partner
• References
• [UCP-LA] M. K. Qureshi et. al., “Utility-based cache partitioning: A low-overhead, high-
performance, runtime mechanism to partition shared caches,” in MICRO, 2006.
• [UCP-LU] I. Guney et. al. “A machine learning approach for a scalable, energy-efficient utility-
based cache partitioning,” in ISC-2015
• [MCFQ] D. Kaseridis, et. al, “Cache friendliness aware management of shared last-level caches
for high performance multicore systems,” IEEE TC-2014
• [MLM] R. Jain et. al. “Machine Learned Machines: Adaptive co-optimization of caches, cores,
and on-chip network,” DATE-2016

29 March 2017 Indian Institute of Technology Delhi, India 19


Reinforcement Learning
• Learning takes place as a result of interaction between an
agent and the world
• Adaptiveness of RL can handle the complexity of applying
multiple optimizations simultaneously
• RL is useful for Sequential Decision Making Problems
• Agent interacts with Architecture by taking actions and
reaching next state
• Agent learns good/bad states and optimal actions based on
the received rewards
29 March 2017 Indian Institute of Technology Delhi, India 20
Markov Decision Process
• Markov Decision Process (MDP)
• Framework to implement RL
• Set of States
• Set of Actions
• Reward Function
• Computational Model
• Agent Interacts with the Env
• Learns Optimal Actions from a State by Trial and Error
• s0 -> a0 -> r1 -> s1 ->a1 ...

29 March 2017 Indian Institute of Technology Delhi, India 21


DCCP Results

29 March 2017 Indian Institute of Technology Delhi, India 22

You might also like