Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Predictive warp scheduling for efficient execution in

GPGPU

Abhinish Anand

MTP Phase-1
Guide: Prof. Virendra Singh

Department of Electrical Engineering


IIT Bombay

October 23, 2019

Abhinish Anand Predictive warp scheduling October 23, 2019 1 / 33


Outline

Introduction
GPU Architecture
Bottlenecks in GPGPU
Literature review
Observation & Motivation
Proposed approach
Experimental results
Future work

Abhinish Anand Predictive warp scheduling October 23, 2019 2 / 33


Introduction

Graphical Processing Units(GPUs) are gaining momentum for general


purpose workloads like scientific application, signal processing, neural
networks.
The programming models such as CUDA and OpenCL have made
programming GPGPUs simpler.
As simpler cores are used for GPU, it is easy to design, have high
yield and low cost per core.
Also, parallelism in GPU provides effective way to hide memory
latency and thus improves performance.

Abhinish Anand Predictive warp scheduling October 23, 2019 3 / 33


GPU Architecture

The GPU consists of Streaming Multiprocessor(SMs), high bandwidth


DRAM channels and on-chip L2 cache.
The number of SMs and cores per SM varies as per the price and
target market of the GPU.
Example:
Nvidia Tesla K40 - 15 SMs
Nvidia Tesla P100 - 56 SMs.
Nvidia GeForce RTX 2080Ti - 72 SMs

Abhinish Anand Predictive warp scheduling October 23, 2019 4 / 33


GPU Architecture

Figure: GPGPU

Abhinish Anand Predictive warp scheduling October 23, 2019 5 / 33


Streaming Multiprocessor

Each SM features in-order


Streaming Processors(SPs).
SP has a fully pipelined integer
ALU and FPU.
Each SM has load/store
units(LSUs) and SFUs, 64 KB
shared memory/L1 cache,
constant cache and texture
cache.
DRAM and L2-cache is off-chip
and shared among SMs. Figure: Streaming Multiprocessor

Abhinish Anand Predictive warp scheduling October 23, 2019 6 / 33


Software Model

Programmer decides #CTAs


and #threads in GPU kernel
code.
A CTA consists of multiple
threads having same code.
CTA is further sub-organized
into groups called warps.
Scheduling happens at the
granularity of warps inside SM.
All threads in a warp execute
together using a common Figure: Software model
program counter.

Abhinish Anand Predictive warp scheduling October 23, 2019 7 / 33


CTA distribution

GPU compilers estimate the


maximum number of concurrent
CTAs that can be assigned to
an SM using resource usage
information.[2]
Then, it assigns a CTA to each
SM in a round robin fashion
until all SMs are assigned upto
the maximum concurrent CTAs.
Figure: CTA distribution
Later on, CTA assignments are
completely demand driven.

[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: Optimizing Thread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures and Compilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 8 / 33
Bottlenecks in GPGPU

Limited on-chip memory


If the per-CTA requirements for memory is high, then the number of
CTAs that can be scheduled simultaneously will be small.
Leads to lower core utilization.
High control flow divergence
GPUs handle branch divergence by serializing the execution path.
Reduces SIMD utilization and IPC in general purpose computing.
Inefficient scheduling mechanisms
Most of the warps arrive at long latency memory operations roughly
at the same time.
The SM becomes inactive because there may be no warps that are
not stalled due to a memory operation.
Reduces the capability of hiding long memory latencies.

Abhinish Anand Predictive warp scheduling October 23, 2019 9 / 33


CTA distribution issue

In the baseline architecture, round robin scheduling policy schedules


the maximum number of CTAs per SM which is not always the
optimal choice from the performance perspective.
High number of threads ⇒ more memory requests ⇒ contention in
the cache, network and memory ⇒ long stalls to the core.
Different techniques to counter this issues:
CPU-Assisted prefetching (Fused Architecture)
Two level warp scheduling
Equalizer
Neither More Nor Less

Abhinish Anand Predictive warp scheduling October 23, 2019 10 / 33


Literature review
CPU Assisted Pre-fetching
After GPU kernel launch, CPU uses pre-execution program to
prefetch data in the L3 cache.
It contains the memory access instructions of the GPU kernel for
multiple thread blocks and thus increases the cache hit rate..
Two level scheduling
Problem: All warps arrive at a single long latency memory operation
at the same time. So all warps get stalled and idle FU cycles get
increased.
Solution: This policy groups all concurrently executing warps into
fixed size fetch groups. These groups and warps inside them have
priorities and are scheduled accordingly.
Prioritizing fetch groups prevents all warps from stalling together.

[1] Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou . ”CPU-Assisted GPGPU on Fused CPU-GPU Architectures”. IEEE,
2012.
Abhinish Anand Predictive warp scheduling October 23, 2019 11 / 33
Literature review
TWO LEVEL SCHEDULING

Figure: Baseline warp scheduling vs two level warp scheduling

[5] Veynu Narasiman ; Michael Shebanow ; Chang Joo Lee ; Rustam Miftakhutdinov ; Onur Mutlu ; Yale N. Patt . ”
Improving GPU performance via large warps and two-level warp scheduling” (MICRO), 2011
Abhinish Anand Predictive warp scheduling October 23, 2019 12 / 33
Literature review

Equalizer
Problem: As threads wait to access the bottleneck resource, other
resources end up being under-utilized, leading to inefficient execution.
Solution: Saves energy by lowering the frequency of under-utilized
resources (memory system or SM) with minimal performance loss.
Increases the frequency of highly-utilized resources to gain
performance and modulate the number of threads for efficient
execution.

[8] Ankit Sethia ; Scott Mahlke. ” Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution”. 47th
AnnualIEEE/ACM International Symposium on Microarchitecture, 2018
Abhinish Anand Predictive warp scheduling October 23, 2019 13 / 33
Literature review
Neither More Nor Less
For the memory intensive application, the core spends its most of the
cycles in fetching the data from the memory.[2]
The high memory requests create contention in the caches, network
and memory, leading to long stalls at the cores.
Best choice is to execute the optimal number of CTAs for each
application.
Optimal number of CTA per SM is decided by checking all the
possible number of CTAs that can be assigned to an SM per
application.
Requires exhaustive analysis for each application, thus inapplicable.
Idea: Dynamically modulate the number of CTAs on each core using
the CTA scheduler

[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: OptimizingThread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures andCompilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 14 / 33
Literature review

Neither More Nor Less


Assign N/2 CTAs to each core instead of N CTAs per core.[2]
Distribute the CTAs to core in round robin fashion. Check stall cycles
and idle cycles of an SM periodically.

Figure: Dynamic CTA scheduling mechanism

[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: OptimizingThread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures andCompilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 15 / 33
Observation

Pausing only the lastly assigned CTA when the memory stalls have
increased a lot cannot always decrease the memory congestion.
The paused lastly assigned CTA may have already created the request
and other CTAs which are not paused may create more load request.
Also, pausing all the warps of any CTA is not efficient as some warps
may be ready to execute and thus pausing them will reduce the TLP.
Also, when the DRAM will serve the request of lastly assigned CTA, it
wouldn’t be able to execute as it is in paused state.

Abhinish Anand Predictive warp scheduling October 23, 2019 16 / 33


Observation

Figure: count of load operation call from each CTA of each SM in vectorAdd
application

Abhinish Anand Predictive warp scheduling October 23, 2019 17 / 33


Observation

Figure: Number of cycles in which the lastly assigned CTA has already created
load request and some other CTAs are available which are going to create load
request in near future
Abhinish Anand Predictive warp scheduling October 23, 2019 18 / 33
Motivation-1

To decrease the stall cycles and increase the utilization of SM.


Experimentally found that the memory intensive applications have
high memory stalls.
These stalls can be reduced by decreasing the congestion by pausing
the bad warps and allowing the ready warps to execute.

Abhinish Anand Predictive warp scheduling October 23, 2019 19 / 33


Motivation

Figure: Fraction of total cycles in which all warps are waiting for their data to
come back

Abhinish Anand Predictive warp scheduling October 23, 2019 20 / 33


Motivation-2

To decrease the number of total misses in the L1D cache


can be achieved by utilizing the data locality across warps in CTA.
implemented by introducing a predictor for each SM which will keep
track of the hit-miss status of warps from each CTA.

Abhinish Anand Predictive warp scheduling October 23, 2019 21 / 33


Motivation-2

Figure: Normalized IPC improvement when ideal L1 cache is used

Abhinish Anand Predictive warp scheduling October 23, 2019 22 / 33


Proposed Approach

Check the increase in congestion using stall cycles for each SM which
denotes that the core is stalled because all the warps are waiting for
their data to come back.
When memory congestion increases from the threshold (paused state)
then
Pause the warps which is going to create memory request.
Pause only those warps whose request is going to get miss in the L1
cache
Predict the hit or miss of the warp using the last warps hit-miss status.
When the SM is in paused state, keep track of pending DRAM
request and when it decreases from a threshold, then change the state
of SM to unpause.
In unpaused state, all warps of SM will get scheduled without any
blocking.

Abhinish Anand Predictive warp scheduling October 23, 2019 23 / 33


Proposed Approach

Figure: Flowchart for the proposed approach

Abhinish Anand Predictive warp scheduling October 23, 2019 24 / 33


Proposed Approach

Predictor Table
PC CTA id Miss counter Access bit
(8 bit) (3 bit) (6 bit) (1 bit)

Total size = 4608 bits


In unpaused state, when any warp executes memory instructions, a
new entry is made in the predictor table with the corresponding CTA
id and the last 8-bit PC of that instruction.
The miss counter is initialized with 100000 and if a warp gets miss in
the L1 cache, then the miss counter will be incremented by 1 and for
hit it will be decremented by 1.

Abhinish Anand Predictive warp scheduling October 23, 2019 25 / 33


Proposed Approach

Predictor Table
Access bit is set for that row when any warp update the table
regarding its corresponding PC and CTA id.
Access bit will be reset after every epoch. It will ensure that in the
last epoch that row (PC) is not being used by any warps and all
warps have executed that PC.
So, that row can be cleared up so that space can be made for newer
entries.
When any CTA exits after it completes its execution, all the rows
belonging to that CTA is cleared.
Predictor Table will not get updated when the SM is in paused state.

Abhinish Anand Predictive warp scheduling October 23, 2019 26 / 33


Experimental results

Simulator: GPGPU-sim v3.2.2

SM configuration
No. of SMs 15 clusters, 1 SM per cluster
1.4 GHz, 32 SIMT width, 48 KB shared memory,
SM resources Max. 1536 threads (48 warps/SM, 32 threads/warp),
32768 registers/SM
Scheduler 2 warp schedulers per SM, LRR policy
L1 data cache 32-sets, 128B block size, 4-way set associative
LLC 64-sets, 128B block size, 6-way set associative
DRAM Configuration
DRAM Scheduler FR-FCFS
6 Memory Channels/Memory Controllers(MC),
DRAM Capacity
16banks/MC, 4KB row size/bank, 32 columns/row

Abhinish Anand Predictive warp scheduling October 23, 2019 27 / 33


Experimental results

Figure: Normalized IPC of the dynamic CTA scheduling w.r.t two level scheduling
(dyncta400: epoch=400, t idle=50, tmem l=2800, tmem h=3200)
(dyncta1000: epoch=1000, t idle=50, tmem l=4000, tmem h=5000)

Abhinish Anand Predictive warp scheduling October 23, 2019 28 / 33


Conclusion

Memory intensive applications are unable to use the high TLP


because of increase in congestion in memory and NOC bandwidth.
The performance can be increased by decreasing the congestion in the
memory bandwidth by pausing those warps only which is going to
flood the DRAM or NOC bandwidth.
So by predicting those bad warps before issuing using predictor and by
pausing their execution, the bandwidth can be used efficiently to
increase the TLP and thus performance.

Abhinish Anand Predictive warp scheduling October 23, 2019 29 / 33


Future work

Simulating the proposed approach using predictor for L1 cache in


GPGPU-Sim simulator and analyzing the performance and utilization
of GPU cores for different workloads.
Extend the above implementation for multiple kernels scheduling in
GPGPU and analyze the fairness for different combination of
workloads.

Abhinish Anand Predictive warp scheduling October 23, 2019 30 / 33


References

[1]. Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou . ”CPU-Assisted GPGPU on Fused CPU-GPU Architectures”.
IEEE International Symposium on High-Performance Comp Architecture, 2012

[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: Optimizing
Thread-level Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures and
Compilation Techniques, 2013

[3]. Gunjae Koo ; Hyeran Joe ; Zhenhong Liu; Nam Sung Kim; Murali Annavaram. ” CTA-Aware Prefetching and
Scheduling for GPU”. IEEE International Parallel and Distributed Processing Symposium, 2018

[4]. Wilson W.L. Fung ; Ivan Sham ; George Yuan ; Tor M. Aamodt . ” Dynamic Warp Formation and Scheduling for
Efficient GPU Control Flow”. 40th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO), 2007

[5]. Veynu Narasiman ; Michael Shebanow ; Chang Joo Lee ; Rustam Miftakhutdinov ; Onur Mutlu ; Yale N. Patt . ”
Improving GPU performance via large warps and two-level warp scheduling”. 44th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 2011

[6]. Farzad Khorasani ; Hodjat Asghari Esfeden ; Amin Farmahini-Farahani ; Nuwan Jayasena ; Vivek Sarkar. ”
RegMutex: Inter-Warp GPU Register Time-Sharing”. ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA), 2018

[7]. Gunjae Koo ; Hyeran Joe ; Zhenhong Liu; Nam Sung Kim; Murali Annavaram. ” CTA Aware Prefetching and
Scheduling for GPU”. IEEE International Parallel and Distributed Processing Symposium, 2018

Abhinish Anand Predictive warp scheduling October 23, 2019 31 / 33


References

[8] Ankit Sethia ; Scott Mahlke. ” Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution”. 47th Annual
IEEE/ACM International Symposium on Microarchitecture, 2018

[9] Jacob T. Adriaens ; Katherine Compton ; Nam Sung Kim ; Michael J. Schulte . ” The case for GPGPU spatial
multitasking”. IEEE International Symposium on High-Performance Comp Architecture, 2012

[10] Adwait Jog ; Onur Kayiran ; Tuba Kesten ; Ashutosh Pattnaik ; Evgeny Bolotin ; Niladrish Chatterjee; Stephen W.
Keckler ; Mahmut T. Kandemir ; Chita R. Das. ” Anatomy of GPU Memory System for Multi-Application Execution”.
Proceedings of the 2015 International Symposium on Memory Systems, 2015

[11] Adwait Jog ; Evgeny Bolotin ; Zvika Guz ; Stephen W. Keckler ; Mahmut T. Kandemir ; Mike Parker ; Chita R. Das
. ” Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications”. Proceedings
of Workshop on General Purpose Processing Using GPUs, 2014

[12] Zhen Lin ; Hongwen Dai ; Michael Mantor ; Huiyang Zhou ; ” Coordinated CTA Combination and Bandwidth
Partitioning for GPU Concurrent Kernel Execution”. ACM Transactions on Architecture and Code Optimization (TACO),
2019
[13] Lingyuan Wang ; Miaoqing Huang ; Tarek El-Ghazawi ” Exploiting concurrent kernel execution on graphic
processing units”. International Conference on High Performance Computing Simulation, 2011

Abhinish Anand Predictive warp scheduling October 23, 2019 32 / 33


The End

Thank You

Abhinish Anand Predictive warp scheduling October 23, 2019 33 / 33

You might also like