Professional Documents
Culture Documents
Static WCET Analysis of GPUs With Predictable - 2017
Static WCET Analysis of GPUs With Predictable - 2017
Abstract—The capability of GPUs to accelerate general- through interconnection networks. The dynamic behavior of
purpose applications that can be parallelized into massive num- cores in competing for the memory resources is also hard to
ber of threads makes it promising to apply GPUs to real-time predict statically.
applications as well, where high throughput and intensive compu-
tation are also needed. However, due to the different architecture Therefore, before applying GPUs to real-time applications,
and programming model of GPUs, the worst-case execution time the time predictability of the GPU architecture needs to be
(WCET) analysis methods and techniques designed for CPUs improved and analyzable. In this work, we proposed to employ
cannot be used directly to estimate the WCET of GPUs. In this a predictable greedy then round-robin scheduling policy, based
work, based on the analysis of the architecture and dynamic
on which we build a timing model for GPGPU programs.
behavior of GPUs, we propose a WCET timing model and
analyzer based on a predictable GPU warp scheduling policy With the proposed timing model, we build a static analyzer
to enable the WCET estimation on GPUs. that can analyze the assembly codes of the GPGPU programs
and give the Worst-Case Execution Time (WCET) estimations.
I. I NTRODUCTION The evaluation results show that the proposed timing model
The massive number of processing cores on chip and the and static analyzer can provide safe and fairly tight WCET
Single-Instruction Multiple-Thread (SIMT) execution model estimations for GPGPU applications.
allow GPUs to execute thousands of threads simultaneously. The rest of the paper is organized as follows. Section II
Therefore, GPUs have become the ideal accelerators for the introduces the background about GPU architecture and the
applications that are compute- and/or data-intensive but can programming model for GPGPU applications. Section III talks
be parallelized into a large number of threads with very little about the GPU architectural simulator GPGPU-Sim[9] used
dependency among each other. With the increasing need for in this work. The proposed WCET analyzer is discussed in
computing power in all kinds of devices, the applications in Section IV, following which the evaluation methodology and
embedded systems have become more compute- and data- experimental results are given in Section V. The related works
intensive as well. As a result, more and more GPUs and GPU are reviewed in Section VI, and the conclusion and future work
platforms for embedded applications have come to the market, are in Section VII.
e.g., the NVIDIA Tegra[1] and the DRIVE PX[2].
GPUs can also benefit real-time applications, such as human II. GPU A RCHITECTURE AND P ROGRAMMING M ODEL
pose recognition[3] and traffic sign classification[4], where
high throughput and/or computation power are needed. How- A. GPU Architecture
ever, to exploit the potentials of GPUs in real-time applica- Fig. 1 shows the basic architecture of a Nvidia GPU1 , which
tions, the predictability issue of the GPU architecture must has a certain number of Streaming Multiprocessors (SMs),
be addressed first. To achieve high average-case performance e.g., 16 SMs in Fermi architecture[5]. All the SMs share the L2
and throughput, modern GPUs maintains massive number of cache, through which they access the DRAM global memory.
active threads at the same time and uses the large number Other parts, like the interface to host CPUs, are not included
of on-chip cores to schedule and execute these threads. The in Fig. 1.
scheduling of the massive number of active threads is a
dynamic behavior, which is very hard to analyze statically
happens.
!
!
Fig. 2: SM Architecture[5]
!
102
one or more instructions and are called Code Segment in this
work. The dependencies between these code segments lead
to the fact that the instructions in one code segment cannot
be issued until the instructions in the previous code segments
have finished their executions and written back the results.
together if it can not decide whether or not there would be a LIij = SCodeSegij + LIinstn (7)
branch divergence. n=0
LEij = M AX(LEinst0 , ..., LEinstK−1 )
E. Resource Limitation
Fig. 5 shows the scheduling of N warps with the greedy then
Due to the limitation of different resources in an SM, the round-robin scheduling policy. Tij represents the time point
maximal number of threads that can be active at the same time when the GPU can start to issue the code segment j of warp
on an SM is limited. The limitations include the total number i. LIij is the latency of issuing the code segment j of warp
of registers, the total available shared memory, the maximal i, while LEij represents the latency of executing the same
numbers of concurrent kernel blocks, warps and threads. All code segment. After initializing the starting issuing time point
these limitations constraint the number of active warps. of each warp by Equation 1, then the rest of the time points
in the scheduling can be calculated using Equation 2, which
IV. GPU WCET A NALYZER
basically means that the time point when one code segment
A. Greedy Then Round-Robin Scheduler Timing Model in a warp can start to issue depends on the maximal latency
We propose the greedy then round-robin (GTRR) warp between the latency of executing the previous code segment
scheduling policy so that a timing model can be built for in the same warp and the latency of issuing the segments in
the execution of the warps in a GPU kernel. Based on the other warps before the scheduler gets back to this warp. Based
dependencies between instructions, the PTX[11] code of a on this model, the estimated WCET is the time point when all
GPU kernel can be divided into segments, each of which has the warps finish the execution, as shown in Equation 3.
103
b) Number of Coalesced Memory Accesses: In a global
memory instruction, different threads in the warp can access
different memory addresses, which are coalesced together so
that addresses belonging to the same 128-Byte memory space
are merged together. Therefore, there can be as many as 32
memory requests with different addresses from one memory
warp instruction. Since these memory requests need to be sent
out by the LD/ST unit one by one at each clock cycle, the
number of coalesced memory requests affects not only the
issuing latency but also the execution latency of the instruction,
as shown in Equation 5 and 6.
c) Number of Competing SMs: Different SMs may com-
pete to access the same memory partition in the memory sys-
tem. In the simulated architecture, the requests from different
SMs are served in a round-robin order. Therefore, if there
are M SMs trying to access the same memory partition, the
interval for two consecutive requests from the same SM to
be served is M-1 cycles. This latency can happen at every
coalesced memory request, as shown in Equation 5 and 6.
Equation 5 calculates the possible stall latency of issuing
a) Number of Active Warps: The pipelines can act as The static kernel analyzer parses the PTX code of a GPU
buffers for different types of instructions. In other words, kernel to get the estimated value of the metrics in the equations
as long as the pipeline is not full, there will be no extra in Section IV-B, as well as the scheduling order of each warp,
stalls in issuing. For arithmetic instructions, the configurations which is used to generate the code segments in the timing
of the number of operand collectors and the length of the model. The analyzer also needs the kernel inputs and the
initiation buffer in function units decide how many instructions hierarchy configuration of the kernel as inputs for the analysis.
the pipeline can hold before the stall happens, while the Fig. 6 shows the components in the analyzer.
configuration of initiation latency determines how long the 1) Warp Scheduling Order: Algorithm 1 shows how the
stall is. The kernel analyzer checks whether the number of scheduling order of a warp is generated. The analyzer starts
active warps is larger than the capacity of the pipeline and with the first instruction of the first basic block and parses
adds the stall latencies to the code segment issuing period each instruction in the current basic block. The register values
according to the instruction types, as shown in Equation 4 are updated with the arithmetic instructions. If the last in-
and 5. struction of a basic block is a branch instruction and there is
104
lists of coalesced memory addresses of each global memory
instruction in the warp. Then the analyzer gets the number
(N) of coalesced memory addresses for this instruction and
appends it to the result list NumCoalAccessList of this warp.
The analyzer returns the warp execution trace, the list of
#$%
numbers that represent the number of coalesced memory
#$%
!
$ accesses, and the list of address lists of each global memory
&$
" !&
instruction in this warp. The same process is done for every
'$
'$
warp and all the results are collected together to calculate the
Fig. 6: GPU Kernel Analyzer maximal and average number of coalesced accesses of the
GPU kernel.
branch divergence, then the analyzer finds the immediate post- Algorithm 2 Coalesced Addresses Generation
dominator[10] basic block and pushes it, together with the not 1: procedure C OALESCEDA DDR L IST G EN (I, W )
2: CoalAddrList = []
taken and then taken basic blocks, to the reconvergence stack. 3: for Each Thread Ti ∈ W do
4: if CheckAcitve(Ti ) then
If there is no branch divergence, then either the taken or not 5: CurAddr = GetAddr(Ti , I)
taken basic block is pushed to the stack. If this last instruction 6: Coalesced = F alse
7: for Each Address Aj ∈ CoalAddrList do
of the current basic block is not a branch instruction, the 8: if Coalesce(CurAddr, Aj ) then
9: Coalesced = T rue
analyzer pops the top from the reconvergence stack as the new 10: Break
current basic block. The analyzer appends every new current 11: end if
12: end for
basic block to the warp scheduling order and returns this trace 13: if Not Coalesced then
14: CoalAddrList.append(CurAddr)
when it reaches the end of the kernel. 15: end if
16: end if
17: end for
Algorithm 1 Warp Execution Trace Generation 18: Return CoalAddrList
19: end procedure
1: procedure WARP E XE T RACE A NA (I NPUTS , CFG, B LOCK , WARP )
2: WarpExecutionTrace = []
3: ReconvergenceStack = []
4: NumCoalAccessList = [] 3) Number of Competing SMs: Algorithm 3 shows how
5: AddrCoalAccessList = []
6: CurrentBB = FirstBB(CFG) the analyzer estimates the possible number of competing SMs
7: WarpExecutionTrace.append(CurrentBB)
8: INST = FirstInstruction(CurrentBB) that may access the same memory partition at the same
9: while INST is not Exit do
10: if INST is arithmetic instruction then time. Based on the memory addresses each warp instruction
11: UpdateRegisterValue(INST, Inputs, Block, Warp) uses, the analyzer builds a vector for every global memory
12: end if
13: if INST is global load/store then instruction in an SM. This vector represents the distribution
14: CoalList = CoalescedAddrListGen(INST, Warp)
15: AddrCoalAccessList.append(CoalList) of the memory addresses among the memory partitions from
16: N = SizeOf(CoalList)
a certain instruction on a certain SM. For instance, if there
17: NumCoalAccessList.append(N)
18: end if are 3 memory partitions and, from one instruction I on SM s,
19: if INST is last of CurrentBB then
20: if INST is branch then there are 5 memory addresses used, among which 2 addresses
21: if Has Divergence then
22: IPD = FindImmediatePostdominator(CFG, CurrentBB) go to partition 0 and 3 addresses go to partition 2, then the
23: ReconvergenceStack.push(IPD) distribution vector is [2,0,3]. As shown in the algorithm, there
24: ReconvergenceStack.push(NotTakenBB)
25: ReconvergenceStack.push(TakenBB) is one such vector for every global memory instruction in every
26: else
27: if Taken then SM, i.e., MemPtnAccVector is a 2D array of such kind of
28: ReconvergenceStack.push(TakenBB)
vectors. Two metrics are calculated using this vector.
29: else
30: ReconvergenceStack.push(NotTakenBB) The first metric represents the unevenness of the distribu-
31: end if
32: end if tion. The Distance2Center function calculates the Euclidean
33: end if
34: CurrentBB = ReconvergenceStack.pop() distance between the vector of the address distribution and
35: WarpExecutionTrace.append(CurrentBB) the vector that represents an even distribution (called center
36: INST = FirstInstruction(CurrentBB)
37: else in the algorithm). This distance is a metric that indicates how
38: INST = NextInstCurBB()
39: end if uneven the distribution to different partitions is. The larger the
40: end while
41: Return WarpExecutionTrace, NumCoalAccessList, AddrCoalAccessList distance is, the more uneven the distribution is and thus the
42: end procedure more possibly SMs compete for the same partition.
Another metric is the Euclidean distance between the dis-
2) Number of Coalesced Memory Accesses: The analyzer tribution vector of one instruction on one SM to the distri-
also collects the information of the memory addresses used bution vector of the same instruction on other SMs, named
by each global memory instruction. All the memory addresses as D2OtherSM in the algorithm. The smaller the value of
used by the threads in a warp are coalesced together using D2OtherSM is, the more similar the address distributions
Algorithm 2. The list of coalesced memory addresses is from two SMs (s and s’) are. If the distance is 0, we
appended to the result list AddrCoalAccessList, which contains have the same distributions and then the number of possibly
105
competing SMs is increased by 1, as show on line 9, where TABLE I: GPGPU-Sim Configuration
the MaxDistance means the maximal distance of two vectors, Number of SMs 15
whose distributions all focus on single but different partitions, Number of Memory Partitions 12
e.g., [5, 0, 0] and [0, 0, 4]. This is a constant value according Number of 32-bit registers per SM 32768
number of SMs; if there are M SMs, MaxDistance
to the total √ Size of shared memory per SM 48KB
L1 data None
is (M − 1) 2. L2 cache None
Then the number of possible competing SMs (Com- Max Number of Active Kernel Block 8
petingSM) and the distance to the center (D2Center) are Max Number of Active Warp 48
Number of SP Operand Collectors 6
compared to heuristic thresholds, i.e., TCompetingSM and Number of SFU Operand Collectors 8
TD2Center , to decide whether the number of possible compet- Number of load/store Operand Collectors 2
ing SM of current instruction counts to the final result (line Warp Scheduling Policy GTRR
11). After all the instructions are analyzed for all the SMs,
an average value of the number of competing SMs is returned TABLE II: GPGPU-Sim Function Unit Latency Configuration
and used in the calculation in Equation 5 and 6. The maximal Initiaion ADD MAX MUL MAD DIV
value of the number of competing SMs is the number of active Integer 1 2 2 1 8
SMs minus one. The reason that heuristic thresholds are used Float 1 2 1 1 4
Double 8 16 8 8 130
is that the behaviors of different SMs are basically independent
Execution ADD MAX MUL MAD DIV
of each other and, therefore, their interactions are very hard to Integer 4 13 4 5 145
predict statically. So, we use these heuristic threshold values Float 4 13 4 5 39
to estimate the average degree of competing among SMs. The Double 8 19 8 8 330
heuristic values used in this work are 13 for TCompetingSM
and 0.5 for TD2Center , for the architecture configuration with
15 SMs and 12 memory partitions. It should be noted that we There are totally 9 GPU kernels in these benchmarks, 2 from
do not claim the WCET estimation with the average degree each, except lud. Table III shows the configuration of the grid
of competing SMs to be a safe upper bound, while the WCET and block sizes of each kernel, as well as the numbers of
estimation with the maximal number of possible competing active SMs and warps in execution. As shown in Table III, two
SMs can be considered as the safe upper bound. configurations of kernel sizes are used in the srad benchmark,
where the block sizes are the same but the grid sizes are
Algorithm 3 Average Number of Competing SMs different, leading to different number of active warps on an
1: NumCompetingSM = [] SM. These benchmarks are selected based on the criteria that
2: for Each I in all load/store instructions do the loop bounds are known statically.
3: for Each s in all SMs do
4: D2Center = Distance2Center(MemPtnAccVector[I][s])
5: CompetingSM = 0 TABLE III: GPU Benchmark Kernels
6: for Each s’ in all the rest SMs do
7: D2OtherSM = Distance2Vector(MemPtnAccVector[I][s],
Benchmark Grid Size Block Size # Act SMs # Act Warps
8: MemPtnAccVector[I][s’])
9: CompetingSM += (MaxDistance - D2OtherSM)/MaxDistance gsn k1 1x1x1 512x1x1 1 16
10: end for gsn k2 32x32x1 8x8x1 15 16
11: if CompetingSM > TCompetingSM or D2Center > TD2Center then
12: NumCompetingSM.append(CompetingSM) nw k1 32x1x1 16x1x1 15 2
13: else nw k2 31x1x1 16x1x1 15 2
14: NumCompetingSM.append(0) cfd k1 127x1x1 192x1x1 15 48
15: end if
16: end for cfd k2 127x1x1 192x1x1 15 48
17: end for lud k1 15x15x1 16x16x1 15 48
18: Return average(NumCompetingSM) srad 128 k1 8x8x1 16x16x1 15 40
srad 128 k2 8x8x1 16x16x1 15 40
srad 512 k1 32x32x1 16x16x1 15 48
V. E VALUATION M ETHODOLOGY AND E XPERIMENTAL srad 512 k2 32x32x1 16x16x1 15 48
R ESULTS
A. Evaluation Methodology B. Experimental Results
We used the GPGPU-Sim[9] simulator in this work as the Fig. 7 shows the normalized estimated WCET of the
analysis target of the GPU architecture. We also implement the simulated GPU architecture with and without the perfect
greedy then round-robin warp scheduling policy based on the memory configuration. The estimated WCET results with the
simulator. The general configuration of the simulator is shown perfect memory configuration is normalized to the measured
in Table I and the latency configuration of the function units simulation performance results with the same configuration.
is in Table II, the numbers in which represent the number of Perfect memory means every memory request just takes one
cycles. cycle after it has arrived at the LD/ST unit and does not go to
In the evaluation, 5 benchmarks, including Gaussian Elimi- the memory partitions through the interconnection network.
nation (gsn), Needleman-Wunsch (nw), CFD Solver (cfd), LU The normalized estimated WCET results with normal mem-
Decomposition (lud) and Speckle Reducing Anisotropic Diffu- ories in Fig. 7 are the estimated WCET results when the
sion (srad), are chosen from the Rodinia benchmark suite[12]. simulator and the WCET analyzer use a normal memory
106
TABLE IV: Estimated Average and Maximal Number of Coalesced Accesses and Competing SMs
Benchmark gsn k1 gsn k2 nw k1 nw k2 cfd k1 cfd k2 lud k1 srad128 k1 srad128 k2 srad512 k1 srad512 k2
Avg. Coalesced Access 22 7 3 2 1 1 2 2 2 2 2
Max. Coalesced Access 32 8 16 16 1 1 2 2 2 2 2
Avg. Competing SMs 0 10 7 5 7 7 10 9 9 13 13
Max. Competing SMs 0 14 14 14 14 14 14 14 14 14 14
system model. These estimated WCET results are normalized values is small or when they are the same, the increase in
to the measured simulation performance results with the the overestimation is small. But, when the difference grows,
normal memory system configuration. The results show that the overestimation increases. For example, in the gsn k2 and
generally with a perfect memory model, the estimator has nw, both the number of coalesced access and the number of
tighter estimations, compared to that with a normal memory competing SMs are different in average and maximal values
model. This is because, when no interference from other SMs and, as a result, the overestimation is huge when the maximal
needs to be considered, the predictability within an SM is value is used. For the two kernel hierarchy configurations in
better than the case when the interconnection network and the the srad benchmark, the srad128 has less estimated average
interferences from other SMs need to be considered. It should number of competing SMs than srad512, since there are less
be noted that the average values of the number of coalesced active warps per SM in the srad128. Therefore, when the
memory accesses and the average number of competing SMs maximal values are used, the overestimation in srad512 is less
are used in getting the estimated results in Fig. 7 and the then in srad128, since the estimated average value is closer to
estimated results are normalized to the measured performance the maximal one. For hard real-time applications, the maximal
with and without perfect memory respectively. Therefore, the estimated values of these two metrics should be used, while
overestimation in the estimated results with normal memory for soft real-time applications, the average values can be used.
can be smaller than the overestimation in the estimated results
with perfect memory, e.g., in benchmark gsn k1 and cfd k2. BothAverage MaxNumberofCoalescedMemoryAccesses
MaxNumberofCompetingSMs BothMax
323% 277%
NormalizedEstimated WCET
180% 180%
NormalizedEstimated WCET with
160% 160%
140%
140% 120%
120% 100%
80%
100% 60%
80% 40%
60% 20%
0%
40%
20%
0%
107
LooseRoundRobin GreedyThenRoundRobin
work. We plan to incorporate static timing models for cache
NormalizedAverageCasePerformance
165% 188% 173% 224% memories and shared memory in the future to estimate the
160%
140%
GPU execution time more accurately. Also, we will explore
120% other time-predictable warp scheduling methods to reduce the
100% impact on the average-case performance.
80%
60% ACKNOWLEDGMENT
40% This work was funded in part by the NSF grant CNS-
20%
1421577.
0%
R EFERENCES
[1] Nvidia. Nvidia Tegra Mobile Processors.
http://www.nvidia.com/object/tegra.html.
Fig. 9: Normalized Average-Case Performance Results [2] Nvidia. Nvidia DRIVE PX 2. http://www.nvidia.com/object/drive-px.html
[3] J. Shotton, et al. Real-time human pose recognition in parts from single
depth images. Commun. ACM 56, 1 (January 2013), 116-124.
[4] D. CiresAn, et al. Multi-column deep neural network for traffic sign clas-
time scheduling algorithms[15][16]. Although these studies try sification. Neural Networks, Volume 32, August 2012, Pages 333Ű338.
[5] NVIDIA Next Generation CUDA Compute Architecture: Fermi.
to employ GPUs in real-time applications, they assume the www.nvidia.com/content/PDF/fermi_white_papers/
WCET of the real-time applications and tasks are known for NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
the scheduling algorithms, which emphasizes the importance [6] Nvidia CUDA. CUDA Toolkit Documentation v7.0.
[7] Stone, J.E.; Gohara, D.; Guochun Shi, OpenCL: A Parallel Programming
of having reliable WCET estimations for GPU kernels. Standard for Heterogeneous Computing Systems. In Computing in Science
Studies on performance analysis of GPU architecture and and Engineering, 2010
GPGPU applications[17][18] focus on building the perfor- [8] Martin Schoeberl. Time-Predictable Computer Architecture. EURASIP
Journal on Embedded Systems, 2009
mance models. These studies mainly concentrate on the mod- [9] A. Bakhoda et al. Analyzing CUDA workloads using a detailed GPU
els of average-case performance and/or using the model to simulator. In ISPASS, Apr 2009.
identify the performance bottleneck, while the performance [10] W. W. L. Fung et al. Dynamic warp formation: Efficient MIMD control
flow on SIMD graphics hardware. ACM Transactions on Architecture and
model in this work focuses on the WCET estimation. Code Optimization, Volume 6 Issue 2, June 2009
There are also studies on the GPU warp scheduling [11] Nvidia CUDA. Parallel Thread Execution ISA Version 4.2.
policies[19][20] to improve the efficiency in utilizing the [12] S. Che, et al. Rodinia: A benchmark suite for heterogeneous computing.
In Proc. of the IEEE Int. Symp. Workload Characterization, 2009.
computational resources and to access the memory in a more [13] A. Eklund, et al. Medical image processing on the GPU Ű Past, present
friendly way, so that the performance is improved. How- and future. Medical Image Analysis. Volume 17, Issue 8, December 2013,
ever, the proposed scheduling policy in this work focuses Pages 1073Ű1094
[14] D. Merrill, et al. Scalable GPU graph traversal. Proceedings of the
on improving the predictability of the GPU architecture. The 17th ACM SIGPLAN symposium on Principles and Practice of Parallel
memory access reordering method proposed in [21] regulates Programming.
the order of memory accesses to the GPU L1 data cache to [15] G. Elliott and J. Anderson. Globally scheduled real-time multiprocessor
systems with GPUs. In Real-Time Systems, vol. 48, 2012.
improve the time predictability of the GPU L1 data cache, [16] G. A. Elliott, et al. GPUSync: Architecture-Aware Management of GPUs
while the proposed scheduling policy and analyzer in this work for Predictable Multi-GPU Real-Time Systems. In Proc. of RTSS, 2013.
focus on the timing model of the whole GPU system. [17] J. Sim, et al. A performance analysis framework for identifying potential
benefits in GPGPU applications. In Proc. of the 17th ACM SIGPLAN
The studies on GPU WCET analysis[22][23] use symposium on Principles and Practice of Parallel Programming, 2012
measurement-based methods, while the proposed WCET [18] Z. Cui, et al. An Accurate GPU Performance Model for Effective Control
analysis method in this work is based on static timing Flow Divergence Optimization. In Proc. of Parallel and Distributed
Processing Symposium (IPDPS), 2012
analysis and can give safe WCET estimations for GPU [19] A. Jog, et al. OWL: Cooperative Thread Array Aware Scheduling
kernels. Techniques for Improving GPGPU performance. In the Proc. of 18th
International Conference on Architectural Support for Programming Lan-
VII. C ONCLUSION AND F UTURE W ORK guages and Operating Systems, 2013
[20] V. Narasiman, et al. Improving GPU performance via large warps and
The parallel computing capability of GPUs can potentially two-level warp scheduling. In Proc. of the 44th Annual IEEE/ACM
benefit the real-time applications with better performance and International Symposium on Microarchitecture, 2011
energy efficiency. However, the time predictability issues of [21] Y. Huangfu, et al. Warp-Based Load/Store Reordering to Improve GPU
Data Cache Time Predictability and Performance. In Proc. of the 19th
GPU architecture and GPGPU applications must be addressed. International Symposium on Real-Time Distributed Computing, 2016
In this work, we analyze the GPU architecture of a detailed [22] A. Betts and A. Donaldson. Estimating the WCET of GPU-Accelerated
GPU simulator, based on which we propose a predictable warp Applications Using Hybrid Analysis. In Proc. of 25th Euromicro Confer-
ence on Real-Time Systems, 2013.
scheduling policy that enables us to build a worst-case timing [23] K. Berezovskyi, et al. WCET Measurement-based and Extreme Value
model for GPU kernels. The experimental results show that our Theory Characterisation of CUDA Kernels. In Proc. of RTNS, 2014.
WCET analyzer can effectively provide WCET estimations for
both soft and hard real-time application purpose.
The proposed timing model and the WCET analysis method
developed for the GPU can be further enhanced in our future
108