Analysis of GPU

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

California State University, Fresno

Department of Electrical and Computer Engineering


Course: ECE 274 High performance Computer Architecture
Instructor: Dr. Nan Wang

Report on: Analysis of GPU Power and Performance Models

Main Reference Papers:


1 Power and Performance Characterization and Modeling of GPU-Accelerated Systems.
2 An Adaptive GPU Performance and Power Model.

Prepared by,
1.Abhishek Gubbi Basavaraj
2.Hemanth Kumar
Semester: Spring 2017
Date: 02/28/2016

To be filled by Instructor
Comments:-------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------

1
Table of Contents
I. Introduction................................................................................................................................................................................ 3
II. CUDA PROGRAMMING MODEL AND NVIDIA FERMI GPU ARCHITECTURE .......................................................... 3
III. THE PROPOSED TWO MODELS ...................................................................................................................................... 4
IV. COMPARISION BETWEEN THE ADAPTIVE MODEL AND THE STATISTICAL MODEL ..................................... 11
V. CONCLUSION........................................................................................................................................................................ 12
VI. References ........................................................................................................................................................................... 13

2
Analysis of GPU Power and Performance Models

I. INTRODUCTION

In in the World of Parallel Computing GPUs are widely utilized and it is known as Genreal purpose parrelel computing.
However GPUs utilized a huge amount of power, it said that the GPU itself consumes nearly half of the amount of power
supplied to system. Therefore, this in turn affect the performance of the GPU applications.

This Report discuss on the Two Models to predict the power and Performance consumption of the GPU system.

The Statistical model uses the concept of Dynamic Voltage Frequency Scaling(DVFS) whereas the Adaptive model uses a
random forest algorithm . The power and performance characterization of these two models is presented on the Nvidia GPU GTX
480 architecture and is studied in detail. These two models are also implemented on different Nvidia GPU architectures. The
prediction power and performance errors are calculated and compared between the two models for different GPU architectures
and the best of these two models is suggested.

The Adaptive performance and power model is based on the Random Forest algorithm [3]. Some of this model metrics include
GPU architecture performance counters, including memory access pattern, multi-processors, and bandwidth. The adaptive model
focuses on the Nvidia Fermi architecture [4] and also uses a G80 architecture [4] in order to validate the adaptive models
universality. The adaptive model uses Fermi architecture as an example to illustrate the primitives of the GPU architecture.

II. CUDA PROGRAMMING MODEL AND NVIDIA FERMI GPU ARCHITECTURE

The Nvidia Fermi GPU contains 16 Stream Multiprocessors(SM). Each of the SM includes 32 scalar processors(SP), 4 special
function units(SFU) and 16 LD/ST units. Each SP has floating point unit(FPU) and a pipelined ALU.
The 16 LD/ST units allow the calculation of source and destination addresses for sixteen threads per clock. Supporting units load
and store the data at each address to cache or DRAM. Execution of special instructions such as sine, cosine, reciprocal and square
root is done by Special Function Units(SFUs).
Warp is a group of 32 parallel threads that executes in a single instruction multiple data (SIMD) fashion and the SM schedulers
work in warps. Each SM has two warp schedulers and two instruction dispatch units, which allows two warps to be concurrently
issued and executed. Fig. 1 illustrates the overview of the SM.

Fig 1. Fermi Stream Multiprocessor [4]

Fermi has a 64K configurable on-chip memory which includes a L1 data cache and a shared memory, and also a L2 off-chip
cache which provides a platform for computing programs. Each SM includes an instruction cache and a large register file. In

3
addition to it, a large capacity off-chip GDDR5 DRAM and global memory are divided into six partitions. These are in turn
connected to multiple SMs through complex connections.

Fermis 16 SM are placed around a common L2 cache. SM is a vertical rectangular strip which contains the orange portion
and includes a scheduler and a dispatch unit, the green portion is the execution units, and light blue portions is the L1 cache and
register file. Fig. 2 illustrates the overview of fermi architecture.

Fig 2. Fermi Architecture Overview [4]

CUDA [5] is the hardware and software architecture that execute programs written in C, C++, Fortran and other languages in
Nvidia GPUs. A CUDA program calls parallel kernels. When a CUDA application invokes a kernel grid from the host, the blocks
of the grid are listed one by one and distributed to multiprocessors based on the available execution capacity. The threads in the
thread block are partitioned into many warps and is concurrently executed on a multiprocessor. The number of thread blocks that
are running concurrently in a multiprocessor greatly affects the power consumption and performance.

III. THE PROPOSED TWO MODELS

A. ADAPTIVE POWER AND PERFORMACE MODEL

The adaptive power and performance model is based on the Random-Forest based analysis model. The Random forest-based
analysis model is divided into two sections.

I. Random Forest Algorithm

Random forest is a set of tree-structured predictors {T1(X), T2(X), ..., Tn(X)}, where X is the training set. The number of
regression trees n and the number of variables m are the two most important factors required to build the random forest algorithm.

The Advantages of using Random forest algorithm:


1) It is adaptive and flexible compared to the regression model because it works on the concept of machine learning.
2) Since each regression tree is independent, this model is faster than other machine learning algorithms.
3) It has high accuracy and is easy to use.

Working of Random Forest Algorithm


1) It uses bootstrap sample method to get n training data set from the input data set S.
X = {xi}, i = 1, ...., n
2) Building regression tree Ti based on the training set xi is done. At each internal node random sampling of the data, set m is
done by choosing variables as the predictors. The best splitter is chosen from the predictors. The measurement of training
deviation data of the prediction model is done by prediction instead of using a bootstrap sample.
3) The response of the new data set is predicted by calculating the average prediction of n trees.

4
II. Power and Performance Analysis Model

It has Two Stages: Prediction Model Construction and Prediction Model Evaluation.

Prediction Model Construction

Fig 3. PREDICTION MODEL CONSTRUCTION [3]

In the Model Construction, as shown in Fig. 3, the GPGPU simulator measures the set of responses which are metrics
independent. Power consumption or the performance is the response. The periodical sampling of a set of tuples with a format of
(Yi, X1,i, X2,i, ... , Xj,i) is performed where Y is the response, Xj,i is the jth metric of sample i and these samples are utilized in
building the prediction model.

Fig. 4 PREDICTION MODEL EVALUATION AND VARIABLE IMPORTANCE RANK [3]

Data sets that only contain values of all the metrics except the response is accepted by the prediction model and is as shown in
Fig. 4. The prediction model then predicts the values of the response and evaluates the importance of all the metrics.

A lot of runtime metrics covering the overall GPU system are measured in order to achieve high accuracy. In order to reduce the
model construction time, the metrics which are less relevant to the response are removed by using the random forest algorithm.
Table 1 lists all the key metrics in this model.

TABLE I
GPGPU-SIM RUNTIME METRICS DESCRIPTION

5
The Adaptive power and performance model uses a gpgpu-simv3.2 as a simulator with a power module GPUWattch [6].
GPUWattch is a power analysis and optimization technique.
The Random forest analysis is done for two GPGPUs with different architecture in order to its compatibility. Table II
shows the specification of the two GPGPUs.

TABLE II
GPU ARCHITECTURES [3]

From the three different benchmark suites NVIDIA SDK4.0 [7], PAROIL [8] and Rodinia [9], a set of real benchmarks is
collected for performing power consumption and performance analysis.

The Fig.5 shows the prediction error rate and performance error rate for twenty benchmarks running on NVIDIA GTX480 GPU.
The figure 5(a) shows the prediction power and performance error rate for different NVIDIA benchmarks. The average prediction
error rate for power consumption is 3.2% and 2.1% for performance.

Fig. 5(a) [3]

6
Prediction error of Power and IPC(Performance) for GTX 480

Fig 5(b) [3]


Scattered plot of measured power and performance
Fig. 5 Power consumption and Performance prediction error for GTX 480

From the Fig. 5(b), we can observe that the predicted power is almost equal to that of the measured power.

The validation of the same benchmarks is done on NVIDIA GPU Quadro FX5600 in order to demonstrate the random forest
based analysis model adaptability. The Fig.6(a) shows the prediction error rate and performance error rate for twenty benchmarks
running on NVIDIA Quadro FX5600 GPU. The average prediction error rate for power consumption and performance is less than
5%. Fig 6(b) illustrates that the predicted power is almost equal to that of the measured power.

This model also analyzes the important metrics which affect the power consumption and IPC(Performance) of the GPU GTX480.
Fig. 7(a) and Fig. 7(b) shows the important fifteen metrics to performance and power consumption of GTX480. From the Fig.
7(a), it is shown that the global memory related metrics have higher importance than other metrics. This implies that off-chip
memory affects the performance and power consumption of the GPU.
Fig. 7(b), it is shown that other metrics are of higher importance than that of global memory related metrics.

Fig. 6(a) [3]


Prediction error of Power and IPC(Performance) for FX5600

Fig 6(b) [3]


Scattered plot of measured power and performance

7
Fig. 6 Power consumption and Performance prediction error for FX5600

Fig.7(a)Variable importance for power consumption model GTX 480[3]

Fig.7(b)Variable importance for performance model GTX 480[3]

B. STATISTICAL MODEL
The Statistical model works on the concept of DVFS [2]. The impact of DVFS [2] on the GPU is studied in detail in the power
and performance characterization section. The Statistical model is studied for the GPU Nvidia GTX 480 and Nvidia GTX 680 in
order to show its universality and compatibility with different GPU architectures.
Power and performance Characterization is the most important section in the statistical model. The characterization of the GPU
power and performance is done in order to study the behavior of the system and also to understand the importance of
characterization in the prediction of power and performance error in GPU.
The best frequency pair for each benchmark which consumes minimum power and gives better performance is derived. This
configuration is done on the basis of core-memory frequency. The configurations involve Core/Memory-High, Medium and low
which are denoted as Core/Mem H, M and L. which are The benchmarks used are SDK4.0 [7], PARBOIL [8] and Rodinia [9].
Table II
Best Core-Memory frequency pair for optimal power consumption
Benchmarks GTX 480 GTX 680
Rodinia Core Frequency Memory frequency
Backprop H-L M-L
BFS H-H M-H
CFD H-H M-M
LUD H-M L-H
NN H-L H-L
NW H-M L-H
SRAD H-H L-H
STREAMCLUSTER H-H M-H
Parboil
Cut cp H-M H-H
lbm M-H M-H
tpacf H-M H-M
mri-q H-L M-H

Table II represents the best core-memory frequency pair for optimal power consumption for both the GPUs GTX 480 and GTX
680. Usually, the default configuration is high core and memory frequency i.e. Core-Mem H-H. By using the best core-memory

8
frequency pair there is an improvement in power efficiency. In GPU GTX 480 the overall improvement in power is about 12.1%
and 24.4 % in the GPU GTX 680. Data obtained from [2].
Statistical Model is divided into two sections.
1.Model construction
The statistical model uses multiple regression trees based approach in which the dependent variables is power and performance
whereas the independent variable is the statistical data obtained from the program counter. The number of program counters
depends on the GPU architecture. The Nvidia GPU GTX 480 has 74 program counters and Nvidia GPU GTX 680 has 108 program
counters.
The CUDA profiler is used to analyze the different benchmark programs and is used as modeling samples. The coefficient of
determination R2 is obtained from these samples. It is observed that the values of R2 do not improve beyond the usage of 10
independent variables.
The model construction involves power modeling and performance modeling.
I. Power Modeling
The division of program counters is done into two groups which are core event or memory event. This division of program
counters is done on the basis of power consumption related to the core or memory frequency of the GPU. The prediction of power
consumption of the GPU is obtained as shown in the equation.

[2]
Where ci, mj are the performance counter values and xi, yi and z are the model coefficients.

II. Performance Modeling

The division of program counters is done on the basis of performance related with the core or memory frequency of the GPU. The
performance prediction is calculated by using the equation

[2]

Where ci, mj are the performance counter values and xi, yi and z are the model coefficients.

2. Model Evaluation

The model evaluation is done on the basis of R2 values and the power and performance prediction values. The obtained R2 values
of the power model are as shown in table III and the obtained R2 values of the performance model are as shown in table IV.

Table III
R2 value of the Power Model [2]
GTX 480 GTX 680
0.70 0.18

Table IV
R2 value of the Performance Model [2]
GTX 480 GTX 680
0.94 0.91
From table III, it is observed that there is a huge variation in the R2 values of the power model. Similarly, from table IV, it is
observed that there is a slight variation in the R2 values of the performance model. The performance model achieves a better R2
value than the power model because the performance model has limited multiple regression trees.

Fig 8(a) represents the prediction error of the power model and fig. 8(b) represents the prediction error of the performance model
for GPU GTX 480 and GTX 680.

9
Fig. 8(a) Prediction error of power for different benchmarks in GTX 480 and GTX 680 [2]

Fig. 8(a) Prediction error of performance for different benchmarks in GTX 480 and GTX 680 [2]
From the figures 8(a) and 8(b), the average prediction values of the power and performance can be calculated. The average
prediction error value of power is as shown in table III.

Table III
Average prediction error of power [2]
GTX 480 GTX 680
Error[%] 18.2 23.5

The average prediction error value of performance is as shown in table IV.

10
Table IV
Average prediction error of performance [2]

GTX 480 GTX 680


Error[%] 39.3 33.5

It is observed that the prediction error of performance is higher than that of the prediction error of power in spite of having higher
R2 values. Also it is observed that the prediction error rate of performance is decreasing as the GPU architecture progresses. The
statistical model has an average prediction power and performance errors between 18% to 20%. All the test programs are
compiled using the Nvidia CUDA Compiler [10] and Yokogawa electric power meter [11] is used for measuring the power.

The evaluation of the statistical model which consists of power and performance models is done by using 20 explanatory
variables as shown in fig. 9(a) and fig. 9(b).

Fig. 9(a)
Impact of explanatory variables on the power model [2]

Fig. 9(b)
Impact of explanatory variables on the performance model [2]

The x-axis in the fig. 9(a) and fig. 9(b) represents the explanatory variables. From the fig 9(a) and fig. 9(b) it is observed that by
usage of 10 explanatory variables results in best prediction accuracy for both the power and performance models. Also, it is
observed that power model gives best prediction error when it is tied to a single architecture whereas the performance models
prediction error improves through progress in the GPU architecture.

IV. COMPARISION BETWEEN THE ADAPTIVE MODEL AND THE STATISTICAL MODEL

The Adaptive and the statistical power and performance models are discussed in the above sections in detail. The Adaptive model
is based on the random-forest based algorithm. The random-forest based algorithm is based on the concept of machine learning
instead of using a predefined formula.

11
The main advantages of using the adaptive power and performance model are because of its adaptability with different GPUs.
Each regression tree in the random-forest based model is independent and training of these regression trees is faster by using the
random-forest algorithm. The adaptive power and performance model is highly accurate in predicting the performance and power
consumption of GPU. This model also provides a clear understanding of the important factors that affect the GPUs power and
performance.

Apart from these advantages, there are also disadvantages of using the adaptive model. The disadvantage is the divergence of
warp, which occurs due to the different execution paths taken by the threads in the warp for divergent branches. Due to the warp
divergence and the divergence of branches, the power and performance consumption of the GPU is affected.

The Statistical power and performance model is based on the concept of DVFS [2]. It uses the multiple linear regression methods
in which the dependent variables are the power and performance and the independent variable is the statistical data obtained from
the program counters.

The important advantage of using the statistical model is that the core and memory frequencies are divided into three modes i.e.
core-mem H, M and L (High, Medium and Low). The best core-memory frequency pair for each benchmark is derived on the
basis of power consumption and performance. Each of the benchmarks is configured on the basis of the best core-memory
frequency pair which results in lower power consumption and better performance of the GPU. The statistical model can be used
for multiple GPU architectures to predict the power consumption and performance.
The disadvantage of using the statistical model is that prediction error of power increases through the progress in the GPU
architecture whereas the performance prediction error almost remains same. Also, the prediction error for power and performance
is high.

Table V
Adaptive Model GPU GPU
GTX FX5600
480
Prediction error for power 3.2% ~5%
Prediction error for performance 2.1% ~5%
Statistical Model GPU GPU
GTX GTX
480 680
Prediction error for power 18.2% 23.5%s
Prediction error for performance 39.3% 33.5%
Comparison between the Adaptive and the Statistical Models

Table V shows the comparison of the prediction error of power and performance between the adaptive model and the statistical
model. It is observed that the prediction error for power consumption is 3.2% and prediction error for performance is 2.1% for
GPU GTX 480 by using the adaptive model. The prediction error for power consumption is 18.2% and prediction error for
performance is 39.3% for GPU GTX 480 by using the statistical model. Also, the prediction error for power and performance for
different GPUs using the adaptive model is less when compared to that of the statistical model.

The adaptive model has lower prediction error for power and performance than the statistical model.

V. CONCLUSION

By comparing the the prediction error of power and performance of these two models, We clearly state that adaptive model has
lower prediction error. The statistical model as described above makes use of DVFS to derive ideal core-memory frequency pairs
which in turn aid in reducing the consumption of power. Thus, if this concept of DVFS could be implemented along with the
random-forest based algorithm in the adaptive model which is the best of the two models addressed above; the problem of power
consumption could be resolved. Hence, future work could be done to improve the adaptive model in this aspect in order to obtain
the lowest possible values for prediction error.

12
VI. REFERENCES
[1] Abe, Y., Sasaki, H., Kato, S., Inoue, K., Edahiro, M., & Peres, M. (2013, September). Power and
performance of GPU-accelerated systems: A closer look. In Workload Characterization (IISWC), 2013 IEEE
International Symposium on (pp. 109-110). IEEE.

[2] Abe, Y., Sasaki, H., Kato, S., Inoue, K., Edahiro, M., & Peres, M. (2014, May). Power and performance
characterization and modeling of gpu-accelerated systems. In Parallel and Distributed Processing Symposium, 2014
IEEE 28th International (pp. 113-122). IEEE.

[3] Li, X., Wu, J., Yu, Z., Xu, C., & Chen, K. (2014, April). An adaptive GPU performance and power model.
In Information Science and Technology (ICIST), 2014 4th IEEE International Conference on (pp. 665-669). IEEE.

[4] NVIDIA, Fermi official whitepaper, 2010.

[5] Nvidia, C. U. D. A. (2011). Nvidia cuda c programming guide. Nvidia Corporation, 120(18), 8.

[6] L. Jingwen, et al. GPUWattch: enabling energy optimizations in GPGPUs, proc. of ISCA. vol. 40, 2013.

bhttps://developer.nvidia.com/gpucomputing-sdk, 2011.

[8]I,Parboil Benchmark Suite, available: http://impact.crhc.illinois.edu/Parboil/, 2011.

[9] S. Che, M. Boyer, et al. Rodinia: A benchmark suite for heterogeneouscomputing,\ in IISWC, pp. 86-97,
Oct 2009

[10] NVIDIA, CUDA TOOLKIT 4.2, 2012, http://developer.nvidia.com/


cuda/cuda-downloads.

[11] Y. E. Corporation, WT1600 digital power meter, 2012, http:


//tmi.yokogawa.com/discontinued-products/digital-power-analyzers/
digital-power-analyzers/wt1600-digital-power-meter/.

13
14

You might also like