Professional Documents
Culture Documents
Pick The Right Edge Device
Pick The Right Edge Device
Pick The Right Edge Device
net/publication/349044937
Pick the Right Edge Device: Towards Power and Performance Estimation of
CUDA-based CNNs on GPGPUs
CITATIONS READS
0 23
3 authors, including:
All content following this page was uploaded by Christopher Alexander Metz on 24 February 2022.
Abstract—The emergence of Machine Learning (ML) as a come with a wide variety of series, some of them can be very
powerful technique has been helping nearly all fields of business expensive and consume too much power. Therefore, for ML
to increase operational efficiency or to develop new value propo- engineers or companies that want to run their ML algorithms
sitions. Besides the challenges of deploying and maintaining ML
models, picking the right edge device (e.g., GPGPUs) to run these such as CNNs on a GPGPU, the selection of an underlying
models (e.g., CNN with the massive computational process) is one edge device is very essential.
of the most pressing challenges faced by organizations today. As According to [4], statistics illustrate that 69% of the small
the cost of renting (on Cloud) or purchasing an edge device to medium businesses face some initial issues when adopting
is directly connected to the cost of final products or services, ML for their applications. The statistics illustrate that lacking
choosing the most efficient device is essential. However, this
decision making requires deep knowledge about performance and knowledge about finding the proper underlying edge device
power consumption of the ML models running on edge devices which most fit to given ML algorithms is the main challenge
that must be identified at the early stage of ML workflow. of such business.
In this paper, we present a novel ML-based approach that For example, consider a scenario that company AA rents
provides ML engineers with the early estimation of both power a GPGPU from company BB to run its CNN algorithm. To
consumption and performance of CUDA-based CNNs on GPG-
PUs. The proposed approach empowers ML engineers to pick achieve the maximum performance, AA chooses the most
the most efficient GPGPU for a given CNN model at the early expensive GPGPU of the BB company. This selection directly
stage of development. affects the cost of the final product of AA for which this CNN
Index Terms—power consumption, performance estimation, is used. However, there could be another alternative that AA
GPGPU, CUDA, machine learning, neural networks
could choose to gain the same performance at a lower cost.
I. I NTRODUCTION In the case of using CNN, the fundamental concern
that should overcome first is to find the most power- and
Nowadays, Machine Learning (ML) has become very im-
performance-efficient edge device that fits it most. This is
portant for all types of industries, ranging from manufacturing
considered a critical step in using such ML models as under
to scientific-, health- and security-related applications [1]–
approximation of the proper device causes performance loss
[3]. This increasing deployment of ML is not only for big
and increases the time-to-market constraints. On the other
companies such as Google, Amazon, or Facebook but also
hand, over-approximation of the required edge devices is
for small and medium-sized enterprises (SME). For example,
directly connected to paying more money to either purchase
about 35% of the SME in Germany have already used ML for
or rent them in a Cloud. Moreover, in the case of online
their applications [4].
One of the most important use cases of ML (e.g., in applications such as smartphones, this over-approximation of
the German industry) is automatization and smart sensors GPGPU selection can reduce the battery lifetime (as the power
[5]. Among the existing ML algorithm, Convolutional Neural consumption increases) and increase the area overhead.
Networks (CNNs) is the most popular ML algorithm for image In this paper, we focus on this scenario. We propose a
recognition in automated manufacturing process control [6]. novel approach to estimate the power consumption and perfor-
The convolutional layers, made up of 4-dimensional convolu- mance of a given CUDA-based CNN model on GPGPUs. Our
tions, are responsible for over 90% of the computation and proposed approach statically analyzes the CNN instructions
require processing massive amounts of data with potentially and GPGPU architectural information and take advantage of
trillions of computations per second [7]. Due to this massive machine learning techniques for its estimation process. By this,
computation, ML engineers take advantage of edge devices ML engineers are able to choose the most efficient GPGPU
such as GPGPUs to gain performance and meet the time-to- for their CNN model at the early stage of development.
market constraints. However, edge devices such as GPGPUs The structure of this paper is as follows: Section II presents
the related work in this domain. The proposed methodology
This work was supported in part by the German Federal Ministry of Education and is described in Section III. The benchmarks and evaluation
Research (BMBF) within the project SATiSFy under contract no. 16KIS0821K, by the
Data Science Center of the University of Bremen (DSC@UB), and by the University of
methods are specified in Section IV. The paper is concluded
Bremen’s graduate school SyDe, funded by the German Excellence Initiative. in Section V.
28
Floating-Point
1 l d . param . u64 %rd1 , [ copy 5 param 0 ] ; Instr.
PTX
2 l d . param . u64 %rd2 , [ copy 5 param 1 ] ; instruction …
3 c v t a . t o . g l o b a l . u64 %rd3 , %r d 2 ; Classes Data
4 c v t a . t o . g l o b a l . u64 %rd4 , %r d 1 ;
Movement &
5 mov . u32 %r1 , %c t a i d . x ; Conversion
6 mov . u32 %r2 , %t i d . x ; Power
FP32 Cores consumption
7 s h l . b32 %r3 , %r1 , 8 ; …
8 o r . b32 %r4 , %r3 , %r 2 ; Performance
9 s h l . b32 %r5 , %r4 , 2 ; Tensor Cores
10 o r . b32 %r6 , %r5 , 1 ; GPGPU
11 o r . b32 %r7 , %r5 , 2 ; components
12 o r . b32 %r8 , %r5 , 3 ; …
13 c v t . u16 . u32 %r s 1 , %r 4 ;
14 mul . wide . u16 %r9 , %r s 1 , −7281; Global
15 s h r . u32 %r10 , %r9 , 2 2 ; Memory
16 mul . wide . u32 %rd5 , %r8 , 9 5 4 4 3 7 1 7 7 ;
17 s h r . u64 %rd6 , %rd5 , 3 3 ; Fig. 3. Neural network architecture for power and performance prediction.
18 c v t . u32 . u64 %r11 , %r d 6 ;
19 and . b32 %r12 , %r11 , 3 1 ;
20 mul . wide . u32 %rd7 , %r8 , −1431655765;
29
TABLE I
E XAMPLE OF T RAINING DATASET STRUCTURE USED TO CREATE THE PREDICTIVE MODELS
IV. B ENCHMARKS AND M ETHODOLOGY E VALUATION analysis techniques. In ACM Transactions on Design Automation of
Electronic Systems (TODAES), 25(5), pp.1-28, 2020.
In order to evaluate the proposed methodology, a set of [2] M. Goli, J. Stoppe, and R. Drechsler, Resilience evaluation for ap-
benchmarks is considered including three different types of proximating SystemC designs using machine learning techniques. In
International Symposium on Rapid System Prototyping (RSP), pp. 97-
GPGPUs as well as different pre-trained well-known CNN 103. IEEE, 2018.
algorithms. The GPGPUs that we considered in our benchmark [3] M. Goli, R. Drechsler, Application III: Design Space Exploration. In
are based on different architectures. This enables us to develop Automated Analysis of Virtual Prototypes at the Electronic System
Level. Springer International Publishing, 2020.
the proposed methodology more generic in the prediction of [4] Bundesverband mittelständische Wirtschaft, Umfrage Anwendung
performance and power consumption of running applications von Künstlicher Intelligenz in KMU, 2020, available: https://
on GPGPUs. The evaluation of the proposed methodology is gemeinsam-digital.de/app/uploads/2020/07/ki-umfrage bvwm gd.pdf
[5] Wissenschaftliches Institute für Infrastruktur und Kommunika-
performed by estimating the performance and power consump- tionsdienste, Künstliche Intelligenz im Mittelstand - Relevanz,
tion of a given CNN (which is not in the training dataset) using Anwendung, Transfer, available: mittelstand-digital.de/MD/Redaktion/
the predictive models. The CNN is run on the GPGPU while DE/Publikationen/kuenstliche-intelligenz-im-mittelstand.pdf? blob=
publicationFile&v=5
the Nvidia-smi Tool is used to measure the power consumption [6] M. Kadar and D. Onita, A deep CNN for image analystics in au-
of the GPGPU. Nvidia-smi reports the power consumption in tomated manufacturing process control. In International Conference
a csv file. For this, a measurement interval of one second is on Electronics, Computers and Artificial Intelligence (ECAI), Pitesti
Romania, 2019, pp. 1-5, doi: 10.1109/ECAI46879.2019.9042159
considered (i.e., every second a measurement is written to the [7] M. Fingeroff, Machine learning at the edge: using HLS to optimize
csv file). The power consumption is averaged over the entire power and performance, The Mentor - A Siemens Business, available:
run and compared to the ML model estimation. https://go.mentor.com/5 3ik
[8] S. Song, C. Su, B. Rountree, K.W. Cameron, A simplified and
V. C ONCLUSION AND F UTURE W ORKS accurate model of power-performance efficiency on emergent GPU
architectures. In International Symposium on Parallel & Distributed
We propose a novel ML-based approach to predict power Processing, Boston MA, ,pp.673-686, 2013.
[9] J. Guerreiro, A. Ilic, N. Roma and P. Tomás, Modeling and decou-
consumption and performance of CUDA-based CNNs on pling the GPU power consumption for cross-domain DVFS. In IEEE
GPGPUs. This enables ML engineers to pick the GPGPU that Transactions on Parallel and Distributed System, vol. 30, no. 11, pp.
suits best their CNN in terms of performance and power con- 2494-25006, 1 Nov. 2019, doi: 10.1109/TPDS.2019.2917181
[10] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo and S. Matsuoka,
sumption. The results of this analysis help ML engineers create Statistical power modeling of GPU kernels using performance counters,
their systems more efficiently and avoid high productions cost. In International Conference on Green Computing, Chicago IL, pp. 115-
The proposed approach is based on analyzing the PTX 122, 2010, doi: 10.1109/GREENCOMP.2010.5598315
[11] G. Wu, J.L. Greathouse, A. Lyashecksy, N. Jayasena and D. Chiou,
instructions obtain from various standard CNN models and GPGPU performance and power estimation using machine learning,
the GPGPUs’ architectural information. We take advantage of in 2012 IEEE 21st International Symposium on High Performance
this information to build a set of training data and use them to Computer Architecutre, Burlingame, CA 2015
[12] J. Chen, Bin Li, Ying Zhang, L. Peng and J. Peir, Statistical GPU
generate a predictive model to estimate the performance and power analysis using tree-based methods. In International Green Com-
power consumption of a given CNN. The proposed methodol- puting Conference and Workshops, Orlando FL, 2011, pp. 1-6, doi:
ogy is under development and is in the early evaluation phase. 10.1109/IGCC.2011.6008582
[13] Q. Wang and X. Chu, GPGPU Performance Estimation with core
We believe this new line of research can truly help designers and memory frequency scaling, In IEEE Transactions on Parallel and
to make better decisions in the early stage of selecting edge Distributed Systems, vol. 31, no 12. pp. 2865-2881, 2020
devices. Moreover, to help them in optimizing their ML [14] N. Ardalani, C. Lestourgeon, K. Sankaralingam and X. Zhu, Cross-
architecture performance prediction (xapp) using cpu code to predict
algorithm to fulfill the constraints of a given edge device. gpu performance, In Proceedings of the 48th International Symposium
As future works, we plan to consider more benchmarks on Microarchitecture ACM, 2015, p. 725-737
(i.e., GPGPUs series and CNN algorithms) to make our predic- [15] I. Baldini, S.J. Fink and E. Altman, Predicting gpu performance from
cpu runs using machine learning, IEEE 2014, pp 254-261
tive models more accurate. The other direction is to consider [16] Nvidia, Nvidia Tesla V100 GPU architecture - The world’s most ad-
not only the CNN algorithms but also expand our method for vanced Data Center GPU, August 2017 available: https://images.nvidia.
other ML methods such as Neural Networks (NNs). Moreover, com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[17] Nvidia, Nvidia Turing GPU architecture, available: https://www.nvidia.
it is also interesting to consider other edge devices such as com/content/dam/en-zz/Solutions/design-visualization/technologies/
Tensor Processing Units (TPUs) to provides ML engineers turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
with more choosing options to run their ML algorithms. [18] Tensorflow, Tensorflow GPU, available: https://www.tensorflow.org/
install/gpu
[19] Nvidia, Parallel Thread Execution ISA application guide, Oktober
R EFERENCES 2020, available: https://docs.nvidia.com/cuda/pdf/ptx isa 7.1.pdf
[1] M. Goli and R. Drechsler, PREASC: Automatic portion resilience [20] Nvidia, Nvidia-smi, available: https://developer.download.nvidia.com/
evaluation for approximating SystemC-Based designs using regression compute/DCGM/docs/nvidia-smi-367.38.pdf
30