Professional Documents
Culture Documents
Offloaded Execution of Deep Learning Inference at Edge: Challenges and Insights
Offloaded Execution of Deep Learning Inference at Edge: Challenges and Insights
Abstract—Efforts to leverage the benefits of Deep Learn- and more practicable with increasing importance of Edge
ing(DL) models for performing inference in resource constrained Computing and may be required in following scenarios:
embedded devices is very popular nowadays. Researchers world-
• while using a large pre-trained model that may not fit
wide are trying to come up with software and hardware accel-
erators that make pre-trained DL models suitable for running hardware accelerator memory,
the inference phase in these devices. Apart from software / • inference rate required is not achievable even using
hardware accelerators, DL model partitioning and offloading to accelerators and
Cloud or Edge network servers is becoming more and more • the embedded device has certain other tasks to perform
practicable with increasing importance of Edge Computing. DL
inference workflow partitioning and offloading is complementary along with DL inference.
to the software / hardware accelerators and can improve latency In this work we focus specifically in DL inference via
and energy efficiency of an already accelerated DL model. In offloading. We discuss the DL inference offloading process,
this work we implement a Dl inference offloading system using state of the art(SoA) and its technology enablers in detail. We
Raspberry Pi 3 based robot vehicle with a hardware accelerator
from Intel (NCS). We report the detailed experimental results, implement a Dl inference offloading system using Raspberry
improvements achieved and several insights and challenges that Pi 3 [11] (Rpi) based robot vehicle, USB connected Intel
we inferred from the experiments, including role of network Movidius Neural Compute Stick [13] (NCS), a hardware
bandwidth, edge device dynamic system load etc.. accelerator for DL inference. We report the detailed experi-
Index Terms—component, formatting, style, styling, insert mental results, improvements achieved and several insights and
I. I NTRODUCTION challenges that we inferred from our experiments. Throughout
this article we use the term DNN to refer to any deep neural
Deep Learning [20] (DL) has acted as an accelerator to our network like CNN [19], RNN etc. , edge device to refer to
journey towards a data driven world by minimizing the role of the network end embedded systems like robots, autonomous
domain specific and tedious feature engineering and allowing vehicles, wearable devices and edge server to refer to the
high performing pre-trained models to be used for newer untapped resources in network edge that can act as offloading
datasets with relatively low re-training. However DL inference servers, e.g. smartphones, gateways, MEC servers [28]. The
phase, which is used for classification / prediction requires fair rest of this paper is organized as follows: in section II DNN
amount of system resources (compute, memory, cache etc.) to Execution Offloading and the enablers for that are discussed
run in real time. Efforts to leverage the benefits of DL models in detail along with the SoA. In section III we present the
for performing inference in resource constrained embedded de- experimental setup, experiments performed and the detailed
vices is in focus since past few years. Researchers worldwide inferences. Section IV concludes the paper.
are trying to come up with software and hardware accelerators
that make pre-trained DL models suitable for running the II. DNN E XECUTION O FFLOADING
inference phase in these devices. Software transformations like In this section we discuss the major technical enablers
model compression [21]–[23], quantization [24], [25], approx- for accelerating DNN inference time by partitioning between
imation, network pruning and coming up with new smaller constrained embedded edge devices and edge / Cloud servers.
network architectures [26] with comparable efficacy. Hardware
friendly optimizations like matrix multiplication factorization, A. Execution Profiling
data path optimization, parallel operations (e.g. convolutions) Profiling DNN execution time and power requirement is
are some of the areas that attracted attention [27] - paving critical for designing effective offloading strategies, by taking
the way for commercially available hardware accelerators for informed decisions about whether or not to offload part of a
DL inference. DL model partitioning and offloading to Cloud DNN inference pipeline. The de facto standard in this regard
or Edge network servers is a complementary approach to is the layer-wise profiling and offloading of DNNs. Profiling
these accelerations and can be applied over and above these the inference process layer by layer for each and every DNN is
transforms. Such partitioned DL inference is becoming more impracticable and hence researchers try to build generalized
prediction models for predicting execution time, power etc. mobile phones, however they do not build any prediction
for different types of layers on different classes of hardware as models of execution time and power consumption.
depicted in Fig. 1. Linear regression based models are popular
B. Partitioning Strategy for Offloading
in this context as those are lightweight in terms of system
resource consumption and can be easily built from structure DNN partitioning is an important enabler for offloaded
of the network and different configuration parameters. execution. Partition points in a DNN data flow graph can be
determined with target of energy and/or latency minimization.
Input Convolutional Layers FC
As shown in Fig. 2, partitioning can be targeted for Edge-
Pool / NL Output
Cloud hierarchy.
Model CPU
Model Params Runtime
Memory Deployment
Target
Hardware Features
Details
Execution Time
Prediction
3 2
1.5
2.5
1
2 0.5
0 2
8 16 5 2 256 512 1024
32 64 10 5 10
128 256 64 128
5121024 40 40 8 16 32
0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio
(a) Configuration-I - Alexnet - Partition point selection (b) Configuration-I - Alexnet - Inference rate
4 70
60
Partition point - layerwise
3 40
30
2.5
20
2 10
0 8 16 2 256 512 1024
2
32 64 64 128
128 256 512 5 5
8 16 32
1024 10 10 0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio
(c) Configuration-II - Alexnet - Partition point selection (d) Configuration-II - Alexnet - Inference rate
edge device along with the predicted DNN execution latency [2] Shuochao Yao, Yiran Zhao, Huajie Shao, ShengZhong Liu, Dongxin
while choosing the partition points, which none of the earlier Liu, Lu Su, and Tarek Abdelzaher, “FastDeepIoT: Towards Understand-
ing and Optimizing Neural Network Execution Time on Mobile and
systems do. Embedded Devices,” In Proceedings of the 16th ACM Conference on
Embedded Networked Sensor Systems (SenSys ’18), Gowri Sankar Ra-
IV. C ONCLUSION machandran and Bhaskar Krishnamachari (Eds.). ACM, New York, NY,
USA, 278-291. DOI: https://doi.org/10.1145/3274783.3274840, 2018.
In this paper we implemented a execution offloading system [3] Qingqing Cao, Niranjan Balasubramanian, and Aruna Balasubramanian,
for Deep Learning inference phase. We experimented with “MobiRNN: Efficient Recurrent Neural Network Execution on Mobile
a embedded System on Chip along with a DL inference GPU, ”. In Proceedings of the 1st International Workshop on Deep
Learning for Mobile Systems and Applications (EMDL ’17). ACM, New
accelerator and demonstrated that intelligent offloading can York, NY, USA, 1-6. DOI: https://doi.org/10.1145/3089801.3089804,
be useful for both resource constrained embedded systems, 2017
as well a more capable edge devices. We discussed the Sate [4] En Li, Zhi Zhou, and Xu Chen, “Edge Intelligence: On-Demand
Deep Learning Model Co-Inference with Device-Edge Synergy,” In
of Art in this research area and outlined several challenges Proceedings of the 2018 Workshop on Mobile Edge Communica-
and insights that may result in a better system design for tions (MECOMM’18). ACM, New York, NY, USA, 31-36. DOI:
DNN inference offloading. In future we would like to design https://doi.org/10.1145/3229556.3229562, 2018
[5] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Ja-
an offloading system that takes care of the dynamic system son Cong, “Caffeine: towards uniformed representation and accel-
load of Internet of Things devices which follow sense-analyze- eration for deep convolutional neural networks,” In Proceedings of
actuate cycle. the 35th International Conference on Computer-Aided Design (IC-
CAD ’16). ACM, New York, NY, USA, Article 12, 8 pages. DOI:
R EFERENCES https://doi.org/10.1145/2966986.2967011, 2016
[6] Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio For-
[1] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor livesi, and Fahim Kawsar, “ An Early Resource Characterization of Deep
Mudge, Jason Mars, and Lingjia Tang, “Neurosurgeon: Col- Learning on Wearables, Smartphones and Internet-of-Things Devices,”
laborative Intelligence Between the Cloud and Mobile Edge.,” In Proceedings of the 2015 International Workshop on Internet of Things
SIGARCH Comput. Archit. News 45, 1 (April 2017), 615-629. DOI: towards Applications (IoT-App ’15). ACM, New York, NY, USA, 7-12.
https://doi.org/10.1145/3093337.3037698, 2017. DOI=http://dx.doi.org/10.1145/2820975.2820980
2
3
2.8
1.5
2.6
2.4
1
2.2
2 2 0.5
0 5
8 16 10 2 256 512 1024
32 64 128 5 10 64 128
256 512 40 40 8 16 32
1024 0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio
(a) Configuration-I - Squeezenet - Partition point selection (b) Configuration-II - Squeezenet - Partition point selection
14
70
12
60
Partition point - layerwise
(c) Configuration-II - Squeezenet - Partition point selection (d) Configuration-II - Squeezenet - Inference rate
[7] Mengwei Xu, Feng Qian, and Saumay Pushp, “Enabling Coopera- 1512.00567, 2015. arxiv.org/abs/1512.00567
tive Inference of Deep Learning on Wearables and Smartphones”, [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “Im-
eprint:arXiv:1712.03073, 2017 ageNet classification with deep convolutional neural networks.” In
[8] J.H. Ko, T. Na, M.F. Amir and S. Mukhopadhyay, “Edge-host partition- Proceedings of the 25th International Conference on Neural Information
ing of deep neural networks with feature space encoding for resource- Processing Systems - Volume 1 (NIPS’12), F. Pereira, C. J. C. Burges,
constrained internet-of-things platforms,” eprint:arXiv:1802.03835, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 1. Curran Associates Inc.,
2018. USA, 1097-1105.
[9] E. Li, Z. Zhou and X. Chen, “Edge intelligence: On-demand deep [18] “China Unicom Edge Computing Technology White
learning model co-inference with device-edge synergy ,” in Pro- Paper”, https://builders.intel.com/docs/networkbuilders/
ceedings of the 2018 Workshop on Mobile Edge Communications, china-unicom-edge-computing-technology-white-paper.pdf.
MECOMM@SIGCOMM 2018, pp. 3136 Budapest, Hungary, August [19] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan
20, 2018 Carlsson. 2014. “CNN Features Off-the-Shelf: An Astounding Baseline
[10] Surat Teerapittayanon, Bradley McDanel, H. T. Kung, “BranchyNet: for Recognition.” In Proceedings of the 2014 IEEE Conference on
Fast Inference via Early Exiting from Deep Neural Networks,”, eprint: Computer Vision and Pattern Recognition Workshops (CVPRW ’14).
arXiv:1709.01686, 2017 IEEE Computer Society, Washington, DC, USA, 512-519. DOI=http:
[11] Raspberry Pi 3: Specs, benchmarks & testing, https://www.raspberrypi. //dx.doi.org/10.1109/CVPRW.2014.131
org/magpi/raspberry-pi-3-specs-benchmarks [20] Bengio Y: “Learning Deep Architectures for AI”. Now Publishers Inc.,
[12] Jetson TX2, ETSI, https://elinux.org/Jetson TX2. Hanover, MA, USA; 2009.
[13] Intel Movidius Neural Compute Stick, ETSI, https://movidius.github.io/ [21] S. Han, H. Mao, and W. J. Dally. “Deep compression: Compressing
ncsdk/ncs.html. deep neural networks with pruning, trained quantization and huffman
[14] Jack Dongarra, “The LINPACK Benchmark: An Explanation,” In Pro- coding.” arXiv preprint arXiv:1510.00149, 2015.
ceedings of the 1st International Conference on Supercomputing, Elias [22] S. Bhattacharya and N. D. Lane. “Sparsification and separation of deep
N. Houstis, Theodore S. Papatheodorou, and Constantine D. Poly- learning layers for constrained resource inference on wearables”. In
chronopoulos (Eds.), Springer-Verlag, London, UK, UK, 456-474, 1987. Proceedings of the 14th ACM Conference on Embedded Network Sensor
[15] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Systems CD-ROM, pages 176189. ACM, 2016
Keutzer. “Squeezenet: Alexnet-level accuracy with 50x fewer parameters [23] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krish-
and 1mb model size,” arXiv preprint arXiv:1602.07360, 2016 namurthy. “MCDNN: an approximation-based execution framework for
[16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. “Rethink- deep stream processing under resource constraints”. In Proceedings of
ing the Inception architecture for computer vision.” arXiv preprint , the 14th Annual International Conference on Mobile Systems, Applica-
5 1.5
1
3
2 0.5
0
0 2 2 256 512 1024
0 5 10 64 128
8 16 10 5
32 64 128 256 40 40 8 16 32
5121024 0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio
(a) Configuration-I - Inception V3 - Partition point selection (b) Configuration-I - Inception V3 - Inference rate
8 25
7
20
Partition point - layerwise
5 15
4
10
3
2 5
1
0
0
256 512 1024
0 2 2
8 16 32 64 128
64 128 256 5 5
8 16 32
5121024 10 10 0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio
(c) Configuration-II - Inception V3 - Partition point selection (d) Configuration-II - Inception V3 - Inference rate
5
Inference rate (FPS)
4.5
3.5
3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Bandwidth in Mbps (upload time)