Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Offloaded Execution of Deep Learning Inference at

Edge: Challenges and Insights


Swarnava Dey (Author) Jayeeta Mondal (Author) Arijit Mukherjee (Author)
Embedded Systems and Robotics Embedded Systems and Robotics Embedded Systems and Robotics
TCS Research & Innovation TCS Research & Innovation TCS Research & Innovation
Kolkata, India Kolkata, India Kolkata, India
Email: swarnava.dey@tcs.com Email: jayeeta.m@tcs.com Email: mukherjee.arijit@tcs.com

Abstract—Efforts to leverage the benefits of Deep Learn- and more practicable with increasing importance of Edge
ing(DL) models for performing inference in resource constrained Computing and may be required in following scenarios:
embedded devices is very popular nowadays. Researchers world-
• while using a large pre-trained model that may not fit
wide are trying to come up with software and hardware accel-
erators that make pre-trained DL models suitable for running hardware accelerator memory,
the inference phase in these devices. Apart from software / • inference rate required is not achievable even using
hardware accelerators, DL model partitioning and offloading to accelerators and
Cloud or Edge network servers is becoming more and more • the embedded device has certain other tasks to perform
practicable with increasing importance of Edge Computing. DL
inference workflow partitioning and offloading is complementary along with DL inference.
to the software / hardware accelerators and can improve latency In this work we focus specifically in DL inference via
and energy efficiency of an already accelerated DL model. In offloading. We discuss the DL inference offloading process,
this work we implement a Dl inference offloading system using state of the art(SoA) and its technology enablers in detail. We
Raspberry Pi 3 based robot vehicle with a hardware accelerator
from Intel (NCS). We report the detailed experimental results, implement a Dl inference offloading system using Raspberry
improvements achieved and several insights and challenges that Pi 3 [11] (Rpi) based robot vehicle, USB connected Intel
we inferred from the experiments, including role of network Movidius Neural Compute Stick [13] (NCS), a hardware
bandwidth, edge device dynamic system load etc.. accelerator for DL inference. We report the detailed experi-
Index Terms—component, formatting, style, styling, insert mental results, improvements achieved and several insights and
I. I NTRODUCTION challenges that we inferred from our experiments. Throughout
this article we use the term DNN to refer to any deep neural
Deep Learning [20] (DL) has acted as an accelerator to our network like CNN [19], RNN etc. , edge device to refer to
journey towards a data driven world by minimizing the role of the network end embedded systems like robots, autonomous
domain specific and tedious feature engineering and allowing vehicles, wearable devices and edge server to refer to the
high performing pre-trained models to be used for newer untapped resources in network edge that can act as offloading
datasets with relatively low re-training. However DL inference servers, e.g. smartphones, gateways, MEC servers [28]. The
phase, which is used for classification / prediction requires fair rest of this paper is organized as follows: in section II DNN
amount of system resources (compute, memory, cache etc.) to Execution Offloading and the enablers for that are discussed
run in real time. Efforts to leverage the benefits of DL models in detail along with the SoA. In section III we present the
for performing inference in resource constrained embedded de- experimental setup, experiments performed and the detailed
vices is in focus since past few years. Researchers worldwide inferences. Section IV concludes the paper.
are trying to come up with software and hardware accelerators
that make pre-trained DL models suitable for running the II. DNN E XECUTION O FFLOADING
inference phase in these devices. Software transformations like In this section we discuss the major technical enablers
model compression [21]–[23], quantization [24], [25], approx- for accelerating DNN inference time by partitioning between
imation, network pruning and coming up with new smaller constrained embedded edge devices and edge / Cloud servers.
network architectures [26] with comparable efficacy. Hardware
friendly optimizations like matrix multiplication factorization, A. Execution Profiling
data path optimization, parallel operations (e.g. convolutions) Profiling DNN execution time and power requirement is
are some of the areas that attracted attention [27] - paving critical for designing effective offloading strategies, by taking
the way for commercially available hardware accelerators for informed decisions about whether or not to offload part of a
DL inference. DL model partitioning and offloading to Cloud DNN inference pipeline. The de facto standard in this regard
or Edge network servers is a complementary approach to is the layer-wise profiling and offloading of DNNs. Profiling
these accelerations and can be applied over and above these the inference process layer by layer for each and every DNN is
transforms. Such partitioned DL inference is becoming more impracticable and hence researchers try to build generalized
prediction models for predicting execution time, power etc. mobile phones, however they do not build any prediction
for different types of layers on different classes of hardware as models of execution time and power consumption.
depicted in Fig. 1. Linear regression based models are popular
B. Partitioning Strategy for Offloading
in this context as those are lightweight in terms of system
resource consumption and can be easily built from structure DNN partitioning is an important enabler for offloaded
of the network and different configuration parameters. execution. Partition points in a DNN data flow graph can be
determined with target of energy and/or latency minimization.
Input Convolutional Layers FC
As shown in Fig. 2, partitioning can be targeted for Edge-
Pool / NL Output
Cloud hierarchy.

Model CPU
Model Params Runtime
Memory Deployment
Target
Hardware Features
Details

Run & Generate Models

Execution Time
Prediction

Fig. 1: Execution profile and prediction - standard practice

Neurosurgeon [1] uses configuration parameters like input


and output size, number of neurons and number of convolution
operation across all filters (for Convolutional layers) and
represent the layer-wise workload of different layer types in
Giga FLOPS (Floating Point Operations / Second). It uses Fig. 2: Partitioning points targeted for Edge-Cloud hierarchy
these models to predict the execution time of any layer of that
type during at DNN inference time. The standard approach for layer wise partitioning is predict-
In FastDeepIoT [2] authors show that the relationship ing the execution , data communication latency and energy
between network structure and configuration is non-linear for each layer of a DNN and finding the optimum point
and trivial linear regression models over the entire model that minimizes both. As this decision is made at the data
are unable to predict execution time accurately. Source of source, i.e., constrained edge device - lightweight optimization
these nonlinear relationship are code level optimizations in algorithms are required.
DNN inference libraries (data alignment, parallel operations, In [8] authors propose a design guideline for partitioning a
loop unrolling etc.) and memory / cache access patterns. To CNN at the end of the last convolutional layer where the data
solve the above problem, authors use a tree structured linear communication requirement is less and the fully connected
regression model that divides the network into discrete regions, layers requiring high run-time memory, are executed in the
where within each region standard linear regression can be server side.
used to build prediction models for DNN layer execution In [9] the models trained using BranchyNets [10] are
time prediction. The independent variable is formed using used for implementing the partitioning. To implement such
features from FLOPs; input, output and intermediate memory; partitioning in real systems, the computation graphs of com-
and number of parameters in the model for fully connected, mon run-time libraries like TensorFlow, Caffe needs to be
convolution and recurrent type layers. partitioned in any case.
In [4] the model size is added to the standard explanatory The standard strategy for partitioning used in [1], [7], [9] is
variables in this context and provide the regression equations finding out the cumulative latency/energy for each partition at
for layer latency prediction. each layer level and perform linear search to get the minimum
In [7] computation and Bluetooth communication latency, out of those.
energy is profiled for wearable devices and smartphones. The
III. E XPERIMENTS ON DNN E XECUTION O FFLOADING
prediction models are built using decision trees and linear
regression. In this work the smartphone play the role of A. Experimental Setup
offloading server. For our experiments we use two types of configurations. In
In [6] authors present extensive results and insights on the first configuration I Raspberry Pi 3 [11] is used as the edge
running CNN and FC networks on different SoCs used in device and smartphones, laptops, Nvidia Jetson Tx2 [12] are
TABLE I: Edge Device and Edge Server configurations-I improvement due to offloading. The inference rates shown in
Edge Server GFLOPs Edge Device GFLOPs the results are not frame processing rates, which are much
Rpi-3b 2 Rpi-3b 4 slower as due to other activities like getting image frames
Rpi-3b 2 iPhone-6 10
Rpi-3b 2 Xeon-E3-1246 PC 20 from Rpi camera, getting output and generating bounding box
Rpi-3b 2 Jetson Tx2 50 and displaying frames in an video object detection setup.
The trend for Squeezenet partitioning (refer to Fig. 4) is
also in the first pooling layer with the only exception when
used as edge servers. In the second configuration II Raspberry the feature map upload time is very high, where the partition
Pi 3 [11] with Intel Movidius Neural Compute Stick [13], point is chosen at fire5 module. The base inference rate for
a hardware accelerator for DNNs, is used as the edge device Squeezenet in Rpi is 0.5 fps in configuration-I which gets 4x
and Nvidia Jetson Tx2, Intel Xeon based server class machines improvement due to offloading. The Rpi+NCS configuration,
are used as edge servers. FLOPs benchmark is done through denoting high end edge device gets 3x improvement.
The Inception V3 CNN (refer to Fig. 5) in configuration-I
TABLE II: Edge Device and Edge Server configurations-II shows a trend of full offloading if bandwidth and processing
Edge Server GFLOPs Edge Device GFLOPs power of the edge server are high. In all other cases the parti-
Rpi-3b + NCS 36 Intel i7-7500U PC 44 tions happen after Mixed 5b, in Rpi3 only setup and Conv2d
Rpi-3b + NCS 36 Xeon-E5-2695 PC 130 2a in a Rpi3 + NCS setup. This model, in configuration-II has
Rpi-3b + NCS 36 Xeon-E5-2696 PC 330
a base inference rate of 3fps and gets around 8x improvement.
From this we conclude that larger the size of the model, more
LINPACK [14], emphVFP Bench(iOS). We have observe that is the scope of overall latency reduction due to offloading.
there is close correlation between the FLOPs benchmark and From our experiments we infer that, if the Edge Server
DNN execution profile in most of the cases. However, for is fast (two to ten times - which is often the case) and
fine grained profiling we use the techniques mentioned in network bandwidth is more than 16Kbps , offloading will
section II-A. always improve the performance. We observe that the effective
We selected three different pre-trained models for our bandwidth for uploading the feature maps from the Rpi, was
experiments as given in table III along with the edge de- quite slow and ranged between 8 Kbps to 1 Mbps maximum.
vice only execution time in the two different configurations. This is in sharp contrast to the standard uplink speed of 4G
Of these three models Squeezenet [15] is fastest - suitable and wifi connections. The connection setup time and power
for running inference in constrained edge devices (1.24M management of the wireless transceivers, seems to be the
params). Inception V3 [16] is the most sophisticated 311 layer reason for this effective speed. From this we can draw the
model (23.83M params), requiring huge system resources and following conclusions:
Alexnet [17] mid-sized CNN with 25 layers. • If the effective upload speeds are high, as observed in
5G application scenarios [18] - Enhanced Mobile Broad-
TABLE III: DNNs used for Benchmarking band(eMBB), Massive Machine Type Communication
Net Name No. of Layers Exec. Time (ms) Exec. Time (ms) (mMTC) and Ultra-reliable and Low Latency Commu-
Rpi 3b Rpi 3b + NCS nications (uRLLC), DNN partitioning will possibly be
Squeezenet 69 2032 48
AlexNet 25 3200 91 required only in case of network outage and unavailability
Inception-v3 311 7000 326 - in all other cases DNN execution will be offloaded fully
in one hop connected edge server.
• There is a need for transforming the DNN inference
B. Benchmark Results and Discussion pipeline into stream processing to gain high uplink band-
We implement a linear search algorithm to find the optimum width advantage. In its current form DNN inference acts
point of partition based on the summation of edge device and on a single frame grabbed from input video camera.
edge server execution latency and the data communication In our experimental setup given in Fig. 6, we perform a
latency, using the SoA discussed in section II-A. We predict object recognition test on a Raspberry Pi3 + NCS based
the partition points and use those for making the offloading robot vehicle, developed by us. This vehicle navigates and
decision for each of the CNNs and measure the resulting simultaneously recognizes objects. We use the partition
inference rate. In our experiments, we emulate different in- point based offloading scheme on this device but achieve
terconnection network speeds on top of a wifi network. only around 5fps (refer Fig. 7) instead of the expected
The Alexnet CNN is always partitioned at the first pooling 15fps inference rate that we observed in our experiments.
layer after the fist convolution in both the configurations - On checking the system load details we observe that the
with a slow Rpi and much faster Rpi with NCS connected navigation algorithms take up CPU and memory, increasing
via USB (refer to Fig. 3). The base inference rate for Alexnet the latency of the partial DNN inference pipeline. This is
in Rpi is 0.3fps in configuration-I which increases to around common in most of the embedded devices which follow sense-
3fps with offloading. Though the edge device inference rate analyze-respond cycle. Based on this we recommend a design
in configuration-II is around 11fps, it also gets six fold guideline of taking into account the dynamic system load of
3.5
4
3
Partition point - layerwise

Inference Rate (FPS)


3.5
2.5

3 2

1.5
2.5
1

2 0.5
0 2
8 16 5 2 256 512 1024
32 64 10 5 10
128 256 64 128
5121024 40 40 8 16 32
0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio

(a) Configuration-I - Alexnet - Partition point selection (b) Configuration-I - Alexnet - Inference rate

4 70

60
Partition point - layerwise

Inference Rate (FPS)


3.5
50

3 40

30
2.5
20

2 10
0 8 16 2 256 512 1024
2
32 64 64 128
128 256 512 5 5
8 16 32
1024 10 10 0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio

(c) Configuration-II - Alexnet - Partition point selection (d) Configuration-II - Alexnet - Inference rate

Fig. 3: Alexnet - Partition point selection and corresponding inference rate

edge device along with the predicted DNN execution latency [2] Shuochao Yao, Yiran Zhao, Huajie Shao, ShengZhong Liu, Dongxin
while choosing the partition points, which none of the earlier Liu, Lu Su, and Tarek Abdelzaher, “FastDeepIoT: Towards Understand-
ing and Optimizing Neural Network Execution Time on Mobile and
systems do. Embedded Devices,” In Proceedings of the 16th ACM Conference on
Embedded Networked Sensor Systems (SenSys ’18), Gowri Sankar Ra-
IV. C ONCLUSION machandran and Bhaskar Krishnamachari (Eds.). ACM, New York, NY,
USA, 278-291. DOI: https://doi.org/10.1145/3274783.3274840, 2018.
In this paper we implemented a execution offloading system [3] Qingqing Cao, Niranjan Balasubramanian, and Aruna Balasubramanian,
for Deep Learning inference phase. We experimented with “MobiRNN: Efficient Recurrent Neural Network Execution on Mobile
a embedded System on Chip along with a DL inference GPU, ”. In Proceedings of the 1st International Workshop on Deep
Learning for Mobile Systems and Applications (EMDL ’17). ACM, New
accelerator and demonstrated that intelligent offloading can York, NY, USA, 1-6. DOI: https://doi.org/10.1145/3089801.3089804,
be useful for both resource constrained embedded systems, 2017
as well a more capable edge devices. We discussed the Sate [4] En Li, Zhi Zhou, and Xu Chen, “Edge Intelligence: On-Demand
Deep Learning Model Co-Inference with Device-Edge Synergy,” In
of Art in this research area and outlined several challenges Proceedings of the 2018 Workshop on Mobile Edge Communica-
and insights that may result in a better system design for tions (MECOMM’18). ACM, New York, NY, USA, 31-36. DOI:
DNN inference offloading. In future we would like to design https://doi.org/10.1145/3229556.3229562, 2018
[5] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Ja-
an offloading system that takes care of the dynamic system son Cong, “Caffeine: towards uniformed representation and accel-
load of Internet of Things devices which follow sense-analyze- eration for deep convolutional neural networks,” In Proceedings of
actuate cycle. the 35th International Conference on Computer-Aided Design (IC-
CAD ’16). ACM, New York, NY, USA, Article 12, 8 pages. DOI:
R EFERENCES https://doi.org/10.1145/2966986.2967011, 2016
[6] Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio For-
[1] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor livesi, and Fahim Kawsar, “ An Early Resource Characterization of Deep
Mudge, Jason Mars, and Lingjia Tang, “Neurosurgeon: Col- Learning on Wearables, Smartphones and Internet-of-Things Devices,”
laborative Intelligence Between the Cloud and Mobile Edge.,” In Proceedings of the 2015 International Workshop on Internet of Things
SIGARCH Comput. Archit. News 45, 1 (April 2017), 615-629. DOI: towards Applications (IoT-App ’15). ACM, New York, NY, USA, 7-12.
https://doi.org/10.1145/3093337.3037698, 2017. DOI=http://dx.doi.org/10.1145/2820975.2820980
2
3

Inference Rate (FPS)


Partition point - layerwise

2.8
1.5

2.6

2.4
1

2.2

2 2 0.5
0 5
8 16 10 2 256 512 1024
32 64 128 5 10 64 128
256 512 40 40 8 16 32
1024 0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio

(a) Configuration-I - Squeezenet - Partition point selection (b) Configuration-II - Squeezenet - Partition point selection

14
70
12
60
Partition point - layerwise

Inference Rate (FPS)


10
50
8
40
6
30
4
20
2 256 512 1024
2
64 128
0 8 16 32
2 5
8 16 32
64 128 256 5 10 0
5121024 10 Bandwidth in Mbps
Bandwidth in Mbps Server to Edge capacity ratio Server to Edge capacity ratio

(c) Configuration-II - Squeezenet - Partition point selection (d) Configuration-II - Squeezenet - Inference rate

Fig. 4: Squeezenet - Partition point selection and corresponding inference rate

[7] Mengwei Xu, Feng Qian, and Saumay Pushp, “Enabling Coopera- 1512.00567, 2015. arxiv.org/abs/1512.00567
tive Inference of Deep Learning on Wearables and Smartphones”, [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “Im-
eprint:arXiv:1712.03073, 2017 ageNet classification with deep convolutional neural networks.” In
[8] J.H. Ko, T. Na, M.F. Amir and S. Mukhopadhyay, “Edge-host partition- Proceedings of the 25th International Conference on Neural Information
ing of deep neural networks with feature space encoding for resource- Processing Systems - Volume 1 (NIPS’12), F. Pereira, C. J. C. Burges,
constrained internet-of-things platforms,” eprint:arXiv:1802.03835, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 1. Curran Associates Inc.,
2018. USA, 1097-1105.
[9] E. Li, Z. Zhou and X. Chen, “Edge intelligence: On-demand deep [18] “China Unicom Edge Computing Technology White
learning model co-inference with device-edge synergy ,” in Pro- Paper”, https://builders.intel.com/docs/networkbuilders/
ceedings of the 2018 Workshop on Mobile Edge Communications, china-unicom-edge-computing-technology-white-paper.pdf.
MECOMM@SIGCOMM 2018, pp. 3136 Budapest, Hungary, August [19] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan
20, 2018 Carlsson. 2014. “CNN Features Off-the-Shelf: An Astounding Baseline
[10] Surat Teerapittayanon, Bradley McDanel, H. T. Kung, “BranchyNet: for Recognition.” In Proceedings of the 2014 IEEE Conference on
Fast Inference via Early Exiting from Deep Neural Networks,”, eprint: Computer Vision and Pattern Recognition Workshops (CVPRW ’14).
arXiv:1709.01686, 2017 IEEE Computer Society, Washington, DC, USA, 512-519. DOI=http:
[11] Raspberry Pi 3: Specs, benchmarks & testing, https://www.raspberrypi. //dx.doi.org/10.1109/CVPRW.2014.131
org/magpi/raspberry-pi-3-specs-benchmarks [20] Bengio Y: “Learning Deep Architectures for AI”. Now Publishers Inc.,
[12] Jetson TX2, ETSI, https://elinux.org/Jetson TX2. Hanover, MA, USA; 2009.
[13] Intel Movidius Neural Compute Stick, ETSI, https://movidius.github.io/ [21] S. Han, H. Mao, and W. J. Dally. “Deep compression: Compressing
ncsdk/ncs.html. deep neural networks with pruning, trained quantization and huffman
[14] Jack Dongarra, “The LINPACK Benchmark: An Explanation,” In Pro- coding.” arXiv preprint arXiv:1510.00149, 2015.
ceedings of the 1st International Conference on Supercomputing, Elias [22] S. Bhattacharya and N. D. Lane. “Sparsification and separation of deep
N. Houstis, Theodore S. Papatheodorou, and Constantine D. Poly- learning layers for constrained resource inference on wearables”. In
chronopoulos (Eds.), Springer-Verlag, London, UK, UK, 456-474, 1987. Proceedings of the 14th ACM Conference on Embedded Network Sensor
[15] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Systems CD-ROM, pages 176189. ACM, 2016
Keutzer. “Squeezenet: Alexnet-level accuracy with 50x fewer parameters [23] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krish-
and 1mb model size,” arXiv preprint arXiv:1602.07360, 2016 namurthy. “MCDNN: an approximation-based execution framework for
[16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. “Rethink- deep stream processing under resource constraints”. In Proceedings of
ing the Inception architecture for computer vision.” arXiv preprint , the 14th Annual International Conference on Mobile Systems, Applica-
5 1.5

Inference Rate (FPS)


Partition point - layerwise

1
3

2 0.5

0
0 2 2 256 512 1024
0 5 10 64 128
8 16 10 5
32 64 128 256 40 40 8 16 32
5121024 0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio

(a) Configuration-I - Inception V3 - Partition point selection (b) Configuration-I - Inception V3 - Inference rate

8 25

7
20
Partition point - layerwise

Inference Rate (FPS)


6

5 15

4
10
3

2 5

1
0
0
256 512 1024
0 2 2
8 16 32 64 128
64 128 256 5 5
8 16 32
5121024 10 10 0
Bandwidth in Mbps Bandwidth in Mbps
Server to Edge capacity ratio Server to Edge capacity ratio

(c) Configuration-II - Inception V3 - Partition point selection (d) Configuration-II - Inception V3 - Inference rate

Fig. 5: Inception V3 - Partition point selection and corresponding inference rate

5
Inference rate (FPS)

4.5

3.5

3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Bandwidth in Mbps (upload time)

Fig. 7: Inference rate for Object recognition by navigating


robot vehicle

Weyand, M. Andreetto, and H. Adam. “Mobilenets: Efficient convo-


lutional neural networks for mobile vision applications”. arXiv preprint
arXiv:1704.04861 , 2017.
[27] Kamel Abdelouahab, Maxime Pelcat, Jocelyn Serot, Franois Berry,
“Accelerating CNN inference on FPGAs: A Survey”, arXiv preprint
Fig. 6: Raspberry Pi3 with NCS based robot vehicle that arXiv:1806.01683
navigates and recognises objects [28] “Multi-access Edge Computing”, ETSI, http://www.etsi.org/
technologies-clusters/technologies/multi-access-edge-computing.

tions, and Services (MobiSys16), pages 123136, 2016.


[24] R. Krishnamoorthi, “Quantizing deep convolutional networks for effi-
cient inference: A whitepaper,” arXiv preprint arXiv:1806.08342 , 2018
[25] Antonio Polino, Razvan Pascanu, Dan Alistarh, “Model compression via
distillation and quantization” arXiv preprint arXiv:1802.05668 , 2018
[26] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T.

You might also like