BigDataPredictionOrig

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

1

Predictive Modeling of Run Time for Model


and Data Distributed Inferencing Using
Gradient Boosting Regression
Scott Brown∗† , David Harman∗† , Cleon Anderson∗† , Matthew Dwyer∗ ,
∗ DEVCOM Army Research Laboratory, Army Research Directorate, MD USA
† Parsons Corporation, Aberdeen Proving Ground, MD USA

{david.c.harman10, scott.e.brown96, cleon.r.anderson}.ctr@army.mil,


{matthew.r.dwyer7,

Abstract—In the realm of Internet of Things (IoT) be feasible, particularly in high-stakes scenarios.In
systems, accurately forecasting the runtime of Infer- addition to providing latency and scalability, edge
encing models on heterogeneous devices is paramount computing must accommodate resource-intensive
for optimizing resource allocation, particularly in the
contexts of Model Distributed Inferencing (MDI) and demands such as DNN and coordination among edge
Data Distributed Inferencing (DDI). This paper delves devices in heterogeneous processing environments
into the application of Gradient Boosting Regression and dynamic network conditions [1]. In response to
(GBR) as a predictive modeling technique for estimating this challenge, innovative solutions have emerged,
runtime in both MDI and DDI scenarios. GBR presents notably Model Distributed Inferencing (MDI) and
an equitable trade-off between interpretability, robust-
ness against noise, and suitability for moderately sized Data Distributed Inferencing (DDI), which dis-
datasets. The study reviews previous research on IoT tribute the computational workload across multiple
inference optimization. It accentuates the multifaceted devices at the network’s edge. The effectiveness of
intricacies of device diversity and the significance of these approaches hinges on two critical factors: the
model interpretability within MDI and DDI setups. The accuracy of the inferencing results and the time
primary contribution of this research is the novel appli-
cation of the GBR model to predict machine learning required to execute these tasks.
inferencing runtime in MDI and DDI contexts. This Both MDI and DDI processes necessitate the
approach is invaluable when empirical data is limited exchange of data between devices. Given the di-
and characterizing the behavior of newly introduced versity in resources among IoT devices, it becomes
devices is imperative. The paper elaborates on the GBR
algorithm’s utilization, hyperparameters, and custom
paramount to comprehensively understand the infer-
loss functions tailored explicitly for MDI and DDI. The encing run-time of each device under varying band-
results section exemplifies GBR’s performance across width constraints. Additionally, gaining insights into
various computational regimes, including MDI and DDI. the resource cost associated with data offloading
It offers insights into the model’s balance between ac- between nodes is essential for assessing the trade-
curacy and complexity. A performance comparison with
prior models underscores GBR’s efficacy in predicting
offs in deploying MDI or DDI. It is important to
runtime in MDI and DDI. This work contributes to the acknowledge that modeling this process for all pos-
ongoing discourse on IoT optimization and predictive sible operational scenarios is impractical. Therefore,
modeling. there is a pressing need to develop a method capable
of estimating trade-offs beyond benchmarking data.
I. Introduction One of the challenges in creating such models
lies in the inherent noise and intricate nonlinearity
A. Background and Motivation present in the data derived from benchmarking ex-
Many Internet of Things (IoT) devices face re- ercises. Conventional linear models, such as multiple
source constraints due to their small size. With regression, are ill-suited for handling these complex-
the growing prevalence of these devices, innovative ities, as they struggle to generalize effectively. One
strategies for conducting inferencing on resource- viable approach to mitigating this issue is to employ
constrained IoT devices are becoming increasingly Ridge Regression, wherein the regression coefficients
imperative. While cloud computing often serves as are penalized. This regularization technique signifi-
a means to offload resource-intensive computations, cantly enhances the model’s predictive capabilities.
there are situations where this approach may not However, it comes at the cost of interpretability, a
2

critical element in comprehending how different sub- identifying machine-independent application phases
variants interact within the system. In the context of via offline benchmark analysis, using neural network
this research study, the attainment of requisite gran- models to analyze intricate cross-platform relation-
ularity for neural network processing necessitates ships, and integrating performance counter mea-
the execution of multiple benchmarking iterations, surements during run-time to increase prediction
rendering this approach operationally cumbersome. accuracy.
To address this challenge innovatively, we turn to Interpretability is a crucial consideration when
the application of boosting methodologies. Boosting, it comes to predictive modeling of IoT devices.
a prominent machine learning technique, orches- Gradient-boosting models offer easily understand-
trates the amalgamation of predictions generated by able results, particularly those based on gradient-
multiple individual models, typically characterized boosted decision trees. This interpretability is valu-
as weak learners, thereby engendering a prediction able for gaining insights into the factors influenc-
model of heightened accuracy and robustness. The ing predictions, such as run-time in [3] research.
underpinning principle of boosting revolves around Their use of neural networks can pose challenges in
the sequential training of a series of models, each interpretability [4], making it less straightforward
directed towards rectifying the misclassifications to discern the critical determinants of run-time.
encountered by its predecessor. This sequential In addition, gradient-boosting models inherently
ensemble of models collectively contributes to the produce feature importance scores [5], which aid
formulation of definitive predictions. In the context in identifying influential factors within applications
of the present study, Gradient Boosting Regression and architectures. These models exhibit robustness
(GBR) assumes a pivotal role in predicting run- [6] to outliers within datasets, gracefully handling
time duration. Several intrinsic attributes of GBR extreme cases without necessitating extensive data
contribute to its selection as the predictive model, preprocessing. This resilience is especially relevant
including interpretability, resilience to noise, ability when dealing with smaller datasets [7], as gradient
to capture non-linear relationships within data, and boosting models often yield reliable results without
suitability for handling modestly sized datasets. the need for intricate hyperparameter tuning, sim-
This deliberate choice strikes a judicious equilibrium plifying the implementation process compared to the
between pursuing heightened predictive precision more complex and data-demanding nature of neural
and managing model complexity. Notably, this se- networks.
lection proves particularly advantageous in scenarios [8] measured the performance of specific hardware
characterized by constraints on data availability devices when running machine learning models.
and in situations where the elucidation of nuanced They considered various metrics, including power
intricacies within the prediction process assumes consumption, inference time, and accuracy, provid-
paramount significance. ing insights into how different devices perform under
various conditions and workloads. This approach
B. Research Objectives can be valuable for selecting the most appropriate
C. Contribution of the Study hardware for specific AI tasks. In contrast, we use a
gradient-boosting regressor to estimate run-time on
II. Related Work heterogeneous devices, creating a predictive model
A. Review of Previous Research on IoT Inference based on historical data. Our approach relies on
Optimization statistical and machine learning techniques to make
This section examines methods for predicting run-time predictions without directly measuring the
run times on IoT devices and explores DDI/MDI devices’ performance. This approach is practical
predictions. Effective task scheduling in heteroge- when empirical data is limited and new devices
neous device environments necessitates thoroughly entering an ecosystem require characterizing.
considering resource disparities among such devices,
as highlighted by [2]. It is imperative to possess
in-depth knowledge of these diverse devices’ archi- B. Existing Predictive Modeling Approaches
tectures, capabilities, and, in some cases, energy C. Gap in the Literature
efficiency profiles to optimize their performance.
III. Methodology
[3], for example, predict both power consump-
tion and performance for applications that run on In this section, we will discuss the methodology
heterogeneous computing systems. They employ a for predicting run time using the Gradient Boosting
multifaceted approach to achieve this, including Regressor algorithm.
3

A. Testbed and Experimental Setup ’dist_type’ column, converting categorical data


1) Testbed Structure: The testbed encompasses into a numerical format.
devices with varying computational capabilities, 3) Data Filtering: To focus our study on relevant
ranging from high-performance nodes equipped with cases, we filter the dataset, retaining only those
NVIDIA V100 GPUs and 8 CPU cores (representing records where the bandwidth values are less
server offload scenarios) to CPU-only devices with than or equal to 40.
2 to 6 CPU cores and memory capacities ranging
from 2GB to 8GB at the bottom of the network C. Predictive Modeling with Gradient Boosting Re-
hierarchy. Table I provides a detailed description of gressor
these device classes.
In this study, we employ a Gradient Boosting
Node Class Cores Memory (GB) Regressor (GBR) [9] to predict the run-time (RT ) of
T3 6 8 specific computational tasks. The predictive model
T2 4 4
T1 2 2
is constructed as an ensemble of decision trees,
where each tree contributes to the final prediction
TABLE I: Description of Node Classes in the in a weighted manner. The goal is to minimize the
Testbed mean squared error (MSE) between the predicted
and actual run-time values.
Let n denote the number of observations
2) Network Topology: Devices within the testbed
in our dataset, each characterized by a tuple
are interconnected using a software-defined network
(BWi , N odei , DeviceT ypei , RTi ), where BWi rep-
(SDN). This SDN enables precise configuration of
resents bandwidth, N odei is the node identifier,
the network, including control over bandwidth be-
DeviceT ypei signifies the device type, and RTi is
tween devices. This flexibility allows us to investi-
the actual run-time.
gate distributed performance under various band-
The GBR model is expressed as:
width constraints. ∑
3) Bandwidth Manipulation: To evaluate the im- RT (x) = [α · f (x, θ)]
pact of varying bandwidth on inference performance,
we perform bandwidth manipulation. Data throt- Here:
tling is implemented through the SDN by specifying RT (x) denotes the predicted run-time for anThe
ports for throttling. Bandwidths ranging from 1 input vector x, comprising bandwidth, node, and
Mbps (representing severely limited networks) to device type. α represents the weight assigned
100 Mbps (considered unconstrained for testing) are to each decision tree in the ensemble. f (x, θ)
targeted. corresponds to an individual decision tree pa-
4) Experiments: Our experiments are categorized rameterized by θ.
primary objective is to minimize the MSE, defined
into three main device types: CPU-only, GPU-only,
as:
and mixed CPU/GPU devices. Each category is fur- 1 ∑[ ]
ther divided into subcategories based on bandwidth M SE = (RTi − RT (xi ))2
n
constraints. This experimental setup extends the
original work [5] by introducing heterogeneous de- This expression quantifies the squared discrepan-
vices and comparing multiple distributed inferencing cies between actual run-time values (RTi ) and the
techniques. model’s predictions (RT (xi )) for each observation.

IV. Training Process


B. Data Preprocessing
The GBR model is trained iteratively. In each
Data preprocessing involves cleaning and trans-
iteration, a new decision tree is fitted to the neg-
forming the dataset for modeling. We perform the
ative gradient of the MSE loss function concerning
following steps:
the current ensemble’s predictions. This iterative
1) Feature Selection: We begin by carefully process adjusts the model to minimize errors by
selecting the pertinent columns for our strategically incorporating decision trees that rectify
analysis, which include ’bandwidth,’ previous prediction residuals.
’nodes,’ ’run_time_per_sec,’ and ’dist_type The final GBR model is characterized by:
(MDI/DDI).’
2) Categorical Encoding: To facilitate our analysis,
we employ one-hot encoding to transform the RT (x) = α1 ·f1 (x, θ1 )+α2 ·f2 (x, θ2 )+. . .+αk ·fk (x, θk )
4

Here: 2) Hyperparameter Tuning: We perform hyper-


parameter tuning using a randomized search ap-
α1 , α2 , . . . , αk signify the weights associated with
proach. The following hyperparameters are explored
each decision tree. during the search:
f1 (x, θ1 ), f2 (x, θ2 ), . . . , fk (x, θk )
TABLE III: GBR Hyperparameters Search
are individual decision trees parameterized by θ.
Our GBR algorithm iteratively optimizes both the Hyperparameter Value Range
weights and parameters of these decision trees Loss Function Custom
during training, leading to an ensemble model that Learning Rate Uniform distribution be-
tween 0.001 and 0.299
minimizes the MSE and offers accurate run-time Number of Estima- Random integer between
predictions based on the predictor variables (BW , tors 100 and 999
N ode, DeviceT ype). Subsample Uniform distribution be-
In our experiments, we employed the scikit-learn tween 0.5 and 0.999
library [10] for gradient boosting regression. The Criterion Friedman Mean Squared Er-
ror, Squared Error
hyperparameters chosen for our GBR model are Minimum Samples Random integer between 1
detailed in Table II. These settings were carefully Split and 9
selected to balance model complexity and predictive Minimum Samples Random integer between 1
performance for our specific regression task. Leaf and 3
Maximum Depth Random integer between 1
TABLE II: Gradient Boost Hyperparameters and 2

Hyperparameter Value Description


Learning Rate 0.01 Step size for gradient boosting
The randomized search is performed with cross-
Number of Estima- 100 Number of boosting stages validation, and the best hyperparameters are se-
tors lected. We created Custom loss functions to assess
Loss Function Quantile Optimization Loss Function
Subsample 0.5 Fraction of data used for train- the GBR model performance based on specific
ing criteria, enhancing the depth of evaluation. These
Criterion Friedman Split quality criterion
Mean custom loss functions include:
Squared
Error TABLE IV: Custom Loss Functions
Min Samples Split 2 Min samples required for a
split Loss Function Description
Custom Percentage Error Loss Quantifies the percentage error between predicted and
Min Samples Leaf 1 Min samples required for a leaf (customp el oss) actual values, providing insights into error magnitudes.
node Custom Huber Loss Combines quadratic and linear loss components to effec-
Max Depth 2 Max depth of individual trees (customh uberl oss) tively handle outliers in the data.
Custom Weighted Loss Allows for the assignment of distinct weights to errors
(customw eightedl oss) based on their significance, accommodating scenarios
where specific errors hold more weight.
The learning rate (0.01) determines the step Custom Outlier
(customo utlierl oss)
Loss Designed to identify errors that exceed a specified thresh-
old as outliers, proving valuable when handling exception-
size for gradient boosting, and 100 estimators were Custom Quantile Loss
ally deviant observations.
Tailored for quantile regression analysis, this function
used to construct a robust ensemble. We employed (customq uantilel oss) evaluates quantile-specific errors, making it suitable for
various quantile-related investigations.
the Quantile loss function to cater to our specific
quantile regression task. A subsample fraction of The custom loss functions introduced herein serve
0.5 was introduced to introduce randomness and as integral components in the process of model
enhance model generalization. evaluation and selection. The custompercentage error
Furthermore, the criterion for assessing the loss function quantifies percentage errors, enabling
quality of splits during tree construction was the a nuanced understanding of error magnitudes within
Friedman Mean Squared Error, and we specified a the model assessments. In contrast, the customhuber
minimum of 2 samples required for a node to be split loss function adeptly combines quadratic and lin-
further, with a minimum of 1 sample required for ear loss components, bolstering the model’s re-
leaf nodes. To control model complexity, individual silience to outliers—a critical attribute when con-
trees in our ensemble had a maximum depth of 2 [9]. fronted with extreme data points. Furthermore, the
customweighted loss function empowers the assign-
1) Algorithm Overview: The Gradient Boosting ment of varying error weights, accommodating sce-
Regressor is an ensemble learning method that narios wherein specific errors bear more substantial
builds an additive model in a forward stage-wise significance than others. Customoutlier loss plays a
manner. It combines the predictions of multiple base crucial role in the identification and mitigation of
estimators (decision trees) to improve predictive errors surpassing predetermined thresholds, particu-
accuracy. larly advantageous when addressing outliers. Lastly,
5

the customquantile loss function is meticulously tai- TABLE V: Performance Metrics of Gradient Boost-
lored for the realm of quantile regression, diligently ing Regressor
evaluating quantile-specific errors and rendering it- Compute Nodes Mean Absolute Error (MAE) R-Squared
self adaptable to a spectrum of quantile-related in- 2 8.76 0.99
3 118.78 0.95
vestigations. Collectively, these functions engender 4 125.83 0.98
a holistic evaluation of gradient boosting models,
embracing diverse loss criteria and ensuring their MDI and DDI data dataset. The regression model
adaptability across a gamut of data characteristics. utilized bandwidth, compute nodes, and bandwidth
This systematic approach substantially heightens reserved as predictor variables to estimate the re-
the prospects of selecting the most apropos gradient sponse variable, the runtime of inferencing on IoT
boosting model tailored to the idiosyncrasies of spe- devices.
cific datasets and research objectives. Furthermore, Compute Nodes: This column represents the num-
the scrupulous documentation of results and custom ber of compute nodes employed in the computa-
loss functions augments transparency and bolsters tional tasks. As a critical factor in parallel com-
the reproducibility of scientific research endeavors. puting, compute nodes profoundly impact runtime;
consequently, we segment the data based on the
A. Evaluation Metrics number of compute nodes.
Mean Absolute Error (MAE): The MAE measures
In the evaluation of our model’s performance, we
the absolute differences between the predicted and
employ the following metrics: Mean Absolute Error
actual values. It quantifies the average magnitude
(MAE): This metric quantifies the average absolute
of errors in predicting the runtime. Smaller MAE
difference between the predicted and actual run
values indicate better model accuracy.
times, providing a measure of the model’s accuracy.
R-Squared (R²): R-squared is a measure of the
R-squared (R2 ) Score: The R2 score is utilized to
goodness of fit of the regression model. It ranges
gauge the proportion of variance in the run times
from 0 to 1, with higher values indicating a better
that can be predicted by our model, elucidating its
fit. In this context, R² reflects the proportion of
predictive capability.
the variance in runtime explained by the predictor
Cross-Validation Analysis
variables. The results in Table 1 reveal insightful
To rigorously assess our model’s generalization
information: Notably, the GBR achieved a low
performance, we adopt a k-fold cross-validation
Mean Absolute Error (MAE) of 8.76 when two (2)
methodology. Specifically, we employ a five-fold
compute nodes were present. This value indicates
cross-validation approach (k=5) to scrutinize the
that the model’s predictions were, on average, very
model’s robustness and performance across distinct
close to the actual runtime values. Moreover, the
data subsets.
R-squared value of 0.99 indicates that the chosen
predictors explained approximately 99/
V. Experimental Results
When the number of compute nodes increased to
In this section, we present the experimental re- three (3), the MAE rose substantially to 118.78,
sults obtained from the Gradient Boosting Regressor indicating larger prediction errors. However, the
model. model still exhibited a reasonably high R-squared
value of 0.95, indicating that it could explain a sig-
A. Description of the Dataset nificant portion of the variance in runtime. With the
addition of the compute nodes, the model remains
The dataset used for experimentation consists
informative despite its increased complexity.
of records with columns ’bandwidth,’ ’nodes,’
Similarly, with four (4) compute nodes, the MAE
’run_time_per_sec,’ and ’dist_type.’ It was pre-
increased to 125.83, although it remained within
processed as described in the Data Preprocessing
a reasonable range. The R-squared value of 0.98
section.
highlights the model’s ability to explain most of the
variance in runtime, even with the higher number
B. Predictive Modeling Results of compute nodes.
In Table IV, we summarize the results of training The table illustrates the performance of the Gra-
a Gradient gradient-boosting regressor model on our dient Boosting Regressor across different numbers
dataset: of compute nodes. It demonstrates the trade-off
The table presents the performance metrics for a between model accuracy (as indicated by MAE)
Gradient gradient-boosting regressor applied to an and model explanatory power (as indicated by R-
6

For 4 compute nodes, the model provided a


reasonably accurate prediction of the new data.
Notably, some noise was observed in the predictions,
particularly in the bandwidth range of 0 to 2 MPS.
These findings highlight the model’s performance
and its ability to predict runtime under varying
conditions. The observed oscillations and deviations
in certain scenarios warrant further investigation to
Fig. 1: Compute Nodes -2 enhance the model’s accuracy and robustness.”

C. Performance Comparison with Previous Models


We compare the performance of the Gradient
Boosting Regressor model with the results obtained
from our previous models [11].

D. Discussion of Results
We discuss the implications and significance of
the predictive modeling results, highlighting any
Fig. 2: Compute Nodes -3 insights gained from the analysis. for Research

squared) as the computational environment becomes VI. Future Work


more complex. These findings offer valuable insights
In the contemporary landscape of data-intensive
for optimizing computational tasks in a parallel
edge computing, the effective orchestration of Data
computing environment.
Distributed Inferencing (DDI) tasks remains a sub-
1) Run Time Prediction: We generated synthetic
stantial challenge. This complex task necessitates
data and applied a Gradient Boosting Regressor
real-time adaptability to the dynamic variations in
(GBR) model to predict runtime based on the
network conditions and available compute resources.
number of compute nodes, bandwidth, and whether
The present study represents a pioneering effort to
we are conducting Model Distributed Inferencing
provide a robust solution to these challenges, with
(MDI) or Data Distributed Inferencing (DDI). In
Gradient Boosting Regression (GBR) as the linchpin
the figures, the blue line represents the original data,
of our predictive modeling methodology.
while the red line represents the predicted values.
Looking forward, we anticipate noteworthy ad-
In the plot for 2 compute nodes, we observed vancements in the field of DDI optimization. Our
that the model initially exhibited oscillations when vision encompasses a future where the core aim
predicting runtime for bandwidths up to 2, but it of estimating inferencing time, factoring in critical
eventually stabilized and closely followed the orig- variables such as bandwidth, compute device types,
inal data with minimal deviation from the plotted and the quantity of compute nodes, transitions
line. from a theoretical concept to an essential practical
With 3 compute nodes, the model demonstrated tool for facilitating dynamic task allocation and
a high degree of predictability, closely tracking the optimization within DDI environments.
original data. However, it’s noteworthy that the Our model, operating within the framework of
model predicted nearly constant values for band- online learning, ushers in a new era. It continually
widths up to 2.5. updates itself in real time as incoming data streams
provide fresh insights. This agile adaptation enables
the model to rapidly respond to the constantly
evolving conditions inherent to edge computing,
consistently providing precise predictions of infer-
encing times. This transformative aspect positions
our research as an invaluable asset in the repertoire
of DDI orchestration.
With a forward-looking perspective, the ramifica-
tions of our research extend across a multitude of
Fig. 3: Compute Nodes- 4 domains. We envision a substantial enhancement in
7

the efficiency and performance of DDI systems, de-


livering tangible implications for practical applica-
tions. The profound impact of our work is poised to
resonate through real-world contexts, spanning from
the Internet of Things (IoT) and edge computing to
the sphere of distributed machine learning.

References
[1] J. Chen and X. Ran, “Deep Learning with Edge
Computing: A Review,” Proceedings of the IEEE, vol.
107, no. 8, pp. 1655–1674, Aug 2019. [Online]. Available:
https://doi.org/10.1109/jproc.2019.2921977
[2] C. Gregg, M. Boyer, K. Hazelwood, and
K. Skadron, “Dynamic Heterogeneous Scheduling
Decisions Using Historical Runtime Data,” Workshop
on Applications for Multi-and Many-Core Processors
(A4MMC), pp. 1–12, 1 2011. [Online]. Available:
https://www.cs.virginia.edu/ skadron/Papers/
gregga 4mmc11.pdf
[3] Y. Kim, P. Mercati, A. More, E. Shriver, and T. Rosing,
“P4: Phase-based power/performance prediction of
heterogeneous systems via neural networks,” 2017
IEEE/ACM International Conference on Computer-
Aided Design (ICCAD), 11 2017. [Online]. Available:
https://doi.org/10.1109/iccad.2017.8203843
[4] Z. C. Lipton, “The mythos of model interpretability,”
ACM Queue, vol. 16, no. 3, pp. 31–57, 6 2018. [Online].
Available: https://doi.org/10.1145/3236386.3241340
[5] J. H. Friedman, “Greedy function approximation: A
gradient boosting machine.” Annals of Statistics,
vol. 29, no. 5, 10 2001. [Online]. Available:
https://doi.org/10.1214/aos/1013203451
[6] J. Friedman, “Stochastic gradient boosting,”
Computational Statistics Data Analysis, vol. 38,
no. 4, pp. 367–378, 2 2002. [Online]. Available:
https://doi.org/10.1016/s0167-9473(01)00065-2
[7] T. Chen and C. Guestrin, “XG-
Boost,” arxiv, 8 2016. [Online]. Available:
https://doi.org/10.1145/2939672.2939785
[8] S. P. Baller, A. Jindal, M. Chadha, and
M. Gerndt, “DeepEdgeBench: Benchmarking deep
neural networks on edge devices,” arXiv (Cor-
nell University), 8 2021. [Online]. Available:
https://arxiv.org/pdf/2108.09457.pdf
[9] T. Hastie, R. Tibshirani, and J. H. Friedman, The
elements of statistical learning, 1 2009. [Online].
Available: https://doi.org/10.1007/978-0-387-84858-7
[10] “Scikit-Learn Ensemble: GradientBoostingRegressor,”
https://scikit-learn.org/stable/modules/generated/
[11] C. Anderson, M. Dwyer, and K. S. Chan, “Optimizing
machine learning inference performance on iot devices:
trade-offs and insights from statistical learning,” SPIE
Proceedings, 2023.

You might also like