Professional Documents
Culture Documents
Du-Li2019 Article CombiningGlobalRegressionAndLo
Du-Li2019 Article CombiningGlobalRegressionAndLo
https://doi.org/10.1007/s00450-018-0391-x
Abstract
To evaluate energy use in green clusters, power models take the resource utilization data as the input to predict server power
consumption. We propose a novel method in power modeling combining a global linear model and a local approximation
model. The new model enjoys high accuracy by compensating the global linear model with local approximation and exhibits
robustness with the generalization capability of the global regression model. Empirical evaluation demonstrates that the new
approach outperforms the two existing approaches to server power modeling, the linear model and the k-nearest neighbor
regression model.
Keywords Server power modeling · Global regression · Local approximation · Linear model · Spatial interpolation
123
36 X. Du, C. Li
123
Combining global regression and local approximation in server power modeling 37
n
P =v·u+b+ wi ui − u . (2)
i=1
v·u+b
= vCPU u CPU + vMemory u Memory + vIO u IO + b. (3)
3.1 The model The model parameters, v, b, and w, are trained by minimizing
the joint objective of the model error on training data and the
We propose a novel method in power modeling. The new model complexity as
method combines global regression and local approximation
as min e(v, b, w) + r (v, b, w). (5)
v,b,w
123
38 X. Du, C. Li
The first term, for k = 1, . . . , n. Solving the linear Eqs. (9), (10), and (11)
⎛ ⎞2
gives the parameters of the power model.
n
n
e(v, b, w) = ⎝v · ui + b + w j u j − ui − Pi ⎠ , (6)
3.3 Model characteristics
i=1 j=1
describes the squared error of the model on the training data. 3.3.1 Impact of close neighbors
As the model parameter number grows with respect to
the number of the training samples, minimizing the squared One of the attractive characteristics of k-nearest neighbor
error only will lead to overfit to the training data by arbitrarily regression is that given an input utilization u close to one
tweaking the parameters, especially w. Here we introduce a of the training samples ui , the predicted power P of the
second term of standard L 2 regularization, input will be close to the close neighbor’s power Pi . This
is because that the large proximity between the input and
r (v, b, w) = β(v, b)2 + γ w2 , (7) the close neighbor contributes an extremely large weight in
the weighted average in the regression formula (1). We now
for penalization with the model complexity (cf., [7]), in show that the new model proposed by us enjoys the same
which β and γ are two constants to control the trade-off characteristics as well.
between the model error and the model complexity. With the Let
L 2 regularization, the optimization prefers a solution with
n
small parameters. It is interpreted as an instance of Bayesian ei = v · ui + b + w j u j − ui − Pi (12)
estimation in which the priors of the parameters follow the j=1
normal distributions with zero means. Our preliminary exper-
imental result shows that the model accuracy is not sensitive denote the training error of the model on the ith training
to different choices of the two constants and therefore we sample. The difference between the power predicted by the
assign two ad hoc values for β and γ . new model and the power of the ith training sample is
The complete optimization problem to solve the model
parameters is given as
n
v · u + b + w j u j − u − Pi
⎛ ⎞2
n
n
j=1
⎛
min ⎝v · ui + b + w j u j − ui − Pi ⎠
v,b,w n
i=1 j=1 = v · u + b + w j u j − u − ⎝v · ui + b
+ β(v, b) + γ w2 .
2
(8) j=1
⎞
n
The minimization problem is solved with standard quad- + w j u j − ui − ei
⎠
ratic programming. Taking the partial derivatives of the j=1
objective function with respect to the model parameters and
n
equating the derivatives to zero, we have = v · (u − ui ) + w j u j − u − u j − ui ) + ei
j=1
⎛ ⎞
n
n
n
⎝v · ui ui + bui + w j u j − ui ui − Pi ui ⎠ ≤ v · u − u i + w j ui − u + ei
i=1 j=1 j=1
⎛ ⎞
+ βv = 0, (9)
n
⎛ ⎞ w j ⎠ + ei .
≤ ui − u ⎝v + (13)
n n
⎝v · ui + b + w j u j − ui − Pi ⎠ j=1
i=1 j=1
+ βb = 0, (10) As the norms of the model parameters v, w, and the train-
ing error ei are minimized in (8) during the model training
and stage, the right-hand side of the inequality (13) is small given
⎛ ⎞ the input utilization u close to the training sample ui . There-
n
n
fore the difference between the power predicted by the new
ui − uk ⎝v · ui + b + w j u j − ui − Pi ⎠ model and the power of the close neighbor is bounded and the
i=1 j=1 predicted power P of the input will be close to the neighbor’s
+γ wk = 0 (11) power Pi .
123
Combining global regression and local approximation in server power modeling 39
3.3.2 Unbalanced training data i = 1, . . . , n. Comparing the optimization problem (14) with
the optimization problem (15), we then reach a near-optimum
The approach of k-nearest neighbor regression suffers from solution of the optimization problem (15) given the set of
the cases where training data is distributed in an unbalanced parameters. This is because that a very small en contributes
little to (15) in the form of m · en , which is the only differ-
2
manner in the utilization input space. Assuming that we have
a balanced training set T = (u1 , P1 ), . . . , (un , Pn ) with n ence in the squared error term. Then the power model learned
samples, we artificially
create a second training set with
n+ from the unbalanced training set becomes
m samples, T = T
∪ (u n+1 = u n , Pn+1 = Pn ) , . . . , u n+m
= u n , Pn+m = Pn , by duplicating the last training sample
n+m
for m times. The augmented samples result in an unbalanced v · ui + b + w j u − u j
distribution of the training samples around un . If the training j=1
sample un is within the k nearest neighbors of a new input
n−1
u, but is not its nearest neighbor, the power predicted with = v∗ · ui + b∗ + w ∗j u − u j
k-nearest-neighbor regression will bias towards Pn . This is j=1
because that multiple instances of (un , Pn ) are involved in
n+m
wn∗
the regression formula (1). Such a bias is far from the ideal + u − un
m+1
scenario. In the ideal scenario of prediction using the orig- j=n
inal balanced training set T , only one instance of (un , Pn )
n
is involved in regression. We now show how the new model = v∗ · ui + b∗ + w ∗j u − u j , (17)
alleviates the problem. j=1
Given the balanced training set T , we rewrite the opti-
mization problem (8) using the term of training error defined which is identical to the model learned from the balanced
in (12) as training set. The example shows that the new model is not
sensitive to the unbalanced distribution of the training data
n−1 in the utilization input space.
min ei2 + en2 + β(v, b)2 + γ (w1 , . . . , wn−1 , wn )2 .
v,b,w
i=1
(14) 4 Experimental results
Now for the unbalanced training set T , we rewrite the opti- 4.1 Server power modeling on benchmarks
mization problem as
We perform experiments to compare the accuracy of our
n−1
2
min ei + (m + 1) en + β v , b
2 2 model with that of the existing approaches on seven data
v,b,w
i=1
sets obtained from running benchmarks on an Intel Rom-
2 ley generation server (SC2600CP). The server has 2 Intel
+ γ w1 , . . . , wn−1
, wn , . . . , wn+m
, (15) CPUs, each with 8 physical cores running at 2.7 GHz and with
hyper-threading enabled (that is, 32 threads in total), 32GB
in which we define the training errors for the specific training memory, and a 500GB hard drive (WDCWD5003ABYX-
set T as 01WERA1). In almost all the cases, the operating system
is Windows Server 2008 R2 Enterprise 64-bit. There is one
n−1
ei = v · ui + b + w j ui − u j exception which is described later.
j=1 To collect the resource utilization data, we use Windows
n+m Performance Monitor to collect CPU utilization, memory
+ w j ui − un − Pi . (16) activities (last level cache misses) and disk I/O (disk I/O
j=n transfers). In order to monitor the memory activities, we use
the plug-in of Intel Performance Counter Monitor integrated
Let v∗ , b∗ , and w∗ = (w1∗ , . . . , wn∗ ) denote the param- into Windows Performance Monitor.1 For the dimensions
eters which form the optimal solution to the optimization of memory activities and disk I/O, the resource utilization
problem (14) in parameter estimation. We also assume that values are normalized to the range of [0,1]. With the normal-
the training error on the nth sample, en is close to 0 as ization, the scale of the memory or I/O utilization is similar
the result of minimizing training errors. Comparing (12)
with (16), we see that if we set v = v ∗ , b = b∗ , and 1
∗ ∗
∗ , wn , . . . , wn , we have e = e for
See https://software.intel.com/en-us/articles/intel-performance-
w = w1∗ , . . . , wn−1 m+1 m+1 i i counter-monitor for details.
123
40 X. Du, C. Li
to that of the CPU utilization, which is important to distance Table 1 Error rates of the approaches on the seven data sets of server
or proximity calculation. power modeling (lowest error rates in bold)
We collect power data through Intel Node Manager,2 a Data Set Linear (%) k-NN (%) GR&LA (%)
firmware component residing in Intel server chipset. Intel
Prime95 4.33 4.69 1.70
Node Manager examines power readings from the platform
Linpack 0.68 0.81 0.72
power sensor instrumented. It is configured to aggregate the
input power to the power supply within a moving time win- SPECPower 8.13 1.09 1.02
dow. The size of the window is defined by management IOMeter 1.13 0.13 0.11
software. Mix of workloads 25.51 1.89 1.62
Four benchmarks, Prime953 (with varying intensity.), Lin- Workload from low to high 23.81 44.08 3.86
pack,4 SPECpower5 and IOMeter,6 are run on the server. For Cross workload 15.14 50.83 7.32
every 10 s, we collect the 10-s average values of the resource
utilization and power data. By running the four individual
workloads, we create four data sets using the utilization and the mix of the workloads, reducing the error rate to 1.82%,
power data collected. We create the fifth data set with mixed it performs even worse than the linear model (error rates
workloads as follows. In addition to the four individual work- around 40–50%) when the training data is unbalanced in the
loads described above, we add some data points for idle last two data sets. Our new model (denoted as ‘GR&LA’) is
power and some data points for peak power. To obtain the accurate in the individual workloads as well as their mix, and
power and utilization data when peak power is reached, we is also robust to unbalanced training data, outperforming the
run FIRESTARTER,7 a tool designed specifically to reach existing methods.
the maximum power consumption levels of Intel processors.
As the tool is only available in Linux version, the operat- 4.2 Server power modeling on real world workloads
ing system is temporarily switched to Ubuntu Server 14.03.
To monitor CPU utilization, memory activities, and disk I/O We next perform experiments on modeling server power
in Linux, we use the command ‘mpstat’, Intel Performance when real world workloads are running. To do so, we place
Counter Monitor, and the command ‘iostat’. We mix all the the same server with the same operating system in a data-
data collected as the fifth data set. In each of the five data sets center for engineering computing in Intel. Multiple users are
described above, the data is randomly split into two mutu- able to access the server to run different workloads described
ally exclusive parts for training and testing. We also create as follows.
two unbalanced data sets. The first one takes a chronological
split of the SPECpower data in which the workload intensity 1. The server acts as the demo environment for Intel Dat-
increases from low to high. In the training set the utilization acenter Manager, a server management software which
inputs are with relatively low values while in the test set the manages power and cooling in a datacenter.8 In the demo
utilization inputs are with relatively high values. The second setting, the software connects to more than 500 stub
one is a cross workload data set that uses SPECpower data server simulators locating in the same collector process
for training and Linpack for testing. The data sets are used on the physical server, stores their power and thermal
to evaluate our approach and the two existing approaches, data, and performs basic analytics including data aggre-
linear power model and k-nearest neighbor regression. gation or histogram analysis on a web UI hosted. Though
Table 1 shows the error rates of the approaches. As we can the workload is running consistently in the background
see from the experimental results, the linear model works rea- including a web server, two Windows services, and a
sonably well in individual workloads achieving error rates back-end database, its utilization of resources is low in
lower than 10%. However, when we mix the workloads general.
together (denoted as ‘Mix of workloads’), its error rate grows 2. Some users occasionally run machine learning workloads
drastically to around 25%. While the approach of k-nearest on the server. Most of the workloads use only one thread
neighbor regression (denoted as ‘k-NN’) works accurately on and access a small data set. The running time, however,
varies depending on the learning algorithm.
2 See http://www.intel.com/content/www/us/en/data-center/data-
center-management/node-manager-general.html for details.
To create the data set for the experiment, we choose a
3 Available at http://www.mersenne.org/download/. time period with more than half an hour which involves
4 Available at http://www.netlib.org/linpack/. a relatively large range of utilization. In the specific time
5 Available at https://www.spec.org/power_ssj2008/.
6 Available at http://www.iometer.org/. 8 See http://www.intel.com/content/www/us/en/software/intel-dcm-
7 Available at http://tu-dresden.de/zih/firestarter/. product-detail.html for details.
123
Combining global regression and local approximation in server power modeling 41
period, not only the background workload, Intel Datacenter 4.3 Analysis on modeling synthetic functions
Manager, is persistent, there are multiple machine learning
workloads running as well. Those machine learning work- To further explore the effectiveness of the new model, we
loads target a categorization problem which classifies a set validate the method combining global regression and local
of tasks in individual cloud computing jobs into fast tasks approximation on synthetic functions with different levels of
and slow tasks. During the period, a comparative study is noise.
performed on different machine learning algorithms with We create four one-dimensional piecewise functions
different parameter settings. The learning algorithms vary, defined on [0, 1]. The four synthetic functions are:
including decision stumps [8], logistic regression [7], support
vector machines [9], random forests [10], etc. The implemen- ⎧
⎨160x + 80
⎪ 0 ≤ x ≤ 0.25
tations of the algorithms include publicly available tools9 and
f (x) = 105x 2 + 130x + 80 0.25 ≤ x ≤ 0.75 (18)
self-developed applications. The application types include ⎪
⎩
Java-based managed code and native application compiled 110x 2 + 130x + 80 0.75 ≤ x ≤ 1
⎧
from code written in C++. In many cases, short-running jobs ⎪
⎨360e − 280 0 ≤ x ≤ 0.25
x
are submitted first and long-lasting jobs are submitted later.
g (x) = −210x 2 + 450x + 80 0.25 ≤ x ≤ 0.75 (19)
Similar to the previous experiment of server power mod- ⎪
⎩
eling with benchmarks, we monitor the CPU utilization, 80x + 240 0.75 ≤ x ≤ 1
⎧
memory activities and disk I/O using Windows Perfor- ⎪
⎨360e − 280 0 ≤ x ≤ 0.25
x
mance Monitor. Power data is collected through Intel Node h (x) = 120x + 150 0.25 ≤ x ≤ 0.75 (20)
Manager. The granularity for both the power data and the ⎪
⎩
utilization data is 10 s. 110x 2 + 130x + 80 0.75 ≤ x ≤ 1
⎧
Figure 4 plots CPU utilization, last level cache misses, ⎨160x + 80
⎪ 0 ≤ x ≤ 0.25
and power consumption of the real world workload during s (x) = 210x + 100x + 80 0.25 ≤ x ≤ 0.75
2 (21)
the period. As we can see from the figure, the trend of the ⎪
⎩
160x + 160 0.75 ≤ x ≤ 1
workload intensity ramps up in general with respect to time,
but with small magnitude of bursts from time to time.
If we regard x as the utilization input with 0 ≤ x ≤ 1, the
We split the data into training set and test set in two
four functions simulate four hypothetical one-dimensional
different ways. The first split is a random one so that the
power models where the server power peaks at 320 W and
training samples and the test samples are drawn from the
idles at 80 W. All the functions are close to linear in each
same probability distribution. Alternatively, we split the data
of their piecewise segments, but incorporate different levels
chronologically, using the first half for training and the sec-
of nonlinearities at different regions. For example, the curve
ond half for testing. In the chronological split, the data is
of the first function, f (x), rises gently at the beginning and
unbalanced where the utilization for training is relatively
then the rate of increase becomes somewhat steeper, while
lower but that for testing is relatively higher.
the curve of the third function, h (x), rises steeply both at
9 The tool LibSVM (available at https://www.csie.ntu.edu.tw/~cjlin/
the beginning and approaching the end, but climbs relatively
slower in the mid-range.
libsvm/) is used for support vector machines. The tool Weka (available
at http://www.cs.waikato.ac.nz/ml/weka) is used for logistic regression 1000 random samples are used as the training data. To
and random forests. simulate the noisy nature of the power behavior, we inject
123
42 X. Du, C. Li
Table 3 Error rates of modeling synthetic piece-wise functions in the context of different noise levels in training
Method Noise: N 0, 0.052 Noise: N 0, 0.12 Noise: N 0, 0.22
f (x) (%) g (x) (%) h (x) (%) s (x) (%) f (x) (%) g (x) (%) h (x) (%) s (x) (%) f (x) (%) g (x) (%) h (x) (%) s (x) (%)
Linear 5.12 7.84 6.23 5.86 5.35 8.12 6.13 5.80 4.99 7.74 6.36 5.60
k-NN 4.01 4.04 4.29 4.03 8.12 8.19 8.29 7.69 16.10 15.65 15.79 16.29
GR&LA 0.91 0.76 0.77 0.88 1.30 1.57 1.51 1.61 2.97 2.79 3.36 2.84
different ratios of noise, N (0, σ 2 ) with σ varied, into the certain sub-systems given the relevant counters. Even if the
training samples. Here we use N (μ, σ 2 ) to denote a nor- power consumption of the sub-system cannot be monitored,
mal distribution with mean of μ and variance of σ 2 . We use the hierarchical system power model and the sub-models can
another 10,000 random samples without noise in testing in be built with the system power measurements only. Second,
order to see how well different methods are able to discover it is possible to extend the system power model from the
the ground truth in a noisy environment. server domain to mobile devices, in which power consump-
Table 3 shows the error rates of the linear model, k-nearest tion is an extremely important and sensitive topic. Finally, the
neighbor regression, and the model combining global regres- server power model can be incorporated in cluster infrastruc-
sion and local approximation in modeling the four synthetic ture management systems for power capacity planning and
piecewise functions in the context of different noise levels. in cluster orchestration software for power aware scheduling.
As we can see from the experimental results, our method
consistently outperforms the two baselines. The k-nearest Acknowledgements We thank Rahul Khanna, Honesty Young, and
Shilin Wang for their comments on an early draft of the paper.
neighbor regression method does not perform well when a
higher level of noise is involved. This is because that it learns
the noise and then overfits to the noise. The linear model is
robust against the noise, but fails to capture different extent
References
of nonlinearities. By contrast, our method is able to achieve 1. Gurumurthi S, Sivasubramaniam A, Irwin MJ, Vijaykrishnan N,
a low error rate when the noise level is low, and is also able Kandemir M, Li T, John LK (2002) Using complete machine
to perform robustly given an increasing noise level. simulation for software power estimation: the SoftWatt approach.
As typically system power consumption is often com- In: Proceedings of the eighth international symposium on high-
performance computer architecture (HPCA-2002). Washington, pp
posed of close-to-linear functions in different intervals with 141
certain extent of nonlinearities (see, e.g., [4]), the results 2. Economou D, Rivoire S, Kozyrakis C, Ranganathan P (2006) Full-
demonstrate that our method is well-suited to the task of system power analysis and modeling for server environments. In:
server power modeling. Proceedings of the workshop on modeling, benchmarking and sim-
ulation (MoBS-2006), Boston
3. Dalton D, Vadher A, Laoghaire D, McCarthy A, Steger C (2012)
Power profiling and auditing consumption systems and meth-
ods, United States Patent Application Publication, Pub. No.: US
5 Conclusions 2012/0011378
4. Fan X, Weber W, Barroso L (2007) Power provisioning for a
warehouse-sized computer. In: Proceedings of the thirty-fourth
This paper has presented a new approach to model server international symposium on computer architecture (ISCA-2007).
system power using global regression and local approxima- San Diego, pp 13–23
tion which overcomes the limitations of the traditional linear 5. Kansal A, Zhao F, Liu J, Kothari N, Bhattacharya A (2010) Vir-
power model and k-nearest neighbor regression. The pro- tual machine power metering and provisioning. In: Proceedings
of the first ACM symposium on cloud computing (SoCC-2010).
posed approach compensates the global regression model Indianapolis, pp 39–50
with local approximation to capture subtleness of nonlinear 6. Mitchell T (1997) Machine learning. McGraw-Hill Inc, New York
utilization-power mapping, and retains robustness in cases 7. Ng AY (2004) Feature selection, L 1 vs. L 2 regularization, and rota-
with unbalanced training data distribution through global tional invariance. In: Proceedings of the twenty-first international
conference on machine learning (ICML-2004), Banff
regression. Empirical evaluation shows that the proposed 8. Iba W, Langley P (1992) Induction of one-level decision trees.
model is accurate and robust, outperforming the existing In: Proceedings of the nineth international conference on machine
approaches. learning (ICML-1992). San Francisco, pp 233–240
There are several issues remaining for future work. First, 9. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn
20(3):273–297
it is possible to involve more special performance counters 10. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
on resource utilization and build a hierarchical system power
model, composing of sub-models predicting the power of
123
Combining global regression and local approximation in server power modeling 43
123