Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Computer Science - Research and Development (2019) 34:35–43

https://doi.org/10.1007/s00450-018-0391-x

SPECIAL ISSUE PAPER

Combining global regression and local approximation in server power


modeling
Xiaoming Du1 · Cong Li1

Published online: 2 May 2018


© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Abstract
To evaluate energy use in green clusters, power models take the resource utilization data as the input to predict server power
consumption. We propose a novel method in power modeling combining a global linear model and a local approximation
model. The new model enjoys high accuracy by compensating the global linear model with local approximation and exhibits
robustness with the generalization capability of the global regression model. Empirical evaluation demonstrates that the new
approach outperforms the two existing approaches to server power modeling, the linear model and the k-nearest neighbor
regression model.

Keywords Server power modeling · Global regression · Local approximation · Linear model · Spatial interpolation

1 Introduction models can be used in many applications. For example,


typically low-cost servers are not instrumented with power
Evaluating energy use is crucial in green clusters with the sensors. One may connect the server with a power meter,
growing demand of managing the cost of power delivery collecting the power data to build a power model. Once
and cooling. Therefore it becomes important to understand the model is built, it can be used to estimate the power
the power consumption at the system level of a server. One consumption of the low-cost server without power sensor
category of approaches to evaluate server power consumption instrumentation or power meters. One may also use a set
is to perform the full-system simulation (see, e.g., [1]). Full- of power models to simulate the power consumption for
system simulators are typically based on analytical models, a cluster of servers, performing what-if analysis on proba-
tying to low-level architecture events. As a result, they suffer ble server power given the hypothesized resource utilization.
from the drawbacks on simulation speed and portability [2]. In an environment where power consumption behaviors for
It then becomes difficult to run a full-system simulation on different servers are heterogeneous, the power models for
long workloads or large data sets, especially in the context those servers can be used to minimize the energy consump-
of energy optimization at the cluster level. tion of the servers by scheduling workloads on the right
Alternatively, it is possible to model the power con- ones.
sumption of a server based on its high level hardware Typically a power model is trained in a calibration phase.
resource utilization, in which the resource utilization met- In calibration, a set of historical utilization-power data is
rics are retrieved through OS performance counters with measured and is used to construct a power model. The model
low overhead. A power model takes the resource utiliza- is then used in runtime to perform power estimation or
tion data as the input and predicts the power consumption prediction. Traditional linear power model provides coarse-
of a server. Power models provide on-the-fly full-system grained generalization even with the minimum amount of
power characterization in a non-intrusive manner [2]. The training data [2], but is not accurate enough over a vari-
ety of workloads as it cannot capture the subtle nonlinearity
B Xiaoming Du in the resource-power mapping. The approach of k-nearest
xiaoming.du@intel.com neighbor regression captures the subtle nonlinearity of the
Cong Li resource-power mapping [3], but does not perform well when
cong.li@intel.com the training data are sparse or are not well distributed over
1 the utilization input space.
No. 880, Zixing Road, Shanghai, People’s Republic of China

123
36 X. Du, C. Li

In this paper we propose a novel method in power mod-


eling that combines a global linear model with a local
approximation model. In the proposed method, traditional
linear regression is employed in the global linear model
and nonlinear spatial interpolation is used for local approx-
imation. By compensating the global linear model with
local approximation, the new method enjoys high accuracy.
Meanwhile, it exhibits robustness with the generalization
capability of the global regression model. Empirical eval-
uation demonstrates that the new approach outperforms the
existing approaches.
Our contributions are two-fold. We provide an empiri-
cal demonstration of the limitations of the two traditional Fig. 2 A one-dimensional example of the linear power model
approaches, that is, linear power model and k-nearest neigh-
bor regression. To the best of our knowledge, such an empir-
ical demonstration has not been explicitly presented in the respect to each of the resource utilization values (see, e.g.,
literature. We propose a new approach to server power model- [2,4]). Figure 2 visualizes a one-dimensional model (though
ing, overcoming the limitations of the traditional approaches. in practice multi-dimensional inputs are used). The linear
model is trained with linear regression using ordinary least
squares and provides a coarse-grained generalization from
minimum amount of training data.
2 Related work
However, the model is not comprehensive enough to cap-
ture the subtle nonlinearity in the resource-power mapping,
A power model, P = f (u), takes the resource utilization
and therefore does not achieve accurate results in a mix of
data u as the input and predicts the power consumption of a
heterogeneous workloads (see, e.g., [5]).
server P. Figure 1 shows the system diagram of server power
As described later in Sect. 4, though in our experiments the
modeling. The power model is trained in a calibration (or
linear power model performs reasonably well on individual
training) phase. In calibration, a set of historical utilization-
workloads (where its error rates are lower than 10%), when
power data, (u1 , P1 ), . . . , (un , Pn ), is measured on the target
we mix the workloads together the error rate grows drastically
server and is used to construct a power model using a certain
to more than 25%. It implies that the linear power model
machine learning algorithm. The model is then used during
cannot be properly used in a complicated environment.
runtime to perform power estimation or prediction given a
new (and typically unseen) input u. Power models can be used
in many usages, including, e.g., monitoring power for servers 2.2 k-Nearest neighbor regression
without power sensor instrumentation, what-if power anal-
ysis given the hypothesized resource utilization, and power Given a utilization input u, the alternative model, k-nearest
aware scheduling. neighbor regression, searches for a set of k nearest neighbors,
N (u) = {u N1 , . . . , u Nk }, from the training samples. It then
2.1 Linear power model calculates the average power of those neighbors weighted
with the proximities with the neighbors [3] as
The traditional power model of linear regression, P = v·u +
k P
b (where v and b are the model parameters), assumes that  Ni 
i=1 u N −u
the power consumption of the system changes linearly with P = k i
. (1)
 1 
i=1 u N −u
i

Figure 3 visualizes a two-dimensional example of 3-


nearest neighbor regression (though, again, practically one
may choose a model of higher dimensions).
Different from the linear power model, k-nearest neighbor
regression employs a lazy learning process in training that it
only memorizes the training samples without performing any
generalization. The generalization beyond the training data
is deferred to the prediction stage after a new utilization input
Fig. 1 System diagram of server power modeling is received [6].

123
Combining global regression and local approximation in server power modeling 37


n
P =v·u+b+ wi ui − u . (2)
i=1

The first part of the model, v · u + b, is a global regres-


sion model. Without loss of generality, given the utilization
vector which contains three components of CPU utiliza-
tion, memory bus activities, and disk I/O transfers (that
is, u = (u CPU , u Memory , u IO )), the global linear regression
model is a traditional linear model of

v·u+b
= vCPU u CPU + vMemory u Memory + vIO u IO + b. (3)

Identical to the traditional linear model, the sub-model of


global linear regression provides a coarse-grained general-
ization describing how the system power changes linearly
with respect to the individual resource utilization data. The
linear model captures the global trending of power con-
Fig. 3 A two-dimensional example of 3-nearest neighbor regression sumption with respect to the resource utilization and can be
estimated with a small set of unbalanced power-utilization
Such a non-parametric model well approximates the local samples distributed sparsely.
nonlinearity of the utilization-power mapping when near The second part of the model,
neighbors are available. However, the method fails in a lack of
near neighbors when the data is distributed sparsely or in an 
n
wi ui − u , (4)
unbalanced manner. For example, if the training data is with
i=1
relatively low utilization, obviously the weighted average
power of those low utilization samples cannot generalize well is a local approximation model based on spatial interpola-
to a high utilization input. Similarly, if the training data con- tion to compensate the error of the coarse-grained global
centrates in one type of workload and is imbalanced within regression model. Similar to k-nearest neighbor regression,
the entire utilization input space, even the nearest neighbors it provides a nonlinear local approximation. Different from
of a new input from a different workload may still be rel- k-nearest neighbor regression, it uses the spatial interpo-
atively far from the input, and therefore the model cannot lation based upon the Euclidean distances between the
generalize well through a weighted average. utilization input and the individual training utilization sam-
As described later in Sect. 4, in our experiments the k- ples. The local approximation model is parameterized and
nearest neighbor regression model performs accurately in the parameters w = (w1 , . . . , wn ) are estimated in the
different workloads and also in the mix of different work- training phase to achieve the generalization. The local
loads, but fails with even larger error rates of 40–50% in the approximation model captures the subtle nonlinearity of the
cases of sparse and unbalanced distribution of training data. utilization-power mapping in addition to the global linear
model.
3 The proposed approach: combining global Combining global regression and local approximation, the
regression and local approximation new power model enjoys high accuracy by compensating the
global linear model with local approximation. It is also robust
In this section, we first describe the model proposed and in different scenarios with the generalization capability of the
then outline the method to estimate its parameters. After that, global regression model.
we provide the mathematical analysis to demonstrate some
interesting characteristics of the proposed model. 3.2 Parameter estimation

3.1 The model The model parameters, v, b, and w, are trained by minimizing
the joint objective of the model error on training data and the
We propose a novel method in power modeling. The new model complexity as
method combines global regression and local approximation
as min e(v, b, w) + r (v, b, w). (5)
v,b,w

123
38 X. Du, C. Li

The first term, for k = 1, . . . , n. Solving the linear Eqs. (9), (10), and (11)
⎛ ⎞2
gives the parameters of the power model.

n 
n  
e(v, b, w) = ⎝v · ui + b + w j u j − ui  − Pi ⎠ , (6)
3.3 Model characteristics
i=1 j=1

describes the squared error of the model on the training data. 3.3.1 Impact of close neighbors
As the model parameter number grows with respect to
the number of the training samples, minimizing the squared One of the attractive characteristics of k-nearest neighbor
error only will lead to overfit to the training data by arbitrarily regression is that given an input utilization u close to one
tweaking the parameters, especially w. Here we introduce a of the training samples ui , the predicted power P of the
second term of standard L 2 regularization, input will be close to the close neighbor’s power Pi . This
is because that the large proximity between the input and
r (v, b, w) = β(v, b)2 + γ w2 , (7) the close neighbor contributes an extremely large weight in
the weighted average in the regression formula (1). We now
for penalization with the model complexity (cf., [7]), in show that the new model proposed by us enjoys the same
which β and γ are two constants to control the trade-off characteristics as well.
between the model error and the model complexity. With the Let
L 2 regularization, the optimization prefers a solution with

n
 
small parameters. It is interpreted as an instance of Bayesian ei = v · ui + b + w j u j − ui  − Pi (12)
estimation in which the priors of the parameters follow the j=1
normal distributions with zero means. Our preliminary exper-
imental result shows that the model accuracy is not sensitive denote the training error of the model on the ith training
to different choices of the two constants and therefore we sample. The difference between the power predicted by the
assign two ad hoc values for β and γ . new model and the power of the ith training sample is
The complete optimization problem to solve the model
parameters is given as 
n
 
v · u + b + w j u j − u − Pi
⎛ ⎞2

n 
n
 
j=1

min ⎝v · ui + b + w j u j − ui  − Pi ⎠ 
v,b,w n
 
i=1 j=1 = v · u + b + w j u j − u − ⎝v · ui + b

+ β(v, b) + γ w2 .
2
(8) j=1


n
 
The minimization problem is solved with standard quad- + w j u j − ui − ei
  ⎠
ratic programming. Taking the partial derivatives of the j=1
objective function with respect to the model parameters and 
n

   
equating the derivatives to zero, we have = v · (u − ui ) + w j u j − u − u j − ui ) + ei
j=1
⎛ ⎞

n 
n
 
n

⎝v · ui ui + bui + w j u j − ui  ui − Pi ui ⎠ ≤ v · u − u i + w j ui − u + ei
i=1 j=1 j=1
⎛ ⎞
+ βv = 0, (9) 
n

⎛ ⎞ w j ⎠ + ei .
  ≤ ui − u ⎝v + (13)
n n
 
⎝v · ui + b + w j u j − ui  − Pi ⎠ j=1
i=1 j=1
+ βb = 0, (10) As the norms of the model parameters v, w, and the train-
ing error ei are minimized in (8) during the model training
and stage, the right-hand side of the inequality (13) is small given
⎛ ⎞ the input utilization u close to the training sample ui . There-

n 
n
  fore the difference between the power predicted by the new
ui − uk  ⎝v · ui + b + w j u j − ui  − Pi ⎠ model and the power of the close neighbor is bounded and the
i=1 j=1 predicted power P of the input will be close to the neighbor’s
+γ wk = 0 (11) power Pi .

123
Combining global regression and local approximation in server power modeling 39

3.3.2 Unbalanced training data i = 1, . . . , n. Comparing the optimization problem (14) with
the optimization problem (15), we then reach a near-optimum
The approach of k-nearest neighbor regression suffers from solution of the optimization problem (15) given the set of
the cases where training data is distributed in an unbalanced parameters. This is because that a very small en contributes
little to (15) in the form of m · en , which is the only differ-
2
manner in the utilization input space. Assuming that we have
a balanced training set T = (u1 , P1 ), . . . , (un , Pn ) with n ence in the squared error term. Then the power model learned
samples, we artificially
create a second training set with
n+ from the unbalanced training set becomes
m samples, T  = T ∪ (u n+1 = u n , Pn+1 = Pn ) , . . . , u n+m
= u n , Pn+m = Pn , by duplicating the last training sample 
n+m
 
for m times. The augmented samples result in an unbalanced v · ui + b + w j u − u j 
distribution of the training samples around un . If the training j=1
sample un is within the k nearest neighbors of a new input 
n−1
 
u, but is not its nearest neighbor, the power predicted with = v∗ · ui + b∗ + w ∗j u − u j 
k-nearest-neighbor regression will bias towards Pn . This is j=1
because that multiple instances of (un , Pn ) are involved in 
n+m
wn∗
the regression formula (1). Such a bias is far from the ideal + u − un 
m+1
scenario. In the ideal scenario of prediction using the orig- j=n
inal balanced training set T , only one instance of (un , Pn ) 
n
 
is involved in regression. We now show how the new model = v∗ · ui + b∗ + w ∗j u − u j  , (17)
alleviates the problem. j=1
Given the balanced training set T , we rewrite the opti-
mization problem (8) using the term of training error defined which is identical to the model learned from the balanced
in (12) as training set. The example shows that the new model is not
sensitive to the unbalanced distribution of the training data

n−1 in the utilization input space.
min ei2 + en2 + β(v, b)2 + γ (w1 , . . . , wn−1 , wn )2 .
v,b,w
i=1
(14) 4 Experimental results
Now for the unbalanced training set T  , we rewrite the opti- 4.1 Server power modeling on benchmarks
mization problem as
We perform experiments to compare the accuracy of our

n−1

2
min ei + (m + 1) en + β  v , b 
2 2 model with that of the existing approaches on seven data
v,b,w
i=1
sets obtained from running benchmarks on an Intel Rom-

2 ley generation server (SC2600CP). The server has 2 Intel
+ γ  w1 , . . . , wn−1

, wn , . . . , wn+m
  , (15) CPUs, each with 8 physical cores running at 2.7 GHz and with
hyper-threading enabled (that is, 32 threads in total), 32GB
in which we define the training errors for the specific training memory, and a 500GB hard drive (WDCWD5003ABYX-
set T  as 01WERA1). In almost all the cases, the operating system
is Windows Server 2008 R2 Enterprise 64-bit. There is one

n−1
 
ei = v · ui + b + w j ui − u j  exception which is described later.
j=1 To collect the resource utilization data, we use Windows

n+m Performance Monitor to collect CPU utilization, memory
+ w j ui − un  − Pi . (16) activities (last level cache misses) and disk I/O (disk I/O
j=n transfers). In order to monitor the memory activities, we use
the plug-in of Intel Performance Counter Monitor integrated
Let v∗ , b∗ , and w∗ = (w1∗ , . . . , wn∗ ) denote the param- into Windows Performance Monitor.1 For the dimensions
eters which form the optimal solution to the optimization of memory activities and disk I/O, the resource utilization
problem (14) in parameter estimation. We also assume that values are normalized to the range of [0,1]. With the normal-
the training error on the nth sample, en is close to 0 as ization, the scale of the memory or I/O utilization is similar
the result of minimizing training errors. Comparing (12)
with (16), we see that if we set v  = v ∗ , b = b∗ , and 1

∗ ∗
∗ , wn , . . . , wn , we have e = e for
See https://software.intel.com/en-us/articles/intel-performance-
w  = w1∗ , . . . , wn−1 m+1 m+1 i i counter-monitor for details.

123
40 X. Du, C. Li

to that of the CPU utilization, which is important to distance Table 1 Error rates of the approaches on the seven data sets of server
or proximity calculation. power modeling (lowest error rates in bold)
We collect power data through Intel Node Manager,2 a Data Set Linear (%) k-NN (%) GR&LA (%)
firmware component residing in Intel server chipset. Intel
Prime95 4.33 4.69 1.70
Node Manager examines power readings from the platform
Linpack 0.68 0.81 0.72
power sensor instrumented. It is configured to aggregate the
input power to the power supply within a moving time win- SPECPower 8.13 1.09 1.02
dow. The size of the window is defined by management IOMeter 1.13 0.13 0.11
software. Mix of workloads 25.51 1.89 1.62
Four benchmarks, Prime953 (with varying intensity.), Lin- Workload from low to high 23.81 44.08 3.86
pack,4 SPECpower5 and IOMeter,6 are run on the server. For Cross workload 15.14 50.83 7.32
every 10 s, we collect the 10-s average values of the resource
utilization and power data. By running the four individual
workloads, we create four data sets using the utilization and the mix of the workloads, reducing the error rate to 1.82%,
power data collected. We create the fifth data set with mixed it performs even worse than the linear model (error rates
workloads as follows. In addition to the four individual work- around 40–50%) when the training data is unbalanced in the
loads described above, we add some data points for idle last two data sets. Our new model (denoted as ‘GR&LA’) is
power and some data points for peak power. To obtain the accurate in the individual workloads as well as their mix, and
power and utilization data when peak power is reached, we is also robust to unbalanced training data, outperforming the
run FIRESTARTER,7 a tool designed specifically to reach existing methods.
the maximum power consumption levels of Intel processors.
As the tool is only available in Linux version, the operat- 4.2 Server power modeling on real world workloads
ing system is temporarily switched to Ubuntu Server 14.03.
To monitor CPU utilization, memory activities, and disk I/O We next perform experiments on modeling server power
in Linux, we use the command ‘mpstat’, Intel Performance when real world workloads are running. To do so, we place
Counter Monitor, and the command ‘iostat’. We mix all the the same server with the same operating system in a data-
data collected as the fifth data set. In each of the five data sets center for engineering computing in Intel. Multiple users are
described above, the data is randomly split into two mutu- able to access the server to run different workloads described
ally exclusive parts for training and testing. We also create as follows.
two unbalanced data sets. The first one takes a chronological
split of the SPECpower data in which the workload intensity 1. The server acts as the demo environment for Intel Dat-
increases from low to high. In the training set the utilization acenter Manager, a server management software which
inputs are with relatively low values while in the test set the manages power and cooling in a datacenter.8 In the demo
utilization inputs are with relatively high values. The second setting, the software connects to more than 500 stub
one is a cross workload data set that uses SPECpower data server simulators locating in the same collector process
for training and Linpack for testing. The data sets are used on the physical server, stores their power and thermal
to evaluate our approach and the two existing approaches, data, and performs basic analytics including data aggre-
linear power model and k-nearest neighbor regression. gation or histogram analysis on a web UI hosted. Though
Table 1 shows the error rates of the approaches. As we can the workload is running consistently in the background
see from the experimental results, the linear model works rea- including a web server, two Windows services, and a
sonably well in individual workloads achieving error rates back-end database, its utilization of resources is low in
lower than 10%. However, when we mix the workloads general.
together (denoted as ‘Mix of workloads’), its error rate grows 2. Some users occasionally run machine learning workloads
drastically to around 25%. While the approach of k-nearest on the server. Most of the workloads use only one thread
neighbor regression (denoted as ‘k-NN’) works accurately on and access a small data set. The running time, however,
varies depending on the learning algorithm.
2 See http://www.intel.com/content/www/us/en/data-center/data-
center-management/node-manager-general.html for details.
To create the data set for the experiment, we choose a
3 Available at http://www.mersenne.org/download/. time period with more than half an hour which involves
4 Available at http://www.netlib.org/linpack/. a relatively large range of utilization. In the specific time
5 Available at https://www.spec.org/power_ssj2008/.
6 Available at http://www.iometer.org/. 8 See http://www.intel.com/content/www/us/en/software/intel-dcm-
7 Available at http://tu-dresden.de/zih/firestarter/. product-detail.html for details.

123
Combining global regression and local approximation in server power modeling 41

Table 2 Error rates of the approaches on the real world workload


Split Linear (%) k-NN (%) GR&LA (%)

Random split 5.53 8.17 4.76


Chronological split 8.62 46.43 6.55

Table 2 shows the experimental results, in which our pro-


posed method consistently outperforms the two baselines in
the real world workload. We notice that the method of k-
nearest neighbor regression does not perform well even in
the data set resulted from the random split, probably due
Fig. 4 Characteristics of the real world workload to the noisy nature resulted from the small bursts of those
short-running machine learning jobs.

period, not only the background workload, Intel Datacenter 4.3 Analysis on modeling synthetic functions
Manager, is persistent, there are multiple machine learning
workloads running as well. Those machine learning work- To further explore the effectiveness of the new model, we
loads target a categorization problem which classifies a set validate the method combining global regression and local
of tasks in individual cloud computing jobs into fast tasks approximation on synthetic functions with different levels of
and slow tasks. During the period, a comparative study is noise.
performed on different machine learning algorithms with We create four one-dimensional piecewise functions
different parameter settings. The learning algorithms vary, defined on [0, 1]. The four synthetic functions are:
including decision stumps [8], logistic regression [7], support
vector machines [9], random forests [10], etc. The implemen- ⎧
⎨160x + 80
⎪ 0 ≤ x ≤ 0.25
tations of the algorithms include publicly available tools9 and
f (x) = 105x 2 + 130x + 80 0.25 ≤ x ≤ 0.75 (18)
self-developed applications. The application types include ⎪

Java-based managed code and native application compiled 110x 2 + 130x + 80 0.75 ≤ x ≤ 1

from code written in C++. In many cases, short-running jobs ⎪
⎨360e − 280 0 ≤ x ≤ 0.25
x
are submitted first and long-lasting jobs are submitted later.
g (x) = −210x 2 + 450x + 80 0.25 ≤ x ≤ 0.75 (19)
Similar to the previous experiment of server power mod- ⎪

eling with benchmarks, we monitor the CPU utilization, 80x + 240 0.75 ≤ x ≤ 1

memory activities and disk I/O using Windows Perfor- ⎪
⎨360e − 280 0 ≤ x ≤ 0.25
x
mance Monitor. Power data is collected through Intel Node h (x) = 120x + 150 0.25 ≤ x ≤ 0.75 (20)
Manager. The granularity for both the power data and the ⎪

utilization data is 10 s. 110x 2 + 130x + 80 0.75 ≤ x ≤ 1

Figure 4 plots CPU utilization, last level cache misses, ⎨160x + 80
⎪ 0 ≤ x ≤ 0.25
and power consumption of the real world workload during s (x) = 210x + 100x + 80 0.25 ≤ x ≤ 0.75
2 (21)
the period. As we can see from the figure, the trend of the ⎪

160x + 160 0.75 ≤ x ≤ 1
workload intensity ramps up in general with respect to time,
but with small magnitude of bursts from time to time.
If we regard x as the utilization input with 0 ≤ x ≤ 1, the
We split the data into training set and test set in two
four functions simulate four hypothetical one-dimensional
different ways. The first split is a random one so that the
power models where the server power peaks at 320 W and
training samples and the test samples are drawn from the
idles at 80 W. All the functions are close to linear in each
same probability distribution. Alternatively, we split the data
of their piecewise segments, but incorporate different levels
chronologically, using the first half for training and the sec-
of nonlinearities at different regions. For example, the curve
ond half for testing. In the chronological split, the data is
of the first function, f (x), rises gently at the beginning and
unbalanced where the utilization for training is relatively
then the rate of increase becomes somewhat steeper, while
lower but that for testing is relatively higher.
the curve of the third function, h (x), rises steeply both at
9 The tool LibSVM (available at https://www.csie.ntu.edu.tw/~cjlin/
the beginning and approaching the end, but climbs relatively
slower in the mid-range.
libsvm/) is used for support vector machines. The tool Weka (available
at http://www.cs.waikato.ac.nz/ml/weka) is used for logistic regression 1000 random samples are used as the training data. To
and random forests. simulate the noisy nature of the power behavior, we inject

123
42 X. Du, C. Li

Table 3 Error rates of modeling synthetic piece-wise functions in the context of different noise levels in training




Method Noise: N 0, 0.052 Noise: N 0, 0.12 Noise: N 0, 0.22
f (x) (%) g (x) (%) h (x) (%) s (x) (%) f (x) (%) g (x) (%) h (x) (%) s (x) (%) f (x) (%) g (x) (%) h (x) (%) s (x) (%)

Linear 5.12 7.84 6.23 5.86 5.35 8.12 6.13 5.80 4.99 7.74 6.36 5.60
k-NN 4.01 4.04 4.29 4.03 8.12 8.19 8.29 7.69 16.10 15.65 15.79 16.29
GR&LA 0.91 0.76 0.77 0.88 1.30 1.57 1.51 1.61 2.97 2.79 3.36 2.84

different ratios of noise, N (0, σ 2 ) with σ varied, into the certain sub-systems given the relevant counters. Even if the
training samples. Here we use N (μ, σ 2 ) to denote a nor- power consumption of the sub-system cannot be monitored,
mal distribution with mean of μ and variance of σ 2 . We use the hierarchical system power model and the sub-models can
another 10,000 random samples without noise in testing in be built with the system power measurements only. Second,
order to see how well different methods are able to discover it is possible to extend the system power model from the
the ground truth in a noisy environment. server domain to mobile devices, in which power consump-
Table 3 shows the error rates of the linear model, k-nearest tion is an extremely important and sensitive topic. Finally, the
neighbor regression, and the model combining global regres- server power model can be incorporated in cluster infrastruc-
sion and local approximation in modeling the four synthetic ture management systems for power capacity planning and
piecewise functions in the context of different noise levels. in cluster orchestration software for power aware scheduling.
As we can see from the experimental results, our method
consistently outperforms the two baselines. The k-nearest Acknowledgements We thank Rahul Khanna, Honesty Young, and
Shilin Wang for their comments on an early draft of the paper.
neighbor regression method does not perform well when a
higher level of noise is involved. This is because that it learns
the noise and then overfits to the noise. The linear model is
robust against the noise, but fails to capture different extent
References
of nonlinearities. By contrast, our method is able to achieve 1. Gurumurthi S, Sivasubramaniam A, Irwin MJ, Vijaykrishnan N,
a low error rate when the noise level is low, and is also able Kandemir M, Li T, John LK (2002) Using complete machine
to perform robustly given an increasing noise level. simulation for software power estimation: the SoftWatt approach.
As typically system power consumption is often com- In: Proceedings of the eighth international symposium on high-
performance computer architecture (HPCA-2002). Washington, pp
posed of close-to-linear functions in different intervals with 141
certain extent of nonlinearities (see, e.g., [4]), the results 2. Economou D, Rivoire S, Kozyrakis C, Ranganathan P (2006) Full-
demonstrate that our method is well-suited to the task of system power analysis and modeling for server environments. In:
server power modeling. Proceedings of the workshop on modeling, benchmarking and sim-
ulation (MoBS-2006), Boston
3. Dalton D, Vadher A, Laoghaire D, McCarthy A, Steger C (2012)
Power profiling and auditing consumption systems and meth-
ods, United States Patent Application Publication, Pub. No.: US
5 Conclusions 2012/0011378
4. Fan X, Weber W, Barroso L (2007) Power provisioning for a
warehouse-sized computer. In: Proceedings of the thirty-fourth
This paper has presented a new approach to model server international symposium on computer architecture (ISCA-2007).
system power using global regression and local approxima- San Diego, pp 13–23
tion which overcomes the limitations of the traditional linear 5. Kansal A, Zhao F, Liu J, Kothari N, Bhattacharya A (2010) Vir-
power model and k-nearest neighbor regression. The pro- tual machine power metering and provisioning. In: Proceedings
of the first ACM symposium on cloud computing (SoCC-2010).
posed approach compensates the global regression model Indianapolis, pp 39–50
with local approximation to capture subtleness of nonlinear 6. Mitchell T (1997) Machine learning. McGraw-Hill Inc, New York
utilization-power mapping, and retains robustness in cases 7. Ng AY (2004) Feature selection, L 1 vs. L 2 regularization, and rota-
with unbalanced training data distribution through global tional invariance. In: Proceedings of the twenty-first international
conference on machine learning (ICML-2004), Banff
regression. Empirical evaluation shows that the proposed 8. Iba W, Langley P (1992) Induction of one-level decision trees.
model is accurate and robust, outperforming the existing In: Proceedings of the nineth international conference on machine
approaches. learning (ICML-1992). San Francisco, pp 233–240
There are several issues remaining for future work. First, 9. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn
20(3):273–297
it is possible to involve more special performance counters 10. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
on resource utilization and build a hierarchical system power
model, composing of sub-models predicting the power of

123
Combining global regression and local approximation in server power modeling 43

Xiaoming Du joined Intel in 2010 and


now works as senior engineer in Soft-
ware and Services Group. His major
research area is on power modeling
and control of different devices and
facilities in data center. With accumu-
lated experiences on machine learn-
ing and system-level technology, he is
currently involving in system model-
ing and optimization using machine
learning methods.

Cong Li is a senior staff engineer


as well as an engineering man-
ager in Intel, leading the devel-
opment of Intel(R) Data Center
Manager with his current inter-
ests in machine learning and data-
driven approaches to system mod-
eling, management, and optimiza-
tion. He had co-authored five pap-
ers and two book chapters on his
work in the area. Before join-
ing Intel in Sept. 2003, Cong had
worked as an assistant researcher
focusing on the research of natu-
ral language processing and mac-
hine learning for more than 2 years in Microsoft Research Asia where
he published several papers on top level international conferences and
journals. He is also a part-time director in Children’s Computer Cen-
ter, Children’s Palace of Chinese Welfare Institute in which he used to
study computer technologies during his childhood, and now pays back
with delivering various courses at weekends

123

You might also like