Professional Documents
Culture Documents
Metamodeling by Using Multiple Regression Integrated K-Means Clustering Algorithm
Metamodeling by Using Multiple Regression Integrated K-Means Clustering Algorithm
Metamodeling by Using Multiple Regression Integrated K-Means Clustering Algorithm
net/publication/253532213
CITATIONS READS
0 147
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Murat M. Gunal on 04 June 2014.
Keywords: simulation optimization, K-means clustering, generate more accurate and faster results than simulation
metamodel, multi regression does. Many methods for SimOpt have been developed,
mainly in four categories; gradient-based and random search
Abstract algorithms, evolutionary algorithms and metaheuristics,
mathematical programming based approaches, and
A metamodel in simulation modeling, as also known as statistical search techniques.
response surfaces, emulators, auxiliary models, etc. relates a
simulation model’s outputs to its inputs without the need for In this study, we suggest a four-phase approach to improve
further experimentation. A metamodel is essentially a the metamodeling process for SimOpt. Our approach
regression model and mostly known as “the model of a includes simulation experimentation, clustering,
simulation model”. A metamodel may be used for metamodeling, and optimization. In the first phase,
Validation and Verification, sensitivity or what-if analysis, conventional simulation experimentation techniques are
and optimization of simulation model. In this study, we used. Note that we assume we have a simulation model of a
proposed a new metamodeling approach by using multiple typical call centre system, and we aim to optimize some
regression integrated K-means clustering algorithm objective function. In the second phase, we apply a
especially for simulation optimization. Our aim is to clustering algorithm (k-means) to the simulation inputs. In
evaluate the feasibility of a new metamodeling approach in the third phase for each cluster, a metamodel is developed.
which we create multiple metamodels by clustering input- Finally, we applied optimization techniques to each
output variables of a simulation model according to their metamodel. Different from classical metamodeling in
similarities. In this approach, first, we run the simulation SimOpt, we integrated clustering before the “multiple
model of a system, second, by using K-Means clustering regression” metamodel, and generated one metamodel for
algorithm, we create metamodels for each cluster, and third, each cluster, instead of one metamodel for all data.
we seek the minima (or maxima) for each metamodel. We
also tested our approach by using a fictitious call center. First, we review some of SimOpt methods in the literature
We observed that this approach increases the accuracy of a in section 2. In section 3 and 4, we give brief information
metamodel and decreases the sum of squared errors. These about metamodel and clustering. In section 5, we presented
observations give us some insights about usefulness of our proposed approach by comparing with the classical
clustering in metamodeling for simulation optimization. approach. To show an application of the proposed approach,
we experimented with a call center simulation model, and
1. INTRODUCTION showed that clustered metamodels outperform the classical
approach.
Coupling the speed of optimization techniques and
flexibility of simulation emerges a new research area called 2. REVIEW OF SIMULATION OPTIMIZATION
Simulation Optimization (SimOpt), which also affected the TECHNIQUES
practice [1-3]. In the history of Operational Research,
SimOpt methods have started to appear in 1990s, with the In SimOpt, a simulation model is used to estimate the
basic idea of merging the advantages of simulation performance of a system, and based on the estimation, then,
modeling with optimization. Simulation methods are known an optimization algorithm is run to find some new input
for their flexibility to tackle the complexity in systems. values that will maximize or minimize the system
Although simulation models require extensive amount of performance estimation. As in the conventional optimization
data, they help decision makers make better decisions. models, the input values, or the decision variables, are
Optimization methods, on the other hand, are not as flexible constrained. The iterative nature of this approach generally
as for modeling complexity, but once they are built, they
618
makes the simulation model a bottleneck and therefore the of the most important steps is factor screening, the initial
model performance is significant. identification of the “important" parameters, those factors
that have the greatest influence on the response. However,
We review some of the well-known simulation optimization in our discussion of optimization of discrete-event
techniques as follows: simulation models, we assume that this has already been
determined. In most discrete-event system applications, this
Gradient-based and random search algorithms (e.g. is usually the case, since there are underlying analytic
stochastic approximation): Gradient-based search methods models which can give a rough idea as to the influence of
are a type of optimization techniques that use the gradient of various parameters. For example, in manufacturing systems
the objective function to find an optimal solution [4]. In and telecommunications networks, the analyst knows from
each iteration of the algorithm, the values of the decision queuing network models which routing probabilities and
variables are adjusted so that the simulation produces a service times have an effect on the performance measures of
lower objective function value. Gradient-based methods interest. RSM procedures usually presuppose a more “black
work well in high-dimensional spaces provided that these box" approach to the problem as stated above, so it is
spaces do not have local minima. The drawback is that unclear a priori which factors are of importance at all [10].
global minima are likely to remain unfound. Additionally, Fu [10] classifies the application of RSM in
two main categories: metamodels, and sequential
Evolutionary Algorithms and Metaheuristics (e.g. procedures.
Genetic Algorithms, Tabu Search and Simulated
Annealing): Heuristic-based methods strike a balance Meta models are special cases of RSM representation and
between exploration and exploitation. This balance permits therefore the remainder of this paper uses the term
the identification of local minima, but encourages the “metamodel” rather than RSM.
discovery of a globally optimal solution [5]. Heuristic
techniques generate good candidate solutions when the 3. METAMODEL
search space is large and nonlinear.
A metamodel is a polynomial model that relates the input-
Mathematical Programming-Based Approaches (e.g. the output behavior of a simulation model. A metamodel is
Sample Path Method): Sample path optimization (also often a least squares regression model that has form as given
known as stochastic counterpart, sample average in Eqs.(1):
approximation; see [6]) takes many simulations first, and k k k k
then tries to optimize the resulting estimates by using E y 0 i xi ii xi2 ... x x j (1)
ij i
i 1 i 1 i 1 j 1
conventional mathematical programming solution
algorithms.
where βi, βii , and βij represent regression coefficients, xi
Statistical Search Techniques (e.g. Sequential Response (i = 1,…..,n) are design variables, and y is the response. The
Surface Methodology): Response surface methodology simple form of a metamodel can reveal the general
(RSM) is a statistical method for fitting a series of characteristics of behavior in complex simulation models.
regression models to the output of a simulation model [5]. The objective of a metamodel is to “effectively” relate the
The goal of RSM is to construct a functional relationship output data of a simulation model to the model’s input to aid
between the decision variables and the output to in the purpose for which the simulation model was
demonstrate how the changes in the value of decision developed [11].
variables affect the output. Relationships constructed from
RSM are often called meta-models [7]. RSM usually Since our aim in this study is to form a metamodel by using
consists of a screening phase that eliminates unimportant clustering algorithms, we review the related literature in the
variables in the simulation [8]. After the screening phase, following section. Note that we aim at classifying the input
linear models are used to build a surface and find the region variables according to the similarities between each other,
of optimality. Then, second or higher order models are run and after clustering the data, there will be n grouped
to find the optimal values for decision variables. (clustered) data sets, n metamodels. We discuss the details
of this approach after stating the clustering algorithms.
The eventual objective of RSM is to determine the optimum
operating conditions for the system or to determine a region 4. CLUSTERING
of the factor space in which operating requirements are
satisfied [9]. In the formal application of RSM for Clustering is a way to examine similarities and
optimization and for design of experiments in general, one dissimilarities of observations or objects. Data often fall
naturally into groups, or clusters, of observations, where the
619
characteristics of objects in the same cluster are similar and K-means uses an iterative algorithm that minimizes the sum
the characteristics of objects in different clusters are of distances from each object to its cluster centroid, over all
dissimilar. Both the similarity and the dissimilarity should clusters. The algorithm moves objects between clusters until
be examinable in a clear and meaningful way. Measures of the sum cannot be decreased further. The result is a set of
similarity depend on the application. clusters that are as compact and well-separated as possible.
An example of clustered data points is shown in Figure-1
Clustering is widespread, and a wealth of clustering
algorithms has been developed to solve different problems
in specific fields. However, there is no clustering algorithm
7
that can be universally used to solve all problems [12].
620
In the second phase, we cluster the simulation inputs. This is
an iterative process since we look for some performance
criterion in each iteration and if the criterion for clustering is
below the acceptable level, we increase the number of
clusters. For example, in K-Means clustering method, the
performance criterion is the silhouette value.
6. APPLICATION
621
until a support person is available. All technical support call There are two constraints in the problem definition; first, the
durations are triangularly distributed with 3, 6, 18 minutes. number of trunk lines must be between 26 and 50. Second,
After a caller is being served, he exits the system. the call center can accommodate 15 operators at most.
The second type of calls is the sales. These calls are routed 6.2. Steps of the Methodology
to the sales staff. A sales staff call duration is triangularly Step-1 Specify the decision variables: We choose the six
distributed with the parameters 4, 15, 45 minutes. As in the decision variable as shown in Table-1 that affected our
technical support, the caller leaves the system after performance criteria (e.g. the total cost).
completion of the call. The third type of call, order status, is
handled by computers. However some customers may Table 1. Decision variables and their lower and upper
require talking to a real operator. This happens in 15% of bounds
this type of calls. Order status calls also distributed Decision Variables Lower Upper
triangularly with 2, 3, 4 minutes. Note that when these calls Bound Bound
are inserted to a queue for a real operator, they have lower New Sales (X1) 0 15
priority than sales calls. An operator can handle these calls New Tech 1 (X2) 0 15
with triangularly distributed times (3, 5, 10 minutes). These New Tech 2 (X3) 0 15
callers then exit the system. New Tech 3 (X4) 0 15
New Tech All (X5) 0 15
Trunk Line (X6) 26 50
In our base experimentation, there are 11 technical support
employees to answer the technical support calls. Two are
Step-2 Simulation Experimentation: For this stage, instead
only qualified to handle calls for product Type 1, three are
of designing our own experiments, we choose the
only qualified to handle calls for product Type 2, three are
experiments that are already specified by Arena’s OptQuest.
only qualified to handle calls for product Type 3, two are
To ease the process, we first run OptQuest for 500
only qualified to handle calls for product Types 1 and 3, and
experiments to find the optimum. As a result of this,
one is only qualified to handle calls for all three products
OptQuest found the values in Table 2 with the objective
types. There are four employees to answer the sales calls
function value of $21,017. The run length for the model is
and those order-status calls that want to speak to a real
1000 hours and we made 10 replications in each experiment.
person.
Table 2. Minimum total cost and values of decision
Our main output variable is the total cost which includes 3
variables via OptQuest
types of costs; (1) staffing and resource costs, (2) costs due
Obj.Func. X1 X2 X3 X4 X5 X6
to poor customer service and (3) costs of rejected calls. A
$21017 3 0 0 0 3 29
sales staff’s cost is $20/hour and a tech-support staff’s cost
is $18-$20/hour, depending on their level of training and
Step-3 Evaluate the Simulation Output: 16 experiments
flexibility. The second type of cost is the incurred cost
among 500 experimental results are removed since they
associated by making costumer wait on hold. When dealing
were in infeasible region.
with a call center, at some point, people will start getting
mad and the system will start incurring a cost. Although it is
Step-4 Determine the Number of Clusters: In this step, we
difficult to measure this cost, we assumed that for tech calls,
cluster the inputs of the simulation model by examining the
this point is 3 minutes; for sales calls, it’s 1 minute; and for
silhouette values. The silhouette plot displays a measure of
order status it’s 2 minutes. Beyond this tolerance point for
the closeness of each data point by comparing with the
each call type, the system will incur a cost of 36.8
neighboring clusters in the diagram. The measure for the
cents/minute for tech calls, 81.8 cents/minute for sales calls
silhouette value ranges from +1 to -1. “+1” indicates the
and 34.6 cents/minute for order status calls. For rejected
points that are very distant from the neighboring clusters.
calls it is assumed that no more than %5 of incoming calls
“0” indicates the points that are not distinctly in one cluster
get a busy signal; any model configuration not meeting this
or another. “-1” indicates the points that are assigned to the
requirement will be regarded as unacceptable. With related
wrong cluster. The value is defined as;
rejected calls changing the number of trunk line is incurred
$98/week for each trunk line.
S(i) = (min(b(i,k),2) - a(i)) / max(a(i),min(b(i,k)))
In the optimization part, we used this call center simulation
where a(i) is the average distance from the ith point to the
model to find the minimum total cost while holding percent
other points in its cluster, and b(i,k) is the average distance
of rejected calls to 5 and less. The decision variables and
from the ith point to points in another cluster k.
their lower/upper bound values are as shown in the Table 1.
622
Step-5 Cluster Simulation Inputs: We clustered the f 4 34439.5-1970.06 * X 1 - 450.6 * X 2 -1767.6 * X 3 -1197.2 * X 4
simulation inputs using the euclidean distance between the - 2515 * X 5 + 130.49 * X 12 + 81.6 * X 1 * X 2 + 156.9 * X 1 * X 5
inputs. Here, we clustered the inputs up to 8 to compare the (5)
Silhouette plots. + 23.7 * X 2 2 + 75.6 * X 2 * X 3 + 81.7 * X 2 * X 4 + 67.87 * X 2 * X 5
+ 171.8 * X 32 + 146.3 * X 3 * X 4 + 353.6 * X 3 * X 5 + 118 * X 4 2
Step-6 Cluster Validation: To validate the clusters, we + 190.4 * X 4 * X 5 + 211.9 * X 52
analyzed the Silhouette plots and means. Here, the best plot
belongs to the 5-clusters (mean 0.55), as shown in Figure-3.
f5 -4170.44 + 2002.79 * X 1 - 1395.23 * X 3 + 6725.7 * X 4 + 323.3 * X 5
Therefore we end up with 5 metamodels.
+ 1151.2 * X 6 + 567.3 * X 12 + 111.7 * X 1 * X 3 - 838.7 * X 1 * X 5
(6)
- 86.4 * X 1 * X 6 + 19.2 * X 2 2 -156.9 * X 2 * X 4 + 102.7 * X 2 * X 5
1 + 80.9 * X 32 + 54.8 * X 3 * X 4 + 315.4 * X 3 * X 5 - 144.5 * X 4 2
- 193.1* X 4 * X 6 + 269 * X 52 - 8.9 * X 6 2
2
f3 47090.5 - 3869 * X 1 +15649 * X 2 - 3294.5 * X 3 - 3404.3 * X 4 Step-9 Find the Optimum of Each Metamodel: To optimize
- 2199.4 * X 5 +474.6 * X 12 + 90 * X 1 * X 4 -1367.5 * X 2 2 the objective functions of five metamodels, we used Matlab
- 1238.2 * X 2 * X 3 -821.7 * X 2 * X 4 -1084.9 * X 2 * X 5 +193.8 * X 32
(4) [19]’s Optimization Tool. Table-4 shows the minimum total
costs and values of decision variables.
+ 295.9 * X 3 * X 4 +250.5 * X 3 * X 5 +170.7 * X 4 2 +251* X 4 * X 5
+104.9 * X 52
623
Table 4. Objective functions and decision variables’ values [2] Law, M. and Kelton W. D., 2001. Simulation Modeling
Method Obj.Func Decision Variables Tested and Analysis, McGrawHill, Second Edition, United
Value [X1;X2;X3;X4;X5;X6] Obj. States.
Func. [3] Fu, M., 2002. “Optimization for Simulation: Theory vs.
OptQuest $21017 [3;0;0;0;3;29] - Practice”, INFORMS Journal on Computing
Cluster-1 $21394 [3.76;0;6.17;4;8;50] $28570 14(3):192-215.
Cluster-2 $24842 [3.4;0.9;1.4;1.95;6.83;5 $26343
[4] Waziruddin, S., Brogan,D. C., Reynolds, P.
0]
F.:“Coercion through Optimization: A Classification of
Cluster-3 $23994 [3.5;0;3.9;6;0;50] $26246
Cluster-4 $21888 [7.5;0;4;2.6;0;41] $25171 Optimization Techniques” Proceedings of the 2004 Fall
Cluster-5 $20345 [4;0;0;0;5;29] $21986 Simulation Interoperability Workshop, Orlando, FL,
September 2004.
Step-10 Test the Optimum by Using Simulation Model: We [5] Carson, Y. and A. Maria: “Simulation
tested the optimum of each cluster that obtained in Step-9 Optimization: Methods and Applications” Proceedings of
by using Arena simulation model. Note that the minimum the 1997 Winter Simulation Conference, 1997.
total cost belongs to the Cluster-5’s metamodel, as shown in [6] Rubinstein, R. Y. and A. Shapiro. 1993. Discrete
Table-3. After running those decision variable values in our Event Systems: Sensitivity Analysis and Stochastic
call center simulation model, the result is $21646 (“Tested Optimization by the Score Function Method. New York:
Objective Function” column) which is close to the minimum John Wiley & Sons.
total cost that OptQuest finds $21017. [7] Fu, M.: “Simulation Optimization” Proceedings of
the 2001 Winter Simulation Conference, 2001.
7. CONCLUSION [8] R. H. Myers and D. C. Montgomery: Response
Surface Methodology: Process and Product Optimization
Simulation optimization techniques have developed Using Designed Experiments, Wiley-Interscience, 2002.
significantly in the last two decades. In this study, we aim at [9] Montgomery, D.C. (1991) Design and Analysis of
contributing the literature by proposing a new approach in Experiments, John Wiley & Sons, New York, NY.
which K-Means clustering algorithm is integrated into [10] Fu, M.C. (1994) Optimization via simulation: A
metamodeling. We tested the proposed approach by using a review. Annals of Operations Research, 53, 199–247.
call center simulation model. In this example we used 500 [11] Sargent ,R.G.: “Reesearch Issues in
scenarios which are created by Arena OptQuest Metamodeling” Proceedings of the 1991 Winter Simulation
optimization tool, and then clustered the inputs into five Conference, 1997.
groups. The clusters helped to create plausible metamodels [12] Xu, R.:”Survey of Clustering Algorithms” IEEE
with satisfactory and near-optimal R-Square and MSE Transactıons on Neural Networks, Vol. 16, No. 3, pp. 645–
values. This gives us an indication of the advantage of the 678, May 2005.
proposed approach. [13] B. Everitt, S. Landau, and M. Leese, Cluster
Analysis. London:Arnold, 2001.Biography.
When the solution space is large and searching is costly, the [14] J. Hartigan, Clustering Algorithms. New York:
proposed approach can be used as an alternative to heuristic Wiley, 1975.
search algorithms. However to generalize the usefulness of [15] A. Jain, M. Murty, and P. Flynn, “Data clustering:
this approach, we aim at having more cases in the future. A review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323,
1999.
8. ACKNOWLEDGMENTS [16] Kelton, W. D. Sadowski, R. P. and Sturrock, D. T.
2007. Simulation with Arena, McGrawHill, Fourth Edition,
The views and conclusions contained herein are those of the United States. pp 195-285.
authors and should not be interpreted as necessarily [17] Minitab, http://www.minitab.com, [accessed
representing the official policies or endorsements, either Jan.2013]
expressed or implied, of any affiliated organization or [18] Arena Simulation Software,
government. http://www.arenasimulation.com/, [accessed Jan.2013]
[19] Matlab,
9. REFERENCES http://www.mathworks.com/products/matlab/ , [accessed
Jan.2013]
[1] Tekin, E. and Sabuncuoglu, I., 2004.“Simulation
Optimization: A Comprehensive Review on Theory and
Applications”. IEEE Transactions, 36:1067-1081.
624
Biography
Emre İrfanoglu is pursuing his MSc in Naval Operations
Research in the Institute of Naval Science and Engineering.
He holds a BSc in Industrial Engineering degree where he
received in 2005 from the Turkish Naval Academy.
625