Professional Documents
Culture Documents
Network Failure-Aware Redundant Virtual Machine Placement in A Cloud Data Center
Network Failure-Aware Redundant Virtual Machine Placement in A Cloud Data Center
Network Failure-Aware Redundant Virtual Machine Placement in A Cloud Data Center
net/publication/319347730
CITATIONS READS
5 136
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shangguang Wang on 05 December 2017.
Abstract—Cloud has become a very popular infrastructure for many smart city applications. A growing number of smart city
applications from all over the world are deployed on the clouds. However, node failure events from the cloud data center have
negative impact on the performance of smart city applications. Survivable virtual machine placement has been proposed by the
researchers to enhance the service reliability. Due to the ignorance of switch failure, current survivable virtual machine
placement approaches cannot achieve the best effect. In this paper, we study to enhance the service reliability by designing a
novel network failure-aware redundant virtual machine placement approach in a cloud data center. Firstly, we formulate the
network failure-aware redundant virtual machine placement problem as an integer non-linear programming problem and prove
that the problem is NP-hard. Secondly, we propose a heuristic algorithm to solve the problem. Finally, extensive simulation
results show the effectiveness of our algorithm.
Index Terms—cloud computing, virtual machine placement, reliability, smart city application
—————————— ——————————
1 INTRODUCTION
machine. In addition, there is no inter-traffic between a of the proposed algorithm. Section 6 shows experiment
backup virtual machine and other active virtual machines results, and Section 7 concludes the paper.
before failure event occurs. However, the backup virtual
machine should interact with other active virtual machine
after a failure event occurs. Without considering this con-
2 RELATED WORK
dition, the virtual machine placement approaches con- Many notable cloud service reliability enhancement ap-
sumes too much upper network resource. The core layer, proaches have been proposed by the researchers.
which is the bottleneck of a fat-tree data center, would be Checkpoint mechanism achieves fault tolerance by
over-utilized. saving the virtual machine state as a checkpoint image
In this paper, we investigate the network failure-aware periodically during failure-free execution [16]. A check-
redundant virtual machine placement problem in a cloud point image contains the whole recovery information.
data center. Our contributions are summarized as follows: Because of the dynamic nature of Infrastructure as a Ser-
1) By mining the characteristic of the data center net- vice clouds, it is hard to design an efficient checkpoint
work, we specially investigate how to apply replication mechanism. To address the problem, [17] proposed an
technique to enhance the service reliability. We formulate optimal checkpoint mechanism aiming at minimizing the
the network failure-aware redundant virtual machine performance overhead and storage resource consumption.
placement problem into an Integer Non-Linear Pro- In order to efficiently save the complete running state of
gramming (INLP) problem. the application, the propose checkpoint mechanism lever-
2) We prove that the network failure-aware redundant ages new functions, such as disk-image multi-
virtual machine placement problem is a NP-hard problem. snapshotting, and inside checkpoint protocols. [18] pre-
An efficient heuristic algorithm EFAP (Edge switch Fail- sented an incremental checkpoint mechanism in cloud
ure Aware Placement) is presented for the problem. data center. To reduce the network resource consumption
3) Extensive simulation results show that our algo- and the time needed to take a checkpoint image, only
rithm can ensure the service reliability and avoid adding modifications compared with latest stored checkpoint
burden to the bottleneck of data center. image is checkpointed. [19, 20] presented distributed
The rest of this paper is organized as follows. Section 2 checkpoint image storage system in fat-tree data center to
describes related work. Section 3 lists the system model, reduce the upper layer data center network resource con-
and the challenges in the network failure-aware redun- sumption. In the distributed checkpoint image storage
dant virtual machine placement. The problem formula- system, the checkpoint images are storage on the nearby
tion is proposed in Section 4. Section 5 presents the details host server.
AUTHOR ET AL.: TITLE 3
(a) (b)
(c) (d)
Checkpoint mechanism is suitable for large scale compu- service reliability cannot be ensured. A novel virtual ma-
tation-intensive service. To address the problem, replica- chine placement algorithm is proposed to minimize the
tion is another mechanism that can be employed network traffic under the reliability constraints. [15] pro-
Taking the quality of service requirement of applica- posed an availability-aware virtual machine mapping
tions into consideration, [21] presented two optimal data algorithm to improve the network resource utilization
replications approaches in cloud computing environment. and service reliability for a multi-tier application. The
[22] proposed a k-fault tolerant virtual machine placement approaches do not employ replication mechanism.
approach. The proposed approach can minimize the In [11], a backup virtual machine is created for each
number of running server while satisfy the quality of ser- critical virtual machine to enhance the survivability.
vice requirements at any k physical server failures. Only When a server fails or a virtual machine failure occurs,
atomic service is considered in the proposed approaches. the affected service is switched over to the backup virtual
[14, 15, 23] proposed reliability enhancement approaches machine. The group of correlated virtual machine and
for complex application. [14] proposed a reliable virtual their backups compose a survivable virtual machine set.
data center mapping algorithm. Considering both the An efficient algorithm is proposed to map the survivable
failure characteristic of hardware and the impact of indi- virtual machine set to a cloud data center. [24] identified
vidual failures on service reliability, the mapping algo- the significant components of a complex application and
rithm tries to achieve high reliability and low cost. To determine the most suitable reliability enhancement strat-
improve the reliability and performance deployed in egy for each identified component. The significance of
cloud data center, [23] presented a structural constraint- each component is calculated based on the invocation
aware virtual machine placement algorithm. The service frequencies and invocation structures. However, the tra-
availability is formulated as a combination of colloca- ditional redundant virtual machine placement approach-
tion/anti-collocation constraints. In a cloud data center, es may become out of use when a network failure occurs.
the Infrastructure as a Service provider intends to place a We will address above-mentioned problems in this
group of virtual machines with high inter-traffic to the paper.
same subnet to reduce intra-network traffic. However, the
4 MANUSCRIPT ID
on host servers. Several binary variables are defined by us All backup virtual machines allocated for a specific
for virtual machine placement: service should be placed on host servers in different sub-
k 1, if virtual machine vm is placed on hsk nets. When a primary virtual machine becomes unavaila-
X vm (2)
0, otherwise ble, the corresponding sub-service is quickly switched to
the backup virtual machine. The backup virtual machine
1, if virtual machine vm is placed on subk
Yvmk (3) becomes a primary virtual machine. Therefore, two pri-
0, otherwise mary virtual machine may be placed on virtual machines
k in the same subnet after failure events occur. This con-
X vm is equal to 1 when virtual machine vm is placed
k k straint can be expressed by the following:
on host server hs k . Otherwise, X vm is equal to 0. Yvm is
equal to 1 when virtual machine vm is placed on a host
Yvmk B Yvmk B =0
k
v p ,i
(14)
vp, j
B
should be satisfied for any vm vp ,i :
4.2 Virtual Machine Interaction Cost
The virtual machines should interact with each other
j
X j
vmvBp ,i
1 (5) to provide a complex service. A large amount of data is
transferred between the virtual machines over the data
A host server should have enough available compu- center network. Different virtual machine placement
ting resource that can be allocated to the placed virtual strategies would result in different network resource con-
machine. This constraints can be expressed by the follow- sumption. We should minimize the network resource
ing: consumption. Three types of communication would con-
rvmcpuv X vmj v c cpu
j
(6) sume data center network resource.
p ,i p ,i
m n
rvmcpuB X vmj B c cpu
j
(9) Cp (vmvp ,i , vmvp , j ) X vmv
X vmv
dt (vmvp ,i , vmvp , j )ds(hsm , hsn )
v p ,i v p ,i p ,i p, j
m n
mem j mem
rvm vBp ,i
X vm vBp ,i
c j
(10) (16)
CP = C p (vmvp ,i , vmvp , j ) (17)
rvmdiskB X vmj B c disk
j
(11) i j
v p ,i v p ,i
calculated by the following: chine placement problem involving only primary virtual
m
Cb (vmvp ,i ) X vm n
X vm B dt (vmv , vmvBp ,i )ds(hsm , hsn ) (18) machine can be solved, then it can be used to solve the
v p ,i
p ,i v p ,i
m n multidimensional packing problem [25, 26].
CB = Cb (vmvp ,i ) (19) Consider a special case of the network failure-aware
i redundant virtual machine placement problem: there are
When a failure event occurs, the interrupted sub- m number of host servers in the data center and n number
application is switched to the backup virtual machine. CV of primary virtual machines. Suppose there is no critical
denotes the network resource consumption between a sub-application. Therefore, the number of backup virtual
backup virtual machine and other virtual machines after machine is 0. The problem is to place all virtual machines
failure event occurs. on host servers with the goal of minimizing the network
Cv (vmvBp ,i , vm) X vm
m
v
n
X vm dt (vmvp ,i , vm)ds(hsm , hsn ) (20) resource consumption. It is easy to see that an algorithm
p ,i
m n for solving the virtual machine placement problem can be
CV = Cv (vmvBp ,i , vm) (21) used to solve the multidimensional packing problem. The
i vmVM ( vmvBp ,i ) multidimensional packing problem is NP-Hard [27, 28].
The proof outlined above shows that even a simpler case
where VM ( vmvBp ,i ) denotes the virtual machines that need of only primary virtual machine involved makes the
problem NP-Hard. Hence it follows that the network fail-
B
to interact with vmvp ,i after failure events occur. ure-aware redundant virtual machine placement problem
In a fat-tree cloud data center, there is no data need to is NP-Hard. End proof
be transferred by the data center network when two vir- There are tens of thousands of host servers in a data
tual machine are placed on the same host server. When center. Therefore, it is impractical to iterate all primary
two host servers are in the same subnet, the data should and backup virtual machine placement strategy. We will
be transferred by an edge switch. The distance is 2. When propose a heuristic algorithm to solve the INLP in this
the two host servers are in different subnets but the same paper.
pod, the data should be transferred by two edge switches
and an aggregation switch. The distance is 4. When the
two host servers are in different pods, the data should be
transferred by two edge switches, two aggregation
switches and a core switch. The distance is 6. Therefore,
the distance between two host servers is defined as fol-
lows:
0, if m n
2, if hs and hs is in the same subnet
m n
ds(hsm , hsn ) (22)
4, if hsm and hsn isin the same pod
6, otherwise
5 PROPOSED ALGORITHM physical servers are in the same pod. Therefore, we at-
tempt to place the virtual machines with high inter-traffic
Our goal is to calculate the optimal primary and backup
in the same pod. The algorithms are illustrated in Algo-
virtual machines placement strategy while minimizing
rithm 1 and Algorithm 2. As shown in Algorithm 1, we
the total network resource consumption. Based on the
sort all links based on the inter-traffic size and add all
topology of data center network, a heuristic algorithm is
pods to candidate pod set. Then, we iterate all pods and
proposed to solve the problem.
decide the optimal pod that can host the most number of
virtual machines. Thirdly, the virtual machines that can
be placed on current optimal pod are removed from the
virtual machine set. In addition, current optimal pod is
removed from the candidate pod set. The above-
mentioned steps are repeated until all virtual machines
have been placed. As shown in Algorithm 2, we iterate
the links in the link list. The links in the link list have al-
ready been sorted in a descending order of data rate (in
Algorithm 1). A link is in the List if at least one node (de-
notes a virtual machine) of the link does not have been
placed. Then, we select available subnet(s) for the un-
placed node(s). The function find is used to find an avail-
able host server. The steps are repeated until there is no
available host server in current pod. “Available” denotes
that (6)-(15) are satisfied.
6 PERFORMANCE EVALUATION
6.1 Experiment Setup
In our experiment, the network topology is a 32 port
fat-tree topology. The fat-tree data center network con-
sists of 32 pods, and 16 subnets in each pod. There are 16
host servers reside in each subnet. We generate 100 appli-
cations to be deployed in the cloud data center. is 1.
The number of sub-applications is uniformly distributed
in [7, 8] for each application. The sub-applications are
serially connected. We randomly add 1 backup virtual
machine for each application. The execution time of each
sub-stage is 5 min. The data interaction rate falls to [0.8
MB/min, 1.2 MB/min]. 6000 tasks are generated for the
applications.
We compare our proposed algorithm with two other
algorithms:
(1) RANDOM. The target host server is randomly se-
lected. First fit strategy is employed to place the primary
and backup virtual machines.
(2) HFAP. Host server failure-aware redundant virtual
machine placement. The algorithm is proposed in [11].
HFAP only considers the host server failure in virtual
machine placement.
All algorithms are evaluated by using the following
metrics: (1) Total lost time. Total lost time that is caused
by failures. (2) Total delayed tasks. A task is delayed if a
primary virtual machine and its backup virtual machine
failed at the same time in the task execution. (3) Total data
that has been transferred by the edge switches. (4) Total
data that has been transferred by the aggregation switch-
es. (5) Total data that has been transferred by the core
switches.
Fig. 3 The number of delayed tasks under different failure rate set-
tings. X-axis denotes the failure rate, and Y-axis denotes the number Fig. 6 The aggregation layer network resource consumption under
of delayed tasks. different failure rate settings. X-axis denotes the failure rate, and Y-
axis denotes the aggregation layer network resource consumption.
Fig. 4 The total lost time under different failure rate settings. X-axis
denotes the failure rate, and Y-axis denotes the total lost time.
Fig. 7 The edge layer network resource consumption under different
failure rate settings. X-axis denotes the failure rate, and Y-axis de-
notes the edge layer network resource consumption.
As shown in Fig. 3 and Fig 4, the total delayed task Konwinski, et al., "A view of cloud computing,"
number and the total lost time of all the three algorithms Communications of the ACM, vol. 53, pp. 50-58, 2010.
become larger with the increase number of edge switch [3] J. Chen, C. Wang, B. B. Zhou, L. Sun, Y. C. Lee, and A. Y.
failure rate from 0 to 0.01. EFAP outperform RANDOM Zomaya, "Tradeoffs between profit and customer satisfaction for
and HFAP in all the edge switch failure rate settings con- service provisioning in the cloud," presented at the Proceedings
sistently. HFAP performs badly in all settings. The reason of the 20th international symposium on High performance
is that only host server failure is considered by HFAP. distributed computing, 2011.
There is a frequent interaction between a primary virtual [4] S. Yi, A. Andrzejak, and D. Kondo, "Monetary cost-aware
machine and its backup virtual machine because the checkpointing and migration on Amazon cloud spot instances,"
backup virtual machine is synchronized with the active
Services Computing, IEEE Transactions on, vol. 5, pp. 512-524,
virtual machine periodically. Therefore, HFAP attempts
2012.
to place primary virtual machine and its backup virtual
[5] A. Zhou, Q. Sun, L. Sun, J. Li, and F. Yang, "Maximizing the
machine on the same subnet but different host servers.
profits of cloud service providers via dynamic virtual resource
Many tasks may be interrupted from the critical stage
renting approach," EURASIP Journal on Wireless
when an edge switch fails. A task fails when it is inter-
rupted from a critical stage. Instead of being restarted Communications and Networking, vol. 2015, p. 71, 2015.
from the last sub-stage, the execution time of the task is [6] W. Guo, K. Chen, Y. Wu, and W. Zheng, "Bidding for highly
lost. available services with low price in spot instance market,"
As illustrated in Figs 5-7, RADDOM consumes the presented at the Proceedings of the 24th International
most network resource, and HFAP consumes the least Symposium on High-Performance Parallel and Distributed
network resource. That’s because HFAP attempts to place Computing, 2015.
the virtual machine with high inter-traffic on host servers [7] X. He, P. Shenoy, R. Sitaraman, and D. Irwin, "Cutting the Cost
in the same subnet. Therefore, the data transfer does not of Hosting Online Services Using Cloud Spot Markets,"
consume too many core layer and aggregation layer net- presented at the The 25th International ACM Symposium on
work resources. However, the reliability cannot be en- High-Performance Parallel and Distributed Computing(HPDC),
sured in HFAP. For attempting to spread virtual ma- 2015.
chines among host servers in the same pod but different [8] J. Liu, S. Wang, A. Zhou, and F. Yang, "PFT-CCKP: A proactive
subnets, EFAP consumes very little core layer network fault tolerance mechanism for data center network," in 2015
resource. Therefore, EFAP can avoid adding burden to IEEE 23rd International Symposium on Quality of Service
the bottleneck of data center but still enhance the service (IWQoS), 2015, pp. 79-80.
reliability. [9] A. Zhou, S. Wang, B. Cheng, Z. Zheng, F. Yang, R. Chang, et al.,
"Cloud Service Reliability Enhancement via Virtual Machine
Placement Optimization," IEEE Transactions on Services
7 CONCLUSION Computing, vol. PP, pp. 1-1, 2016.
[10] S. Wang, Z. Liu, Q. Sun, H. Zou, and F. Yang, "Towards an
In this paper, we studies the network failure-aware re-
accurate evaluation of quality of cloud service in service-
dundant virtual machine placement problem with the
consideration of the data center network resource con- oriented cloud computing," Journal of Intelligent Manufacturing,
sumption. We formulate the network failure-aware re- vol. 25, pp. 283-291, 2014.
dundant virtual machine placement problem as an inte- [11] J. Xu, J. Tang, K. Kwiat, W. Zhang, and G. Xue, "Survivable
ger non-linear programming problem and prove that the virtual infrastructure mapping in virtualized data centers," in
problem is NP-hard. An efficient heuristic algorithm is Cloud Computing (CLOUD), 2012 IEEE 5th International
proposed to solve the problem. Extensive simulations Conference on, 2012, pp. 196-203.
show the effectiveness of our algorithm. We will experi- [12] M. Al-Fares, A. Loukissas, and A. Vahdat, "A scalable,
ment real-world workload in our future work. commodity data center network architecture," ACM SIGCOMM
Computer Communication Review, vol. 38, pp. 63-74, 2008.
[13] S. Kandula, J. Padhye, and P. Bahl, "Flyways to de-congest data
ACKNOWLEDGMENT center networks," 2009.
The work presented in this study is supported by [14] M. Shen, X. Ke, F. Li, F. Li, L. Zhu, and L. Guan, "Availability-
NSFC (61602054), Beijing Natural Science Foundation Aware Virtual Network Embedding for Multi-tier Applications
(4174100), and NSFC (61571066). in Cloud Networks," presented at the Proceedings of the 2015
IEEE 17th International Conference on High Performance
REFERENCES Computing and Communications, 2015 IEEE 7th International
Symposium on Cyberspace Safety and Security, and 2015 IEEE
12th International Conf on Embedded Software and Systems,
[1] M. D. Dikaiakos, D. Katsaros, P. Mehra, G. Pallis, and A. Vakali,
2015.
"Cloud computing: distributed internet computing for IT and
[15] X. Li and C. Qian, "Traffic and failure aware VM placement for
scientific research," Internet Computing, IEEE, vol. 13, pp. 10-
multi-tenant cloud computing," in 2015 IEEE 23rd International
13, 2009.
Symposium on Quality of Service (IWQoS), 2015, pp. 41-50.
[2] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A.
[16] T. Knauth and C. Fetzer, "VeCycle: Recycling VM Checkpoints
10 MANUSCRIPT ID