Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

CosMig: Modeling the Impact of Reconguration in a Cloud

Akshat Verma IBM Research-India Gautam Kumar IIT Kharagpur Ricardo Koller Florida International University Aritra Sen IBM Research-India

AbstractClouds allow enterprises to increase or decrease their resource allocation on demand in response to changes in workload intensity. Virtualization is one of the building blocks for cloud computing and provides the mechanisms to implement the dynamic allocation of resources. These dynamic reconguration actions lead to performance impact during the reconguration duration. In this paper, we model the cost of reconguring a cloud-based IT infrastructure in response to workload variations. We show that maintaining a cloud requires frequent recongurations necessitating both VM resizing and VM live migration, with live migration dominating reconguration costs. We design the CosM ig model to predict the duration of live migration and its impact on application performance. Our model is based on parameters that are typically monitored in enterprise data centers. Further, the model faithfully captures the impact of shared resources in a virtualized environment. We experimentally validate the accuracy and effectiveness of CosM ig using microbenchmarks and representative applications.

I. I NTRODUCTION The ability of clouds to provide resources in an online manner has led to its successful adoption as a compute platform for the future [5], [7], [10]. Clouds have a distinct advantage over traditional data centers in providing elasticity and achieving higher resource utilization. A customer uses the elasticity in a cloud to increase or decrease the amount of resources it wants to reserve for itself. For a cloud provider, elasticity is the ability to seamlessly move resources from one customer to another in response to variation in demand, thus allowing the cloud to operate at high resource utilization. The provider of a cloud can provision resources that are substantially lower than sum of the peaks of individual customer workloads. Virtualization is the key technology that enables both elasticity and high resource utilization in clouds. To achieve the twin objectives, clouds host diverse applications as virtual machines on a shared physical server to achieve higher resource utilization. Further, the resources assigned to any set of applications hosted on the cloud or even the complete cloud can shrink/expand based on workload intensity. The virtualization layer provides the required isolation layer between applications running on the same physical server. Frequent reconguration or dynamic resource allocation from a shared pool is a dening attribute of a cloud. Reconguration in a cloud may lead to performance issues for the hosted applications. Reconguration actions consume resources and may lead to resource contention for the applications. Hence, it is important to understand (i) the frequency of reconguration in a typical cloud environment and (ii) the impact of such reconguration on the hosted applications.

In this work, we study the impact of reconguration in a cloud setting and present a model to characterize it. In this paper, we use the terms virtualized data center and clouds interchangeably to denote data centers that support dynamic resource allocation for virtual machines. Reconguration of resources in a shared cloud is achieved using either (i) Dynamic Virtual Machine (VM) resizing and VM Live Migration or (ii) Creating new VM instances. Dynamic VM resizing allows the resources assigned to a VM to be changed on the y and Live Migration allows a VM to move to a different physical server due to resource bottleneck on its server or for server consolidation. Increasing or decreasing the number of VMs also provides the same functionality, albeit at a coarser granularity. Such horizontal scaling is only applicable to clustered applications with a gateway that distributes requests to nodes in the cluster. Further, horizontal scaling incurs additional cost due to increase in the number of software licenses. Finally, data center management cost is typically directly proportional to the number of VMs and horizontal scaling leads to an increased labour cost. As a consequence, VM resizing and Live Migration have been preferred as a tool for dynamic consolidation in virtualized data centers [19], [18], [3], [12], [8]. Hence, in this study, we focus on characterizing the impact of Live VM Resizing and Live Migration technologies only, which provide a more exible and cost-effective alternative for dynamic resource allocation in virtualized data centers or clouds. A. Motivation
Migration Time (seconds)
100

50

25 bt is ua sp lu daxpy Benchmark

Fig. 1. Variation in Duration of Live Migration for 6 benchmark applications at different operating points

Dynamic resource reconguration using VM Resizing and Live Migration comes with associated costs. It has been observed that the cost of live migration is signicant and needs to be factored during dynamic resource allocation [18], [9], [3], [8]. Existing approaches for dynamic consolidation assume the cost to be a function of active memory [18], other

low-level VM parameters like dirty rate [1], or applicationtype [9]. However, all these models are oblivious to other colocated VMs and server utilization. We rst investigated if such simple models can accurately estimate the impact of live migration. We observed that the duration of migration varies based on the nature of the application (different applications have different migration duration in Fig. 1). Further, even for an application running identical workload, we varied the workload in a co-located virtual machine running on the shared physical server and observed that the the duration of live migration varies by a factor of 2 (e.g., bt application in Fig. 1). Hence, a usable model for live migration needs to be application-aware and take into account other co-located VMs and physical server utilization. A practical and accurate model of live migration is needed to complement existing dynamic consolidation approaches and provide an estimate of the cost of reconguration in both traditional shared data centers and emerging cloud-based data centers. B. Contribution In this work, we address the following questions: (i) How frequent is a reconguration action in a typical cloud? (ii) How can we model the cost of a reconguration? Using a trace analysis on a large data center, we show that frequent recongurations will be very common in enterprise clouds. We also establish that avoiding frequent recongurations using infrequent resource allocation may lead to signicant resource wastage; one of the key problems in traditional data centers that clouds promise to correct. In order to answer the second question, we complement insights drawn from our earlier study [21] with new observations to design the CosMig model that predicts (i) the time taken to complete the VM migration (ii) the performance impact of the migration on the VM being migrated and (iii) the performance impact of migration on other hosted VMs. Our model is based solely on CPU utilization and active memory, two parameters that are monitored in most large data centers. Using a carefully selected set of micro-benchmarks and representative applications, we show that CosM ig is able to accurately estimate the impact of live migration in a cloud environment. II. BACKGROUND We now present a background of the reconguration mechanisms and modeling challenges. A. Reconguration Mechanisms in cloud The reconguration mechanisms considered in this paper are (i) Dynamic VM resizing or Dynamic Logical Partition (DLPAR) resizing (ii) Live VM Migration. Modern hypervisors allow fairly low overhead DLPAR resizing, which allows the resource entitlements to be changed on a running VM. Our experiments on DLPAR resizing for on both IBM Power6 platform with pHyp hypervisor and IBM HS21 BladeServer with VMWare ESX 3 showed that the duration for VM resizing was less than 1 second with no perceptible performance impact. Hence, we focus only on the impact of live migration for this study.

The most important aspect in terms of the performance impact of a live migration activity is the copying of in-memory state from the source hypervisor to the target hypervisor [13], [4], [11]. The copying of in-memory state consists of the following phases: 1) Pre-Copy Phase: The applications keep running in the Pre-Copy phase, which works in rounds. In the rst round, all the active pages in memory are copied to the target server. In any subsequent round, all pages made dirty in the previous round are copied. The phase typically terminates when either the number of dirty pages are small (less than some constant Csmall ) or the decrease in the number of dirty pages between two subsequent iterations is small, i.e. no progress (less than CnoP rog ). 2) Stop and Copy Phase: The application is stopped in this phase and all the remaining dirty pages are copied to the target server to complete migration. The Stop and Copy Phase is small for typical applications, usually less than 1 second. The Pre-Copy Phase is much longer [13], [4] and increases with the size of memory being copied. Live migration techniques use a technique called ballooning [22] to reclaim all idle memory before migration [15], [4]. This minimizes the amount of memory copied from the source server to target server during Pre-Copy. B. Model Parameters We use the following parameters to determine the performance impact of migrating a VM V Mi . 1) Duration ( (i)): Time taken for migration. 2) Self-Impact (s (i)): Ratio between the drop in throughput during migration and the throughput without migration of the application on V Mi . 3) Co-Impact (c (j)): Ratio between the drop in throughput for a co-located VM V Mj during migration and the throughput of V Mj without migration The rst parameter determines the duration for which the performance impact of the reconguration would be observed. The second and third parameters capture the quantitative impact on application performance during the migration. C. Baseline Model We rst present a baseline model based on the design of live migration. The duration of migration for a VM V Mi depends on the number of pages that need to be copied (or active memory AMi ) and the peak network bandwidth (M Bw)between the source and target servers. Further, in order to do the Pre-Copy, all pages of V Mi are marked as read-only to allow the hypervisor to update the list of dirty pages. Hence, the performance impact may depend on the write rate W Ri of an application. Finally, the number of Pre-Copy rounds R required by an application depends on the number of dirty (or unique) pages DRi,r it writes in the previous round r. We use the above parameters to dene the following baseline model to capture the impact of migration.

The time (i) taken to migrate V Mi equals the sum of each round r (i) of copying and can be captured as
R

III. M OTIVATING O BSERVATIONS A. Understanding Live Migration In a preliminary version of this work [21], we conducted a study of live migration on VMWare ESX and IBM pHyp hypervisors. In this work, we use insights drawn from these observations and additional new observations to design a model to estimate the impact of live migration. For the sake of completeness, we summarize the observations made in [21]. For more details on the reasoning for the observations, the reader is refered to [21]. Observation 1: If there are no resource constraints, the duration of migration for an application varies linearly with the active memory of the VM. The migration duration varies across applications with same memory footprint. Observation 2: Live migration requires spare CPU resources on the source server but not on the target server. If spare CPU is not available, it impacts the duration of migration and the performance of the VM being migrated. Observation 3: The amount of CPU required for live migration increases with an increase in the number of active pages of the VM being migrated. Observation 4: A co-located VM impacts a VM being migrated by taking away resources from the physical server. The co-located VM does not suffer from CPU contention but may suffer from cache contention.
1.3 1.25 Normalized Job Duration 1.2 1.15 1.1 1.05 1 0.95 0.9 1000 10000 100000 1e+06 Cluster1 (No Migration) Cluster2 (No Migration) Cluster1 (Migration to 2) Cluster2 (Migration to 1)

(i) =
r=1

r (i), 0 (i) =

DRi,r AMi , r+1 (i) = M Bw M Bw

(1)

The last round R is the rst instance when the number of dirty pages are either small, or very close to the number of dirty pages in the previous round, i.e., DRi,r < Csmall , DRi,r1 DRi,r CnoP rog (2) Similarly, the performance impact s (i) of migration on the VM being migrated should be governed primarily by the penalty incurred in handling faults caused by writes to pages marked as read-only. Further, there should be no performance impact (c (j)) on any other VM V Mj that is hosted on the same server. s (i) = f (W Ri ), c (j) = 0 (3) Barring minor implementation details specic to individual hypervisors (e.g., the simulation model in [1] is more negrained and has different stop conditions), the baseline model captures the essence of all the earlier models [18], [9], [1] proposed to capture the impact of live migration. D. Modeling Challenges in Live Data Centers Data center administrators are often reluctant to use nonstandard tools, which severely restricts the parameters that can be monitored in a data center. Manageability is a key requirement in a data center, forcing them to use standard hardware, operating systems, and tools. We conducted a survey of more than 100 enterprise data centers and enumerate a superset of parameters that are monitored. The complete list of system parameters include <% Total Processor Time,% Priv Time,% User Time,Proc Queue Length, Context Switches/Sec,Swap Page Ins/Sec, Swap Page Outs/Sec, Memory Committed Bytes (MB),Memory Average % Used,DASD % Free, IOPS, Disk Read (Bytes/second), Disk Write(Bytes/second), # Log Vol Red, TCP/IP Conn, TCP/IP Bytes/Sec, TCP/IP Packets Sent, TCP/IP Packets Received>. Additionally, outsourced data centers also monitor application parameters for SLA enforcement. The baseline model is based on parameters like W Ri and DRi,r that are not present in our list above. Even though there is work on obtaining memory traces dynamically [17], these techniques have performance overheads and are feasible only on mid range servers. Since typical clouds use commodity software and servers in order to cut down costs, it may not be feasible to obtain these parameters. Further, the performance impact of each page fault due to writes (f (W Ri )) needs to be estimated, which is an open research problem in itself. Hence, a practical tool for modeling the impact of reconguration should restrict itself to high level system parameters that are easily available in data centers. Finally, the baseline model is oblivious of non-memory parameters (CPU utilization, cache hit rates etc) and application characteristics and it is important to ascertain if the impact of these parameters is small enough to be ignored. We conducted a detailed experimental study to answer some of these questions.

Active Memory (KB)

Fig. 2. Performance Impact for migration from High Utilization Servers (Cluster1) to Low Utilization Servers (Cluster 2) and vice versa.
30 Trial1 Trial2 Trial3 Benchmark Time (sec) 25

20

15

10 No Migration (1) No Migration (2) Migration (1) Migration (2)

Fig. 3. Performance enhancement for BT benchmark post migration with (Cluster1) and without (Cluster2) resource contention.

1) New Observations: We had observed in [21] that migrating virtual machines from high CPU utilization servers leads to a negative performance impact during migration. We conducted experiments with clusters running at low utilization and observe that there could also be a positive performance impact for some applications at low CPU utilization during migration (Fig. 2). In out testbed, we created a cluster with fewer cores (Cluster1) and a cluster with large number of cores (Cluster2) and we migrated VMs between the clusters. We observe that migrating a micro benchmark daxpy from

Cluster1 to Cluster2 lead to an increase in the running time, whereas migrating from Cluster2 to Cluster1 led to a decrease in running time. The details about the testbed and benchmark are available in Section. V. Migration itself can not improve the performance of an application. However, since the benchmarks are long running (typically twice the duration of migration), we conjecture that the performance enhancement happens after migration is completed. To validate our conjecture, we experiment with the BT benchmark from the NAS suite that runs in time roughly half the duration of migration. We run the benchmark 3 times, synchronizing the rst run, with the start of live migration. We observe that without migration (NoMigration in Fig. 3), all 3 trials of the benchmark take roughly the same time. However, with Migration, the third trial of the benchmark takes 20% less time. This performance enhancement can be explained by the fact that typical live migration techniques use ballooning to reclaim idle pages before migration [4], [15]. However, a positive side effect of ballooning is defragmentation of memory on the target server, where all the active pages are allocated together. This greatly improves sequentiality of memory accesses leading to better cache usage and prefetching, exhibiting itself as improved application performance post migration [6]. It is also interesting to note that the performance improvement is greater for applications with a small very active set (small number of active pages in Fig. 2). This surprising result leads to our rst new observation. Observation 5: An application may see performance enhancement post migration due to memory defragmentation. Our new observation in combination with observation 1 brings out a very important point. The duration and the impact of live migration on an application depends on fairly low-level system characteristics like the way it uses the cache, allocates memory, and dirty rates. This leads to our most important observation on the need for a model to be application-aware. Observation 6: The impact of migration is not determined solely by high level sytem parameters but on fairly low-level application specic metrics. Hence, a model for migration needs to be application-aware. B. Frequency of Reconguration The cost of reconguration in a cloud depends on the frequency of reconguration and the cost of each conguration action. We next study the frequency of VM live migration in a shared infrastructure. 1) Testbed setup: In this study, we used production traces collected from a large data center. The data center runs key enterprise applications of one of the worlds 10 largest international airline. We focussed on a cluster of 112 virtualized servers that were hosted on 26 mid range servers with the pHyp hypervisor. This shared server cluster is treated as a private cloud that can move virtual machines (or LPARs) across various physical servers in response to change in application demand. We used the Emerald Dynamic Consolidation tool that has been used and validated in earlier studies [18], [19], [20] to

simulate dynamic consolidation in the server cluster. We use a 48 hour period in July, 2009 to perform this study. We break the 48 hour period into intervals of 4 hour durations and dynamically consolidate the LPARs once at the start of each consolidation interval. The dynamic consolidation algorithm used in the tool is mP P H, which aims to minimize the total number of migrations due to consolidation [18]. We then log the number of LPAR migrations and the server utilization of each physical server that participates in any migration activity. Further, the tool estimates the power consumed by the private cloud for each interval, which is used as an indicator of operational cost. Dynamic consolidation is usually performed either when an application needs more resources (scale-up) or when an application needs less resources than allocated (scale-down). It is important to note that typical data centers are still fairly static and recongure the data center only for scale-up. Hence, data centers would usually operate at low resource utilization in order to minimize reconguration. Our study captures the frequency of migration if data centers aim to aggresively achieve true elasticity and high utilization promised by the paradigm of cloud computing. Current clouds would usually experience a smaller number of migrations than suggested by our study.
140 120

Number of Migrations

\# Migrations source server utilization Target server utilization # servers

80 60 0.5 40 20 0 0 2 4 6 8 Consolidation Window 10 12 0

(a)
250 \# Migrations Power (Normalized) Max Migration 125

Number of Migrations

200

100

150

75

100

50

50

25

0 0 5 10 15 Consolidation Interval (hr) 20 25

(b)
Fig. 4. (a) Reconguration activity with time. (b) Impact of Consolidation Interval on number of migrations

2) Live Migration Activity due to Dynamic Consolidation: We study the reconguration activity due to dynamic consolidation in Fig. 4(a). During each consolidation interval, we plot the total number of VMs being migrated, the average CPU utilization of the servers from which VMs are migrated (source server utilization), the average CPU utilization of the servers to which VMs are migrated (target server utilization), and the number of active servers post consolidation. We observe that the number of VM migrations due to consolidation varies greatly and can be as high as 60 VMs or upto 50% of the total number of VMs in the cluster. There are also periods with 2 or lesser number of migrations. The number of active servers do not change during low migration periods. We looked at the traces and observed that the aggregate load was steady for these intervals. Since, the history-aware placement algorithm

Power (Normalized)

CPU Utilization

100

mP P H migrates VM only if they lead to power savings, there are very few migrations during these intervals. However, most intervals exhibit change in aggregate load leading to a large number of migrations. Observation 7: An average of 25% VMs in a cloud are migrated due to dynamic resource allocation. In some consolidation intervals, upto 50% of all VMs may be migrated for dynamic resource allocation. Our second observation is that the CPU utilization of servers that participate in a migration is usually high. This implies that the impact of live migration is going to be visible on highly loaded servers, amplifying their impact. One may also note that an increase in the number of servers is correlated with a high source server utilization, whereas a decrease in the number of servers is correlated with high target server utilization. This is a direct consequence of the fact that more servers are added when the currently active servers exhibit high utilization. In a similar manner, when consolidation triggers servers to be switched off, we move the VMs of lightly loaded servers to highly loaded servers. It is also pertinent to note here that high server utilization and large number of migrations are often triggered, when the number of active servers are increased (or during scale up). Observation 8: Scale up is often associated with high source server and low target server CPU utilization. Scale down is often associated with low source server and high target server utilization. An important parameter in dynamic consolidation is the consolidation interval or period, dened as the period of time after which a data center is recongured. A small consolidation period allows a data center to take advantage of small periods of low intensity to switch off some servers. However, it may also lead to frequent migrations. We next investigate how the consolidation period affects migration activity and potential power savings. Fig. 4(b) studies the number of migrations over a 24 hour period with change in the consolidation interval. We observe that the number of migrations are small for consolidation intervals greater than 8 hours. However, the reduction in the number of migrations comes at the cost of higher power, showing an increase of 100% as the consolidation interval is set to 24 hours. Hence, migration impact can be reduced by reconguring the cloud less frequently. However, it would prevent resources to be moved around exibly, in response to workload variations. The ability of allocating resources in response to demand is one of the dening features of cloud computing and long consolidation windows impact this ability. We capture the above insight in the following observation. Observation 9: Migration actions can be reduced by reducing the elasticity of the cloud. However, this comes at the expense of increased operational cost. In order to validate our observations, we also conducted a parallel study using 600 virtual machines running windows workloads on commodity Intel servers. All our observations hold for the larger commodity testbed as well. Interestingly, we found an even higher number of migrations for this testbed. A closer observation on our result showed that the virtual

machines in our Intel testbed had more diverse resource requirements. There were a large number of VMs with very small resource requirement. Hence, any change in resource requirement of the larger VMs led to a large number of smallsized VMs being migrated. Hence, we conclude that our core observations would hold for majority of clouds that employ dynamic consolidation aggressively. IV. C OS M IG M ETHODOLOGY Our experimental study highlights the importance of incorporating CPU utilization of the source server in any model for live migration. Further, our study indicates that the impact of write rate W Ri and dirty rate DRi on migration is fairly small and can be ignored. We also conrm the observation made earlier that the model should take the active memory of the application into account. Our most important observation is that the impact of migration varies based on the application. If a benchmark has more instructions per memory access and lower CPI, it leads to a higher processor activity and higher impact. The baseline model ignores these parameters but we observe that the duration of migration and performance impact for two benchmarks differ for the same active memory, CPU utilization and write/dirty rate [21]. We use the above insights to design CosMig, an application-aware migration cost model. A. Model Requirements and Overview We have identied that an accurate model to estimate the impact of reconguration needs to be (i) application-specic, (ii) sensitive to operating point (e.g., CPU utilization, active memory), and (iii) aware of co-located VMs on the physical server. Hence, a simple modeling strategy would calibrate the impact for each application at every operating point for all combinations of co-located applications. Clearly, such an approach does not scale with the number of applications. Further, a practical model needs to use only those parameters that are usually monitored in data centers. We solve this scaling challenge by capturing the impact of application-type and the impact of system level parameters and co-located VMs separately. CosM ig consists of a 2 step Calibrate and Estimate methodology. The rst step in CosM ig consists of building calibration models for each individual application at a pre-dened baseline operating point BOP . Since we have observed that (i) write rate and dirty rate are usually not monitored in data centers and (ii) the impact of write and dirty rate can be ignored with application-awareness, an operating point is captured solely by application type, CPU, and memory utilization of the VM being migrated. We term these parameters for each application as xed parameters. The calibration step also characterizes parameters that capture the sensitivity of the application performance with change in operating parameters. Further, since the impact of a co-located VM is similar to reduced resources, we model the impact of co-located VM as a change in operating point. We term these parameters as Rate parameters. In the Estimate step, we use the xed parameters to create an initial estimate of the impact and then rene this estimate using the Rate parameters. The xed parameters are captured for each application at any pre-dened active memory usage AM with no resource

Parameter BOP AM s c AM CP U Mig CP U Mig M Bw

Type Fixed Fixed Fixed Fixed Fixed Fixed Rate Rate

Description Baseline Operating Point Active Memory Self-Impact Co-Impact Migration Duration CPU Required for Migration
Increase in CP U Increase in Active Memory

Estimating the impact of Migration We now present the CosM ig model to estimate the impact of Begin Estimate (i) = AM (i) + AMi AM MBw CP UiMig = CP UiMig + CP U Mig (AMi AM ) If CP Ui + CP Uj + CP UiMig CP Utot return (i), s (i) and c (i). Else M CP Ui +CP Uj +CP Ui ig , i = CP Utot Return (i)i , s (i)i and c (i) End Estimate
Fig. 6. CosMig model

Peak Memory Copy Bandwidth

TABLE I F IXED AND R ATE PARAMETERS IN C OS M IG

contention. For this active memory usage, we compute self impact s , co-impact c , baseline migration duration AM , and the CPU required by the migration process CP U Mig . Our rate parameters include (i) CP U Mig , which captures the additional CPU required to complete migration for every unit increase in active memory of the virtual machine being migrated and (ii) M Bw , which captures the peak rate at which memory pages can be copied. We list all the CosM ig parameters in Table I. B. CosMig Detailed Methodology The CosMig methodology consists of a calibration step to estimate the xed and rate parameters at a standard operating point for each application. Post calibration, we use these parameters to estimate the impact of migration at any arbitrary operating point. We now describe the calibration step in detail. Calibration We use a micro benchmark B that can use a specied amount of CPU CP UB and memory AMB for its use. We install the benchmark on a dedicated VM on a server with CPU capacity CP Utot and perform per-VM calibration runs for all the applications V Mi in the cloud. We use AM to denote the active memory of V Mi for these runs and monitor AM (i) (the time taken to migrate V Mi ) and self-impact s (i) at active memory AM and no CPU contention. We also estimate the amount of CP U required for migrating V Mi at active memory AM as CP UiMig and the co-impact c (i). The calibration steps are described in Fig. 5.
1) Migrate V Mi with B set at low utilization. 2) Instrument the application performance and migration to obtain self-impact s (i) and migration duration AM (i) for a xed active memory. 3) Increase the CPU utilization of B and iterate till the migration duration exceeds 1.05 AM (i). Estimate CP UiM ig as CP Utot CP UB CP Ui 4) Migrate B and use the application performance to estimate the co-location impact c (i)
Fig. 5. Calibration runs for each VM

In addition to the per-VM calibration, we migrate the micro benchmark B at two different levels of active memory to compute peak copy bandwidth M Bw . For the two active memory levels, we identify the CPU required by the hypervisor for migration. The ratio between the difference in the CPU requirement for migration and the difference between the active memory is used to estimate CP U Mig .

migrating V Mi co-located with one or more VMs on a server with CPU capacity CP Utot . Assume that the active memory of V Mi is AMi and its CPU usage is CP Ui . For clarity, we assume only one co-located VM V Mj , while noting that the method extends to multiple co-located VMs. We rst use linear extrapolation to estimate the expected duration of migration (i) at the new active memory AMi using the benchmarked value AM (i) and the copy bandwidth M Bw , assuming no resource contention. We then estimate the amount of CPU required for live migration CP UiMig using the calibrated CPU requirement CP UiMig and the parameter CP U Mig that captures the rate of change of CPU requirement with change in active memory. We use the total CPU requirement for live migration CP UiMig and the CPU usage by each VM (CP Ui , CP Uj ) to estimate the extent of resource contention i and scale up the duration of migration and self-impact using it. The details of the CosMig model are provided in Fig. 6. We would also like to note that we had a dedicated subnet for live migration, shielding the applications from any network bottleneck. However, if future clouds share the same network for migration and application trafc, the resource contention would be extended to be the maximum of CPU and network contention. Hence, CosM ig can easily capture network bottlenecks as well. CosMig Running Time In normal operation, CosM ig predicts the cost of migration using a table lookup of the xed parameters and perfoms some simple calculations. Hence, the time taken to compute the migration cost is negligible. However, CosM ig has a calibration phase and we characterize the time taken in the calibration phase. Our methodology requires us to only estimate the xed parameters for an application. If the hypervisor support perprocess utilization monitoring, the xed parameters for an application type can be estimated using a single live migration of the application. Further, the xed parameters can be estimated during normal data center operation as well by monitoring the relevant parameters whenever a VM is live migrated. Though the rate parameters require multiple experimental runs, these parameters are application-independent and need to be estimated only once independent of the number of applications. Hence, CosM ig can be directly applied in a live data center without any perceptible overheads.

V. E XPERIMENTAL VALIDATION A. Experimental Setup Our experimental testbed consisted of a small virtualized server farm and a SAN environment. Our server farm consisted of 7 IBM HS 21 Bladecenter servers hosted on an IBM Bladecenter-H Chassis. To simulate a cloud infrastructure, all the servers ran VMWare ESX Server Enterprise 3.5 with VMotion enabled. One network port for each server was dedicated as a vkernel port for VMotion. The other port on the server was used for communicating with a client that was used to drive workloads on our testbed. Every server had a dedicated 2GBps Fibre Channel port, which was connected to an IBM DS4800 Storage Controller via a Cisco MDS Fiber Channel Switch. The server used to drive our workload had a Xeon 2.33GHz processor with 2GB RAM. We divided our data center into two clusters. The rst cluster (Cluster 1) had servers running 1 3.2GHz Xeon processor with 2M B cache. The second cluster (Cluster 2) had servers with 2 3.2GHz Xeon processors. Servers from both clusters had 8GB RAM each. We also ensured that our virtual machines always required less than 3.2GHz to run their applications. Hence, the second cluster captures server farms where there is spare CPU capacity available for any migration related overheads whereas the rst cluster captures server farms running at high utilization. All the virtualization management actions (VM resizing, VM migration) were handled via the VMWare Virtual Center 2.5 hosted on a separate dedicated server. We used two suites of benchmarks for our validation experiments. The rst suite consisted of the daxpy and dcopy benchmarks from the BLAS-1 library [2]. daxpy is a compute intensive benchmark that uses most of the compute components whereas dcopy is a purely memory copy benchmark. The motivating observations used to design CosM ig were made using the BLAS-1 suite [21]. We also used these benchmarks in our validation study as we could vary the size of the vector used during execution, the number of iterations, and control the throughput by sleeping between iterations. Hence, we could design experiments to capture diverse operating conditions. In order to validate CosM ig, we also introduced complex and real-life applications in our testbed. We used the NAS Parallel Benchmark suite (NPB) [14] serial version to validate our model. We picked benchmark classes from the suite that completed in time of the order of migration duration. Hence, we picked class W for all benchmarks other than is, for which we picked class A. The NAS benchmarks mimic a wide spectrum of representative application behaviors, from CPU intensive to memory intensive. It also includes benchmarks that represent computational science applications (e.g., bt captures the basic calculation of Computational Fluid Dynamics). We believe that using a combination of micro and macro benchmarks allows us to cover varied settings, while keeping out study relevant to real applications. Live migration is a memory intensive task and hence both the duration and impact of live migration varies across dif-

ferent runs. We could solve this problem by rebooting each server before every experimental action. However, in actual deployment, it would not be possible to pursue such a strategy. Hence, we decided to deal with this experimental noise by running each experiment multiple times and taking a mean of all the runs. All the experiments were repeated 8 times and the means are reported. Hypervisors from different vendors have their own strengths and weaknesses based on their current implementation. Our goal is to come up with a model that captures the inherent design of live migration and is oblivious of certain weaknesses of the current implementation, that may be easily sorted out in future. Hence, we also experimented with a parallel data center using the pHyp hypervisor on IBM Power6 JS-22 blades and validated our observations. is lu sp ua bt daxpy AM 100 107 138 96 43 44 s -0.05 -0.1 -0.2 -0.27 0.028 0 c 0 0 0 0 0.025 0
Fig. 7. Model Parameters for daxpy and NAS applications

We validated the CosM ig model against a baseline model based on active memory for each application (Section. II-C). Since we could vary the operating parameters for the daxpy application, we used it as a foreground application on VM V M1 that is migrated. A second VM V M2 was used to capture a background application and ran one of the benchmarks from the N AS suite. We performed the calibration runs for all these applications and list the derived model parameters in Fig. 7. We used 1M B as the baseline operating point for daxpy and observed the CPU required to migrate daxpy at baseline operating point to be 1.05GHz. The rate parameters for the data center CP U Mig and M Bw were 0.5Hz/B and 33.3M Bps respectively. The foreground application was migrated from Cluster1 to Cluster2 whereas the background application was running on Cluster1. The background VM running the N AS benchmark was entitled to a capacity of 600M Hz. We vary the CP U allocation for V M1 (from 1000M Hz to 1800M Hz) and observed the impact on the foreground daxpy application (s ), the duration of migration , and the co-impact (c ) on the N AS benchmarks. We compare the observed impact with the impact predicted by CosMig and the application-unaware Baseline model. B. Evaluating Self-Impact and Migration Duration Our experiments with different CPU entitlement for the foreground daxpy application can be divided into two distinct scenarios. In the rst scenario, there was no resource contention as the migration process in the hypervisor got the required CPU (Fig. 8(a)). In the second scenario, there was not enough spare CPU and resources had to be taken away from the applications by the hypervisor to complete the migration (Fig. 8(b)). We note that the self-impact predicted by CosM ig without resource contention is 0 (s (1) = 0), which is validated in our observation. Further, CosM ig predicts the migration duration accurately whereas the application-unaware

0.1 0.08 Throughput Drop (\Pi) 0.06 60 0.04 0.02 0 50 Performance Impact Migration Duration (real) Migration Duration (Predicted) Migration Duration (Baseline Prediction) 70 Migration Duration Benchmark Time

100 Running Time (Real) Running Time (Predicted) Running Time (Baseline) Migration Duration (real) Migration Duration (Predicted)

80 70 60 Migration Duration

90

80 50 70 40 30 60 1000 10000 100000 20 1e+06

40 10000 100000 1e+06

1000

Active Memory (KB)

Active Memory (KB)

(a)

(b)

Fig. 8. Performance Impact on Foreground VM and Migration thread (a) without resource contention (daxpy = 1200MHz. bt = 600MHz) and (b) with resource contention (daxpy = 1800MHz. bt = 600MHz). The plots with other NAS applications as background application were similar (within noise limits) and are omitted for lack of space.
44.5 44 Benchmark Time 43.5 43 42.5 42 41.5 1000 10000 100000 1e+06 Footprint of Migrated VM (KB) Baseline Real Predicted Benchmark Time 45.5 45 44.5 44 43.5 43 42.5 1000 10000 100000 1e+06 Footprint of Migrated VM (KB) Baseline Real Predicted

(a)

(b)

Fig. 9. Performance Impact on Background VM (a) without resource contention (daxpy = 1200MHz. bt = 600MHz) and (b) with resource contention (daxpy = 1800MHz. bt = 600MHz). Other NAS applications did not show any performance impact and are omitted.

Baseline method has error up to 25%. The only aberration from CosM ig happens for a memory footprint of 800M B. We also note that the duration of migration also exceeds the duration predicted by our model at this point. It is clear that even though we predict that there is no resource contention at the last operating point (800M B), there is resource contention for CPU. Using the model parameters in Fig. 7, we predict the total CPU requirement (foreground CPU + background CPU + CPU for migration) to avoid CPU contention as 3200M Hz on a 3200M Hz server. However, the actual total CPU requirement is slighly higher, leading to the resource contention. However, for memory footprint of 500M B, we note that we predict the resource requirement as 3050M Hz. We observe that that there was no performance impact at 500M B and hence the resource requirement was less than 3200M Hz. Hence, even though our model is not 100% accurate in predicting the resource requirement for migration, it is able to predict within 150M Hz (< 5% of the server). Such a small error can be dealt with by a cloud provider using 5% resource over provisioning. Fig. 8(b) presents the expected performance and migration duration for migrations that lead to CPU contention. The amount of contention varies from 8% at low memory footprint to 21% at a memory footprint of 800M B. We observe that both the performance impact and migration duration follow the model reasonably well for small to medium footprint applications. However, when the resource contention approaches 20%, the linear model exhibits error up to 10%. We also note that the duration of migration is lower than the predicted value, whereas the benchmark running time is higher than expected. We believe it is a consequence of the fact that the migration process is not throttled beyond a limit and very high CPU utilization impacts the foreground application

more signicantly. Further, the results at high load show much higher variance across the multiple experimental runs at high CPU utilization. Hence, we conclude that our model can predict self-impact and migration duration for moderate level of resource contention. Further, the high variance at very high load makes prediction of self-impact and migration duration with high accuracy infeasible. C. Validating Co-impact c Our model predicts that the performance impact on colocated VMs is a constant that depends solely on the application. Once an estimate of the co-location impact c on an application is built, it can be used independent of the characteristics of the VM being migrated or the utilization of the server. The only application that showed a non-zero performance impact was the bt benchmark (Fig. 7). The other benchmarks in the N AS suite showed no performance impact due to migration. Hence, we only present the results of our experiment with the bt benchmark as the background application. Fig. 9(a) plots the running time of the bt benchmark under low server utilization as the memory footprint of the daxpy benchmark is varied. Further, based on the c value (= 0.025) of the bt benchmark obtained in the pre-calibration runs, we estimate an expected running time for the benchmark during migration and plot it for comparison. We observe that the predicted performance impact matches very closely the real performance impact for the complete memory footprint range of migrated VM. This validates our model for co-location impact (c ) without resource contention. Our next experiment was designed to validate our model for co-location impact with resource contention. Hence, we increase the CPU allocation of daxpy benchmark and co-located one of the N AS

applications on the secondary VM. We observed again that all benchmarks other than bt showed no performance impact. Fig. 9(b) plots the performance impact on bt application of migrating the VM running daxpy. We again observe that the performance impact closely follows the impact predicted by our model. Hence, we conclude that that our model is very accurate in predicting co-impact even when there is signicant resource contention. Our experimental study establishes the ability of our model to accurately predict the impact of live migration. We also show that an application and CPU utilization oblivious Baseline methodology has signicant errors, underlining the need for an intelligent model for live migration. VI. R ELATED W ORK AND D ISCUSSION Cloud computing has emerged as an exciting new paradigm for enterprises to achieve high resource utilization. This paradigm is based on dynamic allocation of resources between applications, facilitated by the underlying virtualization layer. Hence, dynamic consolidation of a virtualized server farm has attracted a lot of attention in the recent past. Dynamic resource consolidation techniques use either only VM resizing [16] or VM resizing in combination with live migration [19], [18], [3], [12], [8]. Anecdotal evidence as well as recent ndings [9] identify live migration as the reconguration mechanism with signicant performance impact. Dynamic consolidation techniques, following this conventional wisdom, aim to minimize the number of migrations [3], [8], [18]. Live migration technology is available on most popular virtualization platforms including VMWare ESX [13], Xen [4], and IBM pHyp [11]. Other than dynamic consolidation, live migration is also used for server maintenance and to eliminate performance hotspots [23]. Designers of the technology do provide empirical evidence to suggest that the performance impact of live migration is manageable. However, to our best knowledge, there is no systematic study of the parameters that capture the performance impact of live migration under different operating conditions. Further, there is no study on the frequency of reconguration in a cloud that employs dynamic consolidation to maximize resource utilization. Existing work in this area takes a simplistic view of live migration and attaches a constant cost for migrating any VM [3]. Verma et al. [18] take this model a step further by linking migration cost with the active memory of the VM. In [9], Jung et al. note that live migration can signicantly impact both foreground and background applications. Akoush et al [1] provide the rst study to estimate the duration of live migration and present 2 simulation models. However, none of the existing work presents a practical model to estimate the impact on applications during live migration in the presence of competing workloads. Our work presents this missing piece for the elaborate work on dynamic consolidation by providing both an estimate on the frequency, the duration, and the performance impact of reconguration actions. In an earlier version of this work [21], we present some of the motivating observations that have led to the design of CosM ig. In this work, we extend the preliminary work in

multiple ways. We present the rst real study of enterprise workloads in terms of the number of reconguration actions and their correlation with server utilization. We also present additional observations in this work that help us understand the impact of live migration on applications. We present an elaborate methodology CosM ig to estimate the impact of live migration in a live data center and validate it using representative workloads. We show that CosM ig can predict the duration and impact of migration within 5% error range. CosM ig can be directly employed by virtualized data center administrators as well as cloud providers to minimize the impact of reconguration during dynamic resource allocation. R EFERENCES
[1] S. Akoush, R. Sohan, A. Rice, A. Moore, and A. Hopper. Predicting the performance of virtual machine migration. In MASCOTS, 2010. [2] Basic Linear Algebra Subprograms. http://www.netlib.org/blas. [3] H. W. Choi, H. Kwak, A. Sohn, and K. Chung. Autonomous learning for efcient resource utilization of dynamic vm migration. In Proc. ICS, 2008. [4] C. Clark, K. Fraser, S. Hand, J. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Wareld. Live migration of virtual machines. In NSDI, 2005. [5] EC2. Amazon elastic compute cloud. http://aws.amazon.com/ec2/. [6] J. Fu, J. Patel, and B. Janssens. Stride directed prefetching in scalar processors. In IEEE MICRO, 1992. [7] Google app engine. http://code.google.com/appengine/. [8] D. Gmach, J. Rolia, L. Cherkasova, G. Belrose, T. Turicchi, and A. Kemper. An integrated approach to resource pool management: Policies, efciency and quality metrics. In Proc. DSN, 2008. [9] G. Jung, K. Joshi, M. Hiltunen, R.Schlichting, and Calton Pu. A cost-sensitive adaptation engine for server consolidation of multitier applications. In Proc. Middleware, 2009. [10] https://www.lotuslive.com/. [11] G. McLaughlin, L. Liu, D. DeGroff, and K. Fleck. Ibm power systems platform: Advancements in state of the art in it availability. In IBM Systems Journal, 2008. [12] Ripal Nathuji and Karsten Schwan. Virtualpower: coordinated power management in virtualized enterprise systems. In Proc. ACM SOSP, 2007. [13] M. Nelson, B-H. Lim, and G. Hutchins. Fast transparent migration for virtual machines. In Usenix ATC, 2005. [14] Nas parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html. [15] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M.S. Lam, and M. Rosenblum. Optimizing the migration of virtual computers. In Usenix OSDI, 2002. [16] J. Stoess, C. Lang, and F. Bellosa. Energy management for hypervisorbased virtual machines. In Proc. Usenix ATC, 2007. [17] D. Tam, R. Azimi, L. Soares, and M. Stumm. Rapidmrc: Approximating l2 miss rate curves on commodity systems for online optimizations. In ASPLOS, 2009. [18] A. Verma, P. Ahuja, and A. Neogi. pmapper: Power and migration cost aware application placement in virtualized systems. In Proc. Middleware, 2008. [19] A. Verma, P. Ahuja, and A. Neogi. Power-aware dynamic placement of hpc applications. In ACM ICS, 2008. [20] A. Verma, G. Dasgupta, T. Nayak, P. De, and R. Kothari. Server workload analysis for power minimization using consolidation. In Proc. Usenix ATC, 2009. [21] A. Verma, G. Kumar, and R. Koller. The cost of reconguration in a cloud. In Proc. Middleware (Industrial Track), 2010. [22] C. A. Waldspurger. Memory resource management in vmware esx server. In Proc. Usenix OSDI, 2002. [23] T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif. Black-box and gray-box strategies for virtual machine migration. In Proc. NSDI, 2007.

You might also like