Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Benchmarking Servers using Virtual Machines

Samantha S. Foley (ssfoley@cs.indiana.edu)


Vinay Pandey (vkpandey@cs.indiana.edu)
Minh Tang (mhtang@cs.indiana.edu)
Felix Terkhorn (terkhon@gmail.com)
Aparna Venkatraman (apvenkat@indiana.edu)
April 27, 2007

Abstract
Virtualization software is often used in enterprise on server machines to
run virtual machines (VMs), each running a distinct server. This has many
benefits including flexibility and lower hardware cost. This paper presents the
results from conducting a performance evaluation of two comparable dual
quad-core server machines, one from HP and another from Dell, both run-
ning VMWare ESX V3. The performance is measured using a series of inde-
pendent benchmarks testing the performance of a file server, database server,
web server, java server and email server. This multi-workload benchmark is
used to evaluate the two server machines and how the different server work-
loads interact to better utilize the hardware resources.

1 Introduction
Benchmarking is a useful tool in determining the performance of a particular piece
of technology. In this paper we describe a benchmarking experience to deter-
mine server performance using virtual machines to run several different servers
per physical machine. We were asked to perform this study by Indiana Univer-
sitys University Information Systems (UIS), a division of University Information
Technology Services (UITS). UIS is responsible for developing, implementing and
managing information systems that are part of the universitys core business actions
[16]. They require a number of servers performing different tasks. UIS prefers to
use VMwares ESX Server product [9] to create virtual machines to run many dif-
ferent servers on the same physical machine. Our task is to determine which server
performs best, and what workload best utilizes the machines.

Virtualization can provide the user a flexible environment for running applica-
tions. On one physical machine, there can be multiple operating systems running
independently of each other sharing the resources of the physical machine in har-
mony. Virtual machine software typically runs on the bare hardware and then one
can create virtual machines (VMs) with different operating systems and software
on each one. The number of VMs per physical machine depends on the virtual-
ization software and the hardware capabilities. This is an advantage because when
the need for another server arises, one can just add another VM to a machine, in-
stall the appropriate software and the system is ready, as opposed to purchasing a
new machine to handle each new server. Virtualization allows the user to change
the allocation of resources through various tools provided by the software vendor.
This is very important to find the optimum configuration of applications and VMs.

For our experiment we compare the performance of HP and Dell servers run-
ning five VMs. Each VM runs a benchmark to test a database, email, file, java
or web server. The virtualization software, ESX Server, is provided by VMware,
a commercial vendor that has a long history of enterprise virtualization products.
We based our server and benchmark server choices on the needs of our client (UIS)
and previous virtual machine benchmarks [68, 18].

Section 2 describes each of the benchmarks we chose to use and why. Section 3
compares how our work relates to previous work on benchmarking VMs. Section
4 provides a thorough description of the experimental platform. Section 5 explains
the workloads we benchmarked and discusses the results. Section 6 describes the
future research directions some of the authors will pursue. We conclude with con-
clusions and acknowledgments.

2 Benchmarks
In this section we describe the five benchmarks we use to test the virtual machine
systems.

2
2.1 Database Server Benchmark: SwingBench
Swingbench is a free load generator (and set of benchmarks) designed to stress
test an Oracle 10g database. It models users repeatedly executing a predefined mix
of transactions. Swingbench has two benchmarks OrderEntry and CallingCircle.
OrderEntry is based on the oe schema that ships with Oracle10g. It can be run
continuously (that is until you run out of space). It introduces heavy contention
on a small number of tables and is designed to stress interconnects and memory.

Our client, UIS, uses Oracle for all of their database needs, thus we chose to
use a benchmark designed for Oracle. Swingbench was chosen after surveying the
literature and exploring the documentation. It is specifically designed for stress
testing Oracle database and the hardware it runs on. Swingbench consists of two
modes, a GUI mode called Swingbench and a command line version, CharBench
that generate the network load. All the parameters are easily configurable via a sin-
gle XML file. It is also scalable as it can be extended to stress test using multiple
load generators and has a coordinator component for controlling it.

The benchmark setup consisted of an Oracle 10g (10.2.0.1) Database Server in-
stalled on Red Hat Enterprise Linux 4 and a Load Generator Client. The client
requires 1.5 JVM and SwingBench,with Oracle Client also installed on Red Hat En-
terprise Linux 4. The database instance used was approximately 8GB. Application
data files consumed approximately 7 GB, with log files and dictionary views ac-
counting for the remaining space. The load generator average load was maintained
below 70% and ratio of one load generator CPU to two database CPUs was also
maintained to give reliable results. The number of users for the benchmark was
specified to over 100 standard database connections. The user think time between
transactions was 250 milliseconds. The Oracles SGA (in-memory data cache) was
configured to be 512 MB.

The graph 1 shows the total number of transactions and the distribution of type
of transactions that were run on the Oracle Database. We choose to have the Pro-
cess Orders transactions as 5.5% of total transactions load and New Customer Reg-
istration transactions as 11%. The transactions Browse Products, Browse Orders
and Order Products each contributed around 27.8% of the total transactions. These
parameters were decided according to the contemporary database write and read
trends and are easily configurable. These would be changed in the future work de-
pending upon the requirements of UIS (see section 6).

3
Figure 1: Transactions Load for the Oracle Database

4
One of the key observations has been with regards to the total number of trans-
actions. On the HP Virtual machine, it hits a high of 7500 pretty easily and is con-
sistent as shown in the graph above in individual runs. However, the Dell machine
hits a high of between 5500 and 6000 transactions.

2.2 Email Server Benchmark: LoadSim


LoadSim is a freely distributed Microsoft Exchange Server benchmark offered by
Microsoft Corporation. It is intended for use in a lab or non-production setting to
test possible email, calendaring, and appointment workloads. LoadSim is intended
to be run on client machines which deliver message requests to the target Microsoft
Exchange server. Using Perfmon (Windows performance monitor) on the target
server provides a method for tracking many different performance metrics, includ-
ing CPU load, memory pages per second, and disk accesses per second. LoadSim
provides the ability to simulate several different types of mail users operating dur-
ing the timeframe of the benchmark. Each share of users can be designated as heavy
or light mail users, and each share of users can be tweaked to simulate clients us-
ing locally cached mailboxes. Workloads can be altered to simulate various sizes
of attachments for a given percentage of messages in the overall workload. Several
different Outlook/Exchange tasks can be tested for performance impact, including
message filtering, journalling, reading and writing to public folders, meeting re-
quests, folder synchronization between client and server, appointment scheduling,
contact creation, and contact browsing.

LoadSim is free to download from Microsoft. It seemed relatively easy to con-


figure given the documentation available online. It seems to be the most prevalent
method of benchmarking Microsoft Exchange Server installations. The main al-
ternative for benchmarking Exchange Server appears to be Mailstorm, which is
itself a mail service that employs an Oracle backend. Although the main purpose
of Mailstorm is not as a benchmarking utility, it can be used to mass-mail a large
amount of data to an Exchange server. LoadSim seemed the logical choice since it
is configured to specifically test various Outlook functionalities.

Our intention was to benchmark several differing workflows that would mimic
messaging and appointment loads that commonly occur in generalized business
and education environments. Several things went wrong. After managing to par-
tially initialize the benchmark testing routine, we encountered the ubiquitous MAPI

5
E FAILONPROVIDER error. We managed to eventually delete several hundred
spuriously-created Microsoft Exchange server users that had arbitrary passwords
generated during the first failed initialization. This was an extremely time-consuming
error that turned out to be somewhat simple to solve. We also encountered a shar-
ing error stemming from an improperly configured public mail folder. We strug-
gled during the entire time given for benchmark setup to understand why we could
not select a non-local Exchange Server when we initialized LoadSim from the client
machine. By the time the first batch of multi-service benchmarks was being exe-
cuted on the entire system, we had determined that there was no Windows Domain
Controller configured on the Exchange server.

2.3 File Server Benchmark: dbench


File serving is a typical server function. dbench is an open source file server bench-
mark that mimics the behavior of Netbench, the industry standard [15]. It was cre-
ated by Andrew Tridgell, a developer for the open source project Samba, a SMB
file and printer server for Unix. Netbench is a commercial product created by Ziff
Davis Media that measures the performance of a file server serving data to Win-
dows clients [1]. The problem with Netbench is that it requires many physical ma-
chines to act as clients. dbench has a very different philosophy. Tridgell believes
that software should be accessible to all users. Thus the clients are simulated, re-
ducing the physical hardware demand from many physical client machines to one
physical machine to simulate all of the clients. dbench is open source, allowing
anyone to use it for free and more importantly, know exactly what the code does
to simulate the file server load. dbench is the obvious choice for time, space, and
money reasons.

The dbench benchmark is a set of three components: dbench; tbench; and


smbtorture. All of these components simulate the Common Internet File System
(CIFS) calls that would be created to perform file serving. dbench measures the file
system on the server side by measuring the throughput of the disk I/O operations
in MB/s. tbench measures the throughput between the client and server over the
network in MB/s. smbtorture is a utility that stress tests the Samba suite. Together
they mimic Netbench.

I was unable to find the smbtorture component of the benchmark as described


in the documentation. After weeks of searching the internet, I finally found a newer

6
version of smbtorture in the Samba 4.0 development tree. Unfortunately, there was
not enough time to install and use this component as part of the benchmark. Since
we are measuring how well VMs perform as single servers. Thus, we decided it is
sufficient to measure the file system operations using dbench and network traffic
using tbench. The benchmark was run simulating one to a hundred ninety-five
clients.

The results for the benchmark are shown in figures 2.3 and 2.3. Figure 2.3 shows
the results of the dbench component. As you can see as more clients are added, the
file system performance drops off quite fast. File I/O is very CPU intensive, thus
as more clients demand files and the other VMs use the CPU the performance de-
creases. Figure 2.3 shows the results of the tbench component. As you can see, as
more clients are added, the better the throughput. There seems to be a peak at ap-
proximately 80 clients, however, it is clear that the CPU is a bottleneck in the file
server case.

dbench results

180

160

140
HP Dell

120
Throughput (MB/s)

100

80

60

40

20

0
1 5 10 15 20 25 30 35 40 45 50 55 60 65
# of clients

Figure 2: dbench results

7
tbench results

100

90

80

70
Throughput (MB/s)

60

50

40

30

HP Dell
20

10

0
1 5 10 15 20 25 30 35 40 45 50 55 60 65 80 100 120 140 160 180 195
# of clients

Figure 3: tbench results

2.4 Java Server Benchmark: SPECjbb


SPECjbb2005 is a software benchmark, courtesy of Standard Performance Evalau-
tion Corporate(SPEC) [4] benchmark for evaluating the performance of server side
Java. It evaluates the performance by emulating a three-tier client/server system
(with emphasis on the middle tier). The benchmark exercises the implementations
of the JVM (Java Virtual Machine), JIT (Just-In-Time) compiler, garbage collec-
tion, threads and some aspects of the operating system. It also measures the per-
formance of CPUs, caches, memory hierarchy and the scalability of shared mem-
ory processors (SMPs). SPECjbb2005 provides a new enhanced workload, imple-
mented in a more object-oriented manner to reflect how real-world applications
are designed and introduces new features such as XML processing and BigDec-
imal computations to make the benchmark a more realistic reflection of todays
applications. Its main characteristics are:

totally self contained and self driving (generates its own data, generates its
own multi-threaded operations, and does not depend on any package be-
yond the JRE).

memory resident, performs no I/O to disks, has only local network I/O, and

8
has no think times.

Clients are replaced by driver threads, database storage by binary trees of ob-
jects and increasing amounts of workload are applied, providing a graphical
view of scalability.

Benchmarks like RUBiS and Volanomark have been used in the past to bench-
mark Java servers. While RUBiS is an auction site prototype and is usually used to
evaluate application servers performance and scalability, VolanoMark is a pure Java
server benchmark characterized by long-lasting network connections and high
thread counts.It creates client connections in groups of 20 and measures the time
required by the clients to take turns broadcasting a set of messages to the group.
The fact that SPECjbb2005 emulates a 3-tier system which is the most common
type of server-side Java application, is the reason for using this benchmark in our
study.

Terms specific to SPECjbb2005: A warehouse is a unit of stored data. It contains


roughly 25MB of data stored in many objects in several Collections (HashMaps,
TreeMaps). A thread represents an active user posting transaction requests within
a warehouse. There is a one-to-one mapping between warehouses and threads, plus
a few threads for SPECjbb2005 main and various JVM functions. As the number
of warehouses increases during the full benchmark run, so does the number of
threads.

This user can configure the number of application instances to run. When more
than one instance is selected, several instances will be run concurrently with the
final measurement being the sum of those for the individual instances. The multi-
ple application instances are synced using local socket communication and a con-
troller.

Output is produced in terms of business operations per second(bops) and also


bops/JVM.

Results:

The throughput metrics are calculated as follows:

1. The value of the expected peak warehouse(N) is set to the result of the run-
time call to obtain the maximum number of processors in the system. In our

9
case, N = 2.

2. For all points from N to 2*N warehouses, the scores for the individual JVM
instances are added. (The other points do not contribute in the calculation
of throughput metrics.)

3. The summed throughputs for all the points from N warehouses to 2*N ware-
houses (inclusive of both) are averaged. This average is the SPECjbb2005
bops metric. The SPECjbb2005 bops/JVM is obtained by dividing the SPECjbb2005
bops metric by the number of JVM instances.

Figure 4: Dell machine with one JVM instances

The benchmarking results for the Dell machine are as follow:

1. There is only one JVM instance. Both the SPECjbb2005 bops and SPECjbb2005
bops/JVM is 16431. See Figure 4.
2. There iw now two JVM instances. JVM 1 has a score of 8964 while
JVM 2 has a score of 8831. The SPECjbb2005 bops is 17795 and the
SPECjbb2005 bops/JVM is 8898. See Figure 2.4

10
Figure 5: Dell machine with two JVM instances

Figure 6: HP machine with one JVM instances

11
Figure 7: HP machine with two JVM instances

The benchmarking results for the HP machine are as follow:

1. There is only one JVM instance. Both the SPECjbb2005 bops and SPECjbb2005
bops/JVM is 17453. See Figure 6.
2. There iw now two JVM instances. JVM 1 has a score of 6094 while
JVM 2 has a score of 12980. The SPECjbb2005 bops is 19074 and the
SPECjbb2005 bops/JVM is 9537. See Figure 7

2.5 Web Server Benchmark: SPECweb2005


SPECweb2005 is a software benchmark courtesy of the Standard Performance Eval-
uation Corporate (SPEC) [5] designed to measure a system capability as a web
server. The SPECweb2005 measure the unencrypted and SSL encrypted request/response
performance of the web server by employing the following three workloads.

SPECweb2005 Banking is a purely SSL encrypted workload.

SPECweb2005 Ecommerce is a partly SSL encrypted workload

12
SPECweb2005 Support is a purely unencrypted workload. Most of the re-
quests and traffic in SPECweb2005 Support are for normal HTTP down-
loading of files of various sizes.
All of the workloads page requests are dynamic, and so customization of the size of
the page requests are possible. In this benchmarking run, however, we have chosen
to stick with the default settings.

The architecture of SPECweb2005 consists of the following components


1. Clients: The benchmark clients run a program that sends HTTP requests to
the server and receive HTTP responses from the server.
2. Prime Client: The prime client initializes the clients and the back-end simu-
lator, then collects and stores the results of the benchmarking run.
3. Web Server: The web server is the system on which the clients send their
HTTP requests to.
4. Back-end Simulator: The back-end simulator emulates a back-end server
from which the web server must communicate to get dynamic data.
Each of the workload in SPECweb2005 has a metric, which is the number of
simulatneous user session that the web server can service while meting the quality-
of-service (QOS) requirements of the benchmark. The metric for the whole SPECweb2005
is then the geometric mean of the ratio of each workload submetric scores by their
respective workload reference scores. So a system with a SPECweb2005 score of
100 is considered equivalent to the reference system, while one that has a score of
200 is considered to be twice as fast as the reference system.

3 Related Work
Modern computers are now powerful enough to run hundreds of processes at the
same time. Therefore, its now a waste to acquire a new machine for each server
process. Virtualization is therefore an important technique for subdividing the re-
sources of a modern computer. The benefits of virtualization includes [14]
Improving the utilization of machine resources.
Virtual machines can provide secure, isolated sandboxes for running un-
trusted applications.

13
Provide resource limits contraint and, in some cases, resource guarantees.
This helps in the creation of QoS-enabled operating systems.
Easy migration of systems. This allows for a flexible and robust systems for
handling errors.
VMware [9] is an example of a popular full system virtualization tool for x86 archi-
tecture [13]. Virtualization benchmarking using VMware has been performed by
various hardware vendors, such as Dell [11, 12], and HP [3]. Xen is another popular
virtualization tool. The difference between Xen and VMware is that while VMware
employed the fully virtualization approach where the exposed functionality of vir-
tual hardware is indentical to that of the underlying machine, Xen employed an ap-
proach known as paravirtualization where the virtual machine abstraction is simi-
lar but not identical to the underlying hardware [8, 13]. Virtualization benchmark-
ing using Xen is the topic of the papers of Clark et al [6], and Barham et al [8]. Also
relevant to the topic of virtualization is the usage of Disco, a full virtualization tech-
nique, in running commodity operating systems on ccNUMA architectures [7] and
Denali, a paravirtualization technique for hosting vast numbers of virtualized OS
instances [18]. Finally, VMware had also done a performance comparison between
several virtualization frameworks [17] that includes Xen and its own VMware.

4 Experiment Description
The experimental setup is a HP Proliant DL585 G2 model, eight 2.6 Ghz Dual-Core
AMD Opteron Processors and 32 GB of memory, and Dell Inc. PowerEdge 6950 ,
eight 2.6 Ghz Dual-Core AMD Opteron Processors and 24 GB of memory. Each of
these machines is running VMwares ESX Server 3.0.1 as the demonstration plat-
form for the virtual machines. The 5 disks connected to the HP are configured in
RAID Level 5.However, The 5 disks are connected to Dell are not configured in any
RAID Level 0.We have created 5 virtual machines (3 running Red Hat Enterprise
Linux 4 and 2 are running Windows Server 2003) of 2 Virtual CPUs and 4 GB of
memory. For each benchmark run of HP Virtual Machines, Dell Virtual Machines
act as the client and vice versa.

5 Workloads and Results


The workload consists of the overall set of processes being run by each benchmark,
and is characterized by the amount of system resources used during this overall

14
process. It is further constrained by the amount of resources available to each vir-
tual machine on the target physical host.

The specific workload executed during system benchmarks included dbench


with tbench for the file server, SwingBench for Oracle, SPECjbb2005 for the JVM,
and SPECweb2005 for Apache. Each virtual machine was configured with two vir-
tual CPUs, and 4GB of system memory. However, the Exchange Server benchmark
did not participate in the benchmarks, and was consequently idle.

Through four runs of the benchmark, with two runs per system, our results
indicate that theres a slight difference in CPU load between the two systems, e.g.
see Figure 8, but the amount of memory consumed are almost identical, thats the
amount of free memory for the Dell machine hovers around the 13GB mark while
that of the HP machines hover around the 21GB mark, which on account that the
Dells machine only has 24GB of physical memory instead of the 32GB for the HPs
machine, indicates that the benchmarks not able to saturate the system memory.
This can be attributed partly to the fact that the benchmark constraint of only run-
ning the clients on the Dells machine while benchmarking the HPs machine and
vice versa creates an artificial barrier on the stress testing of the systems.

6 Future Work
Due to time constraints and technical difficulties, we were unable to test the ma-
chines as thoroughly as we had hoped. Our future work will include running ad-
ditional workload configurations to fully utilize the hardware resources.

In addition, a subset of the authors plan continue this work as a research project
that will carry on from the preliminary work done this semester. The project will
focus on one or two of the applications described in this paper, such as the database
Oracle 10G application, and focus the experimentation, benchmark, and workload
on exposing aspects of configuration, organization, and load balancing.

The tests will be run on both 32-bit and 64-bit versions of the linus OS. We
will explore generating workloads based on the learning from the previous run

15
Figure 8: Comparison of CPU Load between Dells and HPs machines

16
of the workload. As UIS is underwriting use of the machines, we will refine the
workloads through close interaction with UIS. Moreover, with UIS having heavy
database transactions, we would study the advantages of running small databases
in a virtual environment so that better response time maybe obtained despite pro-
cessing of a large number of queries.

7 Conclusions
The two physical machines did not seem to have a significant difference in per-
formance. The unfortunate and unintended difference in machines is most likely
the culprit in this case. Further work needs to be done in order to ascertain the
true performance differences and an ideal workload configuration. It is clear from
this experiment that virtualization software is an excellent tool for large and small
businesses to host servers, and expand and adapt to the ever changing markets.

References
[1] Answers.com. Computer Desktop Encyclopedia. Computer Language Com-
pany Inc., 2007. Answers.com, April 2007. URL http://www.answers.com/
topic/pc-magazine-benchmarks. 6

[2] P. Chen and D. Patterson. A new approach to i/o performance evaluation.


ACM Transactions on Computer Systems, 12:308339, 1994.

[3] Hewlett-Packard Development Company. Smartpeak WLM with an HP


BladeSystem and VMware ESX environment, June 2006. 14

[4] Standard Performance Evaluation Corporation. SPECjbb2005, . URL http:


//www.spec.org/jbb2005/. 8

[5] Standard Performance Evaluation Corporation. SPECweb2005, . URL http:


//www.spec.org/web2005/. 12

[6] B. Clark et al. Xen and the art of repeated research. In USENIX Annual Tech-
nical Conference, pages 135144, 2004. 2, 14

17
[7] K. Govil et al. Cellular disco: resource management using virtual clusters on
shared-memory multiprocessors. ACM Transactions on Computer Systems,
18:229262, 2000. 2, 14

[8] P. Barham et al. Xen and the art of virtualization. In Proceedings of the nine-
teenth ACM symposium on Operating System principles, pages 164177, 2003.
2, 14

[9] VMware Inc. VMware. URL http://www.vmware.com. 2, 14

[10] S. King, G. Dunlap, and P. Chen. Operating system support for virtual ma-
chines. In 2003 USENIX Annual Technical Conference, pages 7184, 2003.

[11] D. Morse. Improved virtualization performance with 9th generation servers.


Technical report, Dell Enterprise Systems, August 2006. 14

[12] T. Muirhead and D. Jaffe. Advantages of Dell PoweEdge 2950 two socket
servers over Hewlett-Packard Proliant DL 585 G2 four socket servers for vir-
tualization. Technical report, Dell Enterprise Systems, December 2006. 14

[13] R. Rose. Survey of system virtualization techniques. 2004. 14

[14] Amit Singh. URL http://www.kernelthread.com/publications/


virtualization/. 13

[15] Andrew Tridgell. URL http://samba.org/ftp/tridge/dbench/README.


6

[16] Indiana University. University Information Systems (UIS). URL http://


uits.iu.edu/scripts/ose.cgi?apti.ose.help. 1

[17] VMware. A performance comparison of hypervisors, 2006. 14

[18] A. Whitaker, M. Shaw, and S. Gribble. Denali: Lightweight virtual machines


for distributed and networked applications. Technical report, University of
Washington, 2001. 2, 14

18

You might also like