Proceeding of the IEEE

International Conference on Automation and Logistics

Zhengzhou, China, August 2012

Deploying and Researching Hadoop in Virtual Machines

Guanghui Xu, Feng Xu*, Hongxu Ma
College of Computer and Information
Hohai University
Hohai, Nanjing 211100, China
jerryxgh@gmail.com {njxufeng & mdahagg }@163.com
programming model Hadoop but also all the advantages of
virtual machines, such as fully utilizing the system resources,
easing the management of the systems, improving the
reliability and saving the power.

AbstractHadoop's emerging and the maturity of

virtualization make it feasible to combine them together to
process immense data set. To do research on Hadoop in virtual
environment, an experimental environment is needed. This
paper firstly introduces some technologies used such as
CloudStack, MapReduce and Hadoop. Based on that, a method
to deploy CloudStack is given. Then we discuss how to deploy
Hadoop in virtual machines which can be obtained from
CloudStack by some means, then an algorithm to solve the
problem that all the virtual machines which are created by
CloudStack using same template have a same hostname. After
that we run some Hadoop programs under the virtual cluster,
which shows that it is feasible to deploying Hadoop in this way.
Then some methods to optimize Hadoop in virtual machines
are discussed. From this paper, readers can follow it to set up
their own Hadoop experimental environment and capture the
current status and trend of optimizing Hadoop in virtual
Index Terms





In this section we talk about CloudStack, MapReduce

programming model and its open source implementation
Hadoop, two widely used technologies in virtualization is
also mentioned. At last we discuss the advantages and
disadvantages of Hadoop deployed in virtual environment.
A. CloudStack
CloudStack is an open source software platform that
pools computing resources to build public, private, and
hybrid Infrastructure as a Service (IaaS) clouds. CloudStack
manages the network, storage, and compute nodes that make
up a cloud infrastructure. CloudStack can be used to deploy,
manage, and configure cloud computing environments [12].
With CloudStack, you can do things below:
Set up an on-demand, elastic cloud computing
service. Service providers can sell self-service
virtual machine instances, storage volumes, and
networking configurations over the Internet [12].
Set up an on-premise private cloud for use by
employees. Rather than managing virtual machines
in the same way as physical machines, with
CloudStack an enterprise can offer self-service
virtual machines to users without involving IT
departments [12].



When computer was just invented, data and compute

resources were centralized, computer users used terminal to
access them. And with the development of hardware,
personal computer comes into our life. But now it shows a
trend that data and compute resources are centralized again
which called cloud computing.
Nowadays, the most frequently used programs are those
Internet based services, such as search engines, social
network services and electronic businesses, which have
millions of users. Every moment those services emit large
amounts of data, which brings a problem: how to deal with
the immense data set. Search engine leader Google uses a
programming model called MapReduce can process 20PB
data per day [5]. Hadoop is an open source implementation
of MapReduce, which is sponsored by Yahoo. As free and
open source software, Hadoop is developing fast; most
recently its first stable version is released. A lot of research
results have been integrated into it. Not only researchers but
also enterprises are using Hadoop.
Meanwhile, with the maturity of virtual machine
technology, VM-based computing infrastructure has coming
up, such as Amazon EC2 (Elastic Cloud Computing).
With this, can we use Hadoop in virtual clusters instead
of physical cluster? If we can do that, we can not only obtain
the super data processing ability provided by parallel

B. Virtualization Technology
Virtualization is a kind of technologies which can make
computing element running on virtual machines rather than
on physical ones. There are a lot of virtualization
technologies, but we focus on the two technologies which
are free and open source software and have been widely used.
1) Xen
Xen is a virtual-machine monitor providing services that
allow multiple computer operating systems to execute on the
same computer hardware concurrently[10].It is originally
developed by University of Cambridge Computer Laboratory.
Xen is free software and licensed under the GNU General
Public License (GPLv2).Until this article being written, the


latest release version is 4.1.Amazon EC2 (Elastic Compute

Cloud) is using Xen.
2) KVM
Kernel-based Virtual Machine (KVM) is a virtualization
infrastructure for the Linux kernel. KVM supports native
virtualization on processors with hardware virtualization
extensions [11].And also KVM is free and open source

of the master node of MapReduce. Besides that,

virtualization can help to fully utilize the system resources.
By using EC2 like services, customers can easily and costeffectively process vast amounts of data.
2) Disadvantages
The only disadvantage is that the potential for poor
performance and heavy load undoubtedly, which is what to
be solved.

C. MapReduce and Hadoop

The name MapReduce comes from the two kinds
operations in functional programming language: Map and
Reduce. In functional programming language, function has
no side-effect, which means that programs written by
functional programming language can be more optimized in
parallel programming. In functional programming language,
Map and Reduce take functions as parameters, which are
fully used in MapReduce.
MapReduce programming model divide problems to be
solved into Map and Reduce, the two kinds operations.
When it receives a request, its processing flow is like in
figure 1.



Split 0


A. CloudStack Deployment
A CloudStack installation consists of two parts: the
Management Server and the cloud infrastructure that it
manages. When you set up and manage a CloudStack cloud,
you provision resources such as hosts, storage devices, and
IP addresses into the Management Server, and the
Management Server manages those resources. Figure 2
below shows the profile of it:


Split 3


Machine 1

Machine 2

Fig 2. CloudStack overview


Split 4



Split 1
Split 2



1) Management Server
CloudStack use management server to manage the
resources. Users can manage their cloud infrastructure
through management server UI or API.
2) Cloud Infrastructure
The Management Server manages one or more zones
(typically, datacenters) containing host computers where
guest virtual machines will run. The cloud infrastructure is
organized as follows [12]:
Zone: Typically, a zone is equivalent to a single
datacenter. A zone consists of one or more pods and
secondary storage.
Pod: A pod is usually one rack of hardware that
includes a layer-2 switch and one or more clusters.
Cluster: A cluster consists of one or more hosts and
primary storage.
Host: A single compute node within a cluster. The
hosts are where the actual cloud services run in the
form of guest virtual machines.
Primary storage is associated with a cluster, and it
stores the disk volumes for all the VMs running on
hosts in that cluster.
Secondary storage is associated with a zone, and it
stores templates, ISO images, and disk volume
Figure below shows the cloud infrastructure in


Fig 1. MapReduce programming model

MapReduce is only a programming model, in Google, it's

running on GFS(Google File System)[6].Hadoop is the open
source implementation of MapReduce, and it has its own
distributed file system, called HDFS(Hadoop Distributed
File System).Until this article is written, its latest release
version has Common, HDFS and Hadoop MapReduce three
parts. Common is the common utilities that support the other
Hadoop subprojects; HDFS is Hadoop Distributed File
System; Hadoop MapReduce just as it says following the
introduction of MapReduce.
D. Advantages and Disadvantages of Hadoop in Virtual
1) Advantages
MapReduce is designed under commodity PC cluster,
management of thousands commodity PCs is a big job. Also
reliability of commodity PC is a question. Maybe the biggest
problem is the power consumption. So if one want to build
its own compute center, it will pay quite a lot. This is where
the EC2 like services are used. Deploying the Hadoop
Applications on virtual machines can take all the advantages
of virtualization, which can make the management of the
cluster more easily, improve the reliability which is because
that virtual machines can be more easily recovered from
crush than physical ones. Thus, it can improve the reliability


$ sudo update-alternatives --install /usr/bin/java

java /usr/lib/java/jdk1.6.0_20/bin/java 300
$ sudo update-alternatives --install /usr/bin/javac
javac /usr/lib/java/jdk1.6.0_20/bin/javac 300
$ sudo update-alternatives --config java



4) Run Hadoop
With sun jdk, Hadoop is easy to run, but a problem
comes: CloudStack use template to create virtual machines,
which makes that all the virtual machines has the same
hostname, it will bring conflict to Hadoop. To solve, we
introduce Auto Change Hostname Service (ACHS). When a
virtual starts, it firstly run a program, we name it Auto
Change Hostname Client (ACHC), ACHC ask ACHS
whether this machine is registered , if not, register and
request a hostname, then change hostname and write in into
OS configuration and run Hadoop services. If ACHC find
that this machine has been registered, run Hadoop services
immediately. Figure below shows the procedure of the




Fig 3. Organization of a zone in CloudStack

3) CloudStack installation
a) Prepare
Operating system should be one of RHEL 5.4-5.x
64-bit 6.2+ 64-bit or CentOS 5.4-5.x 64-bit or 6.2+
64-bit or Ubuntu 10.04 LTS.
64-bit x86 CPU (more cores results in better
4 GB of memory
250 GB of local disk (more results in better
capability; 500 GB recommended)
At least 1 NIC
Statically allocated IP address
Fully qualified domain name as returned by the
hostname command
XenServer 6.0 (for CloudStack 3.0.0) or XenServer
b) Management Server Installation
Download the CloudStack Management Server You
should have a file in the form of CloudStack-VERSION-NOSVERSION.tar.gz. Untar the file and then run the
install.sh script inside it



Request Hostname
and Register


Change hostname
and save it to

Run Hadop Services


Fig 4. Procedure of Auto Change Hostname Algorithm


# tar xzf CloudStack-VERSION-N-OSVERSION.tar.gz

# cd CloudStack-VERSION-N-OSVERSION # ./install.sh


A. Task Scheduling
Hadoop's performance is closely tied to its task scheduler,
which implicitly assumes that cluster nodes are
homogeneous and tasks make progress linearly, and uses
these assumptions to decide when to speculatively re-execute
tasks that appear to be stragglers[1].These are the implicit
assumptions of Hadoop's scheduler[1]:
Nodes can perform work at roughly the same rate.
Tasks progress at a constant rate throughout time.
There is no cost to launching a speculative task on a
node that would otherwise have an idle slot.
A tasks progress score is roughly equal to the
fraction of its total work that it has done. Specifically,
in a reduce task, the copy, reduce and merge phases
each take 1/3 of the total time.
Tasks tend to finish in waves, so a task with a low
progress score is likely a slow task.
Different tasks of the same category (map or reduce)
require roughly the same amount of work

Then choose M to install the Management Server

To know more, you can find in [12].
B. Hadoop deployment
Hadoop is written in Java, we deploy Hadoop under
Ubuntu 12.04, but the Sun JDK has been deleted from the
official source, to have JDK to run Hadoop, follow these
1) Download the latest version of JDK for Ubuntu from
jdk-7u4-downloads-1591156.html, we chose jdk-7u4-linuxi586.tar.gz.
2) Set environment variables
Untar the file, set environment variable JAVA_HOME to
the path of JDK, add JAVA_HOME/bin to PATH and
3) Make sun-jdk be default jdk


But if Hadoop is running on virtual machines and knows

weather any two virtual machines are in a same physical host,
it will help Hadoop to decide which virtual machine run
which map or reduce job.
In physical cluster, it may be homogenous because the
machine in it may be all the same in hardware, but in virtual
environment, it becomes complicated , that's because even
the virtual machines has the same virtual hardware, some of
them may run on same physical host, and some of them may
run on different physical hosts. Though virtual machine
monitor can isolate the CPU and memory, but virtual
machines have to complete for network bandwidth and disk,
which may cause the Hadoop's implicit assumption that the
cluster Hadoop is running on is homogenous fail. If the
homogenous assumption fails, efficiency of the scheduler of
Hadoop will be impacted seriously.
Though some scheduling algorithms has been brought up,
for example LATE (Longest Approximate Time to End) [1],
but it is designed to help Hadoop to cope with heterogeneous
environment, not only virtual environment. We need to find
a scheduling algorithm in only the virtual cluster for Hadoop
which can improve its efficiency more.

B. I/O Scheduling
The efficiency in virtual machine may be very low than
in physical machine,the reason including task scheduling and
I/O scheduling.MapReduce is designed to run in physical
machines,when a MapReduce task is running,a lot of data
will be tranfered between machines,the efficiency of I/O
scheduling is very important to shorten shorten the response



There is no doubt that the virtual environment is different

from physical environment. But which point is relative to the
efficiency is the key. The most different point is that the I/O
environment. For example, machine A and machine B are
running map and reduce jobs. But A needs some data on B
and B needs some data on A. If in virtual environment, there
are two cases, one is that A and B are in different host
machines, they transfer data as figure 5.
Network Card






We talk about CloudStack, MapReduce programming

model and Hadoop. CloudStack can be used to create virtual
cluster; MapReduce use two operations in functional
programming language map and reduce, which allows
distributed parallel running. Then we discuss how to deploy
Hadoop in virtual machines which can be obtained from
CloudStack by some means, then an algorithm to solve the
problem that all the virtual machines which are created by
CloudStack using same template have a same hostname.
After that we run some Hadoop programs under the virtual
cluster, which shows that it is feasible to deploying Hadoop
in this way. We answer the question why it is feasible to
deploy Hadoop in virtualized data center by discussing the
advantages and disadvantages of Hadoop in Virtual
Environment. The advantages are that it can ease the
management, fully utilize the computing resources, make
Hadoop more reliable and save power and so on. But before
enjoying it, we have to face the lower performance of virtual
machine. Then some methods to optimize Hadoop in virtual
machines are discussed.
At last we talk the differences of Hadoop in virtual and
physical machines, from that we point out two ways to
optimize Hadoop in virtual environment. Our future work is
to follow the two ways to design algorithms to solve the


Physical Machine
Fig 5. VMs are in same host

But if A and B are in the same machine, they transfer

data as figure 6.



Network Card

Network Card





Machine 1

Machine 2

Fig 6.


VMs are in different hosts

Two cases show big difference in efficiency. In case 1,

VM A and VM B use different hard disks and different
network cards; but in case 2, VM A and VM B use the same
hard disk and the same network card, this is same to physical
machines. Data transferring efficiency is half of case 1.That
will make the response time much longer. Too long response
time cant be tolerated in short jobs which is the mainly kind
of jobs MapReduce processes.




