A NEW DATA-GROUPING-AWARE DYNAMIC DATA PLACEMENT-2016

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Accepted Manuscript

A New Data-Grouping-Aware Dynamic Data Placement Method that


Take into Account Jobs Execute Frequency for Hadoop

Jia-xuan Wu , Chang-sheng Zhang , Bin Zhang , Peng Wang

PII: S0141-9331(16)30094-1
DOI: 10.1016/j.micpro.2016.07.011
Reference: MICPRO 2432

To appear in: Microprocessors and Microsystems

Received date: 20 January 2016


Revised date: 2 July 2016
Accepted date: 11 July 2016

Please cite this article as: Jia-xuan Wu , Chang-sheng Zhang , Bin Zhang , Peng Wang , A New
Data-Grouping-Aware Dynamic Data Placement Method that Take into Account Jobs Execute Fre-
quency for Hadoop, Microprocessors and Microsystems (2016), doi: 10.1016/j.micpro.2016.07.011

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

A New Data-Grouping-Aware Dynamic Data Placement Method that Take into Account
Jobs Execute Frequency for Hadoop
Jia-xuan WU1, Chang-sheng ZHANG1, Bin ZHANG†1, Peng WANG1
1
( School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China)

E-mail: zhangbin@ise.neu.edu.cn

Abstract: Recent years have seen an increasing number of scientists employing data parallel computing frameworks, such as
Hadoop, in order to run data-intensive applications. Research on data-grouping-aware data placement for Hadoop has become
increasingly popular. However, we observe that many data-grouping-aware data placement schemes are static, without taking
MapReduce job execution frequency into consideration. Such data placements scheme will lead to severe performance
degradation that is way below the potential efficiency of optimal data distribution when executing MapReduce jobs that are

T
executed frequency. In this paper, we propose a new data-grouping-aware dynamic (DGAD) data placement method based on the
job execution frequency. Firstly, we build a job access correlation relation model among the data blocks according to the

IP
relationships provided by the records about historical data block access. Then we use a clustering algorithm to divide data blocks
into clusters according to the job access correlation relation model among the data blocks and propose a data placement algorithm
based on data block clusters in order to put correlated data blocks within a cluster on the different nodes. Finally, a series of

CR
experiments are carried out in order to verify the method proposed in this paper. Experimental results show that the proposed
method can effectively deal with the mass data and can obviously improve the execution efficiency of MapReduce.

Key words: Hadoop; data placement; date-group-aware; job execute frequency; access correlate relation

1 Introduction
US
In recent years, with the popular of cloud computing [1] based services on the Internet, enormous volumes of
data or called ―big data‖ [3] is being generated by these services around us at all times. In order to extract
maximized value from big data, it needs a new approach to architecture.
AN
The MapReduce [5] model is very attractive for parallel applications which processing of arbitrary big data
sets as mentioned above [4]. MapReduce breaks a computation into small tasks that run in parallel on multiple
machines, and scales easily to very large clusters of inexpensive commodity computers. Its popular open-source
implementation, Hadoop where it runs jobs that produce hundreds of terabytes of data on a large amount of
M

cores [6]. The Hadoop provides a distributed file system (HDFS) [7] and a framework for the analysis and
transformation of very large data sets using the MapReduce.
Many data intensive applications take considerable benefits from the Hadoop system [8], such as in the fields
ED

of bioinformatics [9], astronomy [10], industrial systems [11] and high-energy physics [12]. Only a part of the
subsets of the entire data is the focus of this study. For example, in the field of bioinformatics research, X and Y
chromosomes have the greatest correlation with the gender of human offspring. In genetics, compared to other
chromosomes, the X and Y chromosomes are most often analyzed at the same time [13]. Another example is
PT

that in the climate models and forecasting domain, some scientists are only interested in meteorological
information about particular periods or regions [14]. Therefore, data with high correlations have a greater
likelihood to be used and processed. When using Hadoop to handle these applications, we can use the
characteristics of the concern extent is different when the data is focused by different job, in order to organize
CE

the layout of the data blocks and improve the performance of MapReduce.
The default data placement method of Hadoop is the simplest random placement algorithm [15], which may
place associated data blocks in the same node [16]. As shown in Fig. 1(a), when several map tasks need to
handle the associated data blocks that are randomly placed on the same node, because the number of slots on
AC

each node is limited, even though some map tasks are placed on the node that the data they need to access in
following the principle of data localization, they need to wait for access, resulting in the time delay problem. If
they are arranged on the other nodes that are far away from the data they need to access, then although there is
no need to wait in line, it goes against the principle of data localization, triggering a data block migration. These
two cases will affect the execution performance of MapReduce.
ACCEPTED MANUSCRIPT

Map Map Map Map


Hadoop Default
Map Map Map Map
Random Data
Placement

Map Map
Correlated data blocks
Map
Map Others data
blocks/Empty
node1 node2 node3 node4
(a) Map Local map task

T
Map Map Map Map Map Unlocal map task
Map Map Map Map
Map Waiting map task

IP
Data-Group-Aware

CR
Data Placement

nde1 node2 node3 node4


(b)
Fig. 1 Simple Case of Two Data Block Placement
In some of the current research, in order to solve the problem of the Hadoop random data placement

US
algorithm, some studies consider the correlation between the data blocks and have proposed series
data-grouping-aware data placement methods in order to organized data layout in some specific ways to support
high performance data accesses. Similar to be shown in Fig. 1(b).
Jun Wang et al. [17] considered the features of and relationships among data, and a data grouping matrix has
AN
been built by making a historical data access chart and using a clustering algorithm according to the data
grouping matrix. Thus, the best data placement and the related data blocks have been worked out. Amer A. et al.
[18] exploited the file-grouping relation to better manage distributed file caches. The rationale is that group
accessed files are likely to be group accessed again. Yuan et al. [19] suggested a data placement method that can
M

reduce data movement efficiently during workflow execution for a scientific cloud workflow. Lin et al. [20]
suggested a selection strategy based on values for the HDFS data block placement. This strategy has been
considered to have an imbalanced problem caused by the random strategy. Copies should be placed in
combination with the node distance and data load. Eltabakh M Y et al. [21] proposed a lightweight solution for
ED

collocating related files in HDFS. Xie J et al. [22] proved that the ignorance of data locality can obviously
reduce MapReduce performance because the data transfers from a slow to fast point. The author gave a
calculation, and placed the copy of a data block based on the initial node ability. Zhang et al. [23] takes data
locality into account for launching speculative MapReduce tasks in heterogeneous environments. MRAP [24]
PT

develops a set of MapReduce APIs for data providers who may be aware of the subsequent access patterns of the
data after being uploaded. Jin H. et al. [25] proposed ADAPT, an availability-aware MapReduce data placement
strategy in order to optimize the performance of MapReduce applications, increasing the data locality and
CE

reducing network traffic.


However, most of the above research studies use a static data placement method aimed at balancing all
MapReduce job performance because they do not take MapReduce jobs execute frequency into consideration.
For example, for the same input file, some MapReduce jobs need to be executed frequently (hot job) for a period
AC

of time, while other jobs will only be executed a few times or not executed at all (cold job). In such a situation,
the above-mentioned data-grouping-aware data placement methods that focus on organized data layout in order
to improve the data access performance of all MapReduce jobs (both hot and cold), which is unfair to the hot
MapReduce jobs. Therefore, these methods cannot maximize the performance of hot MapReduce jobs.
Therefore, we propose a new data-grouping-aware dynamic data placement method (DGAD) that takes into
account MapReduce job execution frequency in order to significantly improve the performance of
data-intensive applications. In summary, this study makes the following contributions:
 We built a job access correlate relation model for the data blocks in order to show the correlations among
job access data blocks.
 We proposed a clustering-based placement algorithm for correlated data blocks and used a clustering
algorithm in order to divide the job access correlation model into many data block clusters and place
clusters on the Hadoop nodes with special ways. We maximized the parallel distribution of data blocks.
ACCEPTED MANUSCRIPT

 We realized our work on the Apache Hadoop-2.4.0 release and evaluated it with three real data-intensive
applications. The experimental results show that DGAD is able to improve MapReduce performance.
The rest of this paper is organized as follow: Section II describes the design of DGAD; Section III presents
the experimental methodology and our results and analysis; Section IV presents related works; and finally,
Section V concludes the paper.

2 Data-Grouping-Aware Dynamic Data Placement Method that Take into Account Jobs Execute
Frequency
2.1 Job Access Correlate Relation Model among the Data Blocks

T
Before establishing the job access correlate relation model among the data blocks, we should first define the

IP
correlations among data blocks. MapReduce jobs always require access to multiple data blocks during
execution. The data blocks that are used together at the same time can be defined as being positively correlated.
When data blocks are accessed together in more MapReduce jobs or more frequently accessed by MapReduce

CR
jobs than other, that means they are more positively correlated to each other than other jobs are.
Fig. 2 shows the relationships between MapReduce jobs J1, J2, and J3 and data blocks B1, B2, B3, B4, B5, B6,
B7, B8, B9, and B10. When J1 is executed, it needs to access data blocks B1, B2, B3, B6, B7, and B8. That means data
blocks B1, B2, B3, B6, B7, and B8 are positively correlated; when J2 is executed, it needs to access data blocks B2,

US
B3, B4, B7, and B9. That means data blocks B2, B3, B4, B7, and B9 are positively correlated. When J3 is executed,
it needs to access data blocks B1, B2, B5, B6, B7, and B10. That means data blocks B1, B2, B5, B6, B7, and B10 are
positively correlated.
AN
Bi Data Block Taskij Map Task
Task11 B1 B3

Jobi MapReduce Job Data Flow


Job1 Task12 B2 B6

Task13 B7 B8
M

B1 B2 B3 B5 B10

Task21 B2 B3

Job2 Task22 B4 Job1 Job2 Job3


ED

Task23 B7 B9

B8 B7 B4 B9 B6
Task31 B1 B2
PT

Job3 Task32 B5 B6

Task33 B7 B10
CE

Fig. 2 an Illustration of Access Diagram for Data Blocks and Jobs


 Definition 1 (The job access correlation among data blocks). Jk is the job used in Block Bi, and Block Bi
and Bj may be accessed by multiple jobs. Each job Jk may execute many times. Then the correlation of data
blocks Bi and Bj can be illustrated by the frequency at which Bi and Bj are simultaneously used by jobs.
AC

According to Definition 1, based on the access diagram of the data blocks, this section is intended to establish
a job access correlation model for the data blocks.
Because of different needs, the execution frequency of jobs will change at different times and the frequency
of access might change when two data blocks are accessed at different times, which could result in a dynamic
change for the frequency of two data blocks being accessed simultaneously in different historical periods. In
order to calculate access frequency accurately, the value, which has dynamic effect on the correlation
relationship of job access between data blocks, as a weight that takes part in calculation, is called the timeliness
factor.
 Definition 2 (Timeliness factor).  is the timeliness factor. It represents the access number of weight
two data blocks to be accessed simultaneously in different historical period. The timeliness factor,  , is
ACCEPTED MANUSCRIPT

obtained by experience and defines a custom size, and the value rang is between 0-1.

Assume that the total execution number of jobs is N, and the number of times two data blocks, Bi, Bj, are
accessed by a job simultaneously is n(Bi,Bj).  is the timeliness factor. It represents the access number of
weight two data blocks to be accessed simultaneously in different historical period, for example, certain two
data blocks are frequently accessed simultaneously before one month, and the number of simultaneous access
declines before one week, while there is a change in the number of simultaneous access in the latest three days,
in order to calculate the access frequency more accurately. Here, we multiply the number of two data blocks that
are accessed simultaneously at different times by a weight, such as the access frequency that was recorded

T
during the previous month n1  Bi, Bj  N in the example, because the time is relative long and the timeliness is

IP
poor, multiply by a smaller timeliness factor 1  0.2 . The timeliness of access frequency recorded in latest
three days, n3  Bi, Bj  N , is the highest, so it is multiplies by a bigger timeliness factor, 3  0.9 , and the

CR
recorder before one week n2  Bi, Bj  N between them, then multiplied by a timeliness factor of 2  0.6 .
Thus, the frequency of data blocks Bi and Bj being accessed simultaneously by MapReduce jobs is shown as the
following equation:

US k
n  Bi, Bj 
freq  Bi, Bj    k
N
 k

By researching historical data access recorder from the NameNode log file, one can get the correlation
(1)
AN
between data blocks that are accessed simultaneously by jobs and the number of times a MapReduce job is
executed. According to the timeliness factor and the correlation relationship among data blocks that is
calculated by Equation 1, based on the correlation degree, we built a job access correlation model for data
blocks G=<V, E, W>.
M

 Definition 3 (job access correlate relation model among the data blocks G=<V, E, W>). This model
shows the correlation between job access and simultaneously accessed data blocks on the basis of an
access diagram of the data blocks. Among it, V is the vertex set in the diagram. v∈V represents a data
block input file. E is the set of edges in the diagram. e∈E represents the job access correlation among data
ED

blocks. W is the weight of edges in the diagram. freq∈W represents the frequency of data blocks, which is
represented by two vertexes, is accessed simultaneously.

Assume there are ten data blocks (B1, B2, B3, B4, B5, B6, B7, B8, B9, and B10) in a Hadoop input file and three
PT

MapReduce jobs (J1, J2, and J3) were executed in the most recent one-month period. V represents data block set
B1, B2, B3, B4, B5, B6, B7, B8, B9, and B10, E represents the data blocks that are accessed by jobs (J1, J2, and J3)
simultaneously, and W represents weight on the edge adopts freq(Bi, Bj) to represent the frequency that data
CE

blocks Bi, Bj are accessed by jobs simultaneously, among it, i, j 1,2,3,4,5,6,7,8,9,10 and i  j . J1 accesses
data blocks B1, B2, B3, B6, B7, and B8, and J1 executes 5 times. Among these executions, 1 was longer than one
month ago, 2 times were one week ago, and 2 were in the past three days. J2 accessed data blocks B2, B3, B4, B7,
AC

and B9, and J2 was executed 3 times: 0 times longer than month ago, 1 time in the previous week, and 2 times in
the past three days. J3 accessed data blocks B1, B2, B5, B6, B7, and B10, and J3 was executed 4 times: 3 times
longer than 1 month ago, 1 time in the previous week, and 0 times in the past three days. The following table
shows the access relationship between jobs and blocks.
Table 1 Example Showing Access Correlation between Jobs and Blocks
Jobs (Execute times) Access correlation block set
J1(5) {B1, B2, B3, B6 ,B7, B8}
J2(3) {B2, B3, B4, B7, B9}
J3(4) {B1, B2, B5, B6, B7, B10}
Based on Table 1, we can infer the correlation between each data block that is accessed and the jobs that
access it, as shown in Table 2. For example, B1 was accessed 5 times by J1 and 4 times by J2, while B2 was also
accessed by J1 and J3; that is, B1 and B2 were accessed 5 times simultaneously by J1, and at the same time, they
were accessed 4 times by J3; therefore, B1 and B2 were simultaneously accessed a total of 9 times.
ACCEPTED MANUSCRIPT

Table 2 Access correlation for blocks


Blocks Jobs (Execute times)
B1 J1(5), J3(4)
B2 J1(5), J2(3), J3(4)
B3 J1(5), J2(3)
B4 J2(3)
B5 J3(4)
B6 J1(5), J3(4)
B7 J1(5), J2(3), J3(4)
B8 J1(5)
B9 J2(3)

T
B10 J3(4)
From Table 1 and Table 2, we can learn that B1 and B2 were accessed 9 times simultaneously, namely n(B1,B2)

IP
= 9, and B1 and B3 were accessed 5 times simultaneously, namely n(B1,B3) = 5. The correlation access number
among other data blocks, n(Bi,Bj), can be calculate based on this example. Total execution times of 3 jobs are N
(N = 12). According to the correlation between timeliness factors and the frequency of job execution (as shown

CR
in Table 3), we can calculate the jobs access frequency among data blocks with Equation 1, and then get the
correlation relationship degree accessed by jobs among data blocks.
Table 3 Relation between Timeliness Factors and Frequency of Jobs Execution
Within 1 month Within 1 week
 (0<  <1)
J1(5)
J2(3)
J3(4)
0.2
1
0
3
US 0.6
2
1
1
Within 3 days
0.9
2
2
0
AN
For example, B1 and B2 were accessed 9 times simultaneously. Among these accessions, they were accessed
by J1 5 times simultaneously and by J3 4 times simultaneously. According to Table 2 and Equation 1, we can
calculate the access frequency for B1 and B2 being accessed simultaneously by jobs thusly:
1 3 2 1 20 4.4
freq  B1 , B2    0.2   0.6   0.9  . This is the same calculation method for other data
M

12 12 12 12
blocks, and the detailed result are shown in Fig. 3.
ED
PT
CE
AC

Fig. 3 Frequency Diagram of Job Access Correlation among Data Blocks


Some information can be drawn from the conclusions, such as the correlation between two B2 and B7 is the
largest, yet the correlation between B1 and B4 is the smallest, and according to the data received from Fig. 3, we
can determine the job access correlations among the data blocks, as shown in Fig. 4.
ACCEPTED MANUSCRIPT

B1 freq(B1,B2) B2
freq(B1,B5)
freq(B2,B4)
freq(B1,B3) freq(B2,B3)
freq(B2,B5)

B3 B4
freq(B3,B4)

B5
Fig. 4 Job Access Correlate Relation Model among the Partial Data Blocks

T
2.2 Clustering-Based Placement Algorithm of Correlated Data Blocks

IP
In the previous section, the data blocks that were spontaneously processed by the same MapReduce job are
known as the correlated data blocks. Based on the features of parallel execution of the map tasks on the data
blocks and the characteristics of the data localization in the MapReduce operations. In order to elevate the

CR
degrees of parallel access of the maps of the correlated data in the Hadoop clusters, increase the number of local
map tasks, and shorten the execution time of the map phase, it is necessary to conduct divisions of the clustering
of the strengths of the access association degrees among the data blocks. Efforts were made to place data blocks
with strong job access correlation degrees on different nodes in order to ensure that these correlated data blocks

localization and performance impedance caused by waiting delays.


US
could be more evenly distributed and placed so that the MapReduce jobs can avoid damages caused by data

Based on the job access correlation model that was constructed in the previous section, an adjacency matrix
can be used as a correlation matrix with which to expresses the job access correlation degree among the data
AN
blocks, and the value in the matrix is an integer that can express the correlation degree of the spontaneous
accesses between two blocks, and it was derived based on the data simplification in Fig. 3. The correlation
degree matrix of the data blocks (CDM) is shown in Equation 2.
(2)
M

B1 B 2 B 3 B 4 B 5 B 6 B 7 B8 B 9 B10
B1 0 11 8 0 3 11 11 8 0 3
B2 11 0 14 6 3 11 17 8 6 3
 
B3 8 14 0 6 0 8 14 8 6 0
 
B4 0 6 6 0 0 0 6 0 6 0
ED

B5 3 3 0 0 0 3 3 0 0 3
R  Bi, Bj    
B6 11 11 8 0 3 0 11 8 0 3
B7 11 17 14 6 3 11 0 8 6 3
 
B8 8 8 8 0 0 8 8 0 0 0
B9 0 6 6 6 0 0 6 0 0 0
 
PT

B10  3 3 0 0 3 3 3 0 0 0 
After determining the CDM with Equation 2, we used a matrix clustering algorithm to group highly related
data. Specifically, the bond energy algorithm (BEA) was used to transform the CDM into the clustered
correlation degree matrix (CCDM). The BEA algorithm clusters the highly correlated data blocks together,
CE

indicating which data blocks should be evenly distributed. It is assumed that the number of data blocks confined
in this section shall not exceed that of the computing nodes within each cluster.
Determining the data blocks clusters alone is not enough to achieve optimal data placement. For example,
assume all data blocks are divided into two clusters: cluster 1 (B1, B2, B3, B6, and B7) and cluster (B4, B5, B8, B9,
AC

and B10). Place cluster 1 and cluster 2 on the nodes randomly, as shown in Fig. 5 (a). MapReduce job 2 and job
3 can only run on 4 nodes rather than 5, which is not optimal. The reason is the above data block clustering only
considers the horizontal relationships among the data in CDM, and so it is also necessary to make sure the
blocks on the same node have minimal chance to be in the same cluster (vertical relationships) [26]. In order to
obtain this information, we propose an algorithm for data blocks placement in which the data blocks should be
placed as shown in Fig. 5 (b), by using our algorithm. Our algorithm is inspired by the method which proposed
in [17] and we made some modify to make it more suitable for our situation. The details of algorithm are as
described in Algorithm 1.
ACCEPTED MANUSCRIPT

Place data block clusters on the nodes Place data block clusters on the nodes
randomly with our algorithm

Node1 Node2 Node3 Node4 Node5 Node1 Node2 Node3 Node4 Node5
B1 B2 B3 B6 B7 B1 B2 B3 B6 B7
B5 B10 B8 B4 B9 B4 B10 B5 B9 B8

Jobs Required Data Blocks Placement Nodes Jobs Required Data Blocks Placement Nodes
J1 B1,B2,B3,B6,B7,B8 Node1,Node2,Node3,Node4,Node5 J1 B1,B2,B3,B6,B7,B8 Node1,Node2,Node3,Node4,Node5
J2 B2,B3,B4,B7,B9 Node1,Node2,Node4,Node5 J2 B2,B3,B4,B7,B9 Node1,Node2,Node3,Node4,Node5
J3 B1,B2,B5,B6,B7,B10 Node1,Node2,Node3,Node4 J3 B1,B2,B5,B6,B7,B10 Node1,Node2,Node3,Node4,Node5

T
not optimal optimal
(a) (b)
Fig. 5 Data Blocks Placement Method Contrast

IP
Algorithm 1 Clustering-based placement algorithm of correlated data blocks
Input: The CCDM M[n][n] which build from the BEA algorithm, where n is the number of data

CR
block; the number of clustering c; the number of data nodes m
Output: A matrix indicating the optimal data placement DG[c][m];

CA[c] = the data block clusters: Cluster1,…,Clusterc which getting from the M[n][n]; //The number of
elements of each cluster is equal to m
Temp_DG[2][m] = the 2 x m matrix;
Sub_M[m][m] = the m x m matrix;
for j = 1; j < c; j ++ do
US
Sub_M = the sub-matrix of M; //The column of Sub_M is the Cluster1, the rows of Sub_M is the
AN
Cluster(j+1)
for each row from Sub_M do
R = the index of current row; //Data block No.
t = 0;
Find the minimum value V in R;
M

MinSet = C1,V1;C2,V2...;//Vi is the corresponding value of the index of column Ci and there may be
more than one minimum value
if there is only one (C1,V1) in MinSet then
Temp_DG[0][t]=R;
ED

Temp_DG[1][t]=C1;
Mark C1 can’t assign more;
Continue;
end if
for each column Ci from MinSet do
PT

Sum[i] = sum(Sub_M[*][Ci]);//all the items in Ci column


end for
C = the index of the largest value Sum item;
Temp_DG[0][t] = R;
CE

Temp_DG[1][t] = C;
Mark C can’t assign more;
t++;
end for
AC

t = 0;
DG[0][*] = Temp_DG[0][*];
DG[i][*] = Temp_DG[1][*];
end for
return DG;
The use of Algorithm 1 not only maximizes the parallel distribution of the associated data blocks but also
realizes load balance on the entire node set.

3 Simulation Experiment and Performance Analysis


In this section, we evaluate the performance of our DGAD and relevant algorithms from two perspectives
with some real data-intensive applications. Before discussing the experimental results, we will first describe the
experimental setup.
ACCEPTED MANUSCRIPT

3.1 Experiment Setup


The Hadoop cluster can be deployed on both virtual machine [2] and physical machine. To evaluate the
correctness and efficiency of our proposed method better, the experiments were conducted on a 20-DataNode
Hadoop cluster on a single rack that was deployed on the physical machines and located in a data center of our
laboratory. The detailed configurations are shown in Table 4.
Table 4 Hadoop Cluster Configuration
5 Data Nodes and 1 master Node
Model HP Z820 workstation
CPU 2 Intel Xeon E5-2620, Dual Core, 2.10 GHz

T
RAM 64 GB DDR3, 1333 MHz
Internal HD 2 SATA 500GB (7200 RPM)

IP
Network Interface Intel(R) 82574L Gigabit Network Connection
Operating System Ubuntu 12.04

CR
15 Data Nodes
Model DELL OptiPlex 755
CPU Intel Core 2 T7300, 2.0 GHz
RAM 2 GB DDR2, 667 MHz
Internal HD 1 SATA 160GB (7200RPM)
Network Interface
Operating System

Switch model
US
Intel(R) 82579LM Gigabit Network Connection
Ubuntu 12.04
Cluster Network
TP-LINK TL-SG1024T Gigabit Switch
AN
We realized our work on Hadoop-2.4.0. We evaluated them with MaxTo1949 jobs (statistic the max monthly
average temperature within in 1949) and MaxTo1988 jobs (statistic the max monthly average temperature
within in 1988) [27] on the Hadoop clusters. In the experiments, we developed a program to perform data
M

reorganization according to our mentioned clustering-based placement algorithm for correlated data blocks
output with a preconfigured frequency or we could launch the program manually if necessary.
The data used by the experiment were American data meteorological data set samples that were downloaded
ED

from the US National Meteorological Centre. The capacity of the data sets was about 80GB, and the data sets
were in text format; there was content for each year from 1901 to 2002, and each line in the data sets were one
record that included the ID of the meteorological station, the observation date (year, month, and date), the
observation time (24 hours), temperature, latitude and longitude, and other weather-related information. Each
PT

record has an average length of 175 bits.


In the following experiments, we ensured that there was only one copy of each data existing in each rack. This
assumption was derived from the practical Hadoop configurations (e.g., Hadoop with single-replica for each
piece of data) [28].
CE

3.2 Performance Improvement of MapReduce Programs


1) MapReduce jobs execution frequency
The MapReduce jobs execution frequency is significant to our method. In the experiment, we chose the
AC

whole September MapReduce jobs execution logs to get frequency which can be found in the NameNode. As
shown in the Fig.5, we chosen MaxTo1949 and MaxTo1988 as benchmark there are two MapReduce jobs
which executed in September. As the time goes on, MaxTo1988 job became more and more hot and
MaxTo1949 jobs became more and more cold. We used these frequencies in the subsequent experiments.
ACCEPTED MANUSCRIPT

20
MaxTo1949
MaxTo1988

15

Execute Times
10

T
5

IP
0
Sept 1 Sept 4 Sept 7 Sept 12 Sept 15 Sept 18 Sept 21 Sept 24 Sept 27 Sept 30

CR
Date
Fig. 6 MapReduce Jobs Execution Frequency
2) Jobs completion time
The data locality determined by HDFS’ data placement strategy is an important factor for MapReduce
performance. Especially in a heterogeneous environment, the current default data placement policy of HDFS

effectiveness and efficiency of our proposed and related algorithms. US


can cause severe performance degradation. We ran MapReduce job MaxTo1988 in order to verify the

During the evaluation, we compared three data placement mechanisms of Hadoop: 1) default-strategy: the
native data placement strategy of the current Hadoop release; 2) DRAW: with interest locality placement
AN
mechanism presented by [17]; 3) DGAD: the job access associate aware placement mechanism proposed by this
paper. Based on this, we had run the two MapReduce jobs as mentioned above on the Hadoop cluster that
deployed the three data placement mechanism separately. The default data block size was configured as 64MB.
Fig. 7 reveals the specifics about final experimental results.
M

Default map
100 Default reduce
ED

DRAW map
DRAW reduce
80 DGAD map
DGAD reduce
Completion(%)

60
PT

40
CE

20

0
0 50 100 150 200 250 300 350 400 450 500

Execution Time(Second)
AC

Fig. 7 Comparison Chart of Total Jobs Completion Time


From Fig. 7, we can find out that DGAD is important for MaxTo1988. The map phase running on DGAD
finished 12.7% sooner than DRAW’s data and 41.7% earlier than the one running on randomly placed data, and
the job’s overall execution time was improved by 9.1% and 36.4%, respectively. The reason for this is that this
kind of job focuses on the data blocks containing the meteorological information whose observation year was
1988, which means these jobs’ data access patterns are skewed and may generate hot data blocks. In addition,
although DRAW takes data grouping into account, it does not take MapReduce jobs execution frequency into
account. As shown in Fig. 6, MaxTo1988 job executed frequently, implying that this job is more likely to
execute frequently in the future. Therefore, it cannot maximize the MaxTo1988 performance by DRAW, but
DGAD can.
3) Number of local maps
We had run the MaxTo1988 MapReduce job in order to make a comparative experiment of the number of
ACCEPTED MANUSCRIPT

local map tasks. The default data block size was configured as 64MB. The comparison is shown in Fig. 8, and
the detail of this evaluation is shown in Table 5. The MapReduce job running on the DGAD’s data placement
has 76.1% local maps which benefit from having data locality, compared with 68.2% from the DRAW placed
data and 45.7% from the randomly placed data. The reason for such a result is similar to that of the above
experiment.

120
local maps
not local maps
100

T
Local Map Percent (%)

80

IP
60

CR
40

20

US
0
DGAD DRAW Random

Data Placement Mechasim


Fig. 8 Comparison of the Number of Local Maps
AN
Table 5 Detail of Local Maps Number
Total maps Local maps Ratio
DGAD 412 328 79.7%
DRAW 412 281 68.2%
M

Random 412 188 45.7%

Note that there are still 20.3% of maps that are working without having data locality, even after the DGAD’s
data placement. There are two reasons for this: first, the data grouping information the BEA algorithm used is
ED

generated from all previous MapReduce jobs rather any specific one, and our mentioned data placement
algorithm follows a high-weight-first-placed strategy, which means the data with higher grouping weights will
be granted higher priority for even distribution. In other words, the distribution of the hottest data is only
optimized but may not be 100% perfect for the corresponding MapReduce jobs. Second, the matrix clustering is
PT

an NP-hard problem; hence, the clustered grouping matrix generated from the BEA algorithm is a
pseudo-optimal solution. The adoption of the BEA algorithm is a reasonable tradeoff between efficiency and
accuracy.
CE

4) Performance of MapReduce Jobs


In order to quantify the effect of the dynamic feature of our method, we had executed two MapReduce jobs,
MaxTo1949 and MaxTo1988, respectively on the Hadoop clusters that deployed the two data placement
methods (DGAD, DRAW) separately. The numbers of jobs were set from 10 to 60 in increments of 10. The
AC

default data block size was configured as 64MB. Fig. 9 shows the experiment results.
From Fig. 9(a), we can observe that DGAD has a better performance in all case. In addition, From Fig. 9(b),
we can observe that DRAW has a better performance when the number of MaxTo1949 is 10. However, as the
number of MaxTo1949 increases, DGAD performs better. The reason for this is that the MaxTo1949 executed a
less times, as shown in Fig. 6; therefore, DGAD has little effect on it. However, the execution frequency of
MaxTo1949 constantly increased as the experiment continued. The DGAD will reorganize data in order to
support maximized MapReduce job performance for jobs that are executed frequently.
ACCEPTED MANUSCRIPT

(a) (b)

400 400
DGAD DGAD
DRAW DRAW

300 300
Execution Time(Minutes)

Execution Time(Minutes)
200 200

T
100 100

IP
0 0

CR
10 20 30 40 50 60 10 20 30 40 50 60

Number of MaxTo1988 jobs Number of MaxTo1949 jobs


Fig. 9 Performance of Two MapReduce Jobs based on Two Different Data Placement Methods

4 Conclusions
US
Research on data-grouping-aware data placement for Hadoop has become increasingly popular. However, we
observed that many data-grouping-aware data placement schemes are static and do not take MapReduce job
execution frequency into consideration. This could lead to severe performance degradation and low efficiency
AN
of optimal data distribution when frequently executing MapReduce jobs. In order to solve this problem, a new
DGAD method was developed. DGAD captures runtime data grouping patterns and distributes the grouped data
as evenly as possible. There are three phases in DGAD: building a job access frequency correlation model from
system logs, clustering the data grouping matrix based on access frequency correlation degree, and reorganizing
M

the clustered data. Our experimental results show that, for two representative MapReduce jobs—MaxTo1949
and MaxTo1988—DGAD can significantly improve the execution efficiency of MapReduce.
A number of possible future studies are apparent. First of all, we plan to propose an effective scientific
method to get exact value of timeliness factor  rather than by experience, which could possibly improve the
ED

effectiveness further. Moreover, we will conduct more comprehensive examples to validate our method,
reconfigure data block size to 32MB or 128MB rather than default 64MB, for example.
PT

Acknowledgement
CE

This work was supported by the National Natural Science Foundation Program of China (61572116,
61572117, 61502089), the National key Technology R&D Program of the Ministry of Science and Technology
(2015BAH09F02), the Provincial Scientific and Technological Project (2015302002), and the Special Fund for
Fundamental Research of Central Universities of Northeastern University (N150408001, N150404009).
AC

References
[1] M. Armbrust, A. Fox, R. Griffith, and A. D. Joseph, et al, A view of cloud computing, Commun. ACM. 53 (2010) 50-58.
[2] Zhang, Weizhe, H. Xie, and R. Hsu, Automatic Memory Control of Multiple Virtual Machines on a Consolidated Server,
IEEE Transactions on Cloud Computing. (2015) 1-14.
[3] Manyika J, Chui M, Brown B, et al. Big data: The next frontier for innovation, competition, and productivity. (2011).
[4] Zhang, Weizhe, A. Cheng, and J. Subhlok, DwarfCode: A Performance Prediction Tool for Parallel Applications, IEEE
Transactions on Computers. 65 (2015) 495-507.
[5] Dean J, Ghemawat S, MapReduce: simplified data processing on large clusters, Communications of the ACM. 51 (2008)
107-113.
[6] Kachris C, Sirakoulis G C, Soudris D, A MapReduce scratchpad memory for multi-core cloud computing applications,
Microprocessors and Microsystems. 39 (2015) 599-608.
[7] Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system, MSST, 2010 IEEE 26th Symposium on. IEEE.
ACCEPTED MANUSCRIPT

(2010) 1-10.
[8] Xiong R, Luo J, Dong F, SLDP: A Novel Data Placement Strategy for Large-Scale Heterogeneous Hadoop Cluster Advanced
Cloud and Big Data (CBD), 2014 Second International Conference on. IEEE. (2014) 9-17.
[9] Schatz, M. C., Cloudburst: highly sensitive read mapping with mapreduce, Bioinformatics. 25 (2009) 1363–1369.
[10] Rodriguez-Martinez, M., Seguel, J., & Greer, M., Open Source Cloud Computing Tools: A Case Study with a Weather
Application. IEEE International Conference on Cloud Computing, Cloud 2010, Miami, Fl, Usa, 2010, pp. 443-449.
[11] Yue X, Cai H, Yan H, Cloud-assisted industrial cyber-physical systems: An insight, Microprocessors and Microsystems. 39
(2015) 1262-1270.
[12] Liu J G, Ghanem M, Curcin V, et al. Achievements and Experiences from a Grid-Based Earthquake Analysis and Modelling
Study, IEEE International Conference on E-Science & Grid Computing. IEEE Computer Society. (2006) 35-35.
[13] Dumitriu, A., X and Y (number 5), ACM SIGGRAPH 2004 Art gallery, ACM. (2004) 28-28.

T
[14] Tripathi S, Rao S G, Change detection in rainfall and temperature patterns over India, Knowledge Discovery & Data Mining.
SensorKDD '09 Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Dat. (2009)

IP
133-141.
[15] Borthakur D. HDFS architecture guide. 2008.
[16] Guo, Zhenhua, G. Fox, and M. Zhou, Investigation of Data Locality in MapReduce, IEEE/ACM International Symposium on

CR
Cluster Cloud & Grid Computing. (2012) 419-426.
[17] Wang, Jun, P. Shang, J. Yin, DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications
with Interest Locality, Cloud Computing for Data-Intensive Applications. Springer New York, 2014, pp. 2514-2520.
[18] Amer A, Long D D E, Burns R C, Group-based management of distributed file caches, Distributed Computing Systems,
2002. Proceedings. 22nd International Conference on. IEEE. (2002) 525-534.

US
[19] Yuan D, Yang Y, Liu X, A data placement strategy in scientific cloud workflows, Future Generation Computer Systems. 26
(2010) 1200-1214.
[20] Lin Wei-wei, An Improved Data Placement Strategy for Hadoop, Journal of South China University of Technology (Natural
Science Edition). 40 (2012) 152-158.
[21] Eltabakh M Y, Tian Y, Özcan F, CoHadoop: flexible data placement and its exploitation in Hadoop. Proceedings of the VLDB
AN
Endowment. 4 (2011) 575-585.
[22] Xie J, Yin S, Ruan X, Improving mapreduce performance through data placement in heterogeneous hadoop clusters, Parallel
& Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE. (2010) 1-9.
[23] Zhang X, Feng Y, Feng S, An effective data locality aware task scheduling method for MapReduce framework in
heterogeneous environments, Cloud and Service Computing (CSC), 2011 International Conference on. IEEE. (2011)
M

235-242.
[24] S. Sehrish, G. Mackey, J. Wang, and J. Bent, Mrap: A novel mapreduce-based framework to support HPC analytics
applications with access patterns, in Proc. 19th ACM Int. Symp. High Perform. Distrib. Comput. HPDC’10, New York, NY,
USA, 2010, pp. 107–118.
ED

[25] Jin H, Yang X, Sun X H, Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing,
Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on. IEEE. (2012) 516-525.
[26] Oprime P C, Martins Tristão H, Lopes Pimenta M. Relationships, cooperation and development in a Brazilian industrial
cluste, International Journal of Productivity and Performance Management, 60 (2011) 115-131.
PT

[27] Cook E R, Seager R, Cane M A, et al. North American drought: reconstructions, causes, and consequences, Earth-Science
Reviews, 81 (2007) 93-134.
[28] B. Zhang, N. Zhang, H. Li, F. Liu, and K. Miao, An efficient cloud computing-based architecture for freight system
application in china railway, in Proc. 1st Int. Conf. Cloud Comput. CloudCom’09, Berlin, Germany, 2009, pp. 359–368.
CE

First A. Author, Jiaxuan Wu was born in Shenyang city, Liaoning province, in 1990. He received the
M.S. degrees in the computer application technology from the Northeastern University in 2014. He is
AC

currently a Ph.D. graduate student at computer application technology in Northeastern University. Her
research interests are evolutionary computation and service computing.

Second B. Author, Bin Zhang was born in 1964. Bin Zhang was born in 1964. He is a professor in
the College of Information Science and Technology at Northeastern University, Shenyang, China. He
is a senior member of China Computer Federation (CCF). He received his Ph.D. degree from
Northeastern University in 1997. His current research interests include service oriented computing
and information retrieval.
ACCEPTED MANUSCRIPT

Third C. Author, Changsheng Zhang was born in Changchun city, Jilin province, in 1980. He received
the B.S., M.S. and Ph.D. degrees in the computer science and technology from the Jilin university
in 2009. Since 2009, he has been an Assistant Professor with the information science and
engineering Department, northeastern university. He is the author of one book and more than 50
articles. His research interests include evolutionary computation, service computing and
distributed constraint programming. He is an Associate Editor of the journal Frontiers of
Computer Science.

Fourth D. Author, Peng Wang was born in Yantai city, Shandong province, in 1987. He received the

T
M.S. degrees in the computer application technology from the Northeastern University in 2014. He is
currently a Ph.D. graduate student at computer application technology in Northeastern University. Her

IP
research interests are evolutionary computation and service computing.

CR
US
AN

First A. Author, Jiaxuan Wu


M
ED

Second B. Author, Bin Zhang


PT
CE
AC
ACCEPTED MANUSCRIPT

Third C. Author, Changsheng Zhang

T
IP
CR
Fourth D. Author, Peng Wang

US
AN
M
ED
PT
CE
AC

You might also like