Professional Documents
Culture Documents
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
Abstract- MapReduce framework is widely used in large scale data analysis. It is highly scalable and programming model
ofconvenience. However, performance of MapReduce is a concern when applied on complex data analytical task involving
multiplejoin of datasets for certain aggregates. Join operation is frequentin data processing. Original MapReduce
programming modeldoes not directly support processing multiple related heterogeneousdatasets thus MapReduce has been
modified to supportheterogeneous datasets. Big Data analytics uncovers the hidden patterns and otheruseful information and
a popular data processing engine forbig data is Hadoop MapReduce. Hadoop MapReduce is usedheavily for data analytical
tasks like data mining, SQL queries,etc. Hive is data warehousing resides on the top of Hadoopand provides sql-like
interface to HDFS files. In this paper,we survey various methods of joining datasets using HadoopMapReduce.
I. INTRODUCTION
With exponential growth in data size distributed executes these automatically on Hadoop cluster. Most
processingof data has become important. Parallel of the data Analytical queries involve joining of
Database System [1] are based on shared- nothing- Multiple datasets [11]–[14].
node (separate CPU, memory, disk) and connected
through high-speed interconnect. Every Parallel The rest of paper is organized as follows section II
Database System uses horizontal data partitioning describes Big Data, section III covers Hadoop,
along with partitioned execution of SQL queries. section IV Bloom Filter, section V discusses various
Horizontal partitioning involves distributing rows of methods for join using Hadoop MapReduce and
relational table across nodes of cluster so that rows finally section VII covers conclusion.
can be processed in parallel. MapReduce [2] is
revolutionary platform for large scale data processing II. BIG DATA
in distributed manner. MapReduce is composed of a
master and multiple workers for conducting Big Data [15] refers to practice of collecting and
MapReduce jobs. It can process huge amount of data processing very large datasets and associated system
in small reasonable time using large number of and algorithms used to analyse these massive datasets.
commodity hardware therefore valuable information Big Data which is getting global attention can be
hidden in those big data can be extracted with much described with the help of three V’s, Volume,
less cost. MapReduce programs are automatically Velocity, Variety [16]. Main issues related to big data
parallelized and executed across cluster. Users don’t are capturing, storing searching, sharing, analysing
need to handle parallelism and failures, MapReduce and visualization.
framework automatically parallelizes user program Big data has following properties:
and reschedules failed task. Yahoo! has deployed 1) Volume: Big Data has very large volume of
largest Hadoop cluster with 4000 nodes [3]. data. It may grow up to hundreds of gigabytes
According to [1] MapReduce and Parallel DBMS are to petabytes, and may have large number of
complementary to each other. Hadoop MapReduce records.
[4] is open source implementation of Google’s 2) Velocity: Rate at which data comes is very high
MapReduce [2]. Hadoop is developed by Yahoo! and and this requires tools to respond quickly.
used at Facebook, Amazon, Last.fm and many other 3) Variety: Big data may have range of data types
companies and institutes [5]. and data sources.
Architecture of Big Data generally spans across
Hadoop MapReduce is big data processing multiple machines and clusters. Hadoop and HDFS
framework thatrapidly became the de facto standard has become dominant platform for Big Data Analytics
in both academia andindustry [6]–[9]. To enable [17].
experts in database field Facebookprovided interface
to Hadoop data using SQL like interfacecalled Hive III. HADOOP
[10]. Hive converts query into Hadoop MapReduce
jobs and The Apache Hadoop [4] is open source project which
develops software framework for distributed
International Conference on Electrical Engineering and Computer Science, 21st April-2013, Coimbatore, ISBN: 978-93-83060-02-3
40
Join Algorithm
ms using MapRed
duce: A Survey
processinng of data in fault tolerantt manner. Haadoop Bloo om Filter aree initially set to zero. Forr inserting
frameworrk is designedd to scale froom single macchine elem
ment in Bloom m Filter, elemment is hashed k times
to thousaands of machinnes. withh k hash funcction and possitions in Blo oom Filter
arrayy correspondiing to hash vaalue are set to 1. To test
whether element is present iin Bloom Fiilter array
checck that all bitts of k hash positions aree 1 if yes
elem
ment is presennt in set otherw wise it is not present in
set. Bloom Filterr may yield fa false positivess but false
negaatives are nevver generated. Advantage of Bloom
Filteer is space efficiency. Bloom filteers allow
memmbership queries without the need off original
colleection. We canc concludee that elemen nt is not
present in the origginal collectioon if at least one
o of the
A. MapR Reduce positions computted by the hhash function ns of the
MapReduuce [2] is framework
fr foor processing and blooom filter pointts to a bit whhich is set to 0. Size of
generatinng large data sets.
s MapReduuce frameworrk for blooom filter is fixed
fi but therre is trade-offf between
data analyysis is effectivve when files are very largee and Bloo om Filter veector size m and falsee positive
are rarelyy updated. Proogrammers off MapReduce need probbability p. Probability
P off false posittive after
to speciffy two functiions map() annd reduce(). Both inserrting n numbeer of elementss is given by following
map() annd reduce() fuunction takes input as paiirs of equaation (1).
key and value. Map() function takees (key,value)) pair (1)
as input and
a produces set of interm mediate (key,vvalue)
pairs. Then reduce() fuunction takes these intermeediate
(key,valuue) pair as innput and mergges all the values
v V. JOIN
J USING
G MAPREDU
UCE
with samme intermediatee key.
Programss written usingg MapReducee Programminng are In this
t section we are goinng to discuss various
automaticcally parallellized and exeecuted acrosss the meth hods availabble for jooining dataset using
cluster using
u commmodity hardw ware. MapReeduce Map pReduce.
frameworrk takes caree of partitioniing of input data, A. Reduce-Side
R J
Join
schedulinng of user program on clustter and handlinng of Redu uce-Side Joinn [22]–[24] iss simple join technique
failure. Programmer
P n
need not to worry
w about these wheere we map overo both the datasets and emit join
issues.MaapReduce Proogramming Model: M MapReeduce key as intermediiate key and tuple as inttermediate
program takes set of o (key,valuee) as input and ue. To know from
valu f which ddataset record came we
produces set of (key,vvalue) as output. User exprresses tag intermediate (key,value) ppairs with dataaset name
computattion as two functions maap() and reduuce(). to produce
p taggeed (key,value)) pairs. Outpu uts of the
Singaturees of map and reduce are given below. mapppers are paartitioned, soorted and mergedm by
map(k1,v1) →list(k22,v2) fram
mework. All thhe record withh same join arre grouped
reduce(k2,listt(v2)) → list(kk3,v3) togeether and fed to reducer. Reducer sepaarates and
bufffers the input records into ttwo sets by using
u table
B. Hadooop Distributedd File System (HDFS)
( tag and
a then perfo forms cross prroduct of thesse records.
Hadoop uses file systtem called Hadoop
H Distribbuted (keyy,value) pairs with dataset nname to produ uce tagged
File Systtem (HDFS) [18]. HDFS stores file syystem (keyy,value) pairss. Outputs of the map ppers are
metadata and applicattion data sepaarately. HDFS S has partiitioned, sortedd and mergedd by frameworrk. All the
separate server to storee file system metadata called as reco
ord with same join are grouuped together and a fed to
NameNode and appliccation data iss stored on server redu
ucer. Reducerr separates aand buffers the input
called as DataNode. Unlike
U other file
f systems HDFS
H reco
ords intotwo sets by usinng table tag and then
does not use RAID [19] for data protection. HDFS H perfforms cross product of thesee records.
uses repplication tecchnique for data protecction.
Contents of file are repplicated on multiple
m DataNNodes
for reliaability purpose. This strrategy has added a
advantage that data trransfer bandw width is multiiplied
and has more opportuunity for locaating computtation
near dataa.
OOM FILTER
IV. BLO R
Basic idea behind Reduce-Side join is repartition both replica in DFS. We can increase replication factor for
the datasets by join key. Reduce-Side join approach is smaller dataset so that most of nodes in cluster will
not efficient since it involves shuffling of both the have local copy of smaller dataset and they don’t need
datasets across the network. to retrieve smaller dataset from another node.
B. Map-Side Join If none of the dataset fit into memory then divide
Reduce-Side join method is natural way of joining smaller datasets into number of partitions. We choose
datasets; it uses built-in capacity of Hadoop partition size such that it will fit into memory of node
frameworks to sort intermediate (key,value) pairs and then run memory backed hash join. We need to
before they reach reducer. Hadoop offer another way go n times through other dataset, where n is number
of joining datasets using Map-Side join [23]– [26]. If of partitions.
both the datasets are partitioned and sorted on join
key attribute then joining of datasets can be Use of distributed key value store can useful if neither
accomplished in parallel using map phase of of dataset fit into memory. Distributed key-value store
MapReduce job. We invoke map over one of the can be used to store one dataset into memory of
(larger) dataset, in that mapper we read corresponding multiple machines. Mappers can iterate over other
part of another (smaller) dataset and perform merge dataset and can query distributed key-value store in
join. Reducers are not required unless programmer parallel and perform join if joinkey matches. Open
want to repartition dataset or perform further source system memcached can be used for this
processing. Fig. 3 shows data flow for Map-Side Join purpose thus this technique is called as memcached
technique. join technique.
Map side joins are more efficient than reduce side join
since we are avoiding sorting and shuffling phases but
map side join has very stringent condition that both
the datasets should be sorted and partitioned on join
key.
C. Memory-Backed Join
Memory Backed Joins [26] are family of join
technique. Simplest version of memory-backed join is
broadcast join technique where one of the two dataset
completely fit into memory of each node in Hadoop
cluster. In this case we can load smaller dataset into
memory in every mapper so that it can be accessed by
using join key. Then mappers are applied to larger
dataset and for each key value pair, mapper probe in- Fig. 4 Data Flow for Memory Backed Join
memory dataset to check whether there is tuple with
same join key. If there is then join is performed. This D. MapReduce Join using Bloom Filter
method is known hash join [27] by database Method [28] discusses join processing using Bloom
community. Broadcast join in map only join Filter and Hadoop MapReduce. According [28]
technique MapReduce is not very efficient to perform join
operation because it always processes all records even
if small fraction of datasets are relevant for join
operation. We can improve performance by using
Bloom Filter in join operation. Bloom filter can be
constructed in distributed manner to filter out
redundant records. Section IV discusses Bloom Filter
in detail. For R 1 S operation, Let R is dataset on
which bloom filter constructed and another dataset S.
Method [28] can process only two way join involves
following steps.
1) Job submission: Job is submitted with m1 map
tasks for R and m2 map tasks for S and r reduce
tasks are created.
2) First map phase: Jobtracker assigns m1 map
Fig. 3 Data Flow for Map-Side Join. tasks or reduce tasks to idle tasktrackers. A map
tasktrackers reads input split for the task and
Pre-processing step for broadcast join: Even though converts it into (key,value) pairs.
there is no way to control physical placement of
International Conference on Electrical Engineering and Computer Science, 21st April-2013, Coimbatore, ISBN: 978-93-83060-02-3
42
Join Algorithms using MapReduce: A Survey
Map-Reduce-Merge [29] is an extension to standard Map-side join technique is more efficient than reduce
MapReduce. By adding Merge Phase to perform side join technique if input is sorted and partitioned,
relational operations on dataset, it can merge dataset because there is no need to shuffle datasets over the
naturally. MapReduce itself has performance problem network.
that pull operation of reduce to map may cause large If there are only small fraction of records relevant to
number of disk seeks and slow down effective disk join operation then applying Bloom Filter to filter out
transfer rate. Map-Reduce-Merge does not help to records which will not satisfy join condition will help
solve this problem instead it may aggravate this in improving performance of join operation. Bloom
phenomenon because amount of data transferred filter will filter out records in map phase thus only
becomes larger [30]. subset of records will be transferred to reducer over
network.
International Conference on Electrical Engineering and Computer Science, 21st April-2013, Coimbatore, ISBN: 978-93-83060-02-3
43
Join Algorithms using MapReduce: A Survey
International Conference on Electrical Engineering and Computer Science, 21st April-2013, Coimbatore, ISBN: 978-93-83060-02-3
44