Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

JOIN ALGORITHMS USING MAPREDUCE: A SURVEY

VIKAS JADHAV1, JAGANNATH AGHAV1, SUNIL DORWANI2


1
Department of Computer Engineering, and Information Technology
2
SAS Research and Development (India) Pvt. Ltd., College of Engineering Pune, India.

Abstract- MapReduce framework is widely used in large scale data analysis. It is highly scalable and programming model
ofconvenience. However, performance of MapReduce is a concern when applied on complex data analytical task involving
multiplejoin of datasets for certain aggregates. Join operation is frequentin data processing. Original MapReduce
programming modeldoes not directly support processing multiple related heterogeneousdatasets thus MapReduce has been
modified to supportheterogeneous datasets. Big Data analytics uncovers the hidden patterns and otheruseful information and
a popular data processing engine forbig data is Hadoop MapReduce. Hadoop MapReduce is usedheavily for data analytical
tasks like data mining, SQL queries,etc. Hive is data warehousing resides on the top of Hadoopand provides sql-like
interface to HDFS files. In this paper,we survey various methods of joining datasets using HadoopMapReduce.

Keywords- Analytics, Big Data, Hive, Join, MapReduce, SQL.

I. INTRODUCTION

With exponential growth in data size distributed executes these automatically on Hadoop cluster. Most
processingof data has become important. Parallel of the data Analytical queries involve joining of
Database System [1] are based on shared- nothing- Multiple datasets [11]–[14].
node (separate CPU, memory, disk) and connected
through high-speed interconnect. Every Parallel The rest of paper is organized as follows section II
Database System uses horizontal data partitioning describes Big Data, section III covers Hadoop,
along with partitioned execution of SQL queries. section IV Bloom Filter, section V discusses various
Horizontal partitioning involves distributing rows of methods for join using Hadoop MapReduce and
relational table across nodes of cluster so that rows finally section VII covers conclusion.
can be processed in parallel. MapReduce [2] is
revolutionary platform for large scale data processing II. BIG DATA
in distributed manner. MapReduce is composed of a
master and multiple workers for conducting Big Data [15] refers to practice of collecting and
MapReduce jobs. It can process huge amount of data processing very large datasets and associated system
in small reasonable time using large number of and algorithms used to analyse these massive datasets.
commodity hardware therefore valuable information Big Data which is getting global attention can be
hidden in those big data can be extracted with much described with the help of three V’s, Volume,
less cost. MapReduce programs are automatically Velocity, Variety [16]. Main issues related to big data
parallelized and executed across cluster. Users don’t are capturing, storing searching, sharing, analysing
need to handle parallelism and failures, MapReduce and visualization.
framework automatically parallelizes user program Big data has following properties:
and reschedules failed task. Yahoo! has deployed 1) Volume: Big Data has very large volume of
largest Hadoop cluster with 4000 nodes [3]. data. It may grow up to hundreds of gigabytes
According to [1] MapReduce and Parallel DBMS are to petabytes, and may have large number of
complementary to each other. Hadoop MapReduce records.
[4] is open source implementation of Google’s 2) Velocity: Rate at which data comes is very high
MapReduce [2]. Hadoop is developed by Yahoo! and and this requires tools to respond quickly.
used at Facebook, Amazon, Last.fm and many other 3) Variety: Big data may have range of data types
companies and institutes [5]. and data sources.
Architecture of Big Data generally spans across
Hadoop MapReduce is big data processing multiple machines and clusters. Hadoop and HDFS
framework thatrapidly became the de facto standard has become dominant platform for Big Data Analytics
in both academia andindustry [6]–[9]. To enable [17].
experts in database field Facebookprovided interface
to Hadoop data using SQL like interfacecalled Hive III. HADOOP
[10]. Hive converts query into Hadoop MapReduce
jobs and The Apache Hadoop [4] is open source project which
develops software framework for distributed

International Conference on Electrical Engineering and Computer Science, 21st April-2013, Coimbatore, ISBN: 978-93-83060-02-3
40
Join Algorithm
ms using MapRed
duce: A Survey

processinng of data in fault tolerantt manner. Haadoop Bloo om Filter aree initially set to zero. Forr inserting
frameworrk is designedd to scale froom single macchine elem
ment in Bloom m Filter, elemment is hashed k times
to thousaands of machinnes. withh k hash funcction and possitions in Blo oom Filter
arrayy correspondiing to hash vaalue are set to 1. To test
whether element is present iin Bloom Fiilter array
checck that all bitts of k hash positions aree 1 if yes
elem
ment is presennt in set otherw wise it is not present in
set. Bloom Filterr may yield fa false positivess but false
negaatives are nevver generated. Advantage of Bloom
Filteer is space efficiency. Bloom filteers allow
memmbership queries without the need off original
colleection. We canc concludee that elemen nt is not
present in the origginal collectioon if at least one
o of the
A. MapR Reduce positions computted by the hhash function ns of the
MapReduuce [2] is framework
fr foor processing and blooom filter pointts to a bit whhich is set to 0. Size of
generatinng large data sets.
s MapReduuce frameworrk for blooom filter is fixed
fi but therre is trade-offf between
data analyysis is effectivve when files are very largee and Bloo om Filter veector size m and falsee positive
are rarelyy updated. Proogrammers off MapReduce need probbability p. Probability
P off false posittive after
to speciffy two functiions map() annd reduce(). Both inserrting n numbeer of elementss is given by following
map() annd reduce() fuunction takes input as paiirs of equaation (1).
key and value. Map() function takees (key,value)) pair (1)
as input and
a produces set of interm mediate (key,vvalue)
pairs. Then reduce() fuunction takes these intermeediate
(key,valuue) pair as innput and mergges all the values
v V. JOIN
J USING
G MAPREDU
UCE
with samme intermediatee key.
Programss written usingg MapReducee Programminng are In this
t section we are goinng to discuss various
automaticcally parallellized and exeecuted acrosss the meth hods availabble for jooining dataset using
cluster using
u commmodity hardw ware. MapReeduce Map pReduce.
frameworrk takes caree of partitioniing of input data, A. Reduce-Side
R J
Join
schedulinng of user program on clustter and handlinng of Redu uce-Side Joinn [22]–[24] iss simple join technique
failure. Programmer
P n
need not to worry
w about these wheere we map overo both the datasets and emit join
issues.MaapReduce Proogramming Model: M MapReeduce key as intermediiate key and tuple as inttermediate
program takes set of o (key,valuee) as input and ue. To know from
valu f which ddataset record came we
produces set of (key,vvalue) as output. User exprresses tag intermediate (key,value) ppairs with dataaset name
computattion as two functions maap() and reduuce(). to produce
p taggeed (key,value)) pairs. Outpu uts of the
Singaturees of map and reduce are given below. mapppers are paartitioned, soorted and mergedm by
map(k1,v1) →list(k22,v2) fram
mework. All thhe record withh same join arre grouped
reduce(k2,listt(v2)) → list(kk3,v3) togeether and fed to reducer. Reducer sepaarates and
bufffers the input records into ttwo sets by using
u table
B. Hadooop Distributedd File System (HDFS)
( tag and
a then perfo forms cross prroduct of thesse records.
Hadoop uses file systtem called Hadoop
H Distribbuted (keyy,value) pairs with dataset nname to produ uce tagged
File Systtem (HDFS) [18]. HDFS stores file syystem (keyy,value) pairss. Outputs of the map ppers are
metadata and applicattion data sepaarately. HDFS S has partiitioned, sortedd and mergedd by frameworrk. All the
separate server to storee file system metadata called as reco
ord with same join are grouuped together and a fed to
NameNode and appliccation data iss stored on server redu
ucer. Reducerr separates aand buffers the input
called as DataNode. Unlike
U other file
f systems HDFS
H reco
ords intotwo sets by usinng table tag and then
does not use RAID [19] for data protection. HDFS H perfforms cross product of thesee records.
uses repplication tecchnique for data protecction.
Contents of file are repplicated on multiple
m DataNNodes
for reliaability purpose. This strrategy has added a
advantage that data trransfer bandw width is multiiplied
and has more opportuunity for locaating computtation
near dataa.

OOM FILTER
IV. BLO R

Bloom Filter [20], [211] is probabiliistic data struucture


used toccheck whetheer element iss member off set.
Bloom Filter
F consists of array of o m bits annd k
independdent hash funcctions. All thhe bits in arraay of

Internaational Conferencce on Electrical Engineering


E and Computer
C Sciencee, 21st April-20133, Coimbatore, ISBN: 978-93-8306
60-02-3
41
Join Algorithms using MapReduce: A Survey

Basic idea behind Reduce-Side join is repartition both replica in DFS. We can increase replication factor for
the datasets by join key. Reduce-Side join approach is smaller dataset so that most of nodes in cluster will
not efficient since it involves shuffling of both the have local copy of smaller dataset and they don’t need
datasets across the network. to retrieve smaller dataset from another node.

B. Map-Side Join If none of the dataset fit into memory then divide
Reduce-Side join method is natural way of joining smaller datasets into number of partitions. We choose
datasets; it uses built-in capacity of Hadoop partition size such that it will fit into memory of node
frameworks to sort intermediate (key,value) pairs and then run memory backed hash join. We need to
before they reach reducer. Hadoop offer another way go n times through other dataset, where n is number
of joining datasets using Map-Side join [23]– [26]. If of partitions.
both the datasets are partitioned and sorted on join
key attribute then joining of datasets can be Use of distributed key value store can useful if neither
accomplished in parallel using map phase of of dataset fit into memory. Distributed key-value store
MapReduce job. We invoke map over one of the can be used to store one dataset into memory of
(larger) dataset, in that mapper we read corresponding multiple machines. Mappers can iterate over other
part of another (smaller) dataset and perform merge dataset and can query distributed key-value store in
join. Reducers are not required unless programmer parallel and perform join if joinkey matches. Open
want to repartition dataset or perform further source system memcached can be used for this
processing. Fig. 3 shows data flow for Map-Side Join purpose thus this technique is called as memcached
technique. join technique.

Map side joins are more efficient than reduce side join
since we are avoiding sorting and shuffling phases but
map side join has very stringent condition that both
the datasets should be sorted and partitioned on join
key.

C. Memory-Backed Join
Memory Backed Joins [26] are family of join
technique. Simplest version of memory-backed join is
broadcast join technique where one of the two dataset
completely fit into memory of each node in Hadoop
cluster. In this case we can load smaller dataset into
memory in every mapper so that it can be accessed by
using join key. Then mappers are applied to larger
dataset and for each key value pair, mapper probe in- Fig. 4 Data Flow for Memory Backed Join
memory dataset to check whether there is tuple with
same join key. If there is then join is performed. This D. MapReduce Join using Bloom Filter
method is known hash join [27] by database Method [28] discusses join processing using Bloom
community. Broadcast join in map only join Filter and Hadoop MapReduce. According [28]
technique MapReduce is not very efficient to perform join
operation because it always processes all records even
if small fraction of datasets are relevant for join
operation. We can improve performance by using
Bloom Filter in join operation. Bloom filter can be
constructed in distributed manner to filter out
redundant records. Section IV discusses Bloom Filter
in detail. For R 1 S operation, Let R is dataset on
which bloom filter constructed and another dataset S.
Method [28] can process only two way join involves
following steps.
1) Job submission: Job is submitted with m1 map
tasks for R and m2 map tasks for S and r reduce
tasks are created.
2) First map phase: Jobtracker assigns m1 map
Fig. 3 Data Flow for Map-Side Join. tasks or reduce tasks to idle tasktrackers. A map
  tasktrackers reads input split for the task and
Pre-processing step for broadcast join: Even though converts it into (key,value) pairs.
there is no way to control physical placement of

International Conference on Electrical Engineering and Computer Science, 21st April-2013, Coimbatore, ISBN: 978-93-83060-02-3
42
Join Algorithms using MapReduce: A Survey

3) Local filter construction: Intermediate VI. OTHER METHODS


(key,value) pairs produced from map are sent to
r tasktrackers and also r Bloom filters are There has been lot of research going on for optimizing
constructed on keys in each partition. These are MapReduce for the Analytical join queries. Method
local filters because they are built only on [13] discusses optimization of analytic queries which
intermediate result in single tasktracker. involve very large fact table which is to be joined
4) Global filter merging: When all m1 tasktrackers with smaller dimension table and queries which
complete their map task then tasktracker send involve paths through graph with high out degree.
their Bloom filters to Jobtracker. Jobtracker According to [31] data analytical queries generally
constructs global filter of all local filter for involve joining multi-way join operation and
dataset R and then Jobtracker sends global filter operators involved in join queries are more than equi-
to all tasktrackers. Until now map task for S are join operator. This method calculates cost associated
not assigned. with MapReduce job and also provides efficient
5) Second map phase: Jobtracker assigns m2 map execution of chained type theta join queries and also
tasks for dataset Son tasktrackers. Tasktracker compares result with Pig [32] and Hive [8]. Map-Join-
runs assigned task by using received global Reduce [12] is another attempt to optimize join
filters. Records which are not in global filter are operation using Hadoop MapReduce framework but
filtered out. this approach focuses only on equi-join queries.
6) Reduce phase: This step is same as reduce
phase in Hadoop [4]. A Reduce tasktracker There are wide ranges of application where we
reads corresponding intermediate pairs from all require topk similar records from database. Work
map tasktrackers and it sorts all intermediate mentioned in [33] describes how we can exploit
pairs received and runs reduce function. Final Hadoop MapReduce for topk similarity join
results are written output path in HDFS [18]. algorithms. Hadoop MapReduce frameworks
E. Map-Reduce-Merge implementation of join operation cannot handle skew
Map-Reduce-Merge [29] model enables processing in input data which leads to poor load balancing this
multiple heterogeneous datasets. Map-Reduce-Merge can result in swamping of benefits of parallelization.
[29] has following signature. Symbols α, β, γrepresent Atta, Viglas and Niazi introduced new technique
dataset lineage and symbols k and v represent keys called SAND Join [34] which employs range
and value respectively. partitioning instead of hash partitioning for load
distribution.
map : k1, v1 α k2, v2 α
reduce: k2, v2 α k2, v3 α VII. CONCLUSION
merge : k2, v3 α, k3, v4 β k4, v5 γ
The MapReduce framework enables us to process
The map function transforms an input key and value very large dataset distributed among nodes in cluster.
pair (k1,v1) into a list of intermediate key and value For large dataset parallel processing reduces response
pairs [(k2,v2)]. The reduce function aggregates the list time since processing of subset of dataset is carried
of values [v2] associated with k2and produces a list of out independently on multiple nodes. As every kind of
values [v3], which is also associated with k2. The task can be benefited from parallel processing, join
input and output belong to same dataset lineage α. operation also can be benefited.
Similarly another pair of input belonging to dataset
lineage β also produces list of values [v4] associated Simple Technique for execution of join using
with k3. Merge function combines output of two MapReduce is In-Memory. However memory-backed
reduce function belonging to two different dataset technique can be used only when one of the two
lineage α and β and produces final output with γ datasets completely fit into memory. If both the
dataset lineage. ifα = β then merge function does self- datasets are too large to fit into memory then we must
merge which is similar to self-join. use map-side or reduce-side join.

Map-Reduce-Merge [29] is an extension to standard Map-side join technique is more efficient than reduce
MapReduce. By adding Merge Phase to perform side join technique if input is sorted and partitioned,
relational operations on dataset, it can merge dataset because there is no need to shuffle datasets over the
naturally. MapReduce itself has performance problem network.
that pull operation of reduce to map may cause large If there are only small fraction of records relevant to
number of disk seeks and slow down effective disk join operation then applying Bloom Filter to filter out
transfer rate. Map-Reduce-Merge does not help to records which will not satisfy join condition will help
solve this problem instead it may aggravate this in improving performance of join operation. Bloom
phenomenon because amount of data transferred filter will filter out records in map phase thus only
becomes larger [30]. subset of records will be transferred to reducer over
network.

International Conference on Electrical Engineering and Computer Science, 21st April-2013, Coimbatore, ISBN: 978-93-83060-02-3
43
Join Algorithms using MapReduce: A Survey

REFERENCES [18] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The


hadoop distributed file system,” in Proceedings of the 2010
IEEE 26th Symposium on Mass Storage Systems and
[1] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E.
Technologies (MSST), ser. MSST ’10. Washington, DC,
Paulson, A. Pavlo, and A. Rasin, “MapReduce and parallel
USA: IEEE Computer Society, 2010, pp. 1–10.
DBMSs: friends or foes?” Communications of the ACM,
vol. 53, no. 1, pp. 64–71, Jan.2010. [19] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A.
Patterson, “Raid: high-performance, reliable secondary
[2] J. Dean and S. Ghemawat, “MapReduce: simplified data
storage,” ACM Comput. Surv., vol. 26, no. 2, pp. 145–185,
processing on large clusters,” CACM, vol. 51, no. 1, pp.
Jun. 1994.
107–113,2008.
[20] B. H. Bloom, “Space/time trade-offs in hash coding with
[3] http://www.dbms2.com/2011/07/06/petabyte-hadoop-
allowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422–
clusters/
426, Jul. 1970.
[4] “Apache hadoop.” http://hadoop.apache.org/
[21] L. Michael, W. Nejd, O. Papapetrou, and W. Siberski,
[5] http://wiki.apache.org/hadoop/PoweredBy “Improving distributed join efficiency with extended bloom
filter operations,” in Advanced Information Networking and
[6] J. Dittrich, J.-A. Quiane-Ruiz,´ A. Jindal, Y. Kargin, V. Applications, 2007. AINA ’07. 21st International
Setty, and J. Schad, “Hadoop++: making a yellow elephant Conference on, may 2007, pp. 187 –194.
run like a cheetah (without it even noticing),” Proc. VLDB
Endow., vol. 3, no. 1-2, pp. 515–529, Sep. 2010. [22] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita,
and Y. Tian, “A comparison of join algorithms for log
[7] A. Jindal, J.-A. Quiane-Ruiz,´ and J. Dittrich, “Trojan data processing in mapreduce,” in SIGMOD Conference, A. K.
layouts: right shoes for a running elephant,” in Proceedings Elmagarmid and D. Agrawal, Eds. ACM, 2010, pp. 975–
of the 2nd ACM Symposium on Cloud Computing, ser. 986.
SOCC ’11. New York, NY, USA: ACM, 2011, pp. 21:1–
21:14. [23] T. White, Hadoop - The Definitive Guide: MapReduce for
the Cloud. O’Reilly, 2009.
[8] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Z.
0002, S. Anthony, H. Liu, and R. Murthy, “Hive - a petabyte [24] K. Lee, Y. Lee, H. Choi, Y. Chung, and B. Moon, “Parallel
scale data warehouse using hadoop,” in ICDE, F. Li, M. M. data processing with mapreduce: a survey,” ACM SIGMOD
Moro, S. Ghandeharizadeh, J. R. Haritsa, G. Weikum, M. J. Record, vol. 40, no. 4, pp. 11–20, 2012.
Carey, F. Casati, E. Y. Chang, I. Manolescu, S. Mehrotra, U.
[25] J. Venner, Pro Hadoop, 1st ed. Berkely, CA, USA:
Dayal, and V. J. Tsotras, Eds. IEEE, 2010, pp. 996–1005.
Apress, 2009.
[9] R. A. Brown, “Hadoop at home: large-scale computing at a
[26] J. Lin and C. Dyer, ser. Synthesis Lectures on Human
small college,” in Proceedings of the 40th ACM technical
Language Technologies. Morgan and Claypool Publishers,
symposium on Computer science education, ser. SIGCSE
2010.
’09. New York, NY, USA: ACM, 2009, pp. 106–110
[27] D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R.
[10] Hive http://hive.apache.org/
Stonebraker, and D. A. Wood, “Implementation techniques
[11] A. Aji and F. Wang, “High performance spatial query for main memory database systems,” SIGMOD Rec., vol.
processing for large scale scientific data,” in 14, no. 2, pp. 1–8, Jun. 1984.
SIGMOD/PODS PhD Symposium, X. L. Dong and M. T.
[28] T. Lee, K. Kim, and H.-J. Kim, “Join processing using
Ozsu,¨ Eds. ACM, 2012, pp. 9–14. [Online].
bloom filter in mapreduce,” in Proceedings of the 2012
[12] D. Jiang, A. K. H. Tung, and G. Chen, “MAP-JOIN- ACM Research in Applied Computation Symposium, ser.
REDUCE: Toward scalable and efficient data analysis on RACS ’12. New York, NY, USA: ACM, 2012, pp. 100–
large clusters,” IEEE Trans. Knowl. Data Eng, vol. 23, no. 105.
9, pp. 1299–1311, 2011.
[29] H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker,
[13] F. N. Afrati and J. D. Ullman, “Optimizing joins in a “Map-reducemerge: simplified relational data processing on
mapreduce environment,” in EDBT, ser. ACM International large clusters,” in Proceedings of the 2007 ACM SIGMOD
Conference Proceeding Series, I. Manolescu, S. international conference on Management of data, ser.
Spaccapietra, J. Teubner, M. Kitsuregawa, A. Leger,´ F. SIGMOD ’07. New York, NY, USA: ACM, 2007, pp.
Naumann, A. Ailamaki, and F. Ozcan,¨ Eds., vol. 426. 1029–1040.
ACM, 2010, pp. 99–110.
[30] W. Hu, L. Ma, X. Liu, H. Qi, L. Zha, H. Liao, and Y. Zhang,
[14] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. “A hybrid join algorithm on top of map reduce,” in
Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Semantics Knowledge and Grid (SKG), 2011 Seventh
Srivastava, “Building a high-level dataflow system on top of International Conference on, oct. 2011, pp. 44 –50.
map-reduce: the pig experience,” Proc. VLDB Endow., vol.
[31] X. Zhang, L. Chen, and M. Wang, “Efficient multi-way
2, no. 2, pp. 1414–1425, Aug. 2009.
thetajoin processing using mapreduce,” CoRR, vol.
[15] E. Begoli and J. Horey, “Design principles for effective abs/1208.0081, 2012.
knowledge discovery from big data,” in Software
[32] Available: http://pig.apache.org/
Architecture (WICSA) and European Conference on
Software Architecture (ECSA), 2012 Joint Working [33] Y. Kim and K. Shim, “Parallel top-k similarity join
IEEE/IFIP Conference on, aug. 2012, pp. 215 –218. algorithms using mapreduce,” in Data Engineering (ICDE),
2012 IEEE 28th International Conference on, april 2012, pp.
[16] S. Madden, “From databases to big data,” IEEE Internet
510 –521.
Computing, vol. 16, no. 3, pp. 4–6, 2012.
[34] F. Atta, S. Viglas, and S. Niazi, “Sand join - a skew
[17] V. R. Borkar, M. J. Carey, and C. Li, “Big data platforms:
handling join algorithm for google’smapreduce framework,”
What’s next?” XRDS, vol. 19, no. 1, pp. 44–49, Sep. 2012.
in Multitopic Conference (INMIC), 2011 IEEE 14th
International, dec. 2011, pp. 170 –175.

”””

International Conference on Electrical Engineering and Computer Science, 21st April-2013, Coimbatore, ISBN: 978-93-83060-02-3
44

You might also like