Professional Documents
Culture Documents
Big Data Frameworks: A Comparative Study: Wissem Inoubli Sabeur Aridhi
Big Data Frameworks: A Comparative Study: Wissem Inoubli Sabeur Aridhi
Big Data Frameworks: A Comparative Study: Wissem Inoubli Sabeur Aridhi
2. Map phase: for each chunk having the key-value Scheduling. Scheduling is an important aspect of MapRe-
structure, the corresponding Map function is triggered duce [6]. During the execution of a MapReduce job, sev-
and produces a set of intermediate key-value pairs. eral scheduling decisions need to be taken. These decisions
need to consider several information such as data location,
3. Combine phase: this step aims to group together all
resource availablity and others. For example, during the exe-
intermediate key-value pairs associated with the same
cution of a MapReduce job, the scheduler tries to overcome
intermediate key.
problems such as Map-Skew which refers to imbalanced com-
4. Partitioning phase: following their combination, the putational load among map tasks [7].
results are distributed across the different Reduce func-
tions.
3. BIG DATA FRAMEWORKS
5. Reduce phase: the Reduce function merges key-value In this section, we survey some popular Big Data frameworks
pairs having the same key and computes a final result. and categorize them according to their key features. These
Figure 2: HDFS architecture. Figure 3: Yarn architecture.
key features are (1) the programming model, (2) the sup-
ported programming languages, (3) the type of data sources
and (4) the capability to allow for iterative data processing,
thereby allowing to cope with streaming data.
(e.g., messages coming from sensors). This explains the fact 4.5 Summary of evaluation
that Flink is largely adopted in several applications includ- From the runtime experiments it is clear that Spark can
ing smart cities, financial markets, surveillance systems [19]. deal with complex and large datasets better than Hadoop
Unlike Flink and Storm, Spark collects events’ data every and Flink. The carried experiments in this work also in-
second and performs processing task after that. Hence, more dicate that Hadoop performs well on the whole. However,
than one message is processed, which explains the high CPU it has some limitations regarding the time spent in com-
usage by Spark. Because of Flink’s pipeline nature, each municating data between nodes and requires a considerable
message is associated to a thread and consumed at each win- processing time when the size of data increases. Flink re-
dow time. Consequently, this low volume of processed data source consumption is the lowest compared to Spark and
does not affect the CPU resource usage. Hadoop. This is explained by the greed nature of Spark
and Hadoop.
RAM consumption
Fig. 17 shows the cost of event stream processing in terms
of RAM consumption. Spark reached 6 GB (75% of the 5. BEST PRACTICES
available resources) due its in-memory behaviour and its In the previous section, two major processing approaches
ability to perform in micro-batch (process a group of mes- (batch and stream) were studied and compared in terms of
sages at a time). Flink and Storm did not exceed 5 GB speed and resource usage. Choosing the right processing
(around 62.5% of the available RAM) as their stream mode model is a challenging problem given the growing number of
behaviour consists in processing only single messages. Re- frameworks with similar and various services. This section
garding Spark, the number of processed messages is small. aims to shed light on the strengths of the above discussed
Hence, the communication frequency with the cluster man- frameworks when exploited in specific fields including stream
ager is low. In contrast, the number of processed events processing, batch processing, machine learning and graph
is high for Flink and Storm, which explains the impor- processing. We also discuss the use of the studied frame-
tant communication frequency between the frameworks and works in several real-world applications including healthcare
their Daemons (i.e. between Storm and Zookeeper, and applications, recommender systems, social network analysis
between Flink and Yarn) is high. Indeed, communication and smart cities.
topology in Flink is predefined, whereas the communication
topology in the case of Storm is dynamic because NBus 5.1 Stream processing
(the master component of Storm) searches periodically the As the world becomes more connected and influenced by
available nodes to perform processing tasks. mobile devices and sensors, stream computing emerged as a
basic capability of real-time applications in several domains,
Disk I/O usage including monitoring systems, smart cities, financial markets
Fig. 18 depicts the amount of disk usage by the studied and manufacturing [22]. However, this flood of data that
frameworks. Red slopes denote the amount of write op- comes from various sources at high speed always needs to be
erations, whereas green slopes denote the amount of read processed in a short time interval. In this case, Storm and
operations. The amounts of write operations in Flink and Flink may be considered, as they allow pure stream pro-
Storm are almost close. Flink and Storm frequently ac- cessing. The design of in-stream applications needs to take
cess the disk and are faster than Spark in terms of the num- into account the frequency and the size of incoming events
ber of processed messages. As discussed in the above sec- data. In the case of stream processing, Apache Storm is
tions, Spark framework is an in-memory framework which well-known to be the best choice for the big/high stream
explains its lower disk usage. oriented applications (billions of events per second/core).
As shown in the conducted experiments, Storm performs and body temperature [38]. However, sending and process-
well and allows resource saving even if the stream of events ing iteratively such stream of health data is not supported
becomes important. by the original MapReduce model. Hadoop was initially
designed to process big data already available in the dis-
tributed file system. In the literature, many extensions have
5.2 Micro-batch processing been applied to the original MapReduce model in order
In case of batch processing, Spark may be a suitable frame-
to allow iterative computing such as HaLoop system [39]
work to deal with periodic processing tasks such as Web us-
and Twister [40]. Nevertheless, the two caching functional-
age mining, fraud detection, etc. In some situations, there is
ities in HaLoop that allow reusing processing data in the
a need for a programming model that combines both batch
later iterations and make checking for a fix-point lack ef-
and stream behaviour over the huge volume/frequency of
ficiency. Also, since processed data may partially remain
data in a lambda architecture. In this architecture, peri-
unchanged through the different iterations, they have to be
odic analysis tasks are performed in a larger window time.
reloaded and reprocessed at each iteration. This may lead to
Such behaviour is called micro-batch. For instance, data
resource wastage, especially network bandwidth and proces-
produced by healthcare and IoT applications often require
sor resources. Unlike HaLoop and existing MapReduce
combining batch and stream processing. In this case frame-
extensions, Spark provides support for interactive queries
works like Flink and Spark may be good candidates [26].
and iterative computing. RDD caching makes Spark effi-
Spark micro-batch behaviour allows to process datasets in
cient and performs well in iterative use cases that require
larger window times. Spark consists of a set of tools, such as
multiple treatments on large in-memory datasets [22].
Spark MLLib and Spark Stream that provide rich analysis
functionalities in micro-batch. Such behaviour requires re-
grouping the processed data periodically, before performing 5.5.2 Recommendation systems
analysis task. Recommender systems is another field that began to attract
more attention, especially with the continuous changes and
the growing streams of users’ ratings [41]. Unlike traditional
5.3 Machine learning algorithms recommendation approaches that only deal with static item
Machine learning algorithms are iterative in nature [26]. and user data, new emerging recommender systems must
Most of the above discussed frameworks support machine adapt to the high volume of item information and the big
learning capabilities through a set of libraries and APIs. stream of user ratings and tastes. In this case, recommender
Flink-ML library includes implementations of k-Means clus- systems must be able to process the big stream of data. For
tering algorithm, logistic regression, and Alternating Least instance, news items are characterized by a high degree of
Squares (ALS) for recommendation [27]. Spark has more change and user interests vary over time which requires a
efficient set of machine learning algorithms such as Spark continuous adjustment of the recommender system. In this
MLLib and MLI [28]. Spark MLLib is a scalable and fast li- case, frameworks like Hadoop are not able to deal with
brary that is suitable for general needs and most areas of ma- the fast stream of data (e.g. user ratings and comments),
chine learning. Regarding Hadoop framework, Apache Ma- which may affect the real evaluation of available items (e.g.
hout aims to build scalable and performant machine learning product or news). In such a situation, the adoption of ef-
applications on top of Hadoop. fective stream processing frameworks is encouraged in order
to avoid overrating or incorporating user/item related data
5.4 Big Graph processing into the recommender system. Tools like Mahout, Flink-
The field of processing large graphs has attracted consider- ML and Spark MLLib include collaborative filtering algo-
able attention because of its huge number of applications, rithms, that may be used for e-commerce purpose and in
such as the analysis of social networks [29], Web graphs [30] some social network services to suggest suitable items to
and bioinformatics [31]. It is important to mention that users [42].
Hadoop is not the optimal programming model for graph
processing [32]. This can be explained by the fact that 5.5.3 Social media
Hadoop uses coarse-grained tasks to do its work, which Social media is another representative data source for big
are too heavyweight for graph processing and iterative al- data that requires real-time processing and results. Its is
gorithms [26]. In addition, Hadoop can not cache interme- generated from a wide range of Internet applications and
diate data in memory for faster performance. We also notice Web sites including social and business-oriented networks
that most of Big Data frameworks provide graph-related li- (e.g. LinkedIn, Facebook), online mobile photo and video
braries (e.g., GraphX [15] with Spark and Gelly [33] with sharing services (e.g. Instagram, Youtube, Flickr), etc. This
Flink). Moreover, many graph processing systems have huge volume of social data requires a set of methods and
been proposed [34]. Such frameworks include Pregel [16], algorithms related to, text analysis, information diffusion,
GraphLab [35], BLADYG [36] and trinity [37]. information fusion, community detection and network ana-
lytics, which may be exploited to analyse and process infor-
mation from social-based sources [43]. This also requires it-
5.5 Real-world applications erative processing and learning capabilities and necessitates
5.5.1 Healthcare applications the adoption of in-stream frameworks such as Storm and
Healthcare scientific applications, such as body area network Flink along with their rich libraries.
provide monitoring capabilities to decide on the health sta-
tus of a host. This requires deploying hundreds of inter- 5.5.4 Smart cities
connected sensors over the human body to collect various Smart city is a broad concept that encompasses economy,
data including breath, cardiovascular, insulin, blood, glucose governance, mobility, people, environment and living [44].
It refers to the use of information technology to enhance J. Bernardino, Choosing the right nosql database for
quality, performance and interactivity of urban services in a the job: a quality attribute evaluation, Journal of Big
city. It also aims to connect several geographically distant Data 2 (1) (2015) 1–26.
cities [45]. Within a smart city, data is collected from sen- doi:10.1186/s40537-015-0025-0.
sors installed on utility poles, water lines, buses, trains and [5] S. Ghemawat, H. Gobioff, S.-T. Leung, The google file
traffic lights. The networking of hardware equipment and system, SIGOPS Oper. Syst. Rev. 37 (5) (2003) 29–43.
sensors is referred to as Internet of Things (IoT) and repre- [6] R. Vernica, A. Balmin, K. S. Beyer, V. Ercegovac,
sents a significant source of Big data. Big data technologies Adaptive mapreduce using situation-aware mappers,
are used for several purposes in a smart city including traf- in: Proceedings of the 15th International Conference
fic statistics, smart agriculture, healthcare, transport and on Extending Database Technology, EDBT ’12, ACM,
many others [45]. For example, transporters of the logistic New York, NY, USA, 2012, pp. 420–431.
company UPS are equipped with operating sensors and GPS [7] Y. Kwon, M. Balazinska, B. Howe, J. Rolia, Skewtune:
devices reporting the states of their engines and their posi- Mitigating skew in mapreduce applications, in:
tions respectively. This data is used to predict failures and Proceedings of the 2012 ACM SIGMOD International
track the positions of the vehicles. Urban traffic also pro- Conference on Management of Data, SIGMOD ’12,
vides large quantities of data that come from various sensors ACM, New York, NY, USA, 2012, pp. 25–36.
(e.g., GPSs, public transportation smart cards, weather con- [8] I. Polato, R. Ré, A. Goldman, F. Kon, A
ditions devices and traffic cameras). To understand this traf-
comprehensive view of hadoop researchâĂŤa
fic behaviour, it is important to reveal hidden and valuable
systematic literature review, Journal of Network and
information from the big stream/storage of data. Finding
Computer Applications 46 (2014) 1–25.
the right programming model is still a challenge because of
the diversity and the growing number of services [46]. In- [9] T. White, Hadoop: The definitive guide, ” O’Reilly
deed, some use cases are often slow such as urban planning Media, Inc.”, 2012.
and traffic control issues. Thus, the adoption of a batch- [10] R. Li, H. Hu, H. Li, Y. Wu, J. Yang, Mapreduce
oriented framework like Hadoop is sufficient. Processing parallel programming model: A state-of-the-art
urban data in micro-batch fashion is possible, for example, survey, International Journal of Parallel Programming
in case of eGovernment and public administration services. (2015) 1–35.
Other use cases like healthcare services (e.g. remote assis- [11] M. Zaharia, M. Chowdhury, M. J. Franklin,
tance of patients) need decision making and results within S. Shenker, I. Stoica, Spark: Cluster computing with
few milliseconds. In this case, real-time processing frame- working sets, in: Proceedings of the 2Nd USENIX
works like Storm are encouraged. Combining the strengths Conference on Hot Topics in Cloud Computing,
of the above discussed frameworks may also be useful to deal HotCloud’10, USENIX Association, Berkeley, CA,
with cross-domain smart ecosystems also called big services USA, 2010, pp. 10–10.
[47]. [12] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R.
Henry, R. Bradshaw, N. Weizenbaum, Flumejava:
Easy, efficient data-parallel pipelines, in: Proceedings
6. CONCLUSIONS of the 31st ACM SIGPLAN Conference on
In this work, we surveyed popular frameworks for large- Programming Language Design and Implementation,
scale data processing. After a brief description of the main PLDI ’10, ACM, New York, NY, USA, 2010, pp.
paradigms related to Big Data problems, we presented an 363–375.
overview of the Big Data frameworks Hadoop, Spark, Storm [13] N. Garg, Apache Kafka, Packt Publishing, 2013.
and Flink. We presented a categorization of these frame-
[14] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu,
works according to some main features such as the used
J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin,
programming model, the type of data sources, the supported
A. Ghodsi, M. Zaharia, Spark sql: Relational data
programming languages and whether the framework allows
processing in spark, in: Proceedings of the 2015 ACM
iterative processing or not. We also conducted an extensive
SIGMOD International Conference on Management of
comparative study of the above presented frameworks on a
Data, SIGMOD ’15, ACM, New York, NY, USA,
cluster of machines and we highlighted best practices while
2015, pp. 1383–1394.
using the studied Big Data frameworks.
[15] R. S. Xin, J. E. Gonzalez, M. J. Franklin, I. Stoica,
Graphx: A resilient distributed graph system on spark,
7. REFERENCES in: First International Workshop on Graph Data
[1] A. Gandomi, M. Haider, Beyond the hype: Big data Management Experiences and Systems, GRADES ’13,
concepts, methods, and analytics, International ACM, New York, NY, USA, 2013, pp. 2:1–2:6.
Journal of Information Management 35 (2) (2015) 137 [16] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,
– 144. I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for
[2] A. Oguntimilehin, E. Ademola, A review of big data large-scale graph processing, in: Proceedings of the
management, benefits and challenges, A Review of Big 2010 ACM SIGMOD International Conference on
Data Management, Benefits and Challenges 5 (6) Management of data, ACM, 2010, pp. 135–146.
(2014) 433–438. [17] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy,
[3] J. Dean, S. Ghemawat, MapReduce: simplified data J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu,
processing on large clusters, Commun. ACM 51 (1) J. Donham, N. Bhagat, S. Mittal, D. Ryaboy,
(2008) 107–113. Storm@twitter, in: Proceedings of the 2014 ACM
[4] J. R. Lourenço, B. Cabral, P. Carreiro, M. Vieira, SIGMOD International Conference on Management of
Data, SIGMOD ’14, ACM, New York, NY, USA, Graphle: Interactive exploration of large, dense
2014, pp. 147–156. graphs, BMC Bioinformatics 10 (2009) 417.
[18] P. Hunt, M. Konar, F. P. Junqueira, B. Reed, [32] B. Elser, A. Montresor, An evaluation study of
Zookeeper: Wait-free coordination for internet-scale bigdata frameworks for graph processing, in: IEEE
systems, in: Proceedings of the 2010 USENIX International Conference on Big Data, 2013, pp.
Conference on USENIX Annual Technical Conference, 60–67.
USENIXATC’10, USENIX Association, Berkeley, CA, [33] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl,
USA, 2010, pp. 11–11. S. Haridi, K. Tzoumas, Apache flinkTM : Stream and
[19] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, batch processing in a single engine, IEEE Data Eng.
F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, Bull. 38 (4) (2015) 28–38.
V. Markl, F. Naumann, M. Peters, A. Rheinländer, [34] S. Aridhi, E. M. Nguifo, Big graph mining:
M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, Frameworks and techniques, Big Data Research (2016)
D. Warneke, The stratosphere platform for big data (in press).
analytics, The VLDB Journal 23 (6) (2014) 939–964. [35] Y. Low, D. Bickson, J. Gonzalez, et al., Distributed
[20] Y. Yao, J. Wang, B. Sheng, J. Lin, N. Mi, Haste: graphlab: A framework for machine learning and data
Hadoop yarn scheduling based on task-dependency mining in the cloud, Proc. VLDB Endow. 5 (8) (2012)
and resource-demand, in: Proceedings of the 2014 716–727.
IEEE International Conference on Cloud Computing, [36] S. Aridhi, A. Montresor, Y. Velegrakis, Bladyg: A
CLOUD ’14, IEEE Computer Society, Washington, novel block-centric framework for the analysis of large
DC, USA, 2014, pp. 184–191. dynamic graphs, in: Proceedings of the ACM
doi:10.1109/CLOUD.2014.34. Workshop on High Performance Graph Processing,
[21] J. Lin, C. Dyer, Data-Intensive Text Processing with HPGP ’16, ACM, New York, NY, USA, 2016, pp.
MapReduce, Morgan and Claypool Publishers, 2010. 39–42.
[22] F. Bajaber, R. Elshawi, O. Batarfi, A. Altalhi, [37] B. Shao, H. Wang, Y. Li, Trinity: A distributed graph
A. Barnawi, S. Sakr, Big data 2.0 processing systems: engine on a memory cloud, in: Proc. of the Int. Conf.
Taxonomy and open challenges, Journal of Grid on Management of Data, ACM, 2013.
Computing 14 (3) (2016) 379–405. [38] F. Zhang, J. Cao, S. U. Khan, K. Li, K. Hwang, A
[23] Y. Gupta, Kibana Essentials, Packt Publishing, 2015. task-level adaptive mapreduce framework for real-time
[24] S. Wadkar, M. Siddalingaiah, Apache Ambari, Apress, streaming data in healthcare applications, Future
Berkeley, CA, 2014, pp. 399–401. Gener. Comput. Syst. 43 (C) (2015) 149–160.
[25] D. Eadline, Hadoop 2 Quick-Start Guide: Learn the [39] Y. Bu, B. Howe, M. Balazinska, M. D. Ernst, The
Essentials of Big Data Computing in the Apache haloop approach to large-scale iterative data analysis,
Hadoop 2 Ecosystem, 1st Edition, Addison-Wesley The VLDB Journal 21 (2) (2012) 169–190.
Professional, 2015. [40] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H.
[26] S. Landset, T. M. Khoshgoftaar, A. N. Richter, Bae, J. Qiu, G. Fox, Twister: a runtime for iterative
T. Hasanin, A survey of open source tools for machine mapreduce, in: Proceedings of the 19th ACM
learning with big data in the hadoop ecosystem, International Symposium on High Performance
Journal of Big Data 2 (1) (2015) 1–36. Distributed Computing, ACM, 2010, pp. 810–818.
doi:10.1186/s40537-015-0032-1. [41] P. Resnick, H. R. Varian, Recommender systems,
[27] S. Chakrabarti, E. Cox, E. Frank, R. H. Gting, Commun. ACM 40 (3) (1997) 56–58.
J. Han, X. Jiang, M. Kamber, S. S. Lightstone, T. P. [42] J. Domann, J. Meiners, L. Helmers, A. Lommatzsch,
Nadeau, R. E. Neapolitan, D. Pyle, M. Refaat, Real-time news recommendations using apache spark,
M. Schneider, T. J. Teorey, I. H. Witten, Data in: Working Notes of CLEF 2016 - Conference and
Mining: Know It All, Morgan Kaufmann Publishers Labs of the Evaluation forum, Évora, Portugal, 5-8
Inc., San Francisco, CA, USA, 2008. September, 2016., 2016, pp. 628–641.
[28] E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, [43] G. Bello-Orgaz, J. J. Jung, D. Camacho, Social big
X. Pan, J. E. Gonzalez, M. J. Franklin, M. I. Jordan, data: Recent achievements and new challenges,
T. Kraska, MLI: an API for distributed machine Information Fusion 28 (2016) 45 – 59.
learning, in: 2013 IEEE 13th International Conference [44] C. Yin, Z. Xiong, H. Chen, J. Wang, D. Cooper,
on Data Mining, Dallas, TX, USA, December 7-10, B. David, A literature survey on smart cities, Science
2013, 2013, pp. 1187–1192. China Information Sciences 58 (10) (2015) 1–18.
[29] C. Giatsidis, D. M. Thilikos, M. Vazirgiannis, [45] C. L. Stimmel, Building Smart Cities: Analytics, ICT,
Evaluating cooperation in communities with the k-core and Design Thinking, Auerbach Publications, Boston,
structure, in: Proceedings of the 2011 International MA, USA, 2015.
Conference on Advances in Social Networks Analysis [46] G. Piro, I. Cianci, L. Grieco, G. Boggia, P. Camarda,
and Mining, ASONAM ’11, IEEE Computer Society, Information centric services in smart cities, Journal of
Washington, DC, USA, 2011, pp. 87–93. Systems and Software 88 (2014) 169 – 188.
[30] J. I. Alvarez-Hamelin, L. Dall’Asta, A. Barrat, [47] X. Xu, Q. Z. Sheng, L. J. Zhang, Y. Fan, S. Dustdar,
A. Vespignani, K-core decomposition of internet From big data to big service, Computer 48 (7) (2015)
graphs: hierarchies, self-similarity and measurement 80–83.
biases, NHM 3 (2) (2008) 371–393.
[31] C. Huttenhower, S. O. Mehmood, O. G. Troyanskaya,