Professional Documents
Culture Documents
2018 HISISE BigDatavf
2018 HISISE BigDatavf
2018 HISISE BigDatavf
net/publication/323964028
CITATION READS
1 303
3 authors, including:
Filipe Portela
University of Minho
208 PUBLICATIONS 1,492 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Filipe Portela on 03 May 2018.
1 Introduction
Since the invention of computers, large volumes of data are generated at a surprising
rate [1]. Up to 2003, 5 exabytes of data were created by the human being; currently,
this amount is created in 2 days [2]. This is how Big Data began to reveal its true
potential in dealing with large volumes of data from various sources and generated at
high speeds. The health industry generates huge amounts of data, though most of it is
stored in non-digital format. Nowadays, the trend is to digitize most of the information
[3]. According to Feldman et al. (2012), the increase in the volume of data in the health
industry comes, not only from the creation of new forms of data (three-dimensional
images, biometric sensor readings, and others), but also from the transformation of
existing data, such as radiology images, DNA sequence data and other, to digital
format. Given the noticeable delay in the adoption of Big Data technologies by the
health industry, it is necessary to identify the challenges and potential use of Big Data
in this industry and to identify cases of adoption of Big Data technologies in
hospitals/healthcare clinics. To ease the adoption of Big Data technologies in the health
industry, this study aims to identify, analyze, filter and compare the solutions identified.
This paper is divided into six sections: Introduction; Background; Methods and Tools;
Case Study; Discussion; Conclusion and Future Work. The second section presents the
challenges and potential of Big Data in the healthcare industry. In section 3, the
methods and tools utilized for this project are presented and described. In section 4 the
case study is presented, the various solutions found and the comparison between the
filtered solutions. In section 5 the results are analyzed in the project context. In section
6, a summary of this paper is given, describing the main conclusions. Finally, in section
7 a short description of the future work is presented.
2 Background
“Big data refers to datasets whose size is beyond the ability of typical database software
tools to capture, store, manage, and analyze.” Manyika et al. (2011).
According to Zikopoulos et al. (2012) the three characteristics that define Big Data are:
Volume, Variety and Velocity. Hurwitz et al. (2013) state that the three V's quoted
above is too simplistic and proposed a fourth V, veracity. In the literature, there are
some authors who add a fifth V, value [7]. Therefore, Big Data can be characterized
by: a) Volume – refers to the large amounts of data generated that grows exponentially
and comes from a variety of sources; b) Variety - According to Zikopoulos et al. (2012),
society invests a large part of its time with structured data (representing 20% of the
total volume of data generated), but the challenge lies in the remaining 80% which are
semi-structured or unstructured; c) Velocity - The speed at which the data is generated.
Today, many of the data that is generated has an "expiration date", that is, they are only
relevant to organizations if they are analyzed almost in real time [5]; d) Value - This
characteristic is related to the economic value of the data; e) Veracity - This
characteristic is related to the quality of the data.
According to Feldman et al. (2012), the increase in the volume of data in the health
industry comes, not only from the creation of new forms of data (three-dimensional
images, biometric sensor readings, among others), but also from the digitization of
existing data, such as radiology images, DNA sequence data, among others.
The McKinsey Global Institute conducted a study to understand the potential of Big
Data in five areas, one of which was the health area in the United States. Despite the
importance of this sector in the country's economy (representing 17% of GDP), it is
still possible to notice a delay in the adoption of Big Data in relation to other industries
[8].
INTCare is a research project developed at the Intensive Care Unit (ICU) of the
Centro Hospitalar do Porto, which, at an early stage, was designed to develop an
intelligent system to predict organ failure and its effects on users (F. Portela et al.,
2014). The INTCare project uses a continuous flow and real-time data collection, from
several sensors and monitoring devices, which generates a volume of data from 50 to
500 Terabytes [9]. According to Portela et al. (2014), in 2009 the excessive amount of
medical records on paper or manually entered into the database became apparent. After
a set of studies aimed at identifying the gaps in the ICU information system, it was
possible to develop a new solution based on intelligent systems capable of performing
automatic tasks such as data acquisition and processing [10]. Nowadays, INTCare is a
Pervasive Intelligent Decision Support System (PIDSS) that acts automatically and in
real-time, in order to provide new information to decision-making entities in the ICU,
ie physicians and nurses [10]–[12]. The Data Management subsystem of this
architecture relies on Apache Kafka and Apache Storm for streaming data processing.
For operational data processing it relies on Apache Phoenix and Apache HBase. The
processing of analytical data is ensured by Apache Hive. Security, administration and
operations are ensured by the following Hadoop subprojects: Apache Sqoop; Apache
Knox; Apache Ranger; Apache Flume; and Apache Oozie.
Liu et al. (2015), proposed an architecture for a Big Data processing tool, designed
for the health area, that includes Apache Spark.
The "Data Storage" layer of this architecture, has the Hadoop Distributed File
System, and Apache HBase. The "Data Processing" layer contains Apache Spark and
Spark Streaming. Apache Hive and Spark SQL, are a part of the "Access to data" layer.
Finally, the "Analytics and Business Intelligence" layer contains the MLib, GraphX and
SparkR tools.
3.3 Big Data Architecture designed for the Maharaja Yeshwatrao Hospital
Ojha & Mathur (2016) proposed a Big Data architecture to address the needs of the
Maharaja Yeshwatrao hospital located in Indore, India, which is considered, by the
authors, as the largest public hospital in central India. Ojha & Mathur (2016) state that
the hospital generates large volumes of data based on the number of citizens who attend
it daily. With the implementation of Big Data technologies, the authors intend to
improve the quality of patients life, especially those in need, since long waiting times
have a negative impact on the poorest patients because it forces citizens to lose working
days and consequently lose a portion of their salary.
Using Big Data and data analysis tools, it will be possible to store the data the
Maharaja Yeshwatrao Hospital generates and, therefore, health professionals will be
able to find new knowledge, hidden patterns and trends which may result in improved
treatments, reduction of readmissions, reduction of expenses, and others Ojha & Mathur
(2016).
The "Data Storage" layer is composed of the HDFS as well as Apache HBase. The
"Data Processing" layer contains the Hadoop MapReduce. Apache Hive, Apache Pig
and Apache Avro are part of the "Data Access" layer. The "Management" layer consists
of the Apache Zookeeper and Apache Chukwa.
3.4 IBM PureData Solution For Healthcare Analytics
4 Benchmarking
In this chapter we will compare the three selected architectures, more specifically,
their components (from the "Data processing" layer). The architectures have been
selected based on the type of their license, only open source solutions will be compared.
Table 1 - Comparative table of the selected architectures
Layer Big Data Big Data Big Data
Architecture with Architecture Architecture for the
Apache Spark designed for the INTCare Project
Maharaja
Yeshwatrao
Hospital
Data Storage HDFS HDFS HDFS
Apache Phoenix
Apache Avro
SparkR
Apache Ranger
As it can be seen from Table 1, in the "Data Storage" layer, all solutions consist of
the HDFS and Apache HBase tools.
In the "Data Processing" layer, the Big Data Architecture with Apache Spark
consists of Apache Spark and Spark Streaming, the Big Data Architecture designed for
the Maharaja Yeshwatrao Hospital has the Hadoop MapReduce, and the Big Data
Architecture for the INTCare Project consists of Apache Kafka, Apache Storm and
Apache Phoenix.
In the "Management" layer the Big Data Architecture with Apache Spark does not
present any tool, the Big Data Architecture designed for the Maharaja Yeshwatrao
Hospital has Apache Zookeeper and Apache Chukwa, the Big Data Architecture for the
INTCare Project has Apache Oozie and Apache Flume tools.
The " Data Access" layer has Apache Hive present in all architectures, but the Big
Data Architecture designed for the Maharaja Yeshwatrao Hospital also includes
Apache Pig and Apache Avro and the Big Data Architecture for the INTCare Project
also includes Apache Sqoop.
The Apache Spark Architecture presents the Spark SQL, MLib, GraphX and SparkR
tools for the "Analytical and Business Intelligence" layer, while the Big Data
Architecture for the INTCare Project has the knowledge management subsystem
developed for the INTCare project and the Big Data Architecture designed for the
Maharaja Yeshwatrao Hospital does not present any tool.
Finally, in the "Security" layer only the Big Data Architecture for the INTCare
Project presents tools, which are Apache Knox and Apache Ranger.
4.1 Comparison between Hadoop MapReduce and Apache Spark
In this subchapter, the differences between the two data processing frameworks,
Hadoop MapReduce and Apache Spark, will be presented. Afterwards, two
experiments will be presented comparing the performance of the two frameworks in
several scenarios.
Table 2 shows the main differences between MapReduce and Apache Spark.
Computing based on disk memory, partial use of Computing based on RAM memory,
RAM (Random Access Memory) partial use of disk memory
Difficult to process and analyze data in real time Can be used to modify data in real time
Inefficient for applications that need to constantly Stores the dataset in RAM for efficient
reuse the same dataset reuse
Shi et al. (2015) conducted an experiment to compare the performance between the
two frameworks. The experiment consisted of the execution of several workloads
(WordCount, Sort, K-means) that simulated the real-world use of these frameworks.
Apache Spark performed better in the execution of WordCount. For a 1GB input,
Apache Spark was 34s faster, for 40GB it was 110s faster and finally, for 200GB
Apache Spark was 398s faster at executing the task.
When executing Sort, for a 1GB input, Apache Spark performed the task in less time
with a difference of 3s compared to Hadoop MapReduce, but the same was not visible
for an input of 100 and 500 GB, where MapReduce executed the task with a difference
of 1.5m and 20m respectively.
For both the first and subsequent iterations of the k-means execution, Apache Spark
presents shorter execution times, to emphasize the fact that the difference in time is
accentuated in the following iterations due to the caching mechanism of Apache Spark.
Gu & Li (2014) conducted an experiment to compare the performance of Hadoop
MapReduce and Apache Spark in performing iterative tasks. PageRank was the
algorithm chosen for the experiment.
Runtimes varied depending on the size of the dataset. For small datasets (between 1
and 11MB) Apache Spark was 25-40 times faster to complete the tasks. For datasets
with sizes between 40MB and 89MB Apache Spark was about 10-15 times faster than
MapReduce. For datasets whose size is comprehended between 200 and 600MB
Apache Spark was between 3 to 5 times faster than MapReduce. When the dataset size
exceeds 1GB, MapReduce performed better than Apache Spark, and for some cases,
Apache Spark failed during the execution while MapReduce concluded the task.
5 Discussion
Based on this study, it is possible to conclude that there is not much scientific
documentation about the implementation of Big Data technologies in hospitals/health
clinics. It is also possible to conclude that the approval of the scientific community can
help to overcome some of the challenges that are presented to the adoption of Big Data
technologies in the health area.
Given the results obtained in the analyzed experiments, it is possible to conclude that:
The Big Data Architecture for the INTCare Project is best suited to handle
streaming data. This solution combines Apache Kafka and Apache Storm to
handle data from bedside monitors (vital signs, ventilation and others) [9];
The Maharaja Yeshwatrao Architecture is best suited to handle large volumes
of data, although Hadoop MapReduce performs poorly against Apache Spark
in most of the tests presented in subchapter 6.1, it has been able to handle large
volumes of data;
The Big Data Architecture with Apache Spark is a hybrid solution as it has
proven capable of handling streaming and batch data. However, the
performance of Apache Spark is very dependent on the configuration of the
cluster.
Although it was not possible to make a direct performance comparison of all the
solutions chosen for benchmarking, it is possible to conclude that the most appropriate
architecture for a healthcare organization is the Big Data Architecture for the INTCare
Project. This solution presents in detail all the components and how they will interact
with the system where they are inserted and, more importantly, it is the only solution
that presents components in the "Security" layer (as it can be seen in Table 1). Since
security is one of the challenges to implementing Big Data in healthcare, it is considered
necessary to integrate tools that ensure data and system security in general.
Acknowledges
References
[1] I. Yaqoob et al., “Big data: From beginning to future,” Int. J. Inf. Manage., vol.
36, no. 6, pp. 1231–1247, 2016.
[2] S. Sagiroglu and D. Sinanc, “Big data: A review,” 2013 Int. Conf. Collab.
Technol. Syst., pp. 42–47, 2013.
[3] W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare: promise
and potential,” Heal. Inf. Sci. Syst., vol. 2, p. 3, 2014.
[4] B. Feldman, E. M. Martin, and T. Skotnes, “Big Data in Healthcare - Hype and
Hope,” Dr.Bonnie 360 degree (bus. Dev. Digit. Heal., vol. 2013, no. 1, pp. 122–
125, 2012.
[5] P. Zikopoulos, C. Eaton, D. DeRoos, T. Deutsch, and G. Lapis, Understanding
Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. 2012.
[6] J. Hurwitz, A. Nugent, D. F. Halper, and M. Kaufman, Big Data For Dummies.
2013.
[7] C. Taurion, Big Data. 2013.
[8] J. Manyika et al., “Big data: The next frontier for innovation, competition, and
productivity,” McKinsey Glob. Inst., no. June, p. 156, 2011.
[9] A. Gonçalves, F. Portela and M.F. Santos. Towards of a Real-time Big Data
Architecture to Intensive Care. Procedia Computer Science - ICTH 2017 -
International Conference on Current and Future Trends of Information and
Communication Technologies in Healthcare. pp. 585-590. ISSN: 1877-0509 .
Elsevier. 2017.
[10] F. Portela, M. F. Santos, J. Machado, A. Abelha, Á. Silva, and F. Rua,
Pervasive and Intelligent Decision Support in Intensive Medicine – The
Complete Picture. 2014.
[11] T. Guarda, M. F. Augusto, O. Barrionuevo, and F. M. Pinto, “Internet of Things
in Pervasive Healthcare Systems,” in Next-Generation Mobile and Pervasive
Healthcare Solutions, 2018, pp. 22–31.
[12] T. Guarda, W. Orozco, M. F. Augusto, G. Morillo, S. A. Navarrete, and F. M.
Pinto, “Penetration Testing on Virtual Environments,” Proc. 4th Int. Conf. Inf.
Netw. Secur. - ICINS ’16, no. Vmm, pp. 9–12, 2016.
[13] W. Liu, Q. Li, Y. Cai, Y. Li, and X. Li, “A Prototype of Healthcare Big Data
Processing System Based on Spark,” no. Bmei, pp. 516–520, 2015.
[14] M. Ojha and K. Mathur, “Proposed application of big data analytics in
healthcare at Maharaja Yeshwantrao Hospital,” 2016 3rd MEC Int. Conf. Big
Data Smart City, ICBDSC 2016, pp. 40–46, 2016.
[15] S. M. Krishnan, “Application of analytics to big data in healthcare,” Proc. -
32nd South. Biomed. Eng. Conf. SBEC 2016, pp. 156–157, 2016.
[16] IBM, “IBM PureData Solution for Healthcare Analytics,” 2013.
[17] R. Nambiar, A. Sethi, R. Bhardwaj, and R. Vargheeseh, “A Look at Challenges
and Opportunities of Big Data Analytics in Healthcare,” pp. 17–22, 2013.
[18] A. Verma, A. H. Mansuri, and N. Jain, “Big data management processing with
Hadoop MapReduce and spark technology: A comparison,” 2016 Symp.
Colossal Data Anal. Networking, CDAN 2016, 2016.
[19] J. Shi et al., “Clash of the Titans: MapReduce vs. Spark for Large Scale Data
Analytics,” no. 3, pp. 2110–2121, 2015.
[20] L. Gu and H. Li, “Memory or time: Performance evaluation for iterative
operation on hadoop and spark,” Proc. - 2013 IEEE Int. Conf. High Perform.
Comput. Commun. HPCC 2013 2013 IEEE Int. Conf. Embed. Ubiquitous
Comput. EUC 2013, pp. 721–727, 2014.
[21] R. Lu, G. Wu, B. Xie, and J. Hu, “Stream bench: Towards benchmarking
modern distributed stream computing frameworks,” Proc. - 2014 IEEE/ACM
7th Int. Conf. Util. Cloud Comput. UCC 2014, pp. 69–78, 2014.