2018 HISISE BigDatavf

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/323964028

An Overview of Big Data Architectures in Healthcare

Chapter · March 2018


DOI: 10.1007/978-3-319-77700-9_19

CITATION READS

1 303

3 authors, including:

Filipe Portela
University of Minho
208 PUBLICATIONS   1,492 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Status and Sacredness View project

ioCity View project

All content following this page was uploaded by Filipe Portela on 03 May 2018.

The user has requested enhancement of the downloaded file.


An overview of Big Data architectures in healthcare

Hugo Torres1, Filipe Portela1*, and Manuel Filipe Santos1


1
Algoritmi Research Center, University of Minho, Guimarães, Portugal
{cfp, mfs}@dsi.uminho.pt

Abstract. It is proven that Big Data is related to an increase in efficiency and


effectiveness in many areas. Although many studies have been conducted trying
to prove the value of Big Data in healthcare/medicine, few practical advances
have been made. In this project, an analysis and a comparison were made of the
existing Big Data technologies applied in healthcare. We analyzed a Big Data
solution developed for the INTCare project, a Hadoop-based solution proposed
for the Maharaja Yeshwatrao hospital located in India and a solution that uses
Apache Spark. The three solutions mentioned above are based on open source
technology. The IBM PureData Solution for Healthcare Analytics solution used
at Seattle's Children's Hospital and the Cisco Connected Health Solutions and
Services solution are part of the proprietary solutions analyzed.

Keywords: Big Data, INTCare, HealthCare

1 Introduction

Since the invention of computers, large volumes of data are generated at a surprising
rate [1]. Up to 2003, 5 exabytes of data were created by the human being; currently,
this amount is created in 2 days [2]. This is how Big Data began to reveal its true
potential in dealing with large volumes of data from various sources and generated at
high speeds. The health industry generates huge amounts of data, though most of it is
stored in non-digital format. Nowadays, the trend is to digitize most of the information
[3]. According to Feldman et al. (2012), the increase in the volume of data in the health
industry comes, not only from the creation of new forms of data (three-dimensional
images, biometric sensor readings, and others), but also from the transformation of
existing data, such as radiology images, DNA sequence data and other, to digital
format. Given the noticeable delay in the adoption of Big Data technologies by the
health industry, it is necessary to identify the challenges and potential use of Big Data
in this industry and to identify cases of adoption of Big Data technologies in
hospitals/healthcare clinics. To ease the adoption of Big Data technologies in the health
industry, this study aims to identify, analyze, filter and compare the solutions identified.
This paper is divided into six sections: Introduction; Background; Methods and Tools;
Case Study; Discussion; Conclusion and Future Work. The second section presents the
challenges and potential of Big Data in the healthcare industry. In section 3, the
methods and tools utilized for this project are presented and described. In section 4 the
case study is presented, the various solutions found and the comparison between the
filtered solutions. In section 5 the results are analyzed in the project context. In section
6, a summary of this paper is given, describing the main conclusions. Finally, in section
7 a short description of the future work is presented.

2 Background

2.1 Big Data

“Big data refers to datasets whose size is beyond the ability of typical database software
tools to capture, store, manage, and analyze.” Manyika et al. (2011).
According to Zikopoulos et al. (2012) the three characteristics that define Big Data are:
Volume, Variety and Velocity. Hurwitz et al. (2013) state that the three V's quoted
above is too simplistic and proposed a fourth V, veracity. In the literature, there are
some authors who add a fifth V, value [7]. Therefore, Big Data can be characterized
by: a) Volume – refers to the large amounts of data generated that grows exponentially
and comes from a variety of sources; b) Variety - According to Zikopoulos et al. (2012),
society invests a large part of its time with structured data (representing 20% of the
total volume of data generated), but the challenge lies in the remaining 80% which are
semi-structured or unstructured; c) Velocity - The speed at which the data is generated.
Today, many of the data that is generated has an "expiration date", that is, they are only
relevant to organizations if they are analyzed almost in real time [5]; d) Value - This
characteristic is related to the economic value of the data; e) Veracity - This
characteristic is related to the quality of the data.
According to Feldman et al. (2012), the increase in the volume of data in the health
industry comes, not only from the creation of new forms of data (three-dimensional
images, biometric sensor readings, among others), but also from the digitization of
existing data, such as radiology images, DNA sequence data, among others.
The McKinsey Global Institute conducted a study to understand the potential of Big
Data in five areas, one of which was the health area in the United States. Despite the
importance of this sector in the country's economy (representing 17% of GDP), it is
still possible to notice a delay in the adoption of Big Data in relation to other industries
[8].

3 Big Data architectures used in healthcare

3.1 Big Data Architecture for the INTCare Project

INTCare is a research project developed at the Intensive Care Unit (ICU) of the
Centro Hospitalar do Porto, which, at an early stage, was designed to develop an
intelligent system to predict organ failure and its effects on users (F. Portela et al.,
2014). The INTCare project uses a continuous flow and real-time data collection, from
several sensors and monitoring devices, which generates a volume of data from 50 to
500 Terabytes [9]. According to Portela et al. (2014), in 2009 the excessive amount of
medical records on paper or manually entered into the database became apparent. After
a set of studies aimed at identifying the gaps in the ICU information system, it was
possible to develop a new solution based on intelligent systems capable of performing
automatic tasks such as data acquisition and processing [10]. Nowadays, INTCare is a
Pervasive Intelligent Decision Support System (PIDSS) that acts automatically and in
real-time, in order to provide new information to decision-making entities in the ICU,
ie physicians and nurses [10]–[12]. The Data Management subsystem of this
architecture relies on Apache Kafka and Apache Storm for streaming data processing.
For operational data processing it relies on Apache Phoenix and Apache HBase. The
processing of analytical data is ensured by Apache Hive. Security, administration and
operations are ensured by the following Hadoop subprojects: Apache Sqoop; Apache
Knox; Apache Ranger; Apache Flume; and Apache Oozie.

3.2 Big Data Architecture with Apache Spark

Liu et al. (2015), proposed an architecture for a Big Data processing tool, designed
for the health area, that includes Apache Spark.
The "Data Storage" layer of this architecture, has the Hadoop Distributed File
System, and Apache HBase. The "Data Processing" layer contains Apache Spark and
Spark Streaming. Apache Hive and Spark SQL, are a part of the "Access to data" layer.
Finally, the "Analytics and Business Intelligence" layer contains the MLib, GraphX and
SparkR tools.

3.3 Big Data Architecture designed for the Maharaja Yeshwatrao Hospital

Ojha & Mathur (2016) proposed a Big Data architecture to address the needs of the
Maharaja Yeshwatrao hospital located in Indore, India, which is considered, by the
authors, as the largest public hospital in central India. Ojha & Mathur (2016) state that
the hospital generates large volumes of data based on the number of citizens who attend
it daily. With the implementation of Big Data technologies, the authors intend to
improve the quality of patients life, especially those in need, since long waiting times
have a negative impact on the poorest patients because it forces citizens to lose working
days and consequently lose a portion of their salary.
Using Big Data and data analysis tools, it will be possible to store the data the
Maharaja Yeshwatrao Hospital generates and, therefore, health professionals will be
able to find new knowledge, hidden patterns and trends which may result in improved
treatments, reduction of readmissions, reduction of expenses, and others Ojha & Mathur
(2016).
The "Data Storage" layer is composed of the HDFS as well as Apache HBase. The
"Data Processing" layer contains the Hadoop MapReduce. Apache Hive, Apache Pig
and Apache Avro are part of the "Data Access" layer. The "Management" layer consists
of the Apache Zookeeper and Apache Chukwa.
3.4 IBM PureData Solution For Healthcare Analytics

This architecture is comprised of the IBM PureData Solution for Healthcare


Analytics solution that is being used at Seattle's Children's Hospital to improve
diagnostic and patient care capabilities [15].
IBM PureData Solution for Healthcare Analytics is a solution developed by IBM
that integrates various technologies to meet the Big Data needs of a healthcare
organization. This solution has the following components: IBM Cognos Business
Intelligence - business Intelligence suite; IBM PureData System - a highly scalable
system that relies on servers, databases, storage, and others; IBM Healthcare Provider
Data Model - set of data models and business solution models; IBM InfoSphere
Information Server for DataWarehouse - a data integration platform that supports the
capture, integration and transformation of large volumes of structured or unstructured
data [16].

3.5 Cisco Connected Health Solutions and Services

The infrastructure developed by Cisco Systems, Inc., integrates multiple services


into a single solution that can meet "all" the needs of a healthcare organization.
On the official Cisco Systems, Inc. website, we can see the various applications and
the various services they offer. The services are divided into 6 categories: Personalized
service to the user; Remote assistance and collaboration; Simplify clinical workflows;
Increase efficiency in the workplace; Connect the research and development
department with the production; Enable security and compliance. This solution was
presented to the scientific community by Nambiar et al. (2013).

4 Benchmarking

In this chapter we will compare the three selected architectures, more specifically,
their components (from the "Data processing" layer). The architectures have been
selected based on the type of their license, only open source solutions will be compared.
Table 1 - Comparative table of the selected architectures
Layer Big Data Big Data Big Data
Architecture with Architecture Architecture for the
Apache Spark designed for the INTCare Project
Maharaja
Yeshwatrao
Hospital
Data Storage HDFS HDFS HDFS

Apache HBase Apache HBase Apache HBase

Data Processing Apache Spark Apache Kafka


Hadoop
Spark Streaming MapReduce Apache Storm

Apache Phoenix

Management No information Apache Zookeeper Apache Flume


Apache Chukwa Apache Oozie

Data Access Apache Hive Apache Hive Apache Hive

Apache Pig Apache Sqoop

Apache Avro

Analytical and Business Spark SQL No information Knowledge


Intelligence Management
MLib subsystem of the
INTCare project
GraphX

SparkR

Security No information No information Apache Knox

Apache Ranger

As it can be seen from Table 1, in the "Data Storage" layer, all solutions consist of
the HDFS and Apache HBase tools.
In the "Data Processing" layer, the Big Data Architecture with Apache Spark
consists of Apache Spark and Spark Streaming, the Big Data Architecture designed for
the Maharaja Yeshwatrao Hospital has the Hadoop MapReduce, and the Big Data
Architecture for the INTCare Project consists of Apache Kafka, Apache Storm and
Apache Phoenix.
In the "Management" layer the Big Data Architecture with Apache Spark does not
present any tool, the Big Data Architecture designed for the Maharaja Yeshwatrao
Hospital has Apache Zookeeper and Apache Chukwa, the Big Data Architecture for the
INTCare Project has Apache Oozie and Apache Flume tools.
The " Data Access" layer has Apache Hive present in all architectures, but the Big
Data Architecture designed for the Maharaja Yeshwatrao Hospital also includes
Apache Pig and Apache Avro and the Big Data Architecture for the INTCare Project
also includes Apache Sqoop.
The Apache Spark Architecture presents the Spark SQL, MLib, GraphX and SparkR
tools for the "Analytical and Business Intelligence" layer, while the Big Data
Architecture for the INTCare Project has the knowledge management subsystem
developed for the INTCare project and the Big Data Architecture designed for the
Maharaja Yeshwatrao Hospital does not present any tool.
Finally, in the "Security" layer only the Big Data Architecture for the INTCare
Project presents tools, which are Apache Knox and Apache Ranger.
4.1 Comparison between Hadoop MapReduce and Apache Spark

In this subchapter, the differences between the two data processing frameworks,
Hadoop MapReduce and Apache Spark, will be presented. Afterwards, two
experiments will be presented comparing the performance of the two frameworks in
several scenarios.
Table 2 shows the main differences between MapReduce and Apache Spark.

Table 2 – Main differences between Hadoop MapReduce and Apache Spark.


[18]
Hadoop MapReduce Apache Spark
Stores data on disk Stores the data in memory. The data is
first stored in memory and then
processed

Computing based on disk memory, partial use of Computing based on RAM memory,
RAM (Random Access Memory) partial use of disk memory

Fault tolerance is achieved through replication Fault tolerance is achieved through


RDDs

Difficult to process and analyze data in real time Can be used to modify data in real time

Inefficient for applications that need to constantly Stores the dataset in RAM for efficient
reuse the same dataset reuse

Shi et al. (2015) conducted an experiment to compare the performance between the
two frameworks. The experiment consisted of the execution of several workloads
(WordCount, Sort, K-means) that simulated the real-world use of these frameworks.
Apache Spark performed better in the execution of WordCount. For a 1GB input,
Apache Spark was 34s faster, for 40GB it was 110s faster and finally, for 200GB
Apache Spark was 398s faster at executing the task.
When executing Sort, for a 1GB input, Apache Spark performed the task in less time
with a difference of 3s compared to Hadoop MapReduce, but the same was not visible
for an input of 100 and 500 GB, where MapReduce executed the task with a difference
of 1.5m and 20m respectively.
For both the first and subsequent iterations of the k-means execution, Apache Spark
presents shorter execution times, to emphasize the fact that the difference in time is
accentuated in the following iterations due to the caching mechanism of Apache Spark.
Gu & Li (2014) conducted an experiment to compare the performance of Hadoop
MapReduce and Apache Spark in performing iterative tasks. PageRank was the
algorithm chosen for the experiment.
Runtimes varied depending on the size of the dataset. For small datasets (between 1
and 11MB) Apache Spark was 25-40 times faster to complete the tasks. For datasets
with sizes between 40MB and 89MB Apache Spark was about 10-15 times faster than
MapReduce. For datasets whose size is comprehended between 200 and 600MB
Apache Spark was between 3 to 5 times faster than MapReduce. When the dataset size
exceeds 1GB, MapReduce performed better than Apache Spark, and for some cases,
Apache Spark failed during the execution while MapReduce concluded the task.

4.2 Comparison between Apache Spark and Apache Storm

The experience conducted by Lu et al. (2014) consisted of the execution of 7


workloads (Identity, Sample, Projection, Grep, Wordcount, DistinctCount, Statistics)
to simulate various scenarios of the real use of these frameworks. After analyzing the
experience that compares Apache Spark Streaming to Apache Storm, it is possible to
observe that Apache Spark Streaming has better throughput values (average number of
processed records per second), but the same does not happen in the latency (average of
the intervals from the arrival of each record until the end of processing it) values, where
Apache Storm presents better values, except for the values obtained in the execution of
WordCount workload. As for the ability to handle data, Apache Storm presents worse
results compared to Apache Spark Streaming.

5 Discussion

Based on this study, it is possible to conclude that there is not much scientific
documentation about the implementation of Big Data technologies in hospitals/health
clinics. It is also possible to conclude that the approval of the scientific community can
help to overcome some of the challenges that are presented to the adoption of Big Data
technologies in the health area.
Given the results obtained in the analyzed experiments, it is possible to conclude that:
 The Big Data Architecture for the INTCare Project is best suited to handle
streaming data. This solution combines Apache Kafka and Apache Storm to
handle data from bedside monitors (vital signs, ventilation and others) [9];
 The Maharaja Yeshwatrao Architecture is best suited to handle large volumes
of data, although Hadoop MapReduce performs poorly against Apache Spark
in most of the tests presented in subchapter 6.1, it has been able to handle large
volumes of data;
 The Big Data Architecture with Apache Spark is a hybrid solution as it has
proven capable of handling streaming and batch data. However, the
performance of Apache Spark is very dependent on the configuration of the
cluster.
Although it was not possible to make a direct performance comparison of all the
solutions chosen for benchmarking, it is possible to conclude that the most appropriate
architecture for a healthcare organization is the Big Data Architecture for the INTCare
Project. This solution presents in detail all the components and how they will interact
with the system where they are inserted and, more importantly, it is the only solution
that presents components in the "Security" layer (as it can be seen in Table 1). Since
security is one of the challenges to implementing Big Data in healthcare, it is considered
necessary to integrate tools that ensure data and system security in general.

6 Conclusion and Future Work

The realization of this project made it possible to understand the state of


implementation of Big Data technologies in healthcare, it is potential and the main
challenges. The research of applications used or designed to be implemented in
hospitals/health clinics has proved to be the most challenging task of this project, due
to the scarcity of literature regarding the implementation of Big Data solutions in
healthcare. The research of experiments carried out on applications similar to those
chosen for comparison allowed to evaluate the performance of the applications in
several scenarios, therefore, it was possible to perceive the strengths and weaknesses
of the chosen solutions. The research of Big Data technologies used in healthcare,
revealed the variety of solutions to be explored, and showed that there is no ideal
solution that can satisfy all the needs. Still there are some areas to be explored in the
future, among them which include the research of Big Data solutions similar to those
presented that have not yet been presented to the scientific community and the
execution of practical tests on the tools presented with real data.

Acknowledges

This work has been supported by COMPETE: POCI-01-0145-FEDER-007043 and


FCT – Fundação para a Ciência e Tecnologia within the Project Scope:
UID/CEC/00319/2013

References

[1] I. Yaqoob et al., “Big data: From beginning to future,” Int. J. Inf. Manage., vol.
36, no. 6, pp. 1231–1247, 2016.
[2] S. Sagiroglu and D. Sinanc, “Big data: A review,” 2013 Int. Conf. Collab.
Technol. Syst., pp. 42–47, 2013.
[3] W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare: promise
and potential,” Heal. Inf. Sci. Syst., vol. 2, p. 3, 2014.
[4] B. Feldman, E. M. Martin, and T. Skotnes, “Big Data in Healthcare - Hype and
Hope,” Dr.Bonnie 360 degree (bus. Dev. Digit. Heal., vol. 2013, no. 1, pp. 122–
125, 2012.
[5] P. Zikopoulos, C. Eaton, D. DeRoos, T. Deutsch, and G. Lapis, Understanding
Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. 2012.
[6] J. Hurwitz, A. Nugent, D. F. Halper, and M. Kaufman, Big Data For Dummies.
2013.
[7] C. Taurion, Big Data. 2013.
[8] J. Manyika et al., “Big data: The next frontier for innovation, competition, and
productivity,” McKinsey Glob. Inst., no. June, p. 156, 2011.
[9] A. Gonçalves, F. Portela and M.F. Santos. Towards of a Real-time Big Data
Architecture to Intensive Care. Procedia Computer Science - ICTH 2017 -
International Conference on Current and Future Trends of Information and
Communication Technologies in Healthcare. pp. 585-590. ISSN: 1877-0509 .
Elsevier. 2017.
[10] F. Portela, M. F. Santos, J. Machado, A. Abelha, Á. Silva, and F. Rua,
Pervasive and Intelligent Decision Support in Intensive Medicine – The
Complete Picture. 2014.
[11] T. Guarda, M. F. Augusto, O. Barrionuevo, and F. M. Pinto, “Internet of Things
in Pervasive Healthcare Systems,” in Next-Generation Mobile and Pervasive
Healthcare Solutions, 2018, pp. 22–31.
[12] T. Guarda, W. Orozco, M. F. Augusto, G. Morillo, S. A. Navarrete, and F. M.
Pinto, “Penetration Testing on Virtual Environments,” Proc. 4th Int. Conf. Inf.
Netw. Secur. - ICINS ’16, no. Vmm, pp. 9–12, 2016.
[13] W. Liu, Q. Li, Y. Cai, Y. Li, and X. Li, “A Prototype of Healthcare Big Data
Processing System Based on Spark,” no. Bmei, pp. 516–520, 2015.
[14] M. Ojha and K. Mathur, “Proposed application of big data analytics in
healthcare at Maharaja Yeshwantrao Hospital,” 2016 3rd MEC Int. Conf. Big
Data Smart City, ICBDSC 2016, pp. 40–46, 2016.
[15] S. M. Krishnan, “Application of analytics to big data in healthcare,” Proc. -
32nd South. Biomed. Eng. Conf. SBEC 2016, pp. 156–157, 2016.
[16] IBM, “IBM PureData Solution for Healthcare Analytics,” 2013.
[17] R. Nambiar, A. Sethi, R. Bhardwaj, and R. Vargheeseh, “A Look at Challenges
and Opportunities of Big Data Analytics in Healthcare,” pp. 17–22, 2013.
[18] A. Verma, A. H. Mansuri, and N. Jain, “Big data management processing with
Hadoop MapReduce and spark technology: A comparison,” 2016 Symp.
Colossal Data Anal. Networking, CDAN 2016, 2016.
[19] J. Shi et al., “Clash of the Titans: MapReduce vs. Spark for Large Scale Data
Analytics,” no. 3, pp. 2110–2121, 2015.
[20] L. Gu and H. Li, “Memory or time: Performance evaluation for iterative
operation on hadoop and spark,” Proc. - 2013 IEEE Int. Conf. High Perform.
Comput. Commun. HPCC 2013 2013 IEEE Int. Conf. Embed. Ubiquitous
Comput. EUC 2013, pp. 721–727, 2014.
[21] R. Lu, G. Wu, B. Xie, and J. Hu, “Stream bench: Towards benchmarking
modern distributed stream computing frameworks,” Proc. - 2014 IEEE/ACM
7th Int. Conf. Util. Cloud Comput. UCC 2014, pp. 69–78, 2014.

View publication stats

You might also like