10 1109ICoAC44903 2018 8939061

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/338362642
Big Data Analytics in Healthcare
Conference Paper · December 2018

DOI: 10.1109/ICoAC44903.2018.8939061
CITATIONS READS
46 2,470
2 authors:
Ambigavathi Munusamy D. Sridharan

Indian Institute of Information Technology Una (IIITU) Anna University, Chennai
19 PUBLICATIONS 303 CITATIONS 79 PUBLICATIONS 971 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Ambigavathi Munusamy on 05 January 2020.
The user has requested enhancement of the downloaded file.

Big Data Analytics in Healthcare
M. Ambigavathi 1 D. Sridharan 2
1 2
Research Scholar, Department of ECE Professor, Department of ECE
CEG Campus, Anna University CEG Campus, Anna University
Chennai, India Chennai, India
ambigaindhu8@gmail.com sridhar@annauniv.edu
Abstract—The pace of both digital innovation and technology The amassed data from various sources will be in different
disruption is refining the healthcare industry at an exponential forms including electronic health records (EHRs), medical
rate. The large volume of healthcare data continues to mount imaging (MI), genomic sequencing (GS), clinical records
every second, making it harder and very difficult to find any (CR), pharmaceutical research (PR), wearable, and medical
form of useful information. Recently, big data is shifting the devices (MD). These big health data are specially stored in a
traditional way of data delivery into valuable insights using medical server (MS), clinical database (CDB), and other
big data analytics method. Big data analytics provides a lot of clinical data repositories (CDR) for next level analysis [6]-[7].
benefits in the healthcare sector to detect critical diseases at The storage infrastructures are primarily used to store, process,
the initial stage and deliver better healthcare services to the analyze, manage and retrieve the huge amounts of data in
right patient at the right time so that it improves the quality of order to make it easier for the people. Therefore, it is not only
life care. Big data analytics tools play an essential role to providing the information to understand the symptoms, illness,
analyze and integrate large volumes of structured, semi- and treatments but also to alert, predict outcomes at initial
structured and unstructured vital data rapidly produced by the stages and make the right decisions.
various clinical, hospitals, other social web sources and Big Data Analytics tool is a new paradigm, mainly
medical data lakes. However, there are several issues to be designed to analyze, manage and precisely extract the useful
addressed in the current health data analytics platforms that information from large volumes of data sets that are very
offer technical mechanisms for data collection, aggregation, similar to a particular patient in a very short time [8]. Some of
process, analysis, visualization, and interpretation. Due to lack the common software tools are used as part of this advanced
of detailed study in the previous literature, this article inspects analytics strategy such as predictive analytics, data mining,
the promising field of big data analytics in healthcare. This text analytics and statistical analysis. Moreover, this modern
article examines the unique characteristics of big data, big data technological based analytics method transforms the
analytical tools, different phases followed by the healthcare healthcare industry entirely to take the right decision for the
economy from data collection to the data delivery stage. right patient at the right time [9]. Since, these factors are
Further, this article briefly summarizes the open research highly motivated us to survey the distinct characteristics of big
challenges with feasible findings, and then finally offers the data, life cycle of big data analytics, big health data analytical
conclusion. tools, and recent open research challenges of the big data
analytics in healthcare. The main contributions of this paper
are summarized as follows:
Keywords—Big data; Big data analytics; Big data analytics x To study the distinct characteristics of big data
tools; Healthcare applications
analytics in healthcare
I. INTRODUCTION x To study the various big data analytics tools and
platforms used to improve the quality of patient care
In today’s digital era, the increasing rates of chronic
diseases, increased population growth, inability to process and x To present the different levels of big data analytics
get valuable information from diverse health-related data sets and their key components
that are some of the main reasons for adopting a new x To describe the open research challenges faced by the
technology in healthcare sectors to facilitate and provide an healthcare industry and feasible solutions with future
evidence-based medicine (EBM) [1-3]. Big data is an directions
emerging technology to modernize the conventional health
The remainder of this paper is organized as follows:
care system and move the current healthcare industry forward
Section 2 provides the unique characteristics of big data
on several fronts [4]. The data from different sources such as
analytics in healthcare. Section 3 describes the different
mobile phones, body area sensors, patients, hospitals,
phases of big data analytics and their key components. The big
researchers, healthcare providers, and organizations are
data analytics platforms and tools used in healthcare sectors
currently generating the immense volume of medical data [5].
978-1-7281-0353-2/18/$31.00 ©2018 IEEE
269
are listed in section 4. Open research challenges with possible F. Valor
solutions in healthcare are summarized in section 5. Finally, Based on the incoming medical data, the data scientist
this article is concluded with future directions in section 6. must handle or tackle the big problems.
II. UNIQUE CHARACTERISTICS OF BIG HEALTH DATA G. Vane
This section describes the evolution of V’s and After processing the data, a data scientist must remove the
characteristics of big data in the healthcare applications. The health-related data that are least important or suitable and help
characteristics of big health data are mainly associated with to make the right decision making.
various issues namely capture, cleaning, curation, integration,
H. Vanilla
storage, processing, indexing, search, sharing, transfer,
mining, analysis, and visualization. Several researchers at Even though, the simple analytical model is constructed
different periods are recommended a wide range of V’s such with rigor that can able to provide useful value.
as volume, variety, velocity, veracity, validity for different I. Vantage
applications [10-13]. Some of the researchers also suggested
The big data allows patients or physicians a privileged
that it will be increased up to 100 V’s in the near future for an
view of the complex medical systems.
efficient big data analytics. Further, the other important
characteristics of big data in the healthcare application are J. Varifocal
summarized as follows [14]: The big data analytics allows the patients to view an
A. Value infinite number of insightful information, which is very close,
transitional, and inaccessible to the patients or physicians in
To extract the valuable information from the stored
order to make the right suggestions.
medical data by using analytics tools and techniques. For
instance, this type of analytics helps medical researchers to get K. Varmint
new insights and provide ever-increasing value for patients as The amount of medical data gets bigger, thus may
more data becomes available. Moreover, the true value gradually increase the software bugs.
depends on the opinion of the patients and sometimes the
collective information have different values that will lead to L. Vanish
more risk. This indicates how patients interact with the data scientist,
B. Variability doctors, and other analysts on the subject of diagnosis matter.
The health-related data collected from various data sources M. Versed
that changes at different time intervals. If the data variety To resolve all the various problems faced during the data
changes during the processing and lifecycle, then analysis, they should often need to have more knowledge in
inconsistency will also be increased in the context of providing the mathematics, statistics, programming, databases, etc.
unexpected, hidden and valuable information.
N. Vastness
C. Venue The bigness of big health data is accelerating with the
Generally, the big health data is collected in a distributed recent advancement of Internet of Things (IoT) in the
manner. The heterogeneous amount of data amassed from healthcare industry.
various data sources like healthcare organizations, medical
practitioners, and individuals, medical databases via different O. Vault
platforms that may require different access, location and
format to store or process the medical data. With many healthcare applications based on large and
D. Vocabulary often sensitive data sets, data security is progressively more
important.
The different data validation and modelling approaches are
required to describe the data structure including schema, data
models, semantics, ontologies, taxonomies, and other
III. VARIOUS PHASES OF BIG DATA ANALYTICS
framework-based metadata. Since the data scientist needs to
focus on the different validation techniques to tackle the This section discusses the key roles and its five different
variety of problems in the healthcare applications. phases [15-16] of big data analytics in the evolution of
healthcare applications and research. It is specially used to
E. Vagueness disease exploration with the support of analytical tools and
The gathered data is often very unclear regardless of how techniques so as to collect, process, analyze, inspect and
much data is available. So, it provides confusion over the manage the large volumes of structured, unstructured and semi-
meaning of big health data and which analytical tools will be structured data created by current healthcare systems or
used in spite of data availability. organizations. The various components involved in the big
health data analytics from the data mining to the knowledge
discovery process is summarized as follows:
270
A. Phase 1: Data collection, accumulation, and storage characteristics of the original health information should not be
The first phase is mainly used to acquire the heterogeneous changed. Therefore, the possibility of selecting a data
volume of healthcare data from the billions of various sources representation model may result in more meaningful
(i.e. internal and external, respectively). The accumulated data information.
may be in different formats and types as already mentioned in D. Phase 4: Data modeling, analysis and query processing
section 3. Then, it will be transferred to the system for analysis
The data modeling phase is used to process the complex
or either it is stored in databases or data warehouse. In fact,
medical information into an easily understandable form using
there are also a lot of challenges for handling healthcare data
diagrams, text, and symbols. It is mostly used to view the
due to lack of data protocols, data standards, scalability, and
identical health-related data and ensure that all processes,
data privacy issues and so on. Besides, it is very difficult to
entities, relationships, and data flows have been identified.
find out the right metadata to describe what kind of health data
There are several different data modeling approaches used
is stored and how it is measured. However, this metadata
including conceptual, physical and logical levels. Since, it
acquisition system that depicts and gives information about
ensures data consistency in original values, semantics, and
other types of medical data with the purpose of knowledge
security while ensuring the quality of the data. The analysis of
discovery and identification but also can increase the burden
the data is to find useful information from the healthcare data
of analyzing the metadata. Another important issue is data
sets using various analytical methods and technologies such as
cleaning. If the stored information is not useful, then it can be
using data mining algorithms. One of the crucial problems
passed through the entire data analysis phase and increased the
with current big data analysis is the lack of ability to use
processing error.
different database systems together efficiently. However, once
B. Phase 2: Data cleaning, extraction, and classification the medical information is integrated and analyzed, the next
This second phase is particularly used to extract the step is to query the valuable data. Query processing is a way to
medical information and stored on a single database. The data respond to the user level queries. It may be simple or the range
cleaning is the process of detecting and removing inaccurate from high-level queries to low-level queries from the
health-related records. Often, the collected information from physicians, families or even from individuals regarding their
sensors, physician’s prescription, medical image data, and health status. Base on the complexity of the query, the big data
social networking data will not be incorrect format since the analyst must select the suitable platform and analysis tools.
data needs to be in a structured format for performing the E. Phase: 5 Data interpretation, delivery, and feedback
suitable analysis. The continuous challenge during this phase
The final phase of big data analytics is data interpretation,
is removing and adding the missed values. For instance, the
data delivery, and feedback. Data interpretation is very
received data may also include medical images (i.e. MRI, CT,
important after completed all the above processing steps in the
PET/CT, and Ultrasound, respectively) and in such a case the
data sets. The interpretation of the healthcare data results
data retrieval is often highly application dependent and very
should be very clear. If it is not clear, then the patients or other
difficult to filter based on their structure. These data should be
healthcare professional cannot understand the interpreted
classified in the form of structured, semi-structured and
results provided by the data analyst or decision maker or even
unstructured data to perform the meaningful analysis.
computer systems. Moreover, this interpretation involves
C. Phase 3: Data integration, aggregation, and investigating all the assumptions made and retracing the data
representation analysis. However, a decision-maker has to inspect difficulties
The source information for data aggregation process may on the many assumptions at various phases of analysis. The
instigate from internal and external health databases. This data delivery phase helps to generate health report based on
phase uses the amassed voluminous and variety of medical the previous data model. This model will assist the caregivers
data for aggregation, which can be effectively used for data or medical doctors to take the necessary treatment to avoid any
analysis. The main purpose is to get detailed information about further complications. At the last phase, the feedback will be
particular health records based on specific variables such as obtained from the patients and decision makers in the context
the date when it is created, the similarity between other of improving the quality of patient’s care.
records, critical status, patient’s name, history of past readings
etc. Finally, the aggregated information is transferred to the
IV. BIG DATA ANALYTICS TOOLS
hospitals, data scientist, researchers as well as to local, state,
other remote and government healthcare agencies. Basically, The process of vast quantities of healthcare information
the health-related records are very sensitive in nature. Since using traditional data processing techniques and tools are very
the integration of dynamic medical information with already difficult in real-time [17-18]. The various big data analytical
existing static information is not an easy task in the real-time tools for data analysis and their functionalities are summarized
in Table.1.There are hundreds of big analytical tools are
environment. At the same time, healthcare professionals want
introducing for data analysis today. This analysis process helps
the patient’s information to be accurate and up-to-date in order
to inspect, clean, transform, and model the data in order to
to treat the diseases. Furthermore, the representation of extract useful information, suggest the best possible solutions,
meaningful data should be manageable in size, even after the
data reduction, noise removal as well as the important
271
and support right decision making in the critical situations [19- Solr is a NoSQL database generally used for
20]. providing centralized configuration, highly
12 Apache Solr reliable, fault tolerant, distributed indexing,
TABLE I. DIFFERENT BIG DATA ANALYTICS TOOLS load-balanced querying, and self-respond to
Big data analytical failure and more.
S.No Functionalities Lucene is primarily used for indexing,
tools
Hadoop Distributed File System (HDFS) is analyzing and searching work. It is an open
the primary data storage file system that 13 Lucene source and cross platform that supports a
manages data processing and storage for big high-performance multi-index search and
Hadoop Distributed health data applications running in clustered information retrieval.
1 Avro is a popular data serialization and
File System medical systems. It divides the large amount
of healthcare data set into smaller one and deserialization system in Hadoop systems. It
14 Avro
scattered it across the different medical simplifies the exchange of big health data
servers. between programs written in any language.
MapReduce is a high level parallel Mahout is a declarative machine learning
programming model which divides a large library tool for distributed big health data
2 MapReduce task into smaller tasks and provides the final 15 Mahout flow system. Recently, samsara is added to
results. It keeps tracking the process of each Apache Mahout library for real time
server or node at the same time. analytics.
Complex event processing is recently YARN is mainly used for monitoring the
introduced in the hospitals to detect the new Apache Hadoop MapReduce jobs execution, allocating system
Complex Event 16
3 pattern of the patient from the current YARN resources and scheduling tasks to be executed
Processing on different cluster nodes.
changing event and also predict the
associated events in the real time. Advanced data visualization tool offers more
Text mining tool is used in different ways to analytics capabilities and allows patient or
Advanced Data
analyze the clinical records of the patient and 17 doctors to access the analyzed data and
Visualization
4 Text Mining physician responses from the hospitals in understand the significance of data in eye-
order to take necessary treatment plan for the catching and understanding formats.
patient. Presto is a distributed structured query engine
Pig is mostly used to analyze the large for fast and interactive analytic queries in
amount of big health data sets (i.e. structured contradiction of data of any size (i.e. from
5 Pig 18 Presto gigabytes to petabytes). It can perform data
or unstructured) from various data sources
and store the upshots into the HDFS. analytics across different medical servers so
The hive tool is primarily used to manage, that it combines health related data from
process and organize the huge amount of multiple sources.
healthcare data sets in HDFS. It is located on Vertica is a distributed analytical database
6 Hive created to work on Hadoop clusters. It is a
top of Hadoop in order to review, query and
analyze the large sets of immutable data very of cost-effective, column oriented
using hive query language (QL). 19 Vertica storage platform designed to handle large
Jaql is a functional QL and specially used to volumes of health related data and it enables
perform conversion based analysis (i.e. very fast patient related query performance in
transform, grouping and aggregation) on the critical scenarios.
7 Jaql It is a general metric used to take crucial and
large big health data sets. In order to simplify
the parallel processing, it translates the high- data-driven decisions on medical records. It
level queries into low level ones. also helps medical practitioner find to track
Key Performance
Zookeeper has the capability to coordinate 20 and refer other doctors through social
Indicators
the synchronization process of multiple nodes recommendations and inventions in order to
in the cluster, send configuration attributes to find out well qualified doctors to be
8 Zookeeper a particular or all nodes in the cluster, then monitored the patient.
elects a leader node among the multiple HCatalog is a meta data and table
nodes, and provides a reliable communication management tool for Hadoop framework.
between and among the nodes in the cluster. Hadoop has the capability to process both
HBase is a typical non-relational database 21 HCatalog structured and unstructured medical data and
(i.e. NoSQL) model, which is placed on the it can store and share information about
top of HDFS. It distributes huge healthcare medical data’s structure in the HCatalog in
9 HBase data sets under various Hadoop frameworks any format regardless of structure.
and provides the random based healthcare Chukwa is a flexible, powerful real time data
data access or querying capabilities to the end collection system for monitoring, displaying
users. 22 Chukwa and analyzing large distributed systems. It is
Cassandra is also a distributed file system located on top of the HDFS and MapReduce
which is especially designed to handle large framework.
volume of big health data sets across many Flume is a data ingestion tool for collecting
10 Cassandra aggregating and moving large amounts of
Hadoop servers. It also guarantees and 23 Flume
performs reliable data service with no single medical records from various data producers
point of failure. to a Hadoop system.
Oozie is a scalable and reliable workflow It is a data management tool for handling
scheduler system to manage Hadoop jobs. 24 Falcon medical data pipelines in Hadoop clusters for
These jobs are triggered by pre-determined performing the complex processing jobs.
11 Oozie Kafka is a data streaming platform run as a
intervals or healthcare data availability. It is
specified by the Directed Acyclical Graphs 25 Kafka cluster on one or more medical servers and it
(DAGs) of actions to execute. stores streams of medical records. Each
272
medical record consists of a key, a value, and are commonly used to collect, store, and analyze large
a timestamp. volumes of data in various formats. These coarse data from
Accumulo is a fine-grained data access tool
that provides extremely fast access to the diverse sources will increase data errors, data duplication and
26 Accumulo medical records which are stored in a large incorrect data linking that demand the dedicated data cleansing
table with billions of rows and millions of process [22]. Consequently, the physicians will have to make
columns down to the individual cell. decisions based on the more accurate data to the patient. If the
This storm tool is primarily designed to
process huge amount of data similar to
analysed data is not accurate, it will result in ill-advised
27 Storm Hadoop. Since it is a real-time distributed decisions that would ultimately be detrimental to the patients.
streaming data framework that has the The high reliance on data quality makes testing a high priority
capability of highest data ingestion rates. issue. Subsequently, it requires a lot of resources to guarantee
Atlas is a data governance tool which helps to
gather, process, maintain and exchange the
the accuracy of the information stored. The process of
metadata with other tool within and outside detecting and correcting the inaccurate, incomplete, inaccurate
28 Atlas
of Hadoop stack. This tool offers scalable and medical records from a medical database and then replacing,
extensible set of core service to form a series modifying, or deleting the coarse data are very time
of data properties.
consuming and thus requires relevant analytical tools to
Tez is a framework that is used for
developing high performance batch and provide more accuracy over the collected datasets.
29 Tez interactive big data processing applications in
a real-time. Tez uses a complex directed- C. Converting incomplete big health data into valuable
acyclic-graph (DAG) for processing data. insights
Sqoop is a popular tool designed to import The large volume of data is rapidly mounted from the
data between Hadoop and RDBMS servers as
well as export big health data from HDFS to enormous number of sources [23]. If the data is in incomplete
30 Sqoop format, the computer systems cannot work more effectively if
RDBMS so that it can be easily accessed and
used by the Hadoop servers and then upload they stored multiple items that are not identical in size and
it again to the HDFS. format. Besides, the physicians would not take right decisions
due to incomplete data sets. The data needs to be analysed and
classified by other relevant information such as occupation,
V. OPEN RESEARCH CHALLENGES AND SOLUTIONS designation must take into account patients for which the
The big health data faces a lot of issues in each and every information is not known. For instance, the occupation of the
stage of the analytical process. However, it has the great patient will be considered during the analysis only if the
potential to accelerate the traditional procedures of healthcare relevant information is not available to that particular patient.
data delivery services. Otherwise, it will be eliminated with known parameters of the
A. Capturing and cleaning big health data from diverse patient in the analysis process. The health related information
sources is often collected and scattered across many payors, hospitals,
administrative offices, government agencies, servers and file
The real-time medical data comes from both internal and cabinets. However, after the data cleansing and error
external healthcare providers for data analysis. The captured
correction, some incompleteness and errors in data are likely
data sets will be in different formats and in different sizes.
These dirty data can often derail the big data analytics, so the to remain. These incompleteness and errors must be eliminated
data sets should be clean, complete, accurate, and in correct during the data analysis. It is necessary for healthcare
format before forwarding to the multiple medical systems or organizations to gain important insights from the big health
physicians [21]. Further, it must require to integrate the data analysis. Therefore, a proper analysis mechanism needs to
structured, unstructured and semi-structured data by be find out useful or valuable insights. Further, if the
eliminating disparities and inconsistencies. By this, the data incomplete healthcare data sets are bigger and more diverse,
scientist can able to resolve new errors, eliminate data there will be more difficult to incorporate them into an
inconsistencies, and prevent data loss consistently to exploit big analytical platform.
data properly. The data cleaning tools (i.e. rapid miner,
winepure, open refine, drake, and datawrangler) will must work D. Scaling up and down of big health data according to
with messy data as well as to adapt to current technologies in current demand
order to make the actual data sets faultless, to reduce the time Unlike other conventional data collections, the big health data
of processing the raw data sets and to ensure the high levels of
grows and evolves consistently. So, the healthcare
accuracy and integrity in healthcare data warehouses.
organizations often ignore the fact that the big data volume
B. Maintaining big health data storage and data quality and workload grows rapidly. They must create an
The massive quantity of data in healthcare sectors is infrastructure that simplifies the processing on the fresh
growing exponentially at a very fast pace. The storage and datasets regularly. Recently, many hospitals select the cloud
quality of the medical data are becoming a real challenge for platforms to store and manage big heath data efficiently by
doctors, patients and other clinical trials. Some healthcare using computing resources on demand. However, some of the
providers are no longer able to manage the costs and control big data solutions will not perform optimally in the cloud
the medical data centers. The data lakes or data warehouses server. Therefore, the healthcare industry must address this
273
challenge to scale up and down the medical data sets according highly vulnerable. Accordingly, it is very essential for analysts
to the doctor’s or patient’s demand. In fact, the introduction of and data scientists to consider these security issues and deal
new processing tools and storing capacities will affect the with the data in such a way that will not lead to the disruption
actual analytical process. As a result, the complexity and of privacy [26-27]. People generate and share personal
performance of the systems will directly minimize the scaling healthcare data that are more sensitive and not always be
up the big data sets [24]. The next crucial thing is high protected from government regulation. Data security is one of
velocity and volume data sets requires design of big data the primary issues for healthcare organizations, especially in
algorithms based on the data growth or any modifications in critical situations including a series of high profile breaches,
the actual data sets. For instance, the integration of data hackings, and ransomware events.
streams coming from all healthcare sport services and other The privacy of data is another important concern in the
relevant data flow can easily reach up to millions of tuples per context of big health data. The management of privacy is fully
second. Moreover, the centralized server cannot process the based on the technical and a sociological problem. For
entire flows of this scale in real time. Thus, the main challenge instance, an attacker can infer the identity of the query source
is to build a distributed medical server, where each server is from its location information. Even, the patient’s location
used to store and view the local data flow. These local views information can be tracked through cell towers. Besides, the
of the health related data sets must then be aggregated and sensitive information is more vulnerable to security threats
transmitted in order to build a global view of the data with an such as the improper disclosure of patient’s data, unauthorized
off-line or online analysis. use of patient’s data and unauthorized destruction of patient’s
data. There will be an infinite array of vulnerabilities caused
E. Making faster and better decisions on timeliness big by the fraudulent attempt to obtain sensitive information (i.e.
health data phishing attacks). Consequently, healthcare organizations must
need to protect sensitive information by using transmission
The timeliness of data is one of the most important
security, authentication protocols, and data controls over
challenges in healthcare applications such as clinical decision
access, integrity, and data auditing. The healthcare
support, hospitals and caregivers etc. Therefore, the decision
organizations must check the security procedures such as
should be simpler, faster and ultimately more accurate because
using up-to-date anti-virus software, setting up firewalls,
medical practitioners take decisions based on higher volumes
encrypting sensitive data, and using multi-factor
of cleansed data that are more current and relevant. In some
authentication. The healthcare sector must frequently remind
cases, the doctors may need a very limited report or analytic
the critical nature of data security protocols to their clinical
query. Data scientist or physicians must give attention to data
staff members.
and query structure to ensure the best possible results of the
patients. Sometimes, the healthcare datasets contain complex
G. Collaborating and interacting with the medical practice
and varied events, in this case, the data set has to be tuned
without an over searching structure and it needs to convert into A big data analysis system collects input from multiple
meaningful measures in real time for rapid analysis. Time patient records, analyze and share the explored results. The
delay in processing the complex medical data sets will lead to ability to query data is fundamental task for reporting the
less quality patient care [25]. The big data analysts must get outcomes and analytics. Firstly, it must overcome the
the right medical support for interpreting the results to the interoperability problems that prevent query tools from
patients after the clinical data is analyzed as right clarification accessing the healthcare industry’s entire repository of
over the medical report is very essential for making better information. Many healthcare administrations are using SQL
decisions. Further, the computer system must need to predict to hangout into large datasets and relational databases, but it is
or suggest the valuable and potential doctors and other only effective mechanism when a patient can first trust the
specialists to the user. Moreover, they have to evaluate accuracy, completeness of the results. After the healthcare
multiple spatial proximity queries and locate an exact result providers have done the query process, they must generate a
that require a novel indexing mechanism to support such report to the patient that should be very clear, more accurate
medical queries. Next, if the data volume is growing rapidly and concise. Poor data will produce suspicious reports at the
and then queries provided by the medical practitioners will end of the analytical process, which can be harmful for
have tight response time limits. physician who wants to use the report to treat patient [28].
At the point of quality care, a clean and attractive data
F. Securing large voluminous big health data visualization (i.e. charts, figures, flowcharts, and callouts)
must require to figure out the reported results so that doctors
Once, healthcare organizations have the rights to use big
can easily absorb the medical information and practice it
health data, it provides a wide range of opportunities to take
appropriately. Since, the healthcare sectors must also consider
the right decisions and right therapies. However, it also
good visualization tools including heat maps, bar charts, pie
involves big risks when it comes to data security and privacy.
charts, scatterplots, and histograms to minimize the potential
Because, the analytical tools used only for analysis, extract,
confusion over the medical reports. The converted graphics
and utilizes the data from a different variety of health
should not be overlapped with text, and not frustrated with low
providers. Further, the volume of more healthcare data
quality graphics. Thus leading the patients to ignore or
eventually leads to a risk of exposure of the data, making it
274
misinterpret healthcare data [29]. Big health dataset is a to deliver the services such as case reports, drugs, diseases,
dynamic set that will require relatively frequent updates in and it may vary according to the different hospitals
order to remain current and relevant. For instance, updates on requirements. However, the lack of a common standard among
the patient vital signs may occur every few seconds and other the different healthcare systems will increase the
updates such as home address, marital status might only occur interoperability problems. As well, the hospitals cannot
a few times during a patient’s entire lifetime. manage large volumes of structured and unstructured data
Only less number of patients are receiving all of their efficiently using traditional database management systems.
reports at a single location. It means that sharing sensitive data They will have to shift from relational databases to NoSQL or
with external systems is essential but the healthcare sectors is non-relational databases to analyze, process, and access the
currently facing a lot of issues to improve the sharing of large datasets rapidly and efficiently [30]. Also, they will have
medical data across technical and organizational blocks. The to select various analysis tools differ from each other in
healthcare industries must consider the advanced tools and several aspects. Each and every non-relational database or
strategies as well as partnerships to make it easier for NoSQL tool has its own advantages and disadvantages during
physicians or patients to share data easily and securely. the analysis stages. However, the data scientists or healthcare
industries must pick the right non-relational database and best
H. Reducing the cost and managing the policy and process data management tool for accelerating the quality of care.
For storing and managing the big heath data sets require
additional cost according to the scale, different resources. The VI. CONCLUSION
real challenge starts from acquiring new hardware, paying a This article explores how big data analytics offers a great
cloud provider, and hiring developers etc. The healthcare boon to the healthcare industry, as it helps to make better
industries will need to spend a lot of expenses to decisions. Accordingly, it investigates the potential unique
configuration, maintenance, setup of new software, electricity characteristics, different phases, and analytical tools of the big
and so on. Therefore, the data storage and data process in the health data. Some of the open research challenges and feasible
medical database will also be cost-effective. By resorting the solutions are highlighted in order to reduce the healthcare cost,
medical data lakes, can provide cheap storage opportunities for enhance treatment, and improve the quality of patient care.
the data sets. The optimization algorithms can also decrease With the help of analytics tools, data scientists can able to
the computing energy consumption by 5 to 100 times. Further, integrate health related information from both internal and
the medical data is validated and aggregated, various process external sources. Ultimately, physicians can be alerted to do
and policy-related issues need to be addressed. These policy their treatment and reach out to patients in an efficient way.
and procedures protect health information and provide access This massive volume of huge healthcare data eventually leads
control, authentication, and security during the data to a risk of exposure of the data, makes it highly vulnerable.
transmission. Thus, it is very essential for analysts and data scientists to
I. Lacking of data scientist, standards and techniques consider security issues and deal with the data sets in such a
way that will not lead to the disruption of privacy. However,
big data analytics is still a challenging and time demanding
With the exponential rise of data, a huge demand for big
task in healthcare that needs expensive software, hardware,
data scientists and analysts has been created in the healthcare
storage, computational infrastructure, skilled data scientists
industry. It is important for healthcare organizations to appoint
a data scientist having relevant skills, sufficient amount of and professionals in the healthcare sectors. Furthermore, it has
domain knowledge to identify valuable insights. Next possible to enable the data lakes to be scaled up and down rapidly by
adapting the system to the actual demand. The combination of
challenging issue faced by healthcare sectors is the shortage of
disruptive technologies including machine learning,
professionals who understand big data analysis. Therefore,
augmented reality and artificial intelligence on big data is
healthcare industry needs skilled professionals to manage and
analyze massive amount of real-time data being collected from already assisting, multiplying caregivers’ ability to improve
various internal and external sources in multiple formats. the quality of patient care. The comprehensive review of
several big data analytical techniques available for healthcare
Currently, the big data scientist and analysts are very less but
applications will be discussed in future.
they have high demand to the healthcare industry in order to
manage large datasets efficiently. REFERENCES
[1] Shankar Krishnan, “Application of Analytics to Big Data in Healthcare”,
Still, there is no fixed standards followed by the healthcare IEEE 32nd Southern Biomedical Engineering Conference, 2016.
industry to share the vital information across different agents. [2] Yuxuan Jiang, Zhe Huang and Danny H.K. Tsang, “Towards Max-Min
The vast amount of vital information is collected and amassed Fair Resource Allocation for Stream Big Data Analytics in Shared
Clouds”, IEEE Transactions on Big Data, Vol. 4, no. 1, pp.130-137,
from various devices which may create interoperability 2018.
challenges, so they have to be flexible enough to cope with not [3] Mohammad-Parsa Hosseini, Hamid Soltanian-Zadeh, Kost Elisevich,
only the additional sources but also the evolution of schemas and Dario Pompili, “Cloud-based Deep Learning of Big EEG Data for
and structures used for transporting and storing data. Even, the Epileptic Seizure Prediction”, IEEE Global Conference on Signal and
healthcare organizations do not use or follow a single standard Information Processing (GLOBALSIP), pp.1-5, 2016.
275
[4] Marco Viceconti, Peter Hunter, and Rod Hose, “Big Data, Big and Future Directions”, Journal of Parallel Distributed Computing, pp.3-
Knowledge: Big Data for Personalized Healthcare “, IEEE Journal of 15, 2015.
Biomedical and Health Informatics, Vol. 19, no. 4, pp.1209-1215, 2015. [25] Nishita Mehta and Anil Pandit, “Concurrence of Big data Analytics and
[5] Raghunath Nambiar, Adhiraaj Sethi, Ruchie Bhardwaj, and Rajesh Healthcare: A Systematic Review”, International Journal of Medical
Vargheese, “A Look at Challenges and Opportunities of Big Data Informatics, Vol.114, pp.57–65, 2018.
Analytics in Healthcare”, IEEE International Conference on Big Data, [26] P. Ram Mohan Rao, S. Murali Krishna, and A. P. Siva Kumar, “Privacy
pp.1-6, 2013. preservation techniques in big Data Analytics: a survey. Journal of Big
[6] Carmen C. Y. Poon, Benny P. L. Lo, Mehmet Rasit Yuce, Akram Data, pp.1-12, 2018.
Alomainy, and Yang Hao “Body Sensor Networks: In the Era of Big [27] Matthew Herland, Taghi M. Khoshgoftaar, Richard A. Bauder, “Big
Data and Beyond”, IEEE Reviews in Biomedical Engineering, 2015. Data fraud detection using multiple medicare data sources”, Journal of
[7] Yichuan Wang, LeeAnn Kung, Terry Anthony Byrd, “Big data analytics: Big Data, pp.1-12, 2018.
Understanding its capabilities and potential benefits for healthcare [28] Shujaat Hussain and Sungyoung Lee, “Visualization and descriptive
organizations”, Elsevier Journal of Technological Forecasting & Social analytics of wellness data through Big Data”, The Tenth International
Change, Vol. 126 pp. 3–13, 2018. Conference on Digital Information Management, pp.69-71, 2015.
[8] Lloyd Minor, “Harnessing the Power of Data in Health - Stanford [29] Anish Jindal, Amit Dua, Neeraj Kumar, Ashok Kumar Das, A. V.
Medicine”, pp.1-18, 2017. Vasilakos and Joel J.P.C. Rodrigues, “Providing Healthcare-as-a-Service
[9] Naoual El aboudi and Laila Benhlima, “Big Data Management for Using Fuzzy Rule-Based Big Data Analytics in Cloud Computing”,
Healthcare Systems: Architecture, Requirements, and Implementation”, IEEE Journal of Biomedical and Health Informatics, pp.1-14, 2018.
Advances in Bioinformatics, pp. 1-10, 2018. [30] Ejaz Ahmed, Ibrar Yaqoob, Ibrahim Abaker Targio Hashem, Junaid
[10] Maryam Panahiazar, Vahid Taslimitehrani, Ashutosh Jadhav and Shuja, Muhammad Imran, Nadra Guizani, and Sheikh Tahir Bakhsh,
Jyotishman Pathak, “Empowering Personalized Medicine with Big Data “Recent Advances and Challenges in Mobile Big Data”, IEEE
and Semantic Web Technology: Promises, Challenges, and Use Cases”, Communications Magazine, pp.102-107, 2018.
IEEE International Conference on Big Data, pp.790-795, 2014.
[11] Naoual El aboudi and Laila Benhlima, “Big Data Management for
Healthcare Systems: Architecture, Requirements, and Implementation”,
Advances in Bioinformatics, pp. 1-10, 2018.
[12] St. Kliment Ohridski, “Big Data Analytics in Medicine and Healthcare”,
Journal of Integrative Bioinformatics, pp.1-5, 2018.
[13] Hiba Asri, Hajar Mousannif, Hassan Al Moatassime, and Thomas Noel,
“Big Data in healthcare: Challenges and Opportunities, International
Conference on Cloud Technologies and Applications (CloudTech), pp.
1-7, 2015.
[14] https://www.elderresearch.com/blog/42-v-of-big-data.
[15] Yichuan Wanga, LeeAnn Kungb, William Yu Chung Wangc, and Casey
G. Cegielskid, “An Integrated Big data Analytics-Enabled
Transformation Model: Application to Health care, Elsevier Journal of
Information & Management, Vol.55, pp.64–79, 2018.
[16] Mohammad Ahmad Alkhatib, Amir Talaei-Khoei, Amir Hossein
Ghapanchi, “Analysis of Research in Healthcare Data Analytics”,
Australasian Conference on Information Systems, pp. 1-16, 2015.
[17] Prashant Johri, Tanya Singh, Sanjoy Das, and Shipra Anand, “Vitality of
Big Data Analytics in Healthcare Department”, International Conference
on Information and Communication Technologies and Unmanned
Systems, 2017.
[18] Chintan Zaveri, “Use of Big-Data in healthcare and Life Science using
Hadoop Technologies”, Second International Conference on Electrical,
Computer and Communication Technologies, pp.1-5, 2017.
[19] Iroju Olaronke and Ojerinde Oluwaseun, “Big Data in Healthcare:
Prospects, Challenges and Resolutions, “IEEE Conference on Future
Technologies, pp.1152-1157, 2016.
[20] Ahmed Eldawy Mohamed F. Mokbel Christopher Jonathan, “Hadoop
Viz: A MapReduce Framework for Extensible Visualization of Big
Spatial Data”, IEEE 32nd International Conference on Data Engineering
(ICDE), pp.601-612, 2016.
[21] Nivedita Das, Leena Das, Siddharth Swarup Rautaray and Manjusha
Pandey, ”Big Data Analytics for Medical Applications”, International
Journal of Modern Education and Computer Science, Vol.2, pp.35-42,
2018.
[22] Md Ileas Pramanik, Raymond Y.K. Lau, Haluk Demirkan, and Md.
AbulKalam Azad, “Smart health: Big data enabled health paradigm
within smart cities”, Journal of Expert Systems with Applications,
Vol.87, pp. 370–383, 2017.
[23] Godfrey, V. Hetherington, H. Shum, P. Bonato, N.H. Lovell, S., and
Stuart, “From A to Z: Wearable technology explained”, Elsevier
Journal of Maturitas, Vol.113, pp.40-47, 2018.
[24] Marcos D. Assuncao, Rodrigo N. Calheiros, Silvia Bianchi, Marco A.S.
Netto, and Rajkumar Buyya, “Big Data Computing and Clouds: Trends
276
View publication stats

10 1109ICoAC44903 2018 8939061

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1109ICoAC44903 2018 8939061

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Big Data Analytics in Healthcare

Conference Paper · December 2018

Ambigavathi Munusamy D. Sridharan

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

978-1-7281-0353-2/18/$31.00 ©2018 IEEE

View publication stats

You might also like