Big Data, Big Challenges - A Healthcare Perspective - Background, Issues, Solutions and Research Directions-Sp

Lecture Notes in Bioengineering
Mowafa Househ
Andre W. Kushniruk
Elizabeth M. Borycki Editors
Big Data, Big

Challenges:
A Healthcare
Perspective
Background, Issues, Solutions and
Research Directions
More information about this series at http://www.springer.com/series/11564
Mowafa Househ Andre W. Kushniruk
• •
Elizabeth M. Borycki
Editors
Big Data, Big Challenges:

A Healthcare Perspective
Background, Issues, Solutions and Research
Directions
123
Editors
Mowafa Househ Andre W. Kushniruk
Division of Information and Computing School of Health Information Sciences
Technology, College of Science and University of Victoria
Engineering Victoria, BC, Canada
Hamad Bin Khalifa University, Qatar
Foundation
Doha, Qatar
Elizabeth M. Borycki
School of Health Information Sciences
University of Victoria
Victoria, BC, Canada
ISSN 2195-271X ISSN 2195-2728 (electronic)

ISBN 978-3-030-06108-1 ISBN 978-3-030-06109-8 (eBook)
https://doi.org/10.1007/978-3-030-06109-8
Library of Congress Control Number: 2018964922
© Springer Nature Switzerland AG 2019

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Much has been written about the utilization of big data analytics methods, tools and
technologies to collect, process, visualize and make use of high volume structured
and unstructured data in a number of fields such as finance, insurance, sports,
agriculture, and health. With the fast and ever-increasing growth of user-generated
data from the Internet, such as social media content, data from wireless medical
devices and mobile apps, big data analytical methods, tools and technologies have
become recognized as the only plausible “go-to” solutions that are able to make
sense of such voluminous, disorganized, fluid and free-flowing data. Within health
care, there is a growing knowledge base of big data related studies and imple-
mentations in public health, clinical decision making, disease prevention, and
healthcare cost reduction. As with any new field, much of the research and dis-
cussions center upon the added value and opportunities that new technologies, such
as big data analytics methods, tools and technologies can provide. However, as the
domain area begins to mature through increased implementation, evaluation studies,
and user experiences, the problems, and challenges relating to the methods, tools,
and technologies used for big data analytics begin to emerge. For the past five
years, much of the literature on big data analytics has focused on the benefits of big
data in improving all areas of health care. A new wave of research is beginning to
emerge challenging some of the assumptions made by the positive assertions for big
data analytics in health care. That is the motivation behind this book, which is not
only about sharing success stories or opportunities for big data in health care, but
also to address the arising challenges that many researchers have overlooked.
What makes this book unique is that it examines both the opportunities and
focuses more on the challenges in applying big data analytics methods, tools and
technologies within health care from a number of perspectives. The book is divided
into three parts and eleven chapters. The first part of the book examines the
healthcare professional perspective on the challenges and opportunities of big data
analytics from a nursing, medical, public health, and health administrator per-
spective. Most of the chapters are included in the first part of the book. The second
part of the book focuses on human factors and ethical challenges and opportunities
related to big data analytics in health care. There are three chapters in part two
v
vi Preface
of the book that address topics related to patient safety, user-centered design, and
ethical issues. Part three of the book includes two chapters that examine the
technical challenges in the utilization of big data analytics in health care. The first
chapter examines the challenges and opportunities of big data analytics from a data
scientist’s perspective. The second chapter examines the integrative exposum/
expotype perspective related to big data analytics in health care.
The book provides health data scientists, health care professionals, and health-
care managers and policymakers the first comprehensive insight into the challenges
and opportunities of big data analytics in health care. The book will challenge some
of the pre-held conceptions and notions students and professionals of big data
analytics in health care currently possess and challenge them to derive new solu-
tions and ideas to the proposed challenges suggested within the book.
Doha, Qatar Mowafa Househ

Victoria, Canada Andre W. Kushniruk
Victoria, Canada Elizabeth M. Borycki
Contents
Part I Health Professional Perspective

Big Data Challenges from a Nursing Perspective . . . . . . . . . . . . . . . . . . 3
Suzanne Bakken and Theresa A. Koleck
Big Data Challenges for Clinical and Precision Medicine . . . . . . . . . . . . 17
Michael Bainbridge
Big Data Challenges from a Pharmacy Perspective . . . . . . . . . . . . . . . . 33
Aude Motulsky
Big Data Challenges from a Public Health Informatics Perspective . . . . 45
David Birnbaum
Big Data Challenges from a Healthcare Administration Perspective . . . 55
Donald W. M. Juzwishin
Big Data Challenges from a Healthcare Governance Perspective . . . . . . 69
Part II Human Factors and Ethical Perspectives

Big Data and Patient Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Elizabeth M. Borycki and Andre W. Kushniruk
Big Data Challenges from a Human Factors Perspective . . . . . . . . . . . . 91
Andre W. Kushniruk and Elizabeth M. Borycki
Big Data Privacy and Ethical Challenges . . . . . . . . . . . . . . . . . . . . . . . . 101
Paulette Lacroix
vii
viii Contents
Part III Technological Perspectives

Health Lifestyle Data-Driven Applications Using Pervasive
Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Luis Fernandez-Luque, Michaël Aupetit, Joao Palotti, Meghna Singh,
Ayman Fadlelbari, Abdelkader Baggag, Kamran Khowaja
and Dena Al-Thani
Big Data Challenges from an Integrative Exposome/Expotype
Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Fernando Martin-Sanchez
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Part I
Health Professional Perspective
Big Data Challenges from
a Nursing Perspective
Suzanne Bakken and Theresa A. Koleck
1 Introduction
The International Council of Nurses provides a global definition of nursing as

“Nursing encompasses autonomous and collaborative care of individuals of all
ages, families, groups and communities, sick or well and in all settings. Nursing
includes the promotion of health, prevention of illness, and the care of ill, disabled
and dying people. Advocacy, promotion of a safe environment, research, partici-
pation in shaping health policy and in patient and health systems management, and
education are also key nursing roles” [1]. In contrast to physicians who focus on
cure, nurses focus on individual, family, and group “responses to actual or potential
health problems” [2]. Importantly, nurses consider the individual within the context
of their family, sociocultural, and physical environments. Nursing’s holistic per-
spective as well as the focus on responses to actual or potential health problems has
major implications for the benefits, promise, and challenges of big data streams and
data science methods for nursing.
Multiple authors have highlighted the relevance of data science to nursing [3–5].
Bakken and Brennan further argue that nursing policy statements inform a prin-
cipled and ethical approach to big data and data science [3]. Nurses’ use of data
science methods is on the rise. A recent systematic review of application of data
science in nursing evaluated 17 studies conducted in 2009–2015 [5]. The focus was
on nursing practice and systems that affect nurses. Although most studies were in
acute care settings, community, home health, and public health settings were also
represented reflecting the variety of settings in which nursing occurs. In terms of
S. Bakken (&)
School of Nursing, Department of Biomedical Informatics, and Data Science Institute,
Columbia University, 630 W. 168th Street, New York, NY 10032, USA
e-mail: sbh22@cumc.columbia.edu
T. A. Koleck
School of Nursing, Columbia University, New York, NY, USA
© Springer Nature Switzerland AG 2019 3

M. Househ et al. (eds.), Big Data, Big Challenges: A Healthcare Perspective,
Lecture Notes in Bioengineering, https://doi.org/10.1007/978-3-030-06109-8_1
4 S. Bakken and T. A. Koleck
characterizing the data used according to the criteria for big data [6, 7], all studies
met the criterion of volume, most met the criterion of variety, and a minority met
the criterion of velocity. Veracity and value were not explicitly analyzed. Electronic
health records (EHRs) were the primary data source for 14 studies although several
studies integrated EHR data with other data sources. The study purposes were
categorized as knowledge discovery, prediction, and evaluation. Since the time of
this review, additional nursing studies have been conducted that reflect data sources
beyond EHRs and structured data sources including omics [8], social media [9], and
sensors [10]. Moreover, health policy considerations for data science have been
delineated from a nursing science perspective [11].
The purpose of this chapter is to summarize the benefits and key challenges
related to big data streams and data science from the perspective of nursing. The
benefits and challenges are considered from the perspective of data governance as
well as data science infrastructure and pipeline and illustrated through six case
examples. In addition, two cross-cutting issues (ethical conduct of research and data
science competencies) are addressed.
2 Data Governance and Data Science Infrastructure

and Pipeline
A number of authors have published data science pipelines. However, from the
perspective of nursing, data science starts with a question and because of the nature
of the data, which is often protected health information from a U.S. Health
Insurance Portability and Accountability Act (HIPAA) standpoint, requires careful
consideration of data governance (Fig. 1). In addition, the infrastructure required
for data science is often significantly different from the data management and
analytic pipelines typically available to nurse scientists and clinicians due to the
volume of data and the processing power needed to ingest, wrangle
(i.e., pre-process using semi-automated tools), compute and analyze, model and
validate, and interpret (visualize and report) the data. Moreover, data science
requires platforms beyond SAS, STATA, and R such as Apache Hadoop
Map-Reduce, Apache Mahout (machine learning algorithms), Sparks Machine
Learning Library, and R-Hadoop to support reduction and analysis of
multi-dimensional data through methods such as K-means, random forest classifier,
neural network backpropagation, support vector machines, and Gaussian discrim-
inative analysis. A data science infrastructure must also support visualization of the
data for analysis, interpretation, and reporting through general tools such as Tableau
and tools for special purposes (e.g., Sentiment Viz for visualization of Tweet
contents, ORA for visualization of network structures).
Table 1 displays a summary of challenges related to aspects of data governance,
data science infrastructure, and data science pipeline in a set of case examples that
are described in more detail in the following section.
Big Data Challenges from a Nursing Perspective 5
RRE
Question
Extraction/ Wrangling Computation Modeling and Reporting and

Ingestion and Analysis Application Visualization
Data Science Infrastructure
Data Governance
Fig. 1 Data governance, data science infrastructure, and data science pipeline. Adapted from
Tesla Institute [12]
3 Case Examples
Three case examples from the authors’ experience, which are focused on knowl-
edge discovery from electronic health records (EHRs), omics, and social media and
reflect multiple challenges, are described first. This is followed by briefer
descriptions of three more case examples from the literature that highlight a specific
challenge.
3.1 Electronic Health Records and Symptom Science
Symptoms (e.g., pain, fatigue, sleep disturbance, anxiety, depression, nausea)

related to a disease process and/or clinical intervention are complex—they are
subjective, vary over time, and lack clear biological mechanisms [20, 21]. Despite
challenges, EHRs and clinical data repositories are two related big data resources
that can be used to facilitate symptom research [22, 23]. Koleck and colleagues [13]
investigated demographic and clinical predictors of one of the most common and
distressing postoperative symptoms, nausea, and its frequently accompanying sign,
vomiting, in women undergoing gynecologic surgical procedures. The team took
advantage of EHRs, which capture real-life symptom data over time, available
within their institution’s clinical data repository containing records for over
5.5 million patients. The first challenge addressed by the team was related to
governance, specifically obtaining access to the clinical data necessary to answer
their research questions. The institution at which this study was completed has
formal data stewards who maintain datasets for a variety of clinical applications.
The process for obtaining data involves submitting an electronic form to a central
committee that reviews, approves, prioritizes, and fulfills requests. Time to com-
pletion varies depending on a number of factors including the complexity and
priority of the request as well as the request queue. Procuring proper infrastructure
for storing data was a second challenge but is of upmost importance to both ensure
6
Table 1 Summary of key data science challenges for case studies

Case example Governance Infrastructure Extraction/ Wrangling Computation Modeling and Reporting and
ingestion and analysis application visualization
EHRs and symptom science [13] x x x x x
Omics [14] x x x
Twitter and dementia caregiving [15, x x
16]
Prediction and sepsis campaign x
guideline [17]
Intelligent sensors and aging in place x
[18]
Dashboards and nurse numeracy and x
graph literacy [19]
S. Bakken and T. A. Koleck
that patient confidentiality is maintained and for ease of data manipulation.

Requested data was stored in MySQL relational database tables on a secure
HIPAA-compliant server. Relational database tables also allowed the team to
alleviate the key challenge of data integration (a component of wrangling) by
enabling direct connection of data from multiple clinical applications via a primary
key (e.g., the patient’s master identification number). The next challenge was
related to extracting the relevant symptom information for the population of
interest. Women undergoing gynecological procedures were identified using
structured ICD-9 and ICD-10 procedure codes related to operations on the ovary,
fallopian tube, or uterus. In contrast, while billing codes for symptoms do exists
(e.g., ICD-10 R11 nausea and vomiting), symptom data were not well represented
by these codes. The team overcame this challenge by using postoperative medi-
cation administration records. A comprehensive list of antiemetic medications was
compiled from the literature. Administration of an antiemetic medication was
treated as a surrogate for postoperative nausea and/or vomiting. The team was able
to limit instances of nausea/vomiting to the postoperative period (within the first
48 h after surgery) by subtracting the antiemetic medication administration time
from the surgery anesthesia finish time. Unstructured clinical narratives can be used
to overcome this challenge for symptoms without structured surrogates available.
Common strategies to extract information from clinical narratives include text
mining and natural language processing, but these strategies introduce additional
data pre-processing challenges [24, 25]. Finally, one of the most significant chal-
lenges for analysis of EHR data is assessment of data quality. In order to mitigate
data quality concerns, the team addressed the five dimensions of data quality for
EHR data reuse research—completeness, correctness, concordance, plausibility,
and currency [26].
3.2 Omics
Large scale omic (e.g., genomic, epigenomic, transcriptomic, proteomic, metabo-

lomic, microbiomic) studies aim to enhance our understanding of the molecular
basis of disease, disease risk, and patient outcomes [27, 28]. Arockiaraj and col-
leagues [14] conducted an epigenome-wide association study to explore how
changes in DNA methylomic profiles following acute subarachnoid hemorrhage
impact and potentially explain observed variability in prognosis and recovery from
hemorrhage. Epigenetic changes are DNA modifications that affect gene expression
without changing the DNA sequence [29, 30]. These changes, including methyla-
tion (i.e., the process of attaching methyl groups to DNA), can change overtime and
differ substantially between tissue and cell types. Consequently, data collection
considerations, related to selection of tissue type, timing of serial sampling, plate
design, and phenotype assessment, were of critical importance when designing this
study [31]. Two to five cerebral spinal fluid (CSF) samples were collected from
patient ventricular drains (placed as part of standard care) over the first 14 days
following hemorrhage. The selection of the timeframe was biologically-driven

based on occurrence of delayed cerebral ischemia, a major contributor to compli-
cations and poor patient outcomes following acute subarachnoid hemorrhage.
While CSF represents the local environment of the brain, it is often difficult to
obtain. Thus, the team also collected blood samples within two days following
hemorrhage and planned an analysis to determine if blood is a suitable surrogate for
the methylomic profile of CSF. The Illumina Infinium HumanMethylation450
Beadchip platform was used to measure methylation levels at approximately
450,000 CpG sites—locations where methylation commonly occurs—in the CSF
and blood samples. In order to avoid plate effects and technical artifacts, the team
employed several strategies such as including control samples at fixed methylation
states and technical replicates on the plates. After methylation levels are obtained,
the key challenge for epigenome-wide association studies is data pre-processing.
The overall goal of pre-processing, which includes data cleaning and quality control
procedures, is to retain true biological signals while minimizing biases and
experimental artifacts. Pre-processing was performed separately on CSF and blood
samples and included identification and exclusion of poor performing samples and
low quality probes, functional normalization, and adjustment for cell-type hetero-
geneity. Arockiaraj et al. ultimately developed their own pre-processing pipeline
using a variety of packages within the R statistical computing environment [32–36].
Interpreting the results of the pre-processing procedures posed an additional chal-
lenge due to the large, epigenome-wide nature of the data. Visualizations, including
box plots and quantile-quantile plots, were used to assist interpretation and com-
munication of pre-processing results. Based on the correlation of methylation levels
between the two tissue types, the team ultimately determined that blood is not a
sufficient surrogate for the methylomic profile of CSF following acute subarachnoid
hemorrhage. Visualizations, such as bean plots, also benefited the interpretation and
communication of analysis results. Considering that genes do not work in isolation,
a second approach that the team used to facilitate interpretation of the epigenome
wide information was pathway analysis. Specifically, they explored if CpG sites
that were positively or negatively correlated clustered into relevant biological
pathways [37, 38].
3.3 Twitter and Dementia Caregiving
Social media are an important data stream for capturing perceptions as well as
behaviors in the daily lives of participants [9]. In addition to content mining, data
science methods support the analysis of network structures which are important to
address questions regarding social support and other types of relatedness. Yoon and
colleagues mined Twitter to gain an understanding of the caregiving experience of
Latinos caring for a person living with dementia [16]. Although very limited in
character length, Tweets have associated metadata which results in more than 20
data elements per Tweet including explicit and extractable characteristics of the user
and the Tweet [39]. Through the methods of topic modeling, sentiment analysis,
and network analysis (macro, meso, micro), they found that (a) frequently occurring
dementia topics were related to mental health and caregiving, (b) the sentiments
expressed in the Tweets were more negative than positive, and (c) network patterns
demonstrated lack of social connectedness [15, 16]. In terms of challenges, data
governance was not an issue because a sample of Tweets is publically available on
a daily basis and research use is supported by the Twitter terms of agreement.
However, there were key challenges across analyses related to data science
infrastructure and pipeline. Regarding infrastructure, the institution lacked graphical
user interfaces to its existing high performance computing resources and policies
limited data storage. For pipeline, a key challenge to extraction was cost. Twitter
charges for extraction of retrospective datasets and the federal grant supporting the
research did not have sufficient budget for this purpose. To address these issues,
relevant Tweets were downloaded on a daily basis, pre-processed and then com-
bined to form the analytic Tweet corpus. A second challenge related to extraction
was defining the lexicon for the extraction to capture the Tweets of populations of
interest. This requires application of a set of cultural analytic techniques that begins
with a corpus of text that is labeled (e.g., song lyrics by a Black lyricist, a Latino
poem) and results in an algorithm suitable for text retrieval for that population. Such
techniques were applied to create a Latino Tweet corpus. In addition, a variety of
existing tools were combined to create a pipeline: extraction/ingestion (NODEXL,
nCapture), wrangling (Notepad++, Tableau), structural analyses including visual-
ization (ORA, Pajek), and content analysis including visualization (Weka,
Sentiment Viz).
3.4 Prediction and Sepsis Campaign Guideline
Pruinelli et al. [17] used EHR data to examine the effect of delay within the 3 h
Surviving Campaign Guideline on patients with severe sepsis and septic shock.
Applying sequential propensity score matching, they found that the statistically
significant time in minutes after which a delay increased the risk of death was:
lactate—20 min, blood culture—50 min, crystalloids—100 min, and antibiotic
therapy—125 min. They identified one challenge related to data wrangling.
Typically, crystalloid volume is documented in unstructured nursing flowsheets.
Consequently, actual volume cannot be precisely determined from orders alone. To
address this issue, the authors suggested the need to standardize flowsheet data. In
another report, some of the authors described the creation and validation of
flowsheet information models for five nursing-sensitive quality indicators, five
physiological systems, and expanded vital signs and anthropometric measures [40].
3.5 Intelligent Sensors and Aging in Place
In a series of studies, Rantz and a team of interdisciplinary colleagues examined the

use of sensor technology to enable aging in place [10, 41]. For example, sensors
continuously monitor functional status including: (a) respiration, pulse, and rest-
lessness during sleep; (b) gait speed, stride length, and stride time for calculation of
fall risk; and (c) fall detection. Then, algorithms are applied to sensor data to
generate automated health alerts to healthcare staff who are then able to assess and
intervene as necessary [42]. A recent prospective randomized intervention trial of
sensor data combined with health and fall alerts in 13 assisted living communities
found that the comparison group functionally declined more than the intervention
group [18]. However, a key challenge in implementation of the data science-based
intervention in practice was that the network infrastructure in the assisted living
communities was unable to consistently transmit the health alerts so that nurses
could use it in real-time decision making. Because fall alerts did not require net-
work use, they worked as planned. This suggests that infrastructure is not only
critically important for the sensor processing to develop an intervention, but also to
successfully deploy a data science-based intervention in practice.
3.6 Dashboards and Nurse Numeracy and Graph Literacy
Dashboards are increasingly being integrated into clinical practice and used by
executives and managers for overviews of their organizations or units in terms of
processes as well as cost and quality indicators. There is currently less direct use of
dashboards by clinicians at the point of care to inform their decision making for
individual or groups of patients. A systematic review on the use of clinical dash-
boards revealed positive impact of clinical dashboards on care processes and out-
comes in some contexts [43]. However, the authors noted that it is unclear what
dashboard characteristics are associated with improved outcomes and how dash-
boards are integrated into care and decision making. To address the first knowledge
gap, Dowding and colleagues assessed the relationship between home care nurses’
numeracy and literacy and their comprehension of visual display information in a
dashboard project focused on providing feedback on quality metrics to home care
nurses at the point of care for patients with congestive heart failure [19]. Home care
nurses (n = 196) best understood information displayed as bar graphs (88%), fol-
lowed by Tables (81%), line graphs (77%), and spider graphs (41%). Twenty-five
percent of the nurses had low numeracy and/or low graph literacy. Those with low
numeracy and graph literacy had poorer comprehension across formats, 63 and 65%
respectively. Such findings suggest that the data science competencies of clinicians
related to interpretation of visual displays must be considered along with
methodological and infrastructure aspects for optimal use of dashboards to inform
patient care decision making.
4 Cross-Cutting Issues
Ethical conduct of research and data science competencies are two major
cross-cutting issues for data science from the nursing perspective.
4.1 Ethical Conduct of Research
The historic Belmont Report articulated three principles for ethical conduct of
research that must be considered for use of big data streams and data science
methods: respect for persons (i.e., autonomy), beneficence, and justice [44].
Respect for persons includes two separate moral requirements: acknowledgment of
autonomy and protection of those with diminished autonomy. Informed consent is
the primary mechanism for protection of autonomy. Some big data streams have
explicit opt-in or opt-out consent processes and use of protected health information
(PHI) from EHRs and other electronic clinical data resources for research has
ethical and regulatory oversight from institutional review boards and national
regulations such as HIPAA in the U.S. In contrast, social network sites and other
quantified-self technologies include terms of agreement for data use that may not be
read or fully comprehended by users. This can result in use of an individual’s data
in the absence of informed consent.
Beneficence involves optimizing benefits while minimizing risks to ensure that
scare resources are used wisely. Poor methodological rigor and loss of confiden-
tiality through commodification of data pose threats to beneficence. To ensure
appropriate decision making based on study findings, methodological rigor is
needed in terms of selection of appropriate data streams as well as at each stage of
the data science pipeline. Loss of confidentiality and commodification of patient/
consumer-generated data can occur through presumption as digital content is pro-
duced and consumed by individuals as they access websites, use mobile health
applications, and post and respond to social network messages. Individuals may
vary in their willingness to have their data used for public health versus commercial
purposes because they do not typically reap financial benefits from commodifica-
tion of their data [45, 46].
The principle of justice requires fair procedures and equitable outcomes in the
selection of research participants. For data science, this means consideration of
characteristics of the individuals or populations comprising the data streams that
will be used to address the research question. For example, (a) the severity of illness
and sociodemographic composition of patients represented in EHR data vary by
type and location of the healthcare organization, (b) Latinos are less likely than
Whites or Blacks to use an app for health tracking [47], (c) and racial and ethnic
minorities are less likely to participate in biobanks [48, 49]. Such biases in the data
streams may limit the relevance of discoveries and predictions to those at greatest
risk for health disparities. Consequently, researchers must carefully match their
selection of data streams to their research questions.
4.2 Data Science Competencies
The required data science competencies for nurses will vary by role and take into
account what are general competencies for nurses versus what is needed by spe-
cialists including nursing informatics specialists, chief nursing informatics officers,
and nurse scientists conducting data science research. As with nursing informatics
competencies in the past, the manner in which these competencies will be acquired
through education at the undergraduate, master’s, and doctoral levels will be
defined over time by bodies that provide oversight for nursing education with input
from the nursing community. To date, there has been most consideration of com-
petencies for nurse scientists given the increasingly prominent role of data science
in discovery and expertise is typically conceptualized in three broad areas: com-
putational (e.g., cloud computing, workflow automation, visual analytics), mathe-
matical and statistical (e.g., research design, traditional and machine learning
analytic techniques), and domain (e.g., nursing, genomics, public health) [50].
Published Venn diagrams of these three areas emphasize the interdisciplinary team
science aspects of data science by naming the intersection of all the competencies
“the unicorn”. Educational pathways for nurse scientists should reflect their primary
areas of knowledge development [3, 11]. For example:
• Create computational methods and tools—doctoral or post-doctoral training in a
computational field such as computer science, data science, or biomedical
informatics. The nursing perspective will inform the types of computational
methods and tools developed.
• Apply data science as major method of inquiry in nursing research—doctoral
training in nursing with interdisciplinary data science specialization integrated
into nursing PhD or post-doctoral program. For example, trainees in the
Reducing Health Disparities Through Informatics Pre- and Post-doctoral
Training program at Columbia University have course work and applied
research opportunities in data science primarily related to data mining and
information visualization.
• Awareness of data science as an approach in nursing research—doctoral training
in nursing and generalist training in data science. Every nurse scientist should
have a general understanding of data science similar to their familiarity with
qualitative inquiry, experimental and quasi-experimental designs, and health
services research. In the U.S., the National Institute for Nursing Research has
made significant efforts to meet this need for existing nurse scientists through
the provision of week-long Boot Camps in Data Science and Precision Health
[51].
However, data science is increasingly integrated into routine operations of

healthcare organizations, thus data science competencies are needed beyond the
realm of science. For example, nurses who perform direct patient care are primarily
users of the outputs of data science and as highlighted by the Dowding case
example, numeracy and graph literacy are basic competency requirements for
interpretation of data. In contrast, nurse managers, executives, and others who
manage groups or populations need additional knowledge and skills related to
accessing, manipulating, and visualizing heterogeneous data sources using suites of
tools to support discovery of insights, problem solving, and decision making.
Although an awareness of and respect for data governance is a foundational data
science competency for all nurses, Chief Nursing Informatics Officers and nursing
informatics specialists need particular expertise in this area because of their orga-
nizational roles as members of interdisciplinary teams in establishing data science
infrastructure and associated data governance policies. The competencies required
in the workplace may be beyond the educational training of individuals in such
roles, thus conferences and other continuing education offerings are essential to
meet current competency needs.
5 Conclusion
The availability of data sources to address questions of interest to nurses is on the

rise. Reports of application of data science methods by nurses are also increasing
and provide evidence of the benefits, promise, and challenges. Expanding the data
science infrastructure available to nurses and improving the data science compe-
tencies of nurses in various roles are key foundational priorities to increase the use
of data science to advance nursing science, patient care, and health.
Acknowledgements Manuscript preparation was supported by grants from the National Institutes
of Health: Precision in Symptom Self-Management (PriSSM) Center, New York City Hispanic
Dementia Caregiver Research Program, and Reducing Health Disparities Through Informatics
(RHeaDI) Pre- and Post-doctoral Training Program.
References
1. International Council of Nursing. Definition of nursing. International Council of Nurses,

Geneva, Switzerland [updated 2017; cited 14 Feb 2018]. Available from: http://www.icn.ch/
who-we-are/icn-definition-of-nursing/
2. American Nurses Association (2010) Nursing’s social policy statement: the essence of the
profession, 3rd edn. American Nurses Association, Silver Spring, MD
3. Brennan PF, Bakken S (2015) Nursing needs big data and big data needs nursing. J Nurs
Scholarsh 47(5):477–484. https://doi.org/10.1111/jnu.12159
4. Bakken S, Reame N (2016) The promise and potential perils of big data for advancing
symptom management research in populations at risk for health disparities. Annu Rev Nurs
Res 34(1):247–260. https://doi.org/10.1891/0739-6686.34.247
5. Westra BL, Sylvia M, Weinfurter EF, Pruinelli L, Park JI, Dodd D et al (2017) Big data
science: a literature review of nursing research exemplars. Nurs Outlook 65(5):549–561.
https://doi.org/10.1016/j.outlook.2016.11.021
6. IBM. IBM big data & analytics hub 2015. Available from: http://www.ibmbigdatahub.com/
infographic/four-vs-big-data
7. Marr B. Big data: the 5 Vs 2015 [cited 1 Feb 2015]. Available from: http://www.slideshare.
net/BernardMarr/140228-big-data-volume-velocity-variety-varacity-value
8. Koleck TA, Conley YP (2015) Identification and prioritization of candidate genes for
symptom variability in breast cancer survivors based on disease characteristics at the cellular
level. Breast Cancer (Dove Med Press) 8:29–37. https://doi.org/10.2147/BCTT.S88434
9. Yoon S, Elhadad N, Bakken S (2013) A practical approach for content mining of Tweets.
Am J Prev Med 45(1):122–129. https://doi.org/10.1016/j.amepre.2013.02.025
10. Rantz MJ, Skubic M, Popescu M, Galambos C, Koopman RJ, Alexander GL et al (2015) A
new paradigm of technology-enabled ‘Vital Signs’ for early detection of health change for
older adults. Gerontology 61(3):281–290. https://doi.org/10.1159/000366518
11. Bakken S (2017) Data science. In: Hinshaw AS, Grady PA (eds) Shaping health policy
through nursing research. Springer
12. Tesla Institute. Understanding the data science pipeline [cited 14 Feb 2018]. Available from:
http://www.tesla-institute.com/index.php/using-joomla/extensions/languages/278-
understanding-the-data-science-pipeline
13. Koleck T, Bakken S, Kim M, Wesmiller S, Tatonetti N (in preparation) Use of electronic
health records to examine demographic and clinical predictors of postoperative nausea and
vomiting in women following gynecologic surgical procedures. J Perianesthesia Nurs
14. Arockiaraj AI, Shaffer JR, Koleck TA, Weeks DE, Conley YP (in preparation) Methylomic
data processing protocol shows difference in sample quality and methylation profiles between
blood and cerebral spinal fluid following acute subarachnoid hemorrhage. Genet Epigenetics
15. Yoon S (2016) What can we learn about mental health needs from Tweets mentioning
dementia on World Alzheimer’s Day? J Am Psychiatr Nurses Assoc 22(6):498–503. https://
doi.org/10.1177/1078390316663690
16. Yoon S, Co MC Jr, Bakken S (2016) Network visualization of dementia tweets. Stud Health
Technol Inform 225:925
17. Pruinelli L, Yadav P, Hoff A, Steinbach M, Kumar V, Delaney CW et al (2018) Delay within
the 3-hour surving sepsis campaign guideline on mortality for patients with severe sepsis and
septic shock. Crit Care Med. https://doi.org/10.1097/ccm.0000000000002949. [Epub ahead
of print]
18. Rantz M, Phillips LJ, Galambos C, Lane K, Alexander GL, Despins L et al (2017)
Randomized trial of intelligent sensor system for early illness alerts in senior housing. J Am
Med Dir Assoc 18(10):860–870. https://doi.org/10.1016/j.jamda.2017.05.012
19. Dowding D, Merrill JA, Onorato N, Barron Y, Rosati RJ, Russell D (2018) The impact of
home care nurses’ numeracy and graph literacy on comprehension of visual display
information: implications for dashboard design. J Am Med Inform Assoc 25(2):175–182.
https://doi.org/10.1093/jamia/ocx042
20. Lee KA, Meek P, Grady PA (2014) Advancing symptom science: nurse researchers lead the
way. Nurs Outlook 62(5):301–302. https://doi.org/10.1016/j.outlook.2014.05.010
21. Miaskowski C, Barsevick A, Berger A, Casagrande R, Grady PA, Jacobsen P et al (2017).
Advancing symptom science through symptom cluster research: expert panel proceedings and
4ecommendations. J Natl Cancer Inst 109(4). https://doi.org/10.1093/jnci/djw253
22. Cohen B, Vawdrey DK, Liu J, Caplan D, Furuya EY, Mis FW et al (2015) Challenges
associated with using large data sets for quality assessment and research in clinical settings.
Policy Polit Nurs Pract 16(3–4):117–124. https://doi.org/10.1177/1527154415603358
23. Cowie MR, Blomster JI, Curtis LH, Duclaux S, Ford I, Fritz F et al (2017) Electronic health
records to facilitate clinical research. Clin Res Cardiol 106(1):1–9. https://doi.org/10.1007/
s00392-016-1025-6
24. Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF et al (2017) Natural
language processing systems for capturing and standardizing unstructured clinical informa-
tion: a systematic review. J Biomed Inform 73:14–29. https://doi.org/10.1016/j.jbi.2017.07.
012
25. Pereira L, Rijo R, Silva C, Martinho R (2015) Text mining applied to electronic medical
records: a literature review. Int J E-Health Med Commun (IJEHMC) 6(3):1–18. https://doi.
org/10.4018/IJEHMC.2015070101
26. Weiskopf NG, Weng C (2013) Methods and dimensions of electronic health record data
quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 20(1):144–
151. https://doi.org/10.1136/amiajnl-2011-000681
27. Coughlin SS (2014) Toward a road map for global -omics: a primer on -omic technologies.
Am J Epidemiol 180(12):1188–1195. https://doi.org/10.1093/aje/kwu262
28. McCall MK, Stanfill AG, Skrovanek E, Pforr JR, Wesmiller SW, Conley YP (2018)
Symptom science: omics supports common biological underpinnings across symptoms. Biol
Res Nurs 20(2):183–191. https://doi.org/10.1177/1099800417751069
29. Birney E, Smith GD, Greally JM (2016) Epigenome-wide association studies and the
interpretation of disease-omics. PLoS Genet 12(6):e1006105. https://doi.org/10.1371/journal.
pgen.1006105
30. Riancho J, Del Real A, Riancho JA (2016) How to interpret epigenetic association studies: a
guide for clinicians. Bonekey Rep 5:797. https://doi.org/10.1038/bonekey.2016.24
31. Baumgartel K, Zelazny J, Timcheck T, Snyder C, Bell M, Conley YP (2011) Molecular
genomic research designs. Annu Rev Nurs Res 29:1–26
32. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD et al (2014)
Minfi: a flexible and comprehensive bioconductor package for the analysis of Infinium DNA
methylation microarrays. Bioinformatics 30(10):1363–1369. https://doi.org/10.1093/
bioinformatics/btu049
33. Chen J, Just AC, Schwartz J, Hou L, Jafari N, Sun Z et al (2016) CpGFilter: model-based
CpG probe filtering with replicates for epigenome-wide association studies. Bioinformatics 32
(3):469–471. https://doi.org/10.1093/bioinformatics/btv577
34. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The sva package for removing
batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28
(6):882–883. https://doi.org/10.1093/bioinformatics/bts034
35. Xu X, Gammon MD, Hernandez-Vargas H, Herceg Z, Wetmur JG, Teitelbaum SL et al
(2012) DNA methylation in peripheral blood measured by LUMA is associated with breast
cancer in a population-based study. FASEB J 26(6):2657–2666. https://doi.org/10.1096/fj.11-
197251
36. Xu Z, Niu L, Li L, Taylor JA (2016) ENmix: a novel background correction method for
Illumina HumanMethylation450 BeadChip. Nucleic Acids Res 44(3):e20. https://doi.org/10.
1093/nar/gkv907
37. Phipson B, Maksimovic J, Oshlack A (2016) missMethyl: an R package for analyzing data
from Illumina’s HumanMethylation450 platform. Bioinformatics 32(2):286–288. https://doi.
org/10.1093/bioinformatics/btv560
38. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic
Acids Res 28(1):27–30. KEGG accessible at: http://www.genome.jp/kegg/kegg1.html
39. Sinnenberg L, Buttenheim AM, Padrez K, Mancheno C, Ungar L, Merchant RM (2017)
Twitter as a tool for health research: a systematic review. Am J Public Health 107(1):143-e8
40. Westra BL, Christie B, Johnson SG, Pruinelli L, LaFlamme A, Sherman SG et al (2017)
Modeling flowsheet data to support secondary use. Comput Inform Nurs 35(9):452–458.
https://doi.org/10.1097/CIN.0000000000000350
41. Rantz M, Lane K, Phillips LJ, Despins LA, Galambos C, Alexander GL et al (2015)
Enhanced registered nurse care coordination with sensor technology: impact on length of stay
and cost in aging in place housing. Nurs Outlook 63(6):650–655. https://doi.org/10.1016/j.

outlook.2015.08.004
42. Liu L, Popescu M, Skubic M, Rantz M (2014) An automatic fall detection framework using
data fusion of Doppler radar and motion sensor network. Conf Proc IEEE Eng Med Biol Soc
2014:5940–5943. https://doi.org/10.1109/EMBC.2014.6944981
43. Dowding D, Randell R, Gardner P, Fitzpatrick G, Dykes P, Favela J et al (2015) Dashboards
for improving patient care: review of the literature. Int J Med Inform 84(2):87–100. https://
doi.org/10.1016/j.ijmedinf.2014.10.001
44. The National Commission for the Protection of Human Subjects of Biomedical and
Behavioral Research (1970) The Belmont report: ethical principles and guidelines for the
protection of human subjects of research, Washington, DC
45. Vayena E, Salathe M, Madoff LC, Brownstein JS (2015) Ethical challenges of big data in
public health. PLoS Comput Biol 11(2):e1003904
46. Lupton D (2014) The commodification of patient opinion: the digital patient experience
economy in the age of big data. Sociol Health Illn 36(6):856–869. https://doi.org/10.1111/
1467-9566.12109
47. Fox S, Duggan M (2013) Tracking for health. Pew Internet and American life project. http://
www.pewinternet.org/2013/01/28/tracking-for-health/
48. Dang JH, Rodriguez EM, Luque JS, Erwin DO, Meade CD, Chen MS Jr (2014) Engaging
diverse populations about biospecimen donation for cancer research. J Community Genet 5
(4):313–327
49. Shaibi GQ, Coletta DK, Vital V, Mandarino LJ (2013) The design and conduct of a
community-based registry and biorepository: a focus on cardiometabolic health in Latinos.
Clin Transl Sci 6(6):429–434
50. National Institutes of Health (2016) BD2K investments in training. Available from: https://
datascience.nih.gov/sites/default/files/BD2K%20Training%20Summary_website.pdf
51. National Institute of Nursing Research (2016) NINR precision health: from ‘omics’ to ‘Data
Science’ boot camp [cited 2 May 2016]. Available from: http://www.ninr.nih.gov/training/
trainingopportunitiesintramural/bootcamp#.VyfHG4QrLIU
Big Data Challenges for Clinical
and Precision Medicine
Michael Bainbridge
1 Introduction
It’s hard to avoid hearing the mentions of ‘Big Data’ in 2018. News headlines,
periodicals and books are awash with it. The prospect of taking the wisdom of
crowds, in particular their data, and distilling it into a formal source of high quality
data is most appealing [1]. Uses hoped for include decision support, decision
making, aggregation for public health planning, and epidemiological research. The
list is long, frequently visited and lengthened [2]. The audience is wide from you
and I the consumer, to multinational pharmacological companies [3, 4]. This
chapter will address the issue of big data largely from the perspective of data
derived at the point of care being used with other data sources, to support inferences
that would be unlikely to be made via the conventional routes such as analysis of a
single organisation’s data.
Companies ranging from small start-ups to the established blue chip giants are
investing significantly. The prevailing belief seems to be that ‘if only the might of
Big Data and Deep Learning were applied to Health and Medicine, then the benefits
would flow in abundance’. This chapter will examine these beliefs, propose some
definitions and investigate some of the pitfalls and bear traps for the unwary. We
also examine medicine’s readiness to embrace these challenges in its leadership, its
architecture and maturity of thought. Other chapters will focus on the more tech-
nical aspects, the ‘how’ and the ‘where’. This chapter will explore the clinical
aspects, the potential benefits and the attention needed in the data collection process
which will be required to achieve these benefits. If, through implementation, big
data cannot answer the question to an individual practitioner of “does this help me
deliver 21st Century, safe, state of the art, evidence-based, personalised care?” then
it will fail to be meet expectations. Equally, a consumer posed with the sharing of
M. Bainbridge (&)
University of Victoria, Victoria, BC, Canada
e-mail: bainbrid@uvic.ca

18 M. Bainbridge
their data could ask the same question replacing ‘deliver’ with ‘receive’ and adding
“reducing my risk and ensuring affordability”.
Put simply, we are examining the aggregation of data from clinical and other
sources and applying it to questions relating to all areas of medicine. It is worth
differentiating between time-sensitive applications such as:
• the direct delivery of care
• decision support—including the incorporation of genomics and other ‘omic data
into decisions
And applications which are less time-sensitive:
• the planning of care
• the commissioning of care
• the audit of the quality and safety of care delivery process
• multiple areas of clinical and pharmacological research
• defining and measuring outcomes
• multiple potential commercial applications
The audience is potentially wide for both types of application but they put
different ‘stresses’ on the systems providing this service. In all, however, there is an
inherent risk to trust, privacy and security which we will discuss in more detail later
in the chapter. It must be emphasised that access to these data is not something to
consider only for clinicians; utility and availability must be considered for the
clinician and the populations and individuals that they serve, as well as their
non-clinical carers. This sharing process, of course, brings its own challenges to
society as issues around security, availability, privacy are accentuated. Marked
differences in beliefs and expectation also exist between the baby boom generation
and millennials who have never lived without the internet or social media [5].
Successful and rapid consultation and agreement on these issues must be achieved.
It is important that this is done early and well before disclosures and decisions are
made which become irrevocable. There is already a rising tide of data breaches,
both inadvertent and malicious [6]. Big data magnifies this worry, raising the
spectre of disclosures on an unprecedented scale.
1.1 Other Related Areas in Health Innovation

Are Equally Hyped
Big data is closely connected with other areas in health that are all, at time of
writing, still close to the peak of the Gartner Hype Cycle although descent into the
“trough of disillusionment” is only a matter of time. Alongside Big Data, the issues
of Decision support systems (DSS), Knowledge Support Systems (KSS), Artificial
Big Data Challenges for Clinical and Precision Medicine 19
Fig. 1 Chapter summary—the concepts discussed
Intelligence (AI), Precision Medicine1 and the application of both genomics and
multiple sub-species of ‘omics, all consume multiple column inches. All are related
and form an enticing set of potential benefits for the application of computation to
clinical care. Despite these concepts having been in use in healthcare for over
30 years [7], definitions and wildly differing expected outcomes still exist from
their application. In this chapter, we have the expectation that:
• A Decision Support System will bring options or opportunities for clinicians,
consumers, carers singly, and in combinations, to be presented with different
options for the delivery of care (or indeed its withdrawal).
• Artificial Intelligence systems will make decisions on behalf of the same actors
and may then deliver care autonomously or pause for approval.
• Decisions made by both DSS and AI may be improved by improving the quality
of the information presented to them.
• Precision Medicine focuses on identifying which interventions will be effective
for an individual, based on genetic, environmental, and lifestyle factors.
• The much heralded availability of a person’s genome and the application of
these data to their phenotype (previous medical history plus environmental, and
lifestyle factors) will be major driver (Fig. 1).
1
The older term ‘Personalised Medicine’, which is often used interchangeably with ‘Precision
Medicine’ will not be used in this Chapter. Personalised Medicine which implies specific man-
ufacture or synthesis for the individual is a valid but much more specific concept.
20 M. Bainbridge
1.2 Basic Resources Needed to Deliver Big Data
Much is made of the benefits of big data in the health arena; its analysis and
presentation. The quality and structure of the source data is also important. This is
well recognised in the literature [8, 9] but less well delivered in the real-world of
clinical computing and care [10]. The problems are magnified when the
time-sensitive tasks above are examined and even basic infrastructure issues such as
connectivity and reliability of bandwidth become significant blockers.
1.3 Trust, Privacy and Governance Issues
Along with data access and data quality, trust and privacy are vital to the initial
acceptability of the incorporation and even permission to use and continue to use
data. Inadvertent identification of data subjects due to poor design is just one aspect.
Harm and disadvantage caused to people so identified is another. These issues are
complex especially when examining an insurance and an actuarially driven sector.
Disclosure of genomic data from any time after embryonic implantation could, for
certain diagnoses, preclude that person ever getting a loan or a mortgage [11].
Huntington’s Chorea may, for example, manifest from 4 to 85 years [12].
2 Defining Clinical Big Data
In 2016, Gartner defined eight sources of big data relevant to health [13]. Ernst and
Young and others have characterised it as the ‘four V’s’—Volume, Variety,
Velocity and Veracity [14, 15]. Likewise McKinsey [16] have examined Big Data
and characterised the five ‘rights’ that its use could deliver: right living, right care,
right provider, right value and right innovation. All suggest that the flow of data
through the health ecosystem would improve both the 5 rights [17] as well as actual
clinical outcomes. As you will hear elsewhere in this book, the benefits from access
to and the use of big data are undeniable. However, there are multiple issues which
could potentially derail, devalue and undermine uptake, use and acceptance of the
concept.
Let’s examine the Gartner sources of health related big data in more detail and
address some of their potential value as well as problems:
Physician’s Free-Text Notes—Without doubt, this resource exists in volume
but, despite public perception otherwise, it is highly variable and unstructured. Free
text is prone to semantic error and is often created without significant contextual
cues. Family history, for example is not well recorded [18, 19] in clinical records.
Simple issues such as negation, which may totally change the meaning and be
differently treated according to the algorithm reading the text (still only a partially
solved problem) [20, 21]. Unfortunately, making the transition from free text to
structured and coded records remains the same significant barrier it has been for
decades [22, 23].
Patient Generated Health Data—This data source is beginning to grow
exponentially but in a very unstructured way as health and wellness data is captured
in a wide variety of applications [24, 25]. This data source largely consists of free
text with fewer data items than data created by clinicians. New sources of data from
wearables [26, 27] are also contributing. However, just because you can measure a
data point regularly doesn’t mean that it is of value. Conversely, there is a possi-
bility that we have yet to recognise the value of large volume information like heart
rate for example. More importantly, many clinicians paternalistically reject these
data as they were captured by unskilled observers. There is also professional and
legal anxiety that they may be the source of a new and unfunded duty of care
[28, 29].
Genomic Data—Within the next 2–5 years we will be confronted by large
numbers of the global population being offered affordable full genetic sequencing
[30, 31] possibly at birth [32]. Genomic sequencing of IVF embryos has taken place
since 2014 [33] and comprehensive sequencing forms part of much pre-conceptual
counseling [34]. These data estimated to be between 100 and 150 GB per human
and their standardisation of representation [35] in the genomics community will
offer substantial opportunity for precision and perhaps also personalisation of
medicine. However, much work is needed on the interface between genomic and
phenotype information. This has been a source of discussion for many years and is
still the source of much argument in a crowded landscape. Without doubt, current
clinical systems will need significant redesign in order to benefit from and include
genomics data [36, 37].
Physiological Monitoring Data—There is an increasing overlap between these
data and data traditionally captured in intensive and unscheduled care environ-
ments. The same restrictions and issues apply.
Medical Imaging Data—The size of these data is measured in petabytes and
despite advances in the Natural Language Processing of the reporting process, the
actual image data remains largely unstructured [38–40].
The last three of Gartner’s sources, Publically Available Data, Credit Card
and Purchasing Data and Social Media Data, are out of the scope of this chapter
but obviously a potential source of much information previously thought inacces-
sible to clinical care. Volume, diversity and questionable veracity will all prove to
be limiting in their utility. The potential for privacy breaches through inference
attacks [41] linking to health data is also great.
22 M. Bainbridge
3 Long-Term Technical Concerns
High quality data is a precursor to the delivery of clinical care. However, despite
decades of evidence to support it, a major item that would deliver high quality data
is still not in place. We are, of course, talking about structured and coded clinical
data.
The opportunity offered by the coding and structuring of clinical data is not
promoted or implemented at anything like the scale necessary. Since Larry Weed’s
seminal paper in 1968 [42], medicine has known what has been required to deliver
interoperable care; to capture coded, defined and structured information capable of
being shared between clinicians without compromise to the meaning being intro-
duced by the sharing process. For reasons outside the scope of this book, this has
not been addressed at scale until very recently. Work is now starting with global
collaborations through the Systematized Nomenclature of Medicine (SNOMED)
[43–46] and FHIR [47] which will start to address these issues and greatly con-
tribute to the quality and granular structuring of clinical data available for analysis.
It is hoped that this work will also see an end to 30 years of coding wars where the
confusion between a nomenclature (e.g. SNOMED) and a classification like the
International Classification of Diseases [48] (ICD) when it is realised once and for
all that both can exist to perform different but related (and sometimes mapped)
tasks. For the first time since the inception of digital clinical records, there is an
alignment of the technical aspects surrounding them.
Along with the issues above, basic provision of appropriate hardware, connec-
tivity and service availability should not be underestimated even in countries
thought to be at an advanced stage in their development. For example, in a shared
record, it is vital that all parties are using the same (and hopefully most recent)
release of the terminology so that gaps and inadvertent changes in meaning do not
occur. If you are disconnected from the record while seeing the patient, what
happens to the data when you reconnect? Who is responsible and accountable for
the orchestration of care to the best possible standard? How are the data maintained
accurately when there may be multiple authors?
Some countries acknowledge this issue and are addressing it with syndication
and ontology services [49, 50]. Clinical systems around the world are still largely
proprietary in their coding systems for the capture of data and in the data models
that they have in use in order to capture and reproduce these data items. However,
this is changing in some countries. The UK mandated SNOMED CT implemen-
tation in Primary care by April 2018 and has plans for Secondary care to follow by
2020 [51]. New Zealand have started their migration from the obsolete READ
standard to standardise on SNOMED CT [52]. This use of clinical terminology at
the point of care is catalysing an increased understanding that this approach,
through the uptake of Professional Record Standards, can, for the first time, start to
deliver fit-for-purpose interoperable records.
The UK has recently become the first to take a professional standards approach
with the inception of the Professional Records Standards Body [53]. Without this
move towards professionally owned semantic interoperability, many countries find

themselves stuck in a loop of free text driving billing and claims systems. Little
clinical benefit is derived. Sadly, this free text is often wrapped in sophisticated
technical layers to little, if any advantage. In the end, the content is largely useless
to either the consumer wondering what’s being delivered and “what’s next for me”
or the health professional attempting to deliver 21st century care according to the
requirements above. Recent initiatives like Interopen [54] “Bringing together a
collaborative health and care community to establish and maintain open standards
for the exchange of digital information that are easily and readily available for
common use within the health and social care sector” and Apperta [55] seem to
have rekindled the appetite to address the issues. They could, with the correct
nurturing, start to deliver the agile solutions needed. However, to be achieved, they
will need to address and overcome the two current blockers of coding and structure
in a largely complacent and saturated marketplace that has been taught by various
governments across the world that variance from a standard model is a ‘coin
operated process’ rewarded by ‘payment for feature’ additions.
Alongside this is a non-normalised data model [56, 57] which often delivers
“one feature per table”; useless for accessing the data for either direct care or
analytics. As a consequence of this market failure, clinicians are often, at best,
disengaged and, at worst, openly hostile to the concept of the need for clinical
information systems to be used at the point of care. We see vendor organisations
disputing the very need for clinical standards [58] and even where good specifi-
cations are in place [59] the marketplace has been starved and underfunded so that
what was a good specification 10 years ago has not developed and kept up with
clinician and consumer requirements.
Finally, the overall design, user interface and user experience of a system or
application will significantly influence use and uptake as well as the quality of data
it might both capture and display. Few standards are in place to mandate these
issues. The UK have NHS standards for the presentation of banners identifying
patients as well as sub-categories such as date format (dd-Mmm-CCYY) [60, 61].
Other guidance could demonstrably improve safety and data quality were it to be in
use [62]. This does require professional record standards to be agreed at scale for a
global marketplace. Just as you would not expect a builder to throw together a
skyscraper without adhering to building codes and well-defined best practice, it is
equally unreasonable to expect a clinical system vendor to create all its user
experiences from scratch when the direction of travel is towards diversity of care
provision from multiple providers including data entry from the patient themselves
and their carers.
Despite the prevailing skepticism expressed in this chapter, we accept that there
is likely to be merit in at least some of the claims made by proponents of Artificial
Intelligence, Decision Support, and timely alerts as both adjuncts to, and replace-
ments for parts of, current clinical processes. There is no question that to meet the
challenge of safety and precision, there will need to be deep learning and data
aggregation at an unprecedented scale. With this also comes a need to understand
and embrace the safety agenda; who is accountable for both the decisions made and
24 M. Bainbridge
offered in these new delivery paradigms; “quis custodiet ipsos custodes?” It will
also be vital to address whether the workforce and public are ready to accept the end
of paternalism and place their trust in shared decision-making and interpretation.
4 Trust/Privacy/Governance
The availability of data and its use and reuse depends upon a level of trust afforded
to the custodians of the data. Indeed, good medical care is often equated with this
trust-based relationship. Sadly, several early implementations of Big Data have very
publicly abused this trust relationship through the naivety of their approach. In
some cases it seems commercial pressures may have also clouded judgement.
Notable recent examples are not difficult to find.
The UK NHS Royal Free Hospital and Google Deep-Mind collaboration was
reported as “Royal Free breached UK data law in 1.6 m patient deal after only
7 months into the contract” [63–67]. Also in the UK, the care.data project, which
was supposed to be the flagship of NHS Information technology was widely
reported as a ‘debacle’ when it was summarily closed down in 2016 [68–70]. This
followed years of professional anxiety about trust and covert agendas [71–74].
Large-scale errors and naivety are not solely confined to UK Government. In
2016 the Australian Department of Human Services (DHS) publicly released a
‘de-identified’ dataset containing 3 million patients’ data stretching back 30 years
[75]. It was taken down a few months later when local researchers successfully
re-identified some of the people whose data was published [76]. This issue high-
lights what will be a continuing problem. Large datasets however well ‘anon-
ymised’ are always vulnerable to re-identification if they are of large enough scale.
Just as encryption protocols are always at risk of being broken (whether through
quantum computing [77, 78] or some other technique), the issue of re-identification
of large datasets is one which will not go away. We can only hope to avoid
disadvantage through ‘controlled’ and transparent processes and ensuring that data
subjects permission is both sought and achieved [79, 80].
Another instance of recent issues in the use of big data relates again to the UK
where an application using AI algorithms to interview patients was put to use as an
NHS branded resource. Videos published online show the system that was live and
in public use making potentially fatal errors of ‘diagnosis’ [81]. Currently, there is
no published evidence available that the AI was validated through anything other
than internal and offline testing before going live [82–85].
The trust issue is even more important with the recent announcements in the
USA where large corporate interests such as Warren Buffett, JP Morgan, Amazon
[86] and also CVS and Aetna [87] are to merge and become healthcare providers
covering, and in all likelihood dominating, large populations in the USA. Each have
large databases with information gained in their commercial activities. Aggregated,
these data sources could be a great source for good or a massive risk to privacy for a
significant proportion of the United States population that have used a pharmacy, a
credit card or done online shopping [88].
Each of these examples show how easy it is to misuse data and abuse consumer
trust through activities with large datasets, which although well intentioned, are just
not well thought through. Once out of the bottle, the data genie and disadvantage it
will bring, will not easily go back in. This problem is especially acute with genomic
information. With the potential for a genome to be known not long after conception
and certainly from birth onwards, the potential to disadvantage an entire person’s
life becomes a distinct possibility.
Finally in this section, we should examine clinical leadership in this space and
the failure of the professions to fully engage with the information agenda. We have
seen and examined the significant but still insufficient investment in infrastructure
for an industry where the IT is both mission and safety critical [89–91]. A similar
investment gap exists with clinical informatics. It is only recently that this became a
valid career choice in the UK [92]. There are only a few countries that support fully
accredited structures for clinicians to take a career in clinical informatics without
compromising their registration (where revalidation also exists).
5 Is It Worth the Effort?
This brief exploration of some of the clinical aspects of big data has examined the
hype, the real potential benefits and also the potential pitfalls. If we can address the
challenging issues globally then there is no doubt that the benefits will be signifi-
cant. To jump to the future where published evidence is able to be immediately
tested against a global-sized database and changes to care pathways and plans are
suggested by ever more sophisticated AI backed by deep learning may be a little
way off, but the first steps have been taken [67, 93, 94]. An approach ensuring high
quality structured and coded data [52, 95, 96] is one which should be taken.
Uniquely identifying data subjects is essential for precision. The UK [97, 98],
Australia [99, 100], New Zealand [101, 102] and Nordic [103] countries have all
mandated this approach and maybe the USA and others will follow shortly [104,
105]. What needs to occur is global in nature. It has far reaching implications in the
digital capture of all personal data whether this is for clinical care, illness prevention
or wellness promotion. This vision cannot be achieved at small scale. Global level
coordination and leadership is needed now if we are going to meet the challenge of
big data [106]. In this way we may be ready to address the well documented
challenges of aged care, increased expectation of care, safety of care and budgetary
restriction coupled with a reduction in the availability of a skilled workforce at the
same time [107]. Coupled with this global approach will need to be sustained and
appropriate level of investment in people and workflow-sensitive, interoperable,
precision systems to capture and report on clinical data captured at the point of care
and need.
26 M. Bainbridge
6 The Future
We have examined multiple challenges in this chapter without acknowledging the

major strides in the mainstreaming of new technologies for data storage [108] and
computing power [109, 110]. Although the rise in computing power may be
slowing, the metamorphosis of Quantum Computing from science fiction to reality
will almost certainly advance the raw abilities to crunch numbers [111, 112].
Computing methodologies, system and database design have also advanced sig-
nificantly in the last 5 years. These advances will continue. What is currently
unclear is who will take medicine forward and solve the clinical and precision
challenges. With the availability of these data comes the promise of advantage,
improvement and also the potential for paths that lead to irrevocable damage.
Strong clinical leadership backed by world class technologies must be there at the
front. This leadership must be prepared to take accountability. The complexities of
Big Data mean that this is one of the most significant global opportunities of our
time. Who will take the lead for the benefit of us all?
References
1. Big data, big hype? [Internet] (2014) [cited 24 Feb 2018]. Available from: https://www.
wired.com/insights/2014/04/big-data-big-hype/
2. Hurwitz J, Nugent A, Halper F, Kaufman M (2013) Big data for dummies, 1st edn
3. Adamson D (2015) Big data in healthcare made simple [Internet]. Health Catalyst [cited 24 Feb
2018]. Available from: https://www.healthcatalyst.com/big-data-in-healthcare-made-simple
4. Bate A, Reynolds RF, Caubel P (2018) The hope, hype and reality of big data for
pharmacovigilance. Ther Adv Drug Saf 9(1):5–11
5. Anonymous (2008) Chapter 67: children, young people and attitudes to privacy [Internet].
Australian Privacy Law and Practice (ALRC report 108) [cited 25 Feb 2018]. Available from:
https://www.alrc.gov.au/publications/For%20Your%20Information%3A%20Australian%20
Privacy%20Law%20and%20Practice%20%28ALRC%20Report%20108%29%20/67-childre
6. Collier R (2012) Medical privacy breaches rising. CMAJ 184(4):E215–E216
7. Keen PGW (1980) Decision support systems: a research perspective. https://dspace.mit.edu/
handle/17211/47172 [Internet]. [cited 24 Feb 2018]. Available from: https://dspace.mit.edu/
handle/1721.1/47172?show=full?show=full
8. Jugulum R (2016) Importance of data quality for analytics. In: Quality in the 21st century.
Springer, Cham, pp 23–31
9. Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big
data era. Data Sci J 14:2
10. Middleton B, Bloomrosen M, Dente MA, Hashmat B, Koppel R, Overhage JM et al (2013)
Enhancing patient safety and quality of care by improving the usability of electronic health
record systems: recommendations from AMIA. J Am Med Inform Assoc 20(e1):e2–e8
11. Novas C, Rose N (2000) Genetic risk and the birth of the somatic individual. Econ Soc
29(4):485–513
12. Sermon K, Goossens V, Seneca S, Lissens W, De Vos A, Vandervorst M et al (1998)
Preimplantation diagnosis for Huntington’s disease (HD): clinical application and analysis of
the HD expansion in affected embryos. Prenat Diagn 18(13):1427–1436
13. Sini E (2016) How big data is changing healthcare.pdf [Internet]. Humanitas Hospital Italy.
Available from: https://www.eiseverywhere.com/file_uploads/9b7793c3ad732c28787b2a8
bc0892c31_Elena-Sini_How-Big-Data-is-Changing-Healthcare.pdf
14. Big opportunities, big challenges [Internet]. [cited 25 Feb 2018]. Available from: http://
www.ey.com/gl/en/services/advisory/ey-big-data-big-opportunities-big-challenges
15. Bellazzi R (2014) Big data and biomedical informatics: a challenging opportunity. Yearb
Med Inform 22(9):8–13
16. The big-data revolution in US health care: accelerating value and innovation [Internet].
[cited 18 Dec 2017]. Available from: https://www.mckinsey.com/industries/healthcare-
systems-and-services/our-insights/the-big-data-revolution-in-us-health-care
17. Grissinger M (2010) The five rights: a destination without a map. Pharm Ther 35(10):542
18. Polubriaginof F, Tatonetti NP, Vawdrey DK (2015) An assessment of family history
information captured in an electronic health record. AMIA Annu Symp Proc 5(2015):2035–
2042
19. Nathan PA, Johnson O, Clamp S, Wyatt JC (2016) Time to rethink the capture and use of
family history in primary care. Br J Gen Pract 66(653):627–628
20. Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J et al (2015) DEEPEN: a
negation detection system for clinical text incorporating dependency relation into NegEx.
J Biomed Inform 1(54):213–219
21. Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D et al (2014) Negation’s not solved:
generalizability versus optimizability in clinical natural language processing. PLoS One
9(11):e112774
22. Ford EW, Menachemi N, Phillips MT (2006) Predicting the adoption of electronic health
records by physicians: when will health care be paperless? J Am Med Inform Assoc 13(1):
106–112
23. Warner JL, Jain SK, Levy MA (2016) Integrating cancer genomic data into electronic health
records. Genome Med 8(1):113
24. Richard Lilford AM (2012) Looking back, moving forward [Internet]. University of
Birmingham [cited 17 Oct 2017]. Available from: https://www.birmingham.ac.uk/
Documents/college-mds/haps/projects/cfhep/news/HSJ.pdf
25. Wood WA, Bennett AV, Basch E (2015) Emerging uses of patient generated health data in
clinical research. Mol Oncol 9(5):1018–1024
26. Haghi M, Thurow K, Stoll R (2017) Wearable devices in medical internet of things:
scientific research and commercially available devices. Healthc Inform Res 23(1):4–15
27. Montgomery K, Chester J (2017) Health wearable devices in the big data era: ensuring
privacy, security, and consumer protection. American University, Washington
28. Zhu H, Colgan J, Reddy M, Choe EK (2016) Sharing patient-generated data in clinical
practices: an interview study. AMIA Annu Symp Proc 2016:1303–1312
29. Cohen DJ, Keller SR, Hayes GR, Dorr DA, Ash JS, Sittig DF (2016) Integrating
patient-generated health data into clinical care settings or clinical decision-making: lessons
learned from project healthdesign. JMIR Hum Factors 3(2):e26
30. Burn J (2013) Should we sequence everyone’s genome? Yes. BMJ 21(346):f3133
31. Herper M (2017) Illumina promises to sequence human genome for $100—but not quite yet.
Forbes Magazine [Internet]. [cited 25 Feb 2018]. Available from: https://www.forbes.com/
sites/matthewherper/2017/01/09/illumina-promises-to-sequence-human-genome-for-100-but-
not-quite-yet/
32. Rochman B (2017) Full genome sequencing for newborns raises questions. Scientific
American [Internet]. [cited 25 Feb 2018]. Available from: https://www.scientificamerican.
com/article/full-genome-sequencing-for-newborns-raises-questions/
33. Rojahn SY (2014) DNA sequencing of IVF embryos. MIT Technology Review [Internet].
[cited 25 June 2018]. Available from: https://www.technologyreview.com/s/524396/dna-
sequencing-of-ivf-embryos/
28 M. Bainbridge
34. Martin J, Asan, Yi Y, Alberola T, Rodríguez-Iglesias B, Jiménez-Almazán J, et al (2015)

Comprehensive carrier genetic test using next-generation deoxyribonucleic acid sequencing
in infertile couples wishing to conceive through assisted reproductive technology. Fertil
Steril 104(5):1286–1293
35. Marx V (2013) Next-generation sequencing: the genome jigsaw. Nature 501(7466):263–268
36. Hoffman MA, Williams MS (2011) Electronic medical records and personalized medicine.
Hum Genet 130(1):33–39
37. Hoffman MA (2007) The genome-enabled electronic medical record. J Biomed Inform 40
(1):44–46
38. Salehinejad H, Valaee S, Mnatzakanian A, Dowdell T, Barfett J, Colak E (2017)
Interpretation of mammogram and chest X-ray reports using deep neural networks—
preliminary results [Internet]. arXiv [cs.CV]. Available from: http://arxiv.org/abs/1708.
09254
39. Roberts K, Rink B, Harabagiu SM, Scheuermann RH, Toomay S, Browning T et al (2012) A
machine learning approach for identifying anatomical locations of actionable findings in
radiology reports. AMIA Annu Symp Proc 3(2012):779–788
40. Hassanpour S, Langlotz CP, Amrhein TJ, Befera NT, Lungren MP (2017) Performance of a
machine learning classifier of knee MRI reports in two large academic radiology practices: a
tool to estimate diagnostic yield. AJR Am J Roentgenol 208(4):750–753
41. Vaidya J, Shafiq B, Jiang X, Ohno-Machado L (2013) Identifying inference attacks against
healthcare data repositories. AMIA Jt Summits Transl Sci Proc 18(2013):262–266
42. Weed LL (1968) Medical records that guide and teach. N Engl J Med 278(11):593–600
43. Henriksson A, Conway M, Duneld M, Chapman WW (2013) Identifying synonymy between
SNOMED clinical terms of varying length using distributional analysis of electronic health
records. AMIA Annu Symp Proc 16(2013):600–609
44. Rector AL, Brandt S, Schneider T (2011) Getting the foot out of the pelvis: modeling
problems affecting use of SNOMED CT hierarchies in practical applications. J Am Med
Inform Assoc 18(4):432–440
45. Karlsson D, Nyström M, Cornet R (2014) Does SNOMED CT post-coordination scale? Stud
Health Technol Inform 205:1048–1052
46. Park Y-T, Atalag K (2015) Current national approach to healthcare ICT standardization:
focus on progress in New Zealand. Healthc Inform Res 21(3):144–151
47. Tim Benson GG (2017) Interoperability, SNOMED, HL7 and FHIR [Internet]. [cited 23 Feb
2018]. Available from: https://www.slideshare.net/TimBenson1/interoperability-snomed-
hl7-and-fhir
48. WHO | International Classification of Diseases (2018) [cited 25 Feb 2018]. Available from:
http://www.who.int/classifications/icd/en/
49. Metke A (2016) The human phenotype ontolgy in ontoserver. CSIRO
50. National Clinical Terminology Service (NCTS) website [Internet]. [cited 23 Feb 2018].
Available from: https://www.healthterminologies.gov.au/tools
51. SNOMED CT implementation in primary care [Internet]. [cited 24 Feb 2018]. Available
from: https://digital.nhs.uk/SNOMED-CT-implementation-in-primary-care
52. SNOMED CT implementation in New Zealand [Internet]. Ministry of Health NZ [cited 24
Feb 2018]. Available from: https://www.health.govt.nz/nz-health-statistics/classification-
and-terminology/new-zealand-snomed-ct-national-release-centre/snomed-ct-implementation-
new-zealand
53. Professional Record Standards Body (PRSB) for health and social care [Internet]. [cited 15
Nov 2017]. Available from: https://theprsb.org/
54. INTEROPen [Internet]. [cited 27 Feb 2018]. Available from: https://www.interopen.org/
55. The Apperta Foundation [Internet] (2018) Apperta [cited 26 Feb 2018]. Available from:
https://apperta.org/
56. Codd EF (1970) A relational model of data for large shared data banks. Commun ACM
13(6):377–387
57. Database normalization and design techniques [Internet] (2008) Barry Wise NJ SEO [cited
25 June 2018]. Available from: http://www.barrywise.com/2008/01/database-normalization-
and-design-techniques/
58. McDonald K (2018) MSIA questions need for minimum functionality requirements
project [Internet]. Pulse+IT [cited 26 Feb 2018]. Available from: https://www.
pulseitmagazine.com.au:443/news/australian-ehealth/4171-msia-questions-need-for-minimum-
functionality-requirements-project
59. GP2GP [Internet]. [cited 15 Sep 2017]. Available from: https://digital.nhs.uk/gp2gp
60. DSCN 09/2010 initial standard—ISB—patient banner [Internet]. [cited 27 Feb 2018].
Available from: http://webarchive.nationalarchives.gov.uk/+http://www.isb.nhs.uk/documents/
isb-1505/dscn-09-2010/index_html
61. Common User Interface (CUI) [Internet]. [cited 07 Dec 2018]. Available from: https://
webarchive.nationalarchives.gov.uk/20160921150545/http://systems.digital.nhs.uk/data/cui/
uig
62. National guidelines for on-screen display of medicines information | Safety and Quality
[Internet]. [cited 26 Feb 2018]. Available from: https://www.safetyandquality.gov.au/our-
work/medication-safety/electronic-medication-management/national-guidelines-for-on-screen-
display-of-medicines-information/
63. DeepMind-Royal Free deal is “cautionary tale” for healthcare in the algorithmic age
[Internet] (2017) University of Cambridge [cited 23 Feb 2018]. Available from: http://www.
cam.ac.uk/research/news/deepmind-royal-free-deal-is-cautionary-tale-for-healthcare-in-the-
algorithmic-age
64. Hodson H (2016) Revealed: Google AI has access to huge haul of NHS patient data. New
Scientist [Internet]. [cited 23 Feb 2018]. Available from: https://www.newscientist.com/
article/2086454-revealed-google-ai-has-access-to-huge-haul-of-nhs-patient-data/
65. Basu S. Should the NHS share patient data with Google’s DeepMind? [Internet].
WIRED UK [cited 19 Feb 2018]. Available from: http://www.wired.co.uk/article/nhs-
deepmind-google-data-sharing
66. Vincent J (2017) Google’s DeepMind made “inexcusable” errors handling UK health data,
says report [Internet]. The Verge [cited 15 Nov 2017]. Available from: https://www.
theverge.com/2017/3/16/14932764/deepmind-google-uk-nhs-health-data-analysis
67. Powles J, Hodson H (2017) Google DeepMind and healthcare in an age of algorithms.
Health Technol 7(4):351–367
68. How the NHS got it so wrong with care.data [Internet] (2016) [cited 19 Feb 2018]. Available
from: http://www.telegraph.co.uk/science/2016/07/07/how-the-nhs-got-it-so-wrong-with-caredata/
69. Temperton J. NHS care.data scheme closed after years of controversy [Internet].
WIRED UK [cited 15 Sep 2017]. Available from: http://www.wired.co.uk/article/care-
data-nhs-england-closed
70. NHS (2013) NHS England sets out the next steps of public awareness about care.data
[Internet]. [cited 15 Sep 2017]. Available from: https://www.england.nhs.uk/2013/10/care-
data/
71. van Staa T-P, Goldacre B, Buchan I, Smeeth L (2016) Big health data: the need to earn
public trust. BMJ 14(354):i3636
72. McCartney M (2014) Care.data doesn’t care enough about consent. BMJ 348:g2831
73. Godlee F (2016) What can we salvage from care.data? BMJ 354:i3907
74. Mann N (2016) Learn from the mistakes of care.data. BMJ 354:i4289
75. Cowan P. Govt releases billion-line “de-identified” health dataset [Internet]. iTnews [cited
18 Feb 2018]. Available from: http://www.itnews.com.au/news/govt-releases-billion-line-
de-identified-health-dataset-433814
76. Lubarsky B (2017) Re-identification of “anonymized” data. Georgetown Law Technol Rev
12:202–212
77. Why quantum computers might not break cryptography | Quanta Magazine [Internet].
Quanta Magazine [cited 25 Feb 2018]. Available from: https://www.quantamagazine.org/
why-quantum-computers-might-not-break-cryptography-20170515/
30 M. Bainbridge
78. Bernstein DJ, Heninger N, Lou P, Valenta L (2017) Post-quantum RSA. In: Post-quantum
cryptography. Lecture notes in computer science. Springer, Cham, pp 311–329
79. Wan Z, Vorobeychik Y, Xia W, Clayton EW, Kantarcioglu M, Malin B (2017) Expanding
access to large-scale genomic data while promoting privacy: a game theoretic approach.
Am J Hum Genet 100(2):316–322
80. Malin B, Sweeney L (2004) How (not) to protect genomic data privacy in a distributed
network: using trail re-identification to evaluate and design anonymity protection systems.
J Biomed Inform 37(3):179–192
81. Murphy D (2017) @CareQualityComm—this is one of the triages relating to the 48yr old
30/day smoker woken from sleep with chest pain. It is now updated. pic.twitter.com/
BJG27sft4J [Internet]. @DrMurphy11 [cited 27 Feb 2018]. Available from: https://twitter.
com/DrMurphy11/status/848110663054622721
82. Middleton K, Butt M, Hammerla N, Hamblin S, Mehta K, Parsa A (2016) Sorting out
symptoms: design and evaluation of the “babylon check” automated triage system [Internet].
arXiv [cs.AI]. Available from: http://arxiv.org/abs/1606.02041
83. Crouch H (2017) Babylon health services says it has “duty” to point out CQC
“shortcomings” [Internet]. Digital Health [cited 18 Feb 2018]. Available from: https://
www.digitalhealth.net/2017/12/babylon-health-services-says-duty-point-cqc-shortcomings/
84. McCartney M (2017) Margaret McCartney: innovation without sufficient evidence is a
disservice to all. BMJ 5(358):j3980
85. Ogden J (2016) CQC and BMA set out their positions on GP inspections. Prescriber 27
(6):44–48
86. Dent S (2018) Amazon gets into healthcare with Warren Buffet and JP Morgan [Internet].
Engadget [cited 25 Feb 2018]. Available from: https://www.engadget.com/2018/01/30/
amazon-healthcare-warren-buffet-jpmorgan-chase/
87. Terlep S (2017) The real reason CVS wants to buy Aetna? Amazon.com. WSJ Online
[Internet]. [cited 25 Feb 2018]; Available from: https://www.wsj.com/articles/the-real-
reason-cvs-wants-to-buy-aetna-amazon-com-1509057307
88. Blumenthal D (2017) Realizing the value (and profitability) of digital health data. Ann Intern
Med 166(11):842–843
89. How much should small businesses spend on IT annually? [Internet] (2015) Optimal
Networks [cited 26 Feb 2018]. Available from: https://www.optimalnetworks.com/2015/03/
06/small-business-spend-it-annually/
90. Atasoy H, Chen P-Y, Ganju K (2017) The spillover effects of health IT investments on
regional healthcare costs. Manage Sci [Internet]. Available from: https://doi.org/10.1287/
mnsc.2017.2750
91. Appleby J, Gershlick B (2017) Keeping up with the Johanssons: how does UK health
spending compare internationally? BMJ 3(358):j3568
92. Williams J, Bullman D (2018) The faculty of clinical informatics [Internet]. FCI [cited 26
Feb 2018]. Available from: https://www.facultyofclinicalinformatics.org.uk/
93. Klasko SK (2017) Interview with Deborah DiSanzo of IBM Watson health. Healthc
Transform 2(2):60–70
94. Fogel AL, Kvedar JC (2018) Artificial intelligence powers digital medicine. NPJ Digit Med
1(1):5
95. Personalised health and care 2020 [Internet]. GOV.UK [cited 25 June 2018]. Available from:
https://www.gov.uk/government/publications/personalised-health-and-care-2020
96. Spencer SA (2016) Future of clinical coding. BMJ 26(353):i2875
97. McBeth R (2015) NHS number use becomes law | Digital Health [Internet]. Digital Health.
[cited 15 Nov 2017]. Available from: https://www.digitalhealth.net/2015/10/nhs-number-
use-becomes-law/
98. NHS number [Internet]. [cited 15 Sep 2017]. Available from: https://digital.nhs.uk/NHS-
Number
99. Morrison Z, Robertson A, Cresswell K, Crowe S, Sheikh A (2011) Understanding

contrasting approaches to nationwide implementations of electronic health record systems:
England, the USA and Australia. J Healthc Eng 2(1):25–41
100. Pearce C, Bainbridge M (2014) A personally controlled electronic health record for
Australia. J Am Med Inform Assoc 21(4):707–713
101. Kelman CW, Bass AJ, Holman CDJ (2002) Research use of linked health data—a best
practice protocol. Aust N Z J Public Health 26(3):251–255
102. National health index [Internet]. Ministry of Health NZ [cited 15 Sep 2017]. Available from:
http://www.health.govt.nz/our-work/health-identity/national-health-index
103. Ludvigsson JF, Otterblad-Olausson P, Pettersson BU, Ekbom A (2009) The Swedish
personal identity number: possibilities and pitfalls in healthcare and medical research. Eur J
Epidemiol 24(11):659–667
104. Sood H, Bates D, Halamka J, Sheikh A (2018) Has the time come for unique patient
identifiers for the U.S.? [Internet]. NEJM Catal [cited 26 Feb 2018]. Available from: https://
catalyst.nejm.org/time-unique-patient-identifiers-us/
105. Asian Development Bank (2018) Unique health identifier assessment tool kit [Internet].
Asian Development Bank, Manila, Philippines. Available from: https://www.adb.org/
documents/unique-health-identifier-assessment-toolkit
106. West M (2015) Leadership and leadership development in health care [Internet]. The King’s
Fund [cited 26 Feb 2018]. Available from: https://www.kingsfund.org.uk/publications/
leadership-and-leadership-development-health-care
107. Schneider EC, Sarnak DO, Squires D, Shah A, Doty MM (2017) Mirror, mirror 2017:
international comparison reflects flaws and opportunities for better U.S. health care
[Internet]. [cited 27 Feb 2018]. Available from: http://www.commonwealthfund.org/*/
media/files/publications/fund-report/2017/jul/schneider_mirror_mirror_2017.pdf
108. Robinson I, Webber J, Eifrem E (2015) Graph databases: new opportunities for connected
data. O’Reilly Media, Inc., p 238
109. Waldrop MM (2016) The chips are down for Moore’s law. Nature 530(7589):144–147
110. Hruska J (2013) Intel’s former chief architect: Moore’s law will be dead within a decade
[Internet]. http://www.extremetech.com/computing/165331-intels-chief-architect-moores-
law-will-be-dead-within-adecade
111. Iwama K, Kawano Y, Murao M (2013) Theory of quantum computation, communication,
and cryptography. In: 7th conference, TQC 2012, Tokyo, Japan, 17–19 May 2012, revised
selected papers. Springer, p 153
112. Dumitrescu EF, McCaskey AJ, Hagen G, Jansen GR, Morris TD, Papenbrock T et al (2018)
Cloud quantum computing of an atomic nucleus. Phys Rev Lett 120(21):210501
Big Data Challenges from a Pharmacy
Perspective
Aude Motulsky
1 The Promises of Big Data from a Pharmacy Perspective
What if, we had access to real-life data about how medications are prescribed,
dispensed, administered, and taken? What if. We had the ability to capture the
consequences associated with medication use, both intended (relieve symptoms or
treat and prevent diseases) and unintended (side effects, adverse events) not only
from clinical trials and anecdotal experiences, but from large cohorts of patients
with various characteristics (age, gender, ethnic origins, socioeconomic character-
istics, etc.)? We would then be able to assess effectiveness and safety of medica-
tions from a population perspective, better understand drivers of prescribing
practices and consumption behaviors (e.g. adherence), and inform the
decision-making processes of policy makers, clinicians and patients by providing
them with the risk benefit ratio of each drug—the added value—driven by real-life
data, and adapted to their local or individual characteristics [1, 2]. These are the
promises of Big Data from a pharmacy perspective: to close the gap between
science and practice surrounding medications and provide a personalized answer to
the question: “Should I take (or prescribe? or cover?) this medication”?
A. Motulsky (&)
Department of Management, Evaluation and Health Policy,
School of Public Health, Academic Health Center of the
Université de Montréal, Université de Montréal, Montréal, Canada
e-mail: aude.motulsky@umontreal.ca

34 A. Motulsky
These promises are stimulated by the digitization of health care practices,

allowing for the capture of structured electronic data (as opposed to manuscript free
text data) not only from administrative activities such as billing, but also from
clinical activities such as prescribing and administering. Electronic health record
systems1 and their associated features open new doors for data analysis by gath-
ering structured data along the entire medication management cycle. Combined
with the explosion of pharmacoepidemiological methods from the exploration of
large population databases in many OECD countries (e.g. Denmark, UK, Sweden),
and patient-reported outcomes using various methods (e.g. connected devices,
mobile apps), Big Data and its associated algorithms is promising a revolution.
However, challenges are huge and underestimated. First, medications are com-
plicated, and medication data, even more so. Because they do not only include
information about medications, but also about prescriptions. Second, not all med-
ication data sources are equal, and before making sense of medication data, one
must take a close look at the data sources (prescription, dispensation, administra-
tion), and the local rules, practices, and particularities. Third, making sense of
medication-related data is complicated and impeded by the lack of standards, both
in terms of practices and terminologies. Finally, patient-related data, where the core
of the analytic potential resides, are rarely available, and a link is still missing to be
able to connect the intention associated with medication usage (the indication), with
the consequences, both desired and unwanted. We will describe these challenges in
detail in the following sections.
2 Medication-Related Data are Complicated
Medications are highly complicated because they change so quickly. New medi-
cations enter and others are withdrawn from the market monthly, with different
trends in different jurisdictions. It is difficult to find another health-related concept
that is so volatile and locally situated. The first entry point for the approval of any
medication is the regulatory agency in a given jurisdiction, such as Health Canada,
the Federal Drug Agency (FDA) in the USA and the European Medicines Agency
(EMA) in Europe. They maintain lists of medications approved in their jurisdictions
with related numeric codes and descriptors (not standard), that are called a drug
catalogue (Table 1). These codes are always at the brand level, i.e. describing a
product on the market that may contain more than one active molecule. Each new
brand, being from a generic or an innovator company, will have to go through the
1
Between different jurisdictions, there is no standard terminology to describe the electronic record
applications that are used by clinicians to replace paper charts. In this chapter, we will use the term
electronic health record (EHR) to describe the computerized system that is replacing the paper
chart in health care organizations, with features such as electronic clinical documentation and
prescription (including primary care and acute care settings). It is used as a synonym to electronic
medical record (EMR).
Big Data Challenges from a Pharmacy Perspective 35
Table 1 An example of the Canadian drug catalogue: drug identification numbers (DINs) for
selected atorvastatin oral tablets
DIN Product name Company Active Strength Pharmaceutical Route
ingredient (mg) form
02230711 Lipitor Pfizer Atorvastatin 10 Tablet Oral
02295261 Apo-atorvastatin Apotex Atorvastatin 10 Tablet Oral
02295288 Apo-atorvastatin Apotex Atorvastatin 20 Tablet Oral
02348713 Atorvastatin Sanis Atorvastatin 20 Tablet Oral
health
approval process leading to the emission of a specific code in each jurisdiction.

However, while these drug catalogues are useful for inventory and billing purposes,
they have limited utility when it comes to clinical activities [3, 4]. This is where one
needs to map these codes to a clinically relevant drug identifier that comes from
either commercial drug terminologies (e.g. First DataBank or Vigilance Santé), or
publicly maintained drug terminologies (e.g. in the USA RxNorm Concept Unique
Identifier (RxCUI) or the Canadian Clinical Drug DataSet Non Proprietary Product
Name that is under development).
Useful links for drug catalogues and drug terminologies

Drug catalogues
Drug products database—Health Canada
https://health-products.canada.ca/dpd-bdpp/index-eng.jsp
Drugs@FDA—USA Food and drug administration
https://www.accessdata.fda.gov/scripts/cder/daf/
Publicly maintained drug terminologies
RxNorm—National Institute for Health drug terminology
https://www.nlm.nih.gov/research/umls/rxnorm/
CCDD—Canadian Clinical Drug Dataset
https://infocentral.infoway-inforoute.ca/en/resources/docs/med-mgmt/
canadian-clinical-drug-data-set
In addition, medication-related data are complicated because they may refer to

three different types of concepts: information about the medication itself, infor-
mation about the prescription that was written, transmitted or dispensed, and
information about the patient that is going or has taken this medication (Fig. 1).
However, there is no standard way to refer to these concepts in drug-related ter-
minologies (labels or descriptors), and their definitions may vary based on the
source of the data and the underlying processes generating it (e.g. billing activities
from payer or prescribing activities from electronic record applications or national
e-prescribing services).
36 A. Motulsky
The medication The prescription

Name, molecule, form, Route, dose, frequency,
strength or concentration quantity, duration
The patient
Intake, expected effects
(indication), actual and
perceived effects (signs
and symptoms, side
effects, adverse events)
Fig. 1 Medication-related concepts, by type
The concepts related to the medication include information about the product’s
name, the molecule(s) that is(are) found within the product, and the formulation. In
pharmaceutical terms, formulation refers to the way a medication is prepared to be
administered. Hence, the form refers to what is held in the hands, such as tablets,
capsules, solutions, powders for inhalation, etc. In some cases, it may also include
information about the containers that are utilized to administer the product: inhalers,
syringes, cartridges, transdermal patches, rings, etc. And, most of the time, it is
strongly linked to the route of administration of the product, because the excipients
that are used to ensure the molecule is going to be absorbed, without being painful
or uncomfortable, are adapted to the way the medication will be administered (e.g.
orally, topically, in the eye, injected). But the route is ultimately related to the
prescription and what is administered to the patient. Hence, the route is not only
determined by the formulation and many scenarios are possible for a given form.
For example, pills which are normally taken orally can be administered intrav-
aginally (e.g. misoprostol), and eye drops can be administered orally (e.g. atropine).
Finally, the strength represents the amount of the molecule that is found in a given
quantity of the product that is defined using units. Tables 2 and 3 present different
medication-related concepts and examples of their label in different terminologies.
The border is blurry between these concepts, and they are usually grouped
together in a way that makes sense for the purposes of their utilization in different
drug-related terminologies. For example, in RxNorm, the drug terminology
developed and maintained by the National Library of Medicine in the USA, the
drug name is always linked to the route, and the strength is always linked to the
form to support the electronic prescribing process.
Classification systems for medications have been developed to be able to group
similar medications (e.g. based on their chemical structure or pharmacological
action), or medications that are used similarly (e.g. based on their therapeutic
action). Table 3 presents different characteristics that are used to classify medica-
tions. The World Health Organization (WHO) maintains a classification system for
Table 2 Medication-related concepts and associated labels in different terminologies

Concept Refers to Examples of labels
Characteristics of the product
Name The brand name of the product Manufacturer product, product
name, drug name, prescription
name, generic prescription name
Molecule The active ingredient(s) in this International nonproprietary
product. One medication can name, substance, generic name,
contain more than one molecule therapeutic active moiety, active
(combination products) ingredient
Characteristics of the formulation (or what you have in your hand)
Form The way this molecule is Dosage form, format
prepared for administration. e.g.
liquid, tablet, powder, etc.
(Might also include) The way this molecule is Container, device, dosage form,
Container and packaged for administration. e.g. package, format
packaging syringe, cartridge, patch, ring,
etc.
Strength The amount of the molecule that Strength, dose, dosage, unit dose/
is available in a given reference dose unit
quantity (e.g. unit dose, volume,
puff) of the product (e.g. 10 mg
per tablet, 50 mg per mL,
100 mcg per puff)
(Might also include) The unit is associated with the Unit, unit of measure
Unit strength
Table 3 Other medication-related concepts used to classify medications

Concept Refers to
Chemical structure The molecule’s chemical structure
e.g. fluoroquinolones (fluor on a quinoline structure); benzodiazepines
(benzene + diazene ring)
Pharmacological The way this molecule is going to act in the body (on a cell receptor,
action enzyme, etc.)
e.g. statins are inhibiting the enzyme HMGCoA-reductase; SSRI are
selectively inhibiting serotonin recapture by neurones
Therapeutic action/ The clinical effect a molecule can have on symptoms or disease
indication progression
e.g. SSRI are used to treat depression; antidepressant
all medications approved around the world, the Anatomical Therapeutic Chemical
(ATC) system that is preferred for comparative purposes between jurisdictions
(https://www.whocc.no/atc_ddd_index/). Many other classification systems are
available, such as the American Hospital Formulary Services (AHFS) and the
British National Formulary (BNF) systems, all based on their own logic.
38 A. Motulsky
Table 4 Prescription-related concepts and examples of labels in different terminologies

Concept Refers to Examples of labels
Characteristics related to the regimen Instructions, SIG
Route The way this medication is going to be administered to Route
the patient
Dose The quantity of the product that the patient will receive Dose, quantity, unit
per administration
Frequency The number of times per day (or other period) the Frequency
medication will be administered
Characteristics related to the duration of the prescription
Quantity The number of doses of the product that the patient Prescribed quantity,
will be allowed to receive overall (to complete the dispensed quantity
treatment) and/or per dispensation (if refills are
allowed)
Duration The number of days that the quantity of the medication Duration, validity
is going to last or the number of days that the patient
can get the medication if needed
Refills The number of dispensations of this quantity that the Refills, repeats,
patient is allowed to receive renewals
3 Prescription-Related Data are Even More Complicated
Prescription-related data are even more complicated. Here, medication data are
contextualized for a given patient at a given point in time. According to the ISO
standard on medication management concepts, a prescription represents: (1) an
instruction by a health care provider; (2) a request to dispense; and (3) advice to
patients about their treatment [5]. It may include different information related to
taking or administering the medication (the regimen), and to the duration of the
treatment (Table 4). Again, this occurs without a standard method of referring to
these concepts and for structuring them in an electronic format. Variables related to
the regimen are necessary in order to calculate the daily dose that a patient receives,
while variables related to the duration are important to estimate the exposure of a
patient to a given medication over time (and also to estimate the daily dose when
the instructions are not available). Sources for prescription-related data are diverse,
with their own specificities that are important to highlight.
4 Databases of Medications Prescribed
Prescription-related data may come from what was given to a health care provider
using the “professional” way of writing instructions (e.g. 1 CO TID), but could also
come from what was given to a patient, where instructions are translated into
patient-friendly language (e.g. take one tablet three times a day). At this time, there is
no standard in North America regarding the instruction field structure, and wide
Fig. 2 Prescription related databases, per step of the medication management process, and their
associated risk of errors when estimating medication exposure. EHR = electronic health record;
eMAR = electronic medication administation record
variations in e-prescribing practices are observed [6, 7]. Moreover, no standards on

the structure of e-prescriptions (which fields, which format) have been developed in
Canada, even if Canada Health Infoway has undertaken the development of a national
e-prescribing service (Prescribe ITTM) focused on the electronic transmission of
prescriptions between prescribers and pharmacies. Moreover, prescription-related
data may come from different sources of information, through different types of
systems, designed differently in different jurisdictions. Figure 2 presents an overview
of different prescription-related databases used throughout the medication manage-
ment process. The first source is the medication as prescribed which can come from
what was gathered through an electronic prescribing feature (as a stand-alone system
or through an EHR application), or through the electronic transmission of the pre-
scription to a central data repository (pull model) or directly to a pharmacy (push
model). This depends on the health care system organization and the maturity of
computerization of the medication prescribing practices in a given setting.
In Canada, prescribing databases are rare, given the fact that the majority of
prescriptions are still written on paper in most acute care settings, and that the
adoption of e-prescribing features in primary care has mostly emerged in the past
5 years. European countries are leading the way with national e-prescribing data-
bases, such as the National Prescription Register in Denmark [8], and the National
Prescription Repository in Sweden [9] and Finland [10]. In the USA, e-prescription
transactions are available through SureScript, a national e-prescribing transaction
service, but most e-prescription databases that are used for research purposes are
built directly from what was captured in the EHR [11].
5 Databases of Medication Dispensed and Administered
The retail pharmacy practice is one of the first health care sectors to have com-
puterized its activities beginning in the mid-1980s. Primarily for billing purposes,
pharmacy management systems have allowed for the creation of large databases of
40 A. Motulsky
structured data pertaining to medication services. The primary sources for

prescription-related data are thus dispensing data, coming from retail pharmacies,
or payers (and even sales data [from pharmacy buyers and wholesalers]). This
distinction—between prescription versus dispensation—is important given the fact
that up to 30% of prescriptions are never filled, depending on the indication and the
setting [11, 12]. Using the medication prescribed as the source of information might
overestimate exposure, even if widely used in pharmacoepidemiology studies (e.g.
CPRD database from the UK).
Prescriptions which are dispensed in pharmacies are also different from pre-
scriptions reimbursed by payers, given the fact that not all medications are reim-
bursable, and not all patients are covered by the same insurer. In Canada, the most
complete dispensed databases are the Drug Information Systems that are imple-
mented in many provinces since the mid-1990s, such as PharmaNet in British
Columbia, PIN (Pharmaceutical Information Network) in Alberta, PIP
(Pharmaceutical Information Program) in Saskatchewan, SHARE in Nova Scotia,
and the DSQ (Dossier Santé Québec) in Quebec. The model is generally for
compiling information about medication dispensed in retail pharmacies, but not
always connecting with hospital, long term care, or specialty pharmacies (e.g.
intravenous preparations).
Finally, data about the administration of the medication is usually not available
directly from the patient, except through new technologies such as intelligent
pillboxes, but might be available when the medication is administered by a nurse or
another professional. These are the electronic medication administration records
(eMAR) that are implemented in many acute care centers.
6 Making Sense of the Ocean of Data
Indicators used to describe and analyze prescribing, dispensing and consumption

patterns are numerous, and include the number of prescriptions (written, transmitted
or dispensed), the proportion of the defined daily dose (DDD) associated with a
given prescription, and the proportion of days covered (PDC) over time through
multiple services for a chronic medication. However, analyzing medication-related
data requires a thorough understanding of the underlying processes generating the
data (the sources, the steps in the process), and surrounding regulations structuring
the prescribing and dispensing processes in order to make sense of it, that is, to
build knowledge. For example, when counting prescriptions, it is important to
consider the prescribing and dispensing processes in a given jurisdiction. In
Quebec, refills for a duration of 30 days are the norm for chronic medications, in
relation to a public drug insurance that is available for the whole population since
1997, while in other Canadian provinces, refills for 90 days or longer are the norm.
Hence, counting the number of prescriptions, without taking into account the actual
duration of each service, might create confusion when comparing provinces. The
daily dose is a better indicator of the magnitude of exposure to a given medication.
The WHO has defined average daily doses (DDD) for a given indication and a
given route of administration for each molecule. For example, the DDD for oral
hydromorphone when used for pain is 20 mg, while DDD for rectal and injectable
hydromorphone is 4 mg because the bioavailability of the drug is higher when
administered intravenously or intra rectally (i.e. to achieve the same concentration
in the blood, 20 mg is needed orally while only 4 mg is needed through the other
routes). This is because the absorption of the drug through the gut is never 100%,
and because the drug usually passes through the liver before absorption, leading to
what is called the first passage effect, that is avoided when taken intra rectally or
injected directly in the vein. Reporting daily doses in the proportion of the DDD is a
standard way to estimate the magnitude of exposure to a medication. In an ideal
world, it would need to be combined with BMI, renal and liver functions, and
genotypes of a given patient to be able to better estimate this exposure in relation to
the pharmacokinetics of the drug in a given individual (and thus the blood con-
centration of this drug).
Calculating the daily dose is complicated. It can be estimated from the
instructions (1 mg twice a day will result in 2 mg as a daily dose), or the duration
for a given quantity of a given product (30 pills of 1 mg dispensed for a duration of
15 days will give a daily dose of 2 mg). However, instructions are rarely available
in a standard and structured format that would make this calculation straightforward
[13]. Quantity is usually available, but needs to be mapped to the duration to make
sense, especially in some countries such as France where the quantity dispensed is
rarely aligned with what is needed by the patient for a given treatment as it is
restricted by available packaging. Typically, a French pharmacist will dispense the
smallest format available (e.g. a box of 28 pills) to a patient, even if the prescription
is written for 1 pill per day for 5 days. Using the quantity might lead to an incorrect
analysis of prescribing/dispensing patterns if the duration is not taken into
consideration.
However, the duration might be difficult to assess when the treatment is as
needed, or with a changing dose over time. This is frequent with medication for
pain (e.g. pregabalin), for diabetes (e.g. insulin), or warfarin, where patients will
adjust their daily dose depending on their condition. Thus, estimation of daily dose
would be highly facilitated with standard and structured instructions, including an
assessment of the chronicity status of the medication (chronic or acute, as needed or
regular), and the stability of the dose over time (e.g. successive dose = take 10 mg
for 10 days and then 20 mg; or alternate dose = take 2 mg Monday Wednesday
Friday and 3 mg other days).
7 The Disruptive Potential—Patient-Related Data
Ultimately, the core of the potential of Big Data and also its challenges, rests with
the patient. This is where the potential for Big Data is disruptive, but will only be
actualized if data is captured relating to the reason the patient is prescribed the
42 A. Motulsky
medication (the indication), and what the impact of taking the medication is for a
given patient over time (both expected and unexpected). This is where potential
adverse drug events can be captured prospectively, as well as where real-life drug
effectiveness can be aligned with the practices of prescribers and patients.
Observational studies through Big Data may be the key in assessing the safety and
effectiveness of all types of prescribing practices, as well as fostering our ability to
understand pharmacogenetic drivers of different responses to drugs based on
individual genotypes. It may revolutionize the way medication are tested, approved,
and continuously evaluated after their approval. It is thus not surprising that major
pharmaceutical companies are investing massively in data analytics departments,
and trying to buy, or create business relationships with EHR and other health-data
owners [14]. It will be important to ensure academic researchers and public
agencies have the same analytic capabilities than private companies, both in terms
of data access, merging, and analysis.
At the moment, the approval process of medications is based on clinical trials,
and only certain indications are evaluated, and thus approved. These indications are
called on-label indications. But what prescribers and patients do after an approval
may be far from what was evaluated in clinical trials [15, 16], and little is known on
the true added value of medications in this context. Similarly, pharmacosurveillance
programs are based on voluntary reporting of adverse events, by patients and health
care providers, and would benefit from proactive surveillance of the actual out-
comes associated with exposure to medication, flagging potential patterns. But the
missing link is exactly there: we need to find a way to identify outcomes associated
with medication use, both intended and unintended. To do that, we need to find a
way to map signs and symptoms of patients to medication usage in both directions:
from the indication—or health concern—that the prescriber is trying to address with
a medication, to the actual consequences for a given patient over time. For example,
nausea can be a health concern, and a medication is going to be prescribed for it;
but it could also be a side effect of medications, and this need to be captured
electronically. However, the indication is rarely documented with the prescription,
and no standard terminology is available to document medication-related indica-
tions [17]. Many pilot projects are ongoing, primarily in the USA, to incorporate
indication as a mandatory field when prescribing medications, [18] and even to start
the prescribing process with selecting the indication rather than the medication [19].
However, this is far from being the norm in the prescribing process. Similarly,
diagnosis, health problems or health concerns may be documented in electronic
records (e.g. using ICD or SNOMED CT as the standards), but signs and symptoms
following medication usage are rarely documented (e.g. when a medication is
stopped because of a side effect reported by the patient). Capturing the signs and
symptoms in a standard way, using a common terminology, that can be mapped to
medication-related concepts such as indication and side effects, is a priority, for
enabling our analytic capacity from a pharmacy perspective.
Useful links related to indications and side effects of drugs

DrugBank—many chemical, pharmacological and clinical information about
drugs, including indications as diseases that the molecule is intended to treat
(maintained in Alberta, Canada).
https://www.drugbank.ca/
SIDER—open reference database for drug and their associated side effect
using MedDRA terminology for side effect terms (maintained in Germany).
http://sideeffects.embl.de/.
8 In Conclusion
Big Data from a pharmacy perspective promises a revolution: to move beyond

voluntary reporting of safety events, and use multiple sources of medication-related
data to automatically flag potential problems, predict outcomes associated with
certain behaviors and characteristics, and support decisions in a personalized
fashion. However, data will only become Big when reconciliation will be possible
between different sources of medication-related data, to follow the trajectory of
patients. While the challenges are important to appropriately estimate the exposure
to medications using these variety of data sources, the biggest challenge lies in
patient-related data. The potential of Big Data from a pharmacy perspective will
only unravel when the intention behind medication usage will become available and
linked to the actual consequences of medication exposure in a given patient over
time. The missing link.
References
1. McMahon AW, Dal Pan G (2018) Assessing drug safety in children—the role of real-world
data. N Engl J Med 378(23):2155–2157
2. Schneeweiss S (2014) Learning from big health care data. N Eng J Med 370(23):2161–2163
3. Dhavle AA, Ward-Charlerie S, Rupp MT, Amin VP, Ruiz J (2015) Analysis of national drug
code identifiers in ambulatory e-prescribing. J Manag Care Spec Pharm 21(11):1025–1031
4. Motulsky A, Sicotte C, Gagnon MP, Payne-Gagnon J, Langué-Dubé JA, Rochefort CM,
Tamblyn R (2015) Challenges to the implementation of a nationwide electronic prescribing
network in primary care: a qualitative study of users’ perceptions. J Am Med Inform Assoc 22
(4):838–848
5. ISO/TR 20831:2017 (2017) Health informatics—medication management concepts and
definitions
6. Dhavle AA, Rupp MT (2015) Towards creating the perfect electronic prescription. J Am Med
Inform Assoc 22(e1):e7–e12
44 A. Motulsky
7. Dhavle AA, Yang Y, Rupp MT, Singh H, Ward-Charlerie S, Ruiz J (2016) Analysis of
prescribers’ notes in electronic prescriptions in ambulatory practice. JAMA Intern Med 176
(4):463–470
8. Aabenhus R, Hansen MP, Siersma V, Bjerrum L (2017) Clinical indications for antibiotic use
in Danish general practice: results from a nationwide electronic prescription database. Scand J
Prim Health Care 35(2):162–169
9. Ekedahl A, Brosius H, Jönsson J, Karlsson H, Yngvesson M (2011) Discrepancies between
the electronic medical record, the prescriptions in the Swedish national prescription repository
and the current medication reported by patients. Pharmacoepidemiol Drug Saf 20(11):1177–
1183
10. Kivekas E, Enlund H, Borycki E, Saranto K (2016) General practitioners’ attitudes towards
electronic prescribing and the use of the national prescription centre. J Eval Clin Pract 22
(5):816–825
11. Fischer MA, Stedman MR, Lii J, Vogeli C, Shrank WH, Brookhart MA, Weissman JS (2010)
Primary medication non-adherence: analysis of 195,930 electronic prescriptions. J Gen Intern
Med 25(4):284–290
12. Tamblyn R, Eguale T, Huang A, Winslade N, Doran P (2014) The incidence and determinants
of primary nonadherence with prescribed medication in primary care: a cohort study. Ann
Intern Med 160(7):441–450
13. McTaggart S, Nangle C, Caldwell J, Alvarez-Madrazo S, Colhoun H, Bennie M (2018) Use
of text-mining methods to improve efficiency in the calculation of drug exposure to support
pharmacoepidemiology studies. Int J Epidemiol 47(2):617–624
14. Hirschler B (2018) Big pharma, big data: why drugmakers want your health records. Reuters,
1 Mar 2018. https://www.reuters.com/article/us-pharmaceuticals-data/big-pharma-big-data-
why-drugmakers-want-your-health-records-idUSKCN1GD4MM. Accessed on 18 Mar 2018
15. Eguale T, Buckeridge DL, Winslade NE, Benedetti A, Hanley JA, Tamblyn R (2012) Drug,
patient, and physician characteristics associated with off-label prescribing in primary care.
Arch Intern Med 172(10):781–788
16. Eguale T, Buckeridge DL, Verma A, et al (2016) Association of off-label drug use and
adverse drug events in an adult population. JAMA Intern Med 176 (1):55–63
17. Salmasian H, Tran TH, Chase HS, Friedman C (2015) Medication-indication knowledge
bases: a systematic review and critical appraisal. J Am Med Inform Assoc 22(6):1261–1270
18. Galanter WL, Bryson ML, Falck S, Rosenfield R, Laragh M, Shrestha N, Schiff GD,
Lambert BL (2014) Indication alerts intercept drug name confusion errors during comput-
erized entry of medication orders. PLOS ONE 9(7)
19. Schiff GD, Seoane-Vazquez E, Wright A (2016) Incorporating indications into medication
ordering-time to enter the age of reason. N Eng J Med 375(4):306–309
Big Data Challenges from a Public
Health Informatics Perspective
David Birnbaum
1 Big Data: A Macro Trend Impacting Public Health

in the Era of Automated Data Exchanges
Whether the three core functions of public health are called assessment, policy
development and assurance … or assessment, promotion and protection … these
give rise to a wide-ranging set of recognized responsibilities. Specifically, the 10
Essential Public Health Services have been defined as: (1) monitor health status to
identify and solve community health problems; (2) diagnose and investigate health
problems and health hazards in the community; (3) inform, educate, and empower
persons about health issues; (4) mobilize community partnerships to identify and
solve health problems; (5) develop policies and plans that support individual
and community health efforts; (6) enforce laws and regulations that protect health
and ensure safety; (7) link persons to needed personal health services and assure the
provision of health care when otherwise unavailable; (8) assure a competent public
and personal health care workforce; (9) evaluate effectiveness, accessibility, and
quality of personal and population-based health services; and (10) conduct research
for new insights and innovative solutions to health problems [1]. Clearly, this
defines a data-driven mandate.
Public health’s vanguard has moved from an era of relying on receipt of data
through paper forms and telephone notifications, through an era of automated data
transmission into siloes unique to each public health program without interoper-
ability, to reach the point where interoperability between information systems and
expertise in informatics are of paramount importance. Any individual wanting to
This is source 1 of chapter four. They may want to cite this paper by Lazer et al., science 2014, if
this extract is taken from here http://science.sciencemag.org/content/343/6176/1203.full
D. Birnbaum (&)
Applied Epidemiology, 609 Cromar Road, North Saanich V8L 5M5, BC, Canada
e-mail: david.birnbaum@ubc.ca

46 D. Birnbaum
become certified as proficient as a public health professional will soon need to

demonstrate competency to “use information technology for data collection, stor-
age, and retrieval”, “ensure that informatics principles and methods are used in the
design and implementation of data systems”, “ensure the application of ethical
principles in the collection, maintenance, use, and dissemination of data and
information”, and more [2]. These competencies are consistent with the broader
definitions of informatics (as spanning not only the representation, processing and
communication of data, but also information processing with respect to systems
thinking, systems integration and human interactions with machine and data). In
2015, Brownson et al. [3] identified “big data” among the important “macro trends”
impacting public health. Wikipedia notes that the term “big data” dates back to the
1990s, distinguishing it from other data in terms of volume, velocity, variety,
variability and veracity such that traditional methods for capture, storage, search,
analysis, sharing, transfer, visualization and privacy were overwhelmed. Big data
presents both new opportunities and new challenges for all public health agencies at
the international, national, regional and local levels.
2 Empowering Detection of Signal or Noise? The Example

of Syndromic Surveillance
The International Society for Disease Surveillance (ISDS, http://www.

healthsurveillance.org/) is a good example of the groups that have pursued inves-
tigation of syndromic surveillance applications. Public health agencies respond to
outbreaks, which are defined mathematically as rates of disease occurrence in
excess of historic averages; this inherently involves delays between the point in
time when individuals develop signs and symptoms of a disease versus time of
diagnosis and reporting by a healthcare professional. Syndromic surveillance seeks
to eliminate that delay, enabling more rapid response, by applying big data analytics
(data mining) to other types of data (e.g. volume of chief complaints in emergency
department visits, over-the-counter medication sales volume trends, changes in
relative frequency of search term usage on internet search engine sites, patterns in
social network messaging, etc.). Formal evaluations have tended to suggest pro-
mise. There is need for further refinement and study. There are few clear indications
of better performance by syndromic surveillance over traditional methods [4],
although “web-based surveillance systems have evolved to complement traditional
national surveillance systems” [5]. The Centers for Disease Control and Prevention
(CDC) maintains a National Syndromic Surveillance Program on whose website
(https://www.cdc.gov/nssp/index.html) one can find a community of practice,
anecdotal success stories, and resources. The CDC’s Electronic Surveillance
System for Early Notification of Community-Based Epidemics (ESSENCE) is used
by several states to improve situational awareness in which no other source is
capable of rapidly collecting pertinent information; for example, regarding shifts in
Big Data Challenges from a Public Health Informatics Perspective 47
usage of emergency departments during severe environmental conditions [6].

Similarly, Public Health England maintains a syndromic surveillance program that
produces weekly reports (https://www.gov.uk/government/collections/syndromic-
surveillance-systems-and-analyses).
Other examples of syndromic surveillance attempts reveal inherent challenges
and problems. Lazer et al. [7] examine the demise of Google Flu Trends, examining
two issues related to its failure (“big data hubris” and “algorithm dynamics”),
concluding that these problems are not unique to just Google’s syndromic
surveillance attempt and that “There is a tendency for big data research and more
traditional applied statistics to live in two different realms—aware of each other’s
existence, but generally not very trusting of each other. Big data offer enormous
possibilities for understanding human interactions at a societal scale, with rich
spatial and temporal dynamics, and for detecting complex interactions and non-
linearities among variables. We contend that these are the most exciting frontiers in
studying human behavior. However, traditional “small data” often offer information
that is not contained (or containable) in big data, and the very factors that have
enabled big data are enabling more traditional data collection.” Beyond these
technical concerns, the Electronic Privacy Information Center has also raised
unresolved potential breach of individuals’ privacy concerns (https://epic.org/
privacy/flutrends/) after writing to Google’s Chief Executive Officer in 2008
(https://epic.org/privacy/flutrends/EPIC_ltr_FluTrends_11-08.pdf).
3 Technological Change Occurs Quickly … Too Much,

Too Quickly?
There are obvious benefits to reducing undesirable delays, but on the other hand big
data may be exacerbating what has been called by several authors the tyranny of the
moment. An unintended consequence of technological change over the past decade
has been the constant promise then impatient expected norm of everything always
becoming faster, of “timely” becoming instant while still being accurate. From the
internet to e-mail, and now to communication of findings from data mining, what
was intended to be time-saving advances can wind up consuming recipients’ lives
to the point of diminishing thoughtful reflection time, accelerating the spread of
confusion rather than enlightenment. When compounded by a 24 h a day delivery
of news by various media outlets and social media, it can seem that getting current
information can never be fast enough and getting credible information can never be
accurate enough. This can challenge the ability of public health communications to
influence public opinion on issues that spread rapidly through social media, all the
while protecting the credibility and trustworthiness of public health agencies
themselves.
One of the major challenges faced by American public health agencies under
their federal government’s Meaningful Use initiative has been inadequate
48 D. Birnbaum
infrastructure to receive the volume of automated required reporting [8].

Inconsistencies in definition of data elements common across individual public
health programs, in mapping of public health reporting data elements to the
structure and language of electronic health record fields used by healthcare provi-
ders, and limited capacity of public health agency computer systems have delayed
the technical, administrative or both aspects in state capabilities. Guidance and
recommendations for public health have been published and regularly updated [9];
progress can range from registering intent, to testing and validation of a system, to
on-going receipt of automated reporting through validated systems; and progress
recorded state-by-state has not been uniform [10]. Clinical document architecture
and data format standards have been evolving for system developers through the
Standards and Interoperability Framework initiative under the Office of the National
Coordinator for Health Information Technology (https://www.healthit.gov/
sites/default/files/pdf/fact-sheets/standards-and-interoperability-framework.pdf and
https://www.cdc.gov/ehrmeaningfuluse/siframework.html). The Public Health
Informatics Institute also has assembled blog entries, resource documents and web
page links to advance its business process analysis, policy analysis, technical
assistance and workforce development efforts in its mission of improving the use of
information to achieve better public health outcomes (https://www.phii.org/search?
keys=big%20data).
4 Reality Isn’t Always as Attractive as the Model
Important lessons also can be learned from the independent audit of Panorama [11],
a project to develop a seamless national public health information system for
Canada. This Auditor General’s report documents serious problems in all three
aspects audited—functionality, stability and usability, stemming from deficiencies
in project leadership, contract management, system development and accountabil-
ity. It contains quotations regarding benefits from core functionality in the system
produced, and responses from public health agencies and the Ministry of Health to
recommendations made in the audit report, but also notes that Panorama has not
become a national pan-Canadian or even a total provincial pan-British Columbian
information system as originally intended. Started in 2004 and implemented in
2011, Panorama is years late in delivery, significantly over-budget in costs, and for
reasons explained in detail in the report the Auditor General states that “The
ministry’s failure to meet established budgets and deliver the full scope of both
projects indicates that Panorama did not achieve value for money.”
As public health departments acquire the capacity to not only collect large
volumes of detailed data about individuals, and database linkage capabilities grow
across the internet, the challenge of balancing legitimate access to information of
public importance versus the expectation of patient privacy protection also has
overwhelmed the adequacy of traditional approaches under existing legal authority
[12, 13]. Changes recommended by Information and Privacy Commissionaires as
well as public health leaders must be addressed within their respective national and
state or provincial jurisdictions; however, a harmonized international framework is
also needed to ensure compatibility and interoperability between jurisdictions.
Thus, the realm of national politics and international trade agreements is also
germane to the future of public health informatics. Past experience with such
agreements is cause for caution within the public health community [14, 15].
Intrinsic in this aspect is the question of data ownership—whether healthcare
providers and corporate entities own patient care data or simply are stewards of
patient-owned records of care. Also, the question of when if not whether the suc-
cession of electronic patient record systems developed to archive these records for
entire populations will satisfy the working needs and expectations of all stake-
holders [16].
5 What Does What You Might See Mean?
Beyond the challenges of collecting big data rests the challenges of analysis and
visualization. The Institute for Health Metrics and Evaluation has been at the
forefront of studying the Global Burden of Disease and exploring ways to visualize
its complex data sets (http://www.healthdata.org/results/data-visualizations). Others
have developed platforms like HealthMap (http://www.healthmap.org/en/) to sim-
ply improve real-time accessibility of “a unified and comprehensive view of the
current global state of infectious diseases and their effect on human and animal”.
Limitations and pitfalls of familiar graphs and charts have been identified by authors
like William Cleveland [17], who in his 1993 book presents tools for visually
encoding and decoding the “hypervariate” and “multiway” data that are more
complex than the more familiar univariate, bivariate and trivariate types of data
often seen. As Cleveland says, “Visualization is critical to data analysis. It provides
a front line of attack, revealing intricate structure in data that cannot be absorbed in
any other way. We discover unimagined effects, and we challenge imagined ones
… When a graph is made, quantitative and categorical information is encoded by a
display method. Then the information is visually decoded. This visual perception is
a vital link. No matter how clever the choice of the information, and no matter how
technologically impressive the encoding, a visualization fails if the decoding fails.
Some display methods lead to efficient, accurate decoding, and others lead to
inefficient, inaccurate decoding.” Modeling is another approach to using data to
inform decisions. Of course not all public health problems need big data to discover
useful answers, but richer data sets may be able to support the creation, refinement
and validation of more meaningful models. As the statistician George
Box cautioned, “All models are wrong but some models are useful” [18]. Modeling
complex feedback-driven health systems spans expertise from healthcare profes-
sions, systems analysts, statisticians, engineers and others. Consider, for example,
how public health and systems science methods were combined to model
the structure and behavior of an entire country’s immunization system [19].
50 D. Birnbaum
This example is a noteworthy partnership across engineering, public health and

other disciplines to create a model useful in exploring potential impact on health
service outcome of various possible interventions, therein addressing complexity to
support more realistic health policy and systems-level research decisions.
It should be pointed out that a first step in any form of data analysis begins with
understanding what the data truly represent. If repurposed rather than initially
collected by design for the specific purpose at hand, then a failure at this stage
easily leads to incorrect conclusions. A classic example of the importance of
insightful thinking at this early stage involves work by Abraham Wald during
World War II. He was shown aircraft that had returned from combat missions
riddled with damage, and asked where additional armor should be added. Armor
adds weight that decreases performance, so strengthening only critical areas was the
priority. Wald is credited with recognizing that the random pattern of damage he
saw in returning planes was the complement to the data actually needed: position of
holes in planes that didn’t return. Planes that returned with holes indicated vul-
nerable parts that weren’t critical. Since he couldn’t go behind enemy lines to see
the other aircraft, he realized that areas without holes suggest vulnerable parts that
were critical to survival [20]. Another example rests in numerous attempts to use
hospital administrative (ICD-9 coded billing records) data rather than first-hand
examination of primary records by trained infection control professionals to esti-
mate rates of hospital-associated infection, all of which demonstrated extremely
poor predictive value from the administrative data. These caveats relate to the
challenges presented by potential variety, variability and veracity in big data.
Unless comparable data have been collected in a uniform and reliable manner, and
its suitability for use confirmed (i.e. sufficiently accurate and free enough from bias
for the intended purpose), then there is a distinct risk of garbage-in-garbage-out
regardless of analytic approach. The fact remains that precision of an estimate (viz.,
size of deviations between successive sample means obtained by repeated appli-
cation of a sampling process) can be improved by increasing the sample size, but
accuracy of an estimate (viz., size of deviation between the true population mean
value and its estimated value in the sample mean) cannot [21]. Thus, big data
containing appreciable bias gains no advantage from just being enormously big.
6 Big Data Resources for Health Service Applications
Several countries maintain big data resources available for health service
research. For example, the Canadian Institutes of Health Research (http://www.
cihr-irsc.gc.ca/e/49941.html), the U.S. National Institutes of Health (https://
commonfund.nih.gov/bd2k), the U.S. Department of Health and Human Services
(https://www.healthdata.gov/), the European Union (http://data.europa.eu/euodp/
en/home), the UK Government (https://data.gov.uk/data/search?theme-primary=
Health), etc. Philanthropic foundations also have committed to sharing high quality
data (e.g. https://www.gatesfoundation.org/How-We-Work/General-Information/
Information-Sharing-Approach). The International Association of National Public

Health Institutes reported consensus in 2016 among over 40 international public
health leaders on responsibly using and sharing public health surveillance data
where a public health need is identified (http://ianphi.org/news/2016/datasharing1.
html). There also has been consideration about the skills needed by “big data
experts” and the readiness of university programs to produce those workers; “The
informatics field will not only need to develop systems and methods to best
utilize this data, but also train the professionals who will perform and lead this
work.” [22]. Data analytics can be applied to both quantitative and qualitative
data, thus to both numbers and to text or speech; therefore familiarity with
various social sciences is as important as expertise in technical and analytic
disciplines.
A discussion of big data in public health would not be complete without men-
tioning genomics and proteonomics. While advances in these technologies were big
drivers of big data analytics in biology and pharmacology, and now are central to
the era of precision medicine (https://ghr.nlm.nih.gov/primer/precisionmedicine/
precisionvspersonalized), they also are important in public health microbiology and
epidemiology. Epidemiologists investigating suspected outbreaks have long relied
upon the power of molecular methods (plasmid fingerprinting, restriction endonu-
clease analysis, restriction fragment-length polymorphism, pulsed-field gel elec-
trophoresis, and polymerase chain reaction) relative to phenotypic methods
(biotyping, antibiogram, and serotyping) in order to determine whether coincidental
cases of infection involve ancestrally-related microbes or not [23]. Gene sequencing
and whole genome analysis provide even more powerful tools for rapid identifi-
cation of emerging new pathogens as well as determination of whether cases
clustered in time and/or space involve the same versus different strains of a
pathogen [24]. This level of understanding is essential to dealing with the inter-
national peril of antimicrobial resistance, for which a curated global bioinformatics
database of resistance genes, their products and associated phenotypes is main-
tained (https://card.mcmaster.ca/home). This aspect of big data is a success story.
Public health application of big data is a consideration for low- and
middle-income countries too, not just countries with the largest economies [25].
The United Nations’ Global Pulse initiative to harness big data for development
and humanitarian action (https://www.unglobalpulse.org/about-new) describes
numerous projects (https://www.unglobalpulse.org/news/world-health-day-5-%
E2%80%98big-data-public-health%E2%80%99-projects). Poverty, conflict and
non-communicable diseases are at the root of many health disparities evident in
low- and middle-income countries, as well as in parts of the world’s richest
countries. These and other challenges are compounded in lower income countries
by infrastructure gaps that historically impacted collection and application of data.
Whether big data delivers data-for-action among donors and political leaders that
result in sustained reductions of global disease burden, or just documents situations
stagnant due to a variety of reasons, remains to be seen.
52 D. Birnbaum
7 Conclusion
All of the challenges described above have implications at the international,

national, regional, local and personal level. Innovative organizations become and
remain leaders by encouraging their people to be imaginative, push boundaries, and
not be afraid of failures along their path to discovery. However, the risk-benefit
balance is not identical across all types of organizations. Cutting-edge private sector
companies and research universities have more leeway to fail in a calculated risk
than organizations like government agencies and healthcare provider organizations;
however, even those private sector disruptors can experience regulatory risk if they
stray too far over lines of public trust (e.g. recent government hearings in several
countries following Facebook’s misadventure with Cambridge Analytica and
Russian interference in America’s presidential election). Government agencies and
healthcare organizations that rely upon public trust and good reputation have much
to lose from large-scale failures involving privacy breaches, massive service dis-
ruptions, notorious project management incompetence, etc. Economic trends like an
emphasis on reducing redundancy through lean processes and staffing can combine
with technological speed efficiencies and growing impatience over delay, raising
the complexity and stakes of governing and managing inevitable change.
Governance excellence also now requires consideration of international develop-
ments in standards and conventions to ensure interoperability, scalability, adapt-
ability and appropriate safeguards as organizations invent, acquire or adapt
powerful new systems. This relates not only to the flow of data into and between
computer systems, but also to automating the very nature of transforming data into
information—the realm of artificial intelligence. Deep Patient [26] and IBM’s
Watson Health are but two examples of machine learning being coupled with
enough clinical record big data to eventually advance the future of medical diag-
nosis and treatment decision-making. To inform public health policy decisions,
which must consider the broader and even more complex realm of social deter-
minants of health, during the past couple of years organizations like the Canadian
Institutes of Health Research announced capacity building in artificial intelligence
for public health as a priority area (http://www.cihr-irsc.gc.ca/e/50866.html). To
define the appropriate use of these emerging technological capabilities, it will be
important to consider what computers are better at (e.g., data mining, complex
simulations, rare event pattern recognition, etc.), what humans are better at (e.g.,
social skills and perceptiveness, creativity, etc.), and what engenders trust in service
relationships. Harnessing big data and powerful artificial intelligence will require
careful consideration of the nature and stewardship of evidence, the politics of
priority choices that should determine which questions and evidence to consider
pertinent and which irrelevant; Parkhurst’s book [27] is informative in this regard.
It seems prudent to end on a note of caution. During the era when researchers
had to examine their data manually and even do calculations by hand, they
developed an intuitive feel for what made sense and what did not. This provided a
reality check on the results of statistical analysis. However, when that era was
replaced by one involving analysis at the push of a button on data sets too large to
examine, or use of algorithms that had not been tested for validity to control
automated equipment, errors occurred and harm as well as near-misses resulted
when such error was not immediately recognized [28]. William Vaughan and Paul
Erhlich have variously been quoted as saying that “To err is human, to really foul
things up requires a computer” (https://quoteinvestigator.com/2010/12/07/foul-
computer/). What, then, should we say about amplifying the power of computers
with big data? Perhaps “The combination of a strong epidemiologic foundation,
robust knowledge integration, principles of evidence-based medicine, and an
expanded translation research agenda can put Big Data on the right course” [29].
There is no denying the potential in big data to improve our understanding of
complex systems, to advance personalized medicine that can improve the safety and
effectiveness of medical therapy, and to improve public health’s ability to inform
decisions that can safeguard population health. However, the path to those benefits
must be navigated with due discipline and caution.
References
1. CDC (2017) National public health performance standards. Available at http://www.cdc.gov/

nphpsp/essentialservices.html. Accessed on 21 Nov 2017
2. NBPHE (undated) CPH content outline. Available through https://www.nbphe.org/cph-
content-outline/ at https://s3.amazonaws.com/nbphe-wp-production/app/uploads/2017/05/
ContentOutlineMay-21-2019.pdf. Accessed on 21 Nov 2017
3. Brownson RC, Samet JM, Gilbert F, Chavez GF, Davies MM, Galea S, Hiatt RA,
Hornung CA, Khoury MJ, Koo D, Mays VM, Remington P, Yarber L (2015) Charting a
future for epidemiologic training. Ann Epidemiol 25:458–465. Available at http://www.
annalsofepidemiology.org/article/S1047-2797(15)00086-1/fulltext. Accessed on 21 Nov 2017
4. Ontario Agency for Health Protection and Promotion, Provincial Infectious Diseases
Advisory Committee (2012) Syndromic surveillance discussion paper. Queen’s Printer for
Ontario, Toronto, ON. Available at https://www.publichealthontario.ca/en/eRepository/
PIDAC_SyndromicSurveillance_DiscussionPaper_ENG_2013.pdf. Accessed on 21 Nov
2017
5. Choi J, Cho Y, Shim E, Woo H (2016) Web-based infectious disease surveillance systems and
public health perspectives: a systematic review. BMC Public Health 16:1238. Available at
https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-016-3893-0. Accessed
on 21 Nov 2017
6. Wiedeman C, Shaffner J, Squires K, Leegon J, Murphree R, Petersen PE (2017) Monitoring
out-of-state patients during a hurricane response using syndromic surveillance—Tennessee,
2017. Morb Mortal Wkly Rep 66(49):1364–1365. Accessed on 5 Jan 2018
7. Lazer D, Kennedy R, King G, Vespignani A (2014) The parable of Google flu: traps in big
data analysis. Science 343(6176):1203–1205. https://doi.org/10.1126/science.1248506.
Accessed on 5 Jan 2018
8. Lenert, L, Sundwall DN (2012) Public health surveillance and meaningful use regulations: a
crisis of opportunity. Am J Public Health 102(3):e1–e7. Available at https://www.ncbi.nlm.
nih.gov/pmc/articles/PMC3487683/. Accessed on 22 Nov 2017
9. CDC (2016) Public health agency readiness for meaningful use, 2015–2018: guidance and
recommendations. Available at https://www.cdc.gov/ehrmeaningfuluse/docs/readiness_
guide_v3-0-final-508.pdf. Accessed on 21 Nov 2017
54 D. Birnbaum
10. CMS (2017) Centralized repository. Available at https://www.cms.gov/Regulations-and-

Guidance/Legislation/EHRIncentivePrograms/CentralizedRepository-.html. Accessed on 21
Nov 2017
11. Office of the Auditor General of British Columbia (2015) An audit of the panorama public
health system, Aug. Available at https://www.bcauditor.com/sites/default/files/publications/
reports/OAGBC_PanoramaReport_FINAL.pdf. Accessed on 25 Nov 2017
12. Birnbaum D, Borycki E, Karras BT, Denham E, Lacroix P (2015) Addressing public health
informatics patient privacy concerns. Clin Gov 20(2):91–100
13. Birnbaum D, Gretsinger K, Antonio MG, Loewen L, Lacroix P (2018) Revisiting public
health informatics: patient privacy concerns. Int J Health Gov 23(2):149–159
14. Birnbaum D (2016) Have international trade agreements been good for your health? Int J
Health Gov 21(2):47–50
15. Labonté R, Shram A, Ruckert A (2016) The trans-pacific partnership: is it everything we
feared for health? Int J Health Policy Manage 5(8):487–495. Available through http://www.
ijhpm.com/article_3186_0.html at http://www.ijhpm.com/article_3186_741c0738f19120039
415d58aedff5602.pdf. Accessed on 21 Nov 2017
16. Greenhalgh T, Potts HWW, Wong G, Bark P, Swingelhurst D (2009) Tensions and paradoxes
in electronic patient record research: a systematic literature review using the meta-narrative
method. Milbank Q 87(4):729–788. Available at https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC2888022/. Accessed on 21 Nov 2017
17. Cleveland WS (1993) Visualizing data. Hobart Press, Summit, NJ
18. Box GEP (1979) Robustness in the strategy of scientific model building. In: Launer RL,
Wilkinson GN (eds) Robustness in statistics. Academic Press, pp 201–236
19. Schuh HB, Merritt MW, Igusa T, Lee BY, Peters DH (2017) Examining the structure and
behavior of Afghanistan’s routine childhood immunization system using system dynamics
modeling. Int J Health Gov 22(3):212–227
20. Mangel M, Samaniego FJ (1984) Abraham Wald’s work in aircraft survivability. J Am Stat
Assoc 79:259–267
21. Cochran WG (1977) Sampling techniques. Wiley, New York
22. Otero P, Hersh W, Jai Ganesh AU (2014) Big data: are biomedical and health informatics
training programs ready? Yearb Med Inform 9(1):177–181. Available at http://
pubmedcentralcanada.ca/pmcc/articles/PMC4287071/. Accessed on 22 Nov 2017
23. Tenover FC, Arbeit RD, Goering RV et al (1997) How to select and interpret molecular strain
typing methods for epidemiological studies of bacterial infections: a review for healthcare
epidemiologists. Infect Control Hosp Epidemiol 18(6):426–439
24. Gardy JL, Loman NJ (2018) Towards a genomics-informed, real-time, global pathogen
surveillance system. Nat Rev Genet 19(1):9–20
25. Wyber R, Vaillancourt S, Perry W, Mannava P, Folaranmi T, Celi L (2015) Big data in global
health: improving health in low- and middle-income countries. Bull World Health Organ
93:203–208. Available at http://www.who.int/bulletin/volumes/93/3/14-139022/en/. Accessed
on 21 Nov 2017
26. Miotto R, Li L, Kidd BA, Dudley JT (2016) Deep patient: an unsupervised representation to
predict the future of patients from the electronic health records. Sci Rep. Available at https://
www.nature.com/articles/srep26094. Accessed on 14 June 2018
27. Parkhurst J (2016) The politics of evidence: from evidence-based policy to the good
governance of evidence. Available at http://blogs.lshtm.ac.uk/griphealth/books/. Accessed on
13 June 2018
28. Birnbaum D, Morris R (1996) Artificial stupidity. Clin Perform Qual Health Care 4(4):195–197
29. Khoury MJ, Ioannides JPA (2014) Big data meets public health: human well-being could
benefit from large-scale data if large-scale noise is minimized. Science 346(6213):1054–1055.
Available at http://pubmedcentralcanada.ca/pmcc/articles/PMC4684636/. Accessed on 21
Nov 2017
Big Data Challenges from a Healthcare
Administration Perspective
1 Introduction
The role of leadership is to execute on the strategic guidance of governance bodies

described in the previous chapter. Leaders are held accountable for establishing the
effective operational and performance management infrastructure along with deci-
sion making structures focused on creating a delivery environment that is focused
on client and family centered care. Health Standards Organizations identify 4
leadership undertakings;
1. Creating and sustaining a caring culture,
2. Planning and designing services,
3. Allocating resources and building infrastructure,
4. Monitoring and improving quality and safety (HSO [1], p. 1).
2 Objectives of This Chapter
In this chapter we will:

• Identify the standards and best practices that administrators aspire to;
• Identify the opportunities of using big data;
• Identify the challenges to the effective use of big data;
• Provide guidance on how big data can be exploited by administrators for the
benefit of the citizens and the health care system.
D. W. M. Juzwishin (&)
University of Victoria, British Columbia, Canada
e-mail: djuzwishin@uvic.ca

56 D. W. M. Juzwishin
3 What Are Administration and Leadership in Health

Care Systems?
3.1 Definitions
Table 1 describes the role and responsibility of administration and leadership in the
health care system.
The role and responsibility of governance bodies and leadership is to work
together and execute on the legal requirements of health care delivery. We will
review the HSO framework to describe and analyze the eight standards in Table 2
by which health care is expected to be delivered to the population. These standards
are generally reflective of the expectations of other accreditation bodies in other
modern democratic and open societies.
In addition to the eight dimensions of excellence in delivering care, four values
are identified as key to express the aspirational relationship between citizens,
Table 1 Functions, definitions and mechanisms

Function/ Definition Mechanisms
entity
Administration The role of executing the strategic directions into Bylaws, policies
operational activities and procedures
Leadership Leaders at all levels, including directors, managers, Management
supervisors, clinical leaders, and others who have practices
leadership responsibilities within the organization LEADS
framework for
example
Table 2 Quality dimensions in HCS delivery (excerpted from HSO, p. 2)

Quality dimensions Tag lines Indicator
1. Population focus Work with my community to anticipate Health status
and meet our needs
2. Continuity of Coordinate my care across the Timeliness and
services continuum appropriateness of transitions
3. Appropriateness Do the right thing to achieve the best Outcomes match promised
results expectations
4. Efficiency Make the best use of resources Patient value = health
outcomes/cost
5. Worklife Take care of those who take care of me Adverse occupational events
6. Safety Coordinate my care across the Adverse events
continuum
7. Accessibility Providing timely and equitable services Wait times
8. Client-centered Partner with me and my family in our Patient satisfaction and
services care outcomes
Big Data Challenges from a Healthcare Administration Perspective 57
Table 3 Contemporary values, expectations of governance and administration

Value Expectation of governance and administration
Dignity and respect Listening to and honouring client and family perspectives and choices.
Client and family knowledge, values, beliefs, and cultural
backgrounds are incorporated into the planning and delivery of care
Information sharing Communicating and sharing complete and unbiased information with
clients and families in ways that are affirming and useful. Clients and
families receive timely, complete, and accurate information in order to
effectively participate in care and decision-making
Partnership and Encouraging and supporting clients and families to participate in care
participation and decision making to the extent that they wish
Collaboration Collaborating with clients and families in policy and program
development, implementation and evaluation, facility design,
professional education, and delivery of care
patients, health care providers and the governance bodies and administration. The
four values are summarized in Table 3.
Big data can either be a facilitator or a threat to achieving these values. In this
chapter we take a critical look at the hope and hype of big data with a view to
preparing ways that administrators can engage with big data effectively.
4 What Opportunities and Challenges Do Big Data

Promise Administration in Health Care?
The promise of big data for administrators and leaders on the surface appears
enormous. Many of the promises are theoretical, they appear conceptually sound,
but other than some very early results from high quality studies few have delivered
on the promise. In this section we want to be critical of ensuring that we have
thought through carefully the unintended consequences of big data.
5 Population Focus
5.1 Opportunities—The Promise of Mobilizing Social

Determinants of Health
Health care leaders are expected to bring a social determinants and population wide
perspective to their roles. Social and financial status, genetic predisposition and
environmental factors are all seen as influencing the health status of a community.
Coupling a social determinants of health perspective with an all-government
approach suggests that personal health data could be linked to other data from
social, education, geographic and economic sources to identify what social, eco-
nomic and public policy gaps exist and what interventions might be appropriate to
improve a community’s health status. Leaders are mobilized by an egalitarian
distribution of health status in a community focusing on marginalized populations
that exhibit health status lower than the general population. Big data could be
utilized to identify the gaps in health status to mobilize policy and interventions to
address the gaps.
5.2 Challenges—The Perils of Breaking Out

of the Path Dependency
A barrier to the achievement of a population focus mobilized by social determinants

are the traditional approaches to how the needs of a population are identified and
converted into health care needs, inputs and outputs. Conventional approaches are
to respond to the health needs with services after the health problems are manifest.
A social determinants approach would be proactive to create an environment that
would promote health and prevent the health problems from emerging. Policy and
program responses would be developed that would be clear on the health outcomes
to be achieved and to direct investments toward income, housing, nutrition and
education to name a few.
Health leaders will want to explore how linking the repositories of health data
with other socio-economic data can provide for a conducive environment for
understanding better the link between health and social settings.
Weng and Kahn note that “The advances in Big Data Analytics and Internet
technologies together with the engagement of citizens in sciences are shaping the
global clinical research enterprise, which is getting more open and increasingly
stakeholder-centered, where stakeholders include patients, clinicians, researchers,
and sponsors.” [2]. The emergence of real world data research where it is more
closely integrated with a population health focus, social determinants, health care
practice and delivery show promise. The challenge is finding ways to undertake an
all government approach to link the policy and program interventions to health
outcomes.
6 Accessibility
6.1 Opportunities—The Right Service at the Right Time

at the Right Place
Timely access to the right health care service in a convenient location for the patient
is important. Institutional health care delivery within four walls has traditionally
expected that patients will come to the location to receive the service. This may be
due to the complexity of the intervention or the availability of the expertise,

equipment and facilities. Big data creates an opportunity for patients to receive
some services closer to home. Monitoring of physiological characteristics such as
arrhythmia, diagnosing a condition using telemedicine or providing clinical con-
sultative services are all now possible with connectivity. Big data could also
improve access to the health care services through more effective scheduling and
triaging through centralized booking models in which a patient’s health need can be
expedited by considering the patients location and the availability of open slots.
6.2 Challenges—Liberate the Data
One of the challenges for big data to improve the accessibility of services for the
citizen is the fragmentation and lack of linkage among data repositories. Many
jurisdictions have yet to provide easy access and ownership of all personal health
data to citizens.
There is also a need to differentiate between the primary uses of the health data
versus secondary use. Primary use is for the purpose of delivering care. The con-
fidentiality and privacy of this information is generally protected by law and
restricted for use between the patient and the care giver(s). Secondary use would be
for policy development, quality improvement, research or innovation. The
de-identification of data for purposes of research and policy planning will be
essential for big data to make the promised contribution.
7 Safety
7.1 Opportunities
Health care leaders are committed to delivering safe and effective health care.
Big data could be mobilized to assess whether the outcomes of health care
interventions such as diagnostic tests, rehabilitation, surgical procedures, and other
therapies are being delivered in a safe manner. Are the benefits of the interventions
greater than the harms?
Today’s clinical and administrative leaders are encouraged to communicate and
share complete and unbiased information with their patients and families in ways that
are affirming and constructive. This is why organizations have only recently come to
publicly apologize for negative consequences of a patients interaction with the health
care system. There exists an opportunity for big data to make a significant improve-
ment in the safety record for the industry to adopt the same approach as the airlines
have of no blame and open expression of airplane transport accidents. This type of
no-fault approach has been demonstrated to continually improve the safety record of
the airlines. Openly publishing an organizations adverse event rate experience could
help build trust with the community. Big data could provide an opportunity for open,
explicit and transparent reporting of health provider’s safety record.
7.2 Challenges—Communicate and Share Complete

and Unbiased Information with Me
One approach to this has been the establishment of registries that monitor and report
on the trajectory of interventions and the outcomes associated with them. This
raises the need to handle audits, be transparent with the results and report to the
public. Big data can help to track a number of different variables to help understand
adverse events and how they can be avoided. Big data will require that health care
ontologies, nomenclatures, catalogs, terms and databases be developed and agreed
to. The challenge is that not all professions and leaders will be open to the challenge
dialogues necessary to arrive at standardized and a systematic approach. On the
cautionary side Kuziemsky has identified a number of unintended and negative
consequences at the individual, organization and social levels of applying big data
approaches (Kuziemsky [3]). Leaders and administrators will need to be sensitive
to context and to include the care providers and patients early in the conversation to
ensure that they are part of the solution.
8 Worklife
8.1 Opportunities—Life-Long Learning Is Here to Stay

and Treat Me with Dignity and Respect
Organizations are encouraged to learn how they can effectively improve the climate
of dignity and respect that is generated between and among staff as well as with
patients and leadership. New forms and platforms of communication and crowd-
sourcing of data would be useful to have a broader spectrum of communication
between health care leaders and staff to identify opportunities for improvement of
the work setting. Finding new ways to have health care providers work alongside
data scientists seeking ways to improve health care delivery and understanding the
care delivery process could result in important dividends for patients.
Big data may provide an opportunity for organizations to improve worklife by
documenting the incidence of work place injuries and associating this with other
variables such as location, time, exposure to infectious agents and other environ-
mental factors. This could formalize and improve organization performance and
staff satisfaction.
Undertaking research in the workplace to determine what conditions are con-

ducive and supportive of a safe and content worklife could be useful. Usability and
human factors analysis as a way to study and analyze the work setting and the risks
associated with professional’s work would be another important way to improve
worklife with the cross tabulating of data to adverse events. Health leaders could
use big data to simulate program and service delivery models to determine what the
effect would be on workload, staff fatigue and adverse events.
8.2 Challenges—Ensure the Big Data Is Fit for Purpose
One of the challenges with big data and worklife is that very little is understood
about the relationship of the two. Usability studies and the systematic review of the
barriers and critical success factors in the implementation of clinical information
systems in health care provider organizations demonstrate that the challenges are
rarely technical and more often socio cultural and not well understood. For big data
to make a contribution to the effective improvement of the worklife of health care
providers much more research and understanding is necessary as to what the bar-
riers to effective use of information systems are. Health leaders should consider
adopting usability approaches and methods to determine what would best suit their
need for big data to inform their policy and decision making.
9 Client-Centered Services
9.1 Opportunities—As a Patient I Am Your Partner

in My Care
Health care leaders are looking for ways to engage with their community members
so that their values and perspectives can be understood and used to inform ways to
improve health care delivery to them. One approach that big data provides is using
crowdsourcing as a means of gauging public opinion.
A rapidly growing area of big data is consumer health informatics, the use of
patient generated health data or mobile health to monitor health status. As tech-
nological advancements continue with sensors, machine learning and the price of
these devices decrease they are becoming more widespread. Research is being
undertaken to try and understand how these might be useful for maintenance of the
patient’s health and the effective delivery of health care. There is also a cautionary
note being voiced by Redmond [4] about the need for policy and regulation to be in
place to ensure that the detailed data from wearable sensor data is not abused
causing an invasion of privacy or prejudice of individuals. Other issues arise as to
how these data might be usefully integrated into the personal health record.
9.2 Challenges—Establish Common Interoperable

Ontologies and Standards for My Care and Health
Status
There is a significant movement toward patient centered care and coordinating care
much more effectively around the patient. A big challenge for big data is that there
are currently a large number of ontologies, nomenclatures and structures of data-
bases being used that will make it very challenging for them to talk to one another.
In a systematic review Kruse [5] identified nine challenges facing big data; “data
structure, security, data standardization, data storage and transfers, managerial
issues such as governance and ownership, lack of skill of data analysts, inaccuracies
in data, regulatory compliance, and real-time analytics” [5]. Organizations like
ISQUAL, ISO, Accreditation Canada, HL7 and IMIA will need to be encouraged to
collaborate with governing bodies and administrators to arrive at a consensus on
standardized approaches. It will be next to impossible to make sense of big data
unless these foundational blocks are put into place.
10 Continuity
10.1 Opportunities—Please Don’t Make

Me Repeat My History, Again!
Big data promises to be a facilitative expediter of bridging across the continuum of

care however the parochial boundaries between and among organizations persist,
hampering progress. Political, social and operational continuity among the provider
organizations is necessary to set the stage for truly a common informatics platform
to help expedite the seamless sharing of data among health care organizations
providers, and the patients. The patient and the health care provider should have at
their fingertips the information necessary for care to be administered, however,
governance and administrative practices have not kept pace with public values and
expectations.
10.2 Challenges—Identify Me but Protect Me
Health care leaders are interested in developing streams of data from patients that
will enable the citizen and patient to better self-manage their health. This encour-
ages shared decision making with their providers and having virtual care accessible
to them. Health leaders will need to be attentive to “changes for reimbursement for
health care services, increased adoption of relevant technologies, patient engage-
ment, and calls for data transparency raise the importance of patient-generated
health information, remote monitoring, non-visit care, and other innovation

approaches that foster more frequent contact with patients and better management
of chronic conditions” [6].
A challenge that concerns many citizens is whether information contained within
their health record identifying a propensity toward diseases (mental health) or
debilitating diseases would render them ineligible for health care or life insurance.
Governments have begun to act proactively passing legislation to provide assurance
that discrimination of that form is illegal.
11 Appropriateness
11.1 Opportunities—Provide Me the Best Care Current

Knowledge and Skills Can Provide
Big data could make a significant contribution to providing answers to many of the
vexatious diseases such as type II diabetes, obesity and other chronic diseases;
however, the time from discovery of new knowledge in basic science and its
clinical application can take decades to benefit the patient. The current approach of
using hypothesis based clinical research, which requires complex and expensive
randomized control trials, is both resource intensive and time consuming. Big data
provides the promise of real world data driven research where the rigor and internal
validity of clinical trials are maintained, and confounding variables are accom-
modated [7]. The objective in many clinical research projects is to reduce the
uncertainty about which intervention(s) will result in the most clinically effective
outcome. Building longitudinal data sets linking interventions to patient outcomes
and monitoring these over times would provide a foundation for the continually
learning health care system in improving its performance.
Big data promises to be a significant support to this effort by providing the
repository of the world’s medical knowledge to the physician’s fingertips with
artificial intelligence and machine learning programs such as Watson [8]. Prompts
and reminders linking the patient’s condition through the personal health record to
the literature can potentially improve safety and outcomes for patients. The map-
ping of the human genome and targeting of interventions based on risk factors for
patient’s promises improved outcomes.
11.2 Challenges—OK Google, Should I Have My Prosthetic

Hip Resurfaced or Replaced?
The promises identified above are powerful and engaging but our current infras-
tructure does not permit us to make progress unless significant challenges are
addressed.
To begin with appropriateness of care is not only a technical question it is also a
social, political and moral question. Health care leaders will need to be sensitive to
and respectful of the citizen and patient preferences.
To be successful moving toward real-world trials there will need to be a strong
linkage and integration of data and practice between the delivery of health care and
the research enterprise. There are currently significant challenges in linkage of data
between funders, providers and institutions that deliver care. A major challenge to
be addressed will be for governments and their agencies to come to terms with
identifying how they can provide citizens and patients the safeguards they require
and yet not inhibit their opportunity to enroll in clinical trials of their choosing. One
strategy for leaders to address this challenge is to open the door for research
institutions such as universities and clinical trialists to work alongside their provider
colleagues.
Big data may be able to traverse the gaps among the data points in a health
record, with the information a patient or health care provider has but the final leap is
one where the specific patient’s condition is linked to an evidence base of clinical
interventions in which the real time application of documenting the patient’s con-
dition, the therapeutic interventions and monitoring the outcomes so that the tra-
jectory of the patients clinical course contributes to an continual learning system of
care delivery. This would be of benefit for the individual patient as they would
benefit from the cumulative experience and in turn their documented experience
and results would enter the database and help inform future clinical decisions.
Big data cannot deliver on this promise unless there is complete consensus on
standards and a commitment of the patient, health care provider organizations, and
professions to share this information among themselves. Murphy holds out hope
stating “A new architecture for EMRS is evolving which could unite Big Data,
machine learning, and clinical care through a microservice-based architecture which
can host applications focused on quite specific aspects of clinical care, such as
managing cancer immunotherapy … informatics innovation, medical research and
clinical care go hand in hand as we look to infuse science-based practice into
healthcare. Innovative methods will lead to a new ecosystem of applications (apps)
interacting with healthcare providers to fill a promise that is still to be determined”
[9]. Watson is an attempt to build a machine learning capacity to bring this promise
of big data to life but the ingredients are far from being able to deliver in the real
world setting. Big data will rely health leaders to come to strong consensus on
information sharing in partnership and collaboration for this to become a reality.
12 Efficiency—Minimize Inputs, Maximize Outputs,

Outcomes and Value
12.1 Opportunities—You Can Only Spend a Dollar Once
The cost of health care is a major concern for leaders. Big data holds the promise to
support more effective and efficient management of resources. Unmet needs could
be identified, access and quality of interventions could be improved, and the con-
nection between interventions and outcomes could be determined and acted upon.
Tradeoffs between programs to achieve the optimal outputs and outcomes from the
financial investment could be made.
Reduction of waste by identifying and removing ineffective, unsafe or harmful
interventions, technologies or services is another promise big data can make.
The continual learning system approach could stimulate the shift of resources
among programs and financial silos to test various hypotheses for care delivery to
improve efficiency of the health system.
Big data could be useful in identifying ways in which to incentivize behavior
within programs or reimbursement systems to achieve the best patient and popu-
lation health outcomes. Introducing disincentives could help eliminate poor prac-
tices and behaviors. Experimentation with concepts where citizens and patients are
provided with the financial means to seek their optimal health seeking behavior
could be addressed through more effective linkage between interventions and
outcomes. Contracting and procurement decisions of health care systems could be
re oriented toward health authorities paying for services based on the value received
rather than products delivered. This would refocus our thinking from being input
and output oriented to thinking about ways to link the outputs to promised patient
outcomes. The emergence of block chain technology as a means to track the
transactional elements from acquisition to impact for the patient could be facilitated
through big data.
12.2 Challenges—Be Prepared to Speak Truth to Power
Efficiency is formulaic in addressing the relationship between the cost of inputs and
the process resulting in program and patient outcomes. Health care leaders are
accountable for the services delivered and outcomes through budgeting planning
processes and reporting on the results. New funding to address opportunities for
innovation are constrained because of the attempts of governments to bend the cost
curve downward. Leaders are driving into the future but looking into the rear-view
mirror. The rapidity with which technological and clinical innovation is accelerating
into the care environment renders the current approach ineffective.
Big data offers a solution to this conundrum, but it comes with significant risks.
Although health care leaders may recognize that there are services and programs
that should be phased out there will be political forces that have a desire to maintain
the status quo because their employment, income stream and/or security depends on
them.
Big data can provide leaders with contemporaneous data that will be required to
explicitly address these issues through open and explicit information sharing,
partnership and collaborations with their health care providers and patients to
ensure that change management strategies are developed and implemented such that
a smooth transition takes place to replace some forms of inefficient program
delivery with those that are more efficient.
13 Concluding Remarks
Health care leaders will need to consult widely and exercise a strong will to work
collaboratively with their partners to use big data effectively to improve health care
delivery. The promise of big data is enormous, however, the risks associated with
the uncritical deployment and application of big data are not be ignored. Leaders
must become proactive to put in place infrastructure, standards, and capacity to
effectively harness the power of big data to benefit the health of citizens.
Bellazzi reminds us “The way forward with the big data opportunity will require
properly applied engineering principles to design studies and applications, to avoid
preconceptions or over-enthusiasms, to fully exploit the available technologies, and
to improve data processing and data management regulations” [10]. Leaders will
need to be very vigilant to ensure that their approaches and uses of big data are
accurate and true. Nothing will erode the confidence of citizens more quickly than
data that is false and untrustworthy.
References
1. Health Standards Organization (2018) Leadership. https://healthstandards.org/assessment-

programs/. Accessed on 26 June 2018
2. Weng C, Kahn MG (2014) Clinical research informatics for big data and precision medicine.
IMIA Yearb Med Inform 211
3. Kuziemsky CE, Monkman H, Petersen C, Weber J, Borycki EM, Adams S, Collins S (2014)
Big data in healthcare—defining the digital persona through user contexts from the micro to
the macro. IMIA Yearb Med Inform 82–89
4. Redmond SJ, Lovell NH, Yang GZ, Horsch A, Lukowicz P, Murrugarra L, Marschollek M
(2014) What does big data mean for wearable sensor systems? IMIA Yearb Med Inform 135–
142
5. Kruse CS, Goswamy R, Raval Y, Marawi S (2016) Challenges and opportunities of big data
in health care: a systematic review. JMIR Med Inform 4(4):e38
6. Sands DZ, Wald JS (2014) Transforming health care delivery through consumer engagement,
health data transparency, and patient-generated health information. IMIA Yearb Med Inform
170
7. Martin-Sanchez F, Verspoor K (2014) Big data in medicine is driving big changes. IMIA
Yearb Med Inform
8. Kohn MS, Sun J, Knoop S, Shabo A, Carmeli B, Sow D, Syed-Mahmood T, Rapp W (2014)
IBM’s health analytics and clinical decision support. IMIA Yearb Med Inform 154–162
9. Murphy S, Castro V, Mandl K (2017) Grappling with the future use of big data for
translational medicine and clinical care. IMIA Yearb Med Inform 96–102
10. Bellazzi R (2014) Big data and biomedical informatics: a challenging opportunity. IMIA
Yearb Med Inform 8–13
Big Data Challenges from a Healthcare
Governance Perspective
1 Introduction
“Water, water everywhere, Nor any drop to drink” [1]. In the Rime of the Ancient
Mariner Samuel Taylor Coleridge describes a sailor, stranded on a ship, surrounded
by water that he cannot drink to quench his thirst. His survival depends on water
being in a form that can sustain life. In its current form it would hasten his death.
The sea of data, information and evidence we are swirling in is a reminder of the
plight of the sailor. At one level the citizen is awash in health data, information and
knowledge about health and health care delivery however access to healthcare and
the distribution of health outcomes is suboptimal. At the moment comprehensive
personal health care data is rarely or readily accessible to the citizen because it is
institutionally owned. The contemporary patient or citizen is analogous to the sailor
—awash in data but not in a form that sustains personal health.
The global repositories of health data, information and knowledge are growing
exponentially and differentiating between truth and myth is becoming increasingly
challenging. The blurring of lines among inaccurate data, misinformation or pseudo
knowledge for policy and decision-making can lead to significant negative conse-
quences for citizens or society. Governance bodies must be prepared to review and
critically assess the veracity and merits of the data, information and evidence
emerging. Coupled with the growth of new forms of data, contributing to big data
from social media, sensor and surveillance technologies, financial transactions,
localization and movement data, the human genome and the Internet of Everything
will further exacerbate the challenges for governance.
Having anticipated the rise in the prominence of big data, the yearbook of the
International Medical Informatics Society in 2014 dedicated the entire volume to
the theme, “big data—smart health strategies” [2]. The contributors examined a
D. W. M. Juzwishin (&)
University of Victoria, Victoria, BC, Canada
e-mail: djuzwishin@uvic.ca

wide range of topics identifying opportunities and challenges associated with big
data in healthcare delivery. To date it serves as the most comprehensive and
high-quality examination of the subject.
Absent from that work, however, was a description and analysis of the impact
that the emergence of big data has and will have for the governance of health care
systems. Is big data a hope for the future of governance? Is it big hype? Can it
provide a platform for an effective use of health data to improve the outcomes of
citizens and the effective delivery of services? What are the opportunities and the
challenges? This chapter will attempt to redress the gap in the literature and provide
a way forward.
2 Objectives of This Chapter
This chapter is not about the best practices for healthcare data governance. Our
attention is directed toward how best practices in governance of healthcare systems
can successfully address the challenges and risks in the indiscriminant use of big
data. Having established that big data is a new and promising concept it is also
threatening several fundamental values of society such as who owns the personal
health record? Who has access to it? How is access to be controlled? How do
governing bodies use it to achieve their own objectives in the interests of citizens
and patients?
In this chapter we will:
• Define and identify the legal, regulatory frameworks as well as the values that
provide opportunities but also are threatened by big data;
• Identify the standards and best practices that governance bodies and adminis-
trators aspire to;
• Identify the opportunities of using big data;
• Identify the challenges to the effective use of big data;
• Provide guidance on how big data can be exploited by governance bodies for the
benefit of the citizens and the health care system.
3 What Are the Role and Responsibilities of Governance

Bodies in Health Care Systems?
3.1 Definitions
For the purpose of this chapter we define big data as the total accumulation of all
past, current and emerging health data, information and knowledge that can be
usefully applied to govern and manage the health care delivery system for the
citizens of society.
Big Data Challenges from a Healthcare Governance Perspective 71
Table 1 Functions, definitions and mechanisms (excerpted from HSO, pp. 3–4)
Function/ Guideline Mechanisms
entity
Governance The governing body is accountable for the Acts, regulations, license,
quality of services/care, and supports the privileges, scope of
organization to achieve its goals, consistent practice
with its mandated objectives and its
accountability to stakeholders
Governing The body that holds authority, ultimate Bylaws
body decision-making ability, and accountability for Health profession
an organization and its services. This may be a legislation
board of directors, a Health Advisory Medical staff bylaws
Committee, a Chief and Council, or other body
Governments legislate and regulate how healthcare data is to be handled.

Governments can also delegate responsibility to other health care provider orga-
nizations, for example, health authorities. Governments can delegate standard set-
ting responsibilities to arms-length agencies such as the International Society for
Quality in Health Care (isqua.org) or as in Canada’s case Health Standards
Organization (HSO) [3]. In this chapter we will focus on HSO standards
governance.
Table 1 describes the role and responsibility of governance in the health care
system. The table is adapted from HSO definitions.
Governance body’s key functions are:
• Functioning as an effective governing body;
• Developing a clear direction for the organization;
• Supporting the organization to achieve its mandate; and
• Being accountable and achieving sustainable results (HSO, p. 1).
We adopt the HSO guidance to describe and analyze the eight quality dimen-
sions in Table 2 by which health care is expected to be delivered to the population.
These standards are generally reflective of the expectations of other accreditation
bodies in other modern democratic and open societies.
In addition to the eight quality dimensions of excellence, four values are iden-
tified as key to express the aspirational relationship between citizens, patients,
health care providers and the governance bodies and administration. The four
values are summarized in Table 3.
We focus on these dimensions of care and values because they represent the
contemporary expectations of governance bodies. Big data can either be a facilitator
or a detractor to achieving these values.
Table 2 Quality dimensions in HCS delivery (excerpted from HSO, p. 2)

Quality dimensions Tag lines Indicator
1. Population focus Work with my community to anticipate Health status
and meet our needs
2. Continuity of Coordinate my care across the Timeliness and
services continuum appropriateness of transitions
3. Appropriateness Do the right thing to achieve the best Outcomes match promised
results expectations
4. Efficiency Make the best use of resources Patient value = health
outcomes/cost
5. Worklife Take care of those who take care of me Adverse occupational events
6. Safety Coordinate my care across the Adverse events
continuum
7. Accessibility Providing timely and equitable services Wait times
8. Client-centered Partner with me and my family in our Patient satisfaction and
services care outcomes
Table 3 Contemporary values, expectations of governance and administration (excerpted from

HSO)
Value Expectation of governance and administration
Dignity and respect Listening to and honouring client and family perspectives and choices.
Client and family knowledge, values, beliefs, and cultural
backgrounds are incorporated into the planning and delivery of care
Information sharing Communicating and sharing complete and unbiased information with
clients and families in ways that are affirming and useful. Clients and
families receive timely, complete, and accurate information in order to
effectively participate in care and decision-making
Partnership and Encouraging and supporting clients and families to participate in care
participation and decision making to the extent that they wish
Collaboration Collaborating with clients and families in policy and program
development, implementation and evaluation, facility design,
professional education, and delivery of care
4 What Opportunities and Challenges Do Big Data

Promise Governance in Health Care?
The promise of big data for governance bodies on the surface appears enormous.
Many of the promises are theoretical, they appear conceptually sound, but other
than some very early results from high quality studies few have delivered on the
promise. In this section we want to be critical of ensuring that we have thought
through carefully the unintended negative consequences of big data.
Big data will not have its potential realized for the health care system unless
significant changes are made to accommodate the requirements of big data in a
thoughtful and systematic approach. Big data could become the greatest nightmare
for governance bodies if they are not able to come to terms with how to harness its
potential in service to the community. Risk of breaching the confidentiality of
patients and citizens is a significant risk that governing bodies and administration
must be prepared to address.
It would be wise to heed the words of Niccolo Machiavelli:
It ought to be remembered that there is nothing more difficult to take in hand, more perilous
to conduct, or more uncertain in its success, than to take the lead in the introduction of a
new order of things. Because the innovator has for enemies all those who have done well
under the old conditions, and lukewarm defenders in those who may do well under the new.
This coolness arises partly from fear of the opponents, who have the laws on their side, and
partly from the incredulity of men, who do not readily believe in new things until they have
had a long experience of them. [4]
5 Population Focus
5.1 Opportunities—The Promise of Mobilizing

Social Determinants of Health
Big data promise several opportunities for governance entities to work effectively to
identify and anticipate the health care needs of the community. Big data could help
healthcare providers comply with the standards of health care delivery through
public monitoring and reporting on their performance. Public health surveillance is
an approach that facilitates the government, governance bodies, and health care
providers gaining a good understanding of what health needs are and how they could
be met. Health authorities and government departments of health could prepare their
planning, programming and funding based on health surveillance data. Health care
provider organizations could also survey their community members through social
media platforms and crowdsourcing to understand what their heath needs are.
Big data could be useful to deal with disasters such as tornados, tsunamis,
earthquakes, fires and floods that arise unexpectedly and require government and
health care organizations to respond. Databases identifying the location of citizens,
particularly those who are in danger and vulnerable to the threat, would be useful.
5.2 Challenges—The Perils of Breaking Out of the Path

Dependency
A significant barrier to the achievement of a population focus will be the traditional

approach to how the needs of a population are identified and converted into health
care needs, inputs and outputs for governance bodies.
Part of the opportunity and difficulty that big data will face in helping with the
transformation of the system is that governing bodies do not regularly collect
outcomes data for the citizens or patients they provide service to. They count the
number of emergency visits, the number of surgeries or the number of patient days.
They rarely have data as to the short term or long-term consequences of the
interventions and their impact on the health status of those patients. Big data could
begin to identify ways to link between the identified needs, interventions, outputs
and outcomes but this will require a new set of metrics, patient-oriented outcome
measures such as EQ5D. These will need to be introduced as a regular follow up to
all health care interventions. Some research and innovation activity are beginning to
recognize the importance of using outcome measures to assess the clinical and cost
effectiveness of health care delivery with new interventions that are introduced; in
fact, it has become a condition of funding.
6 Accessibility
6.1 Opportunities—Provide Me with Timely and Necessary

Services When I Need Them
Accessibility for the citizen means getting the health care that they need when they
need it. In publicly funded health care systems citizens expect timely and equitable
access to health care services. The citizen’s ability to pay for service is never to be a
barrier to access medically necessary services. In reality, because of the limited
resources available to fund healthcare services there is very little slack in the
system. Throughput is optimized by differentiating among levels and types of care;
emergent, urgent and elective with a view that in the public interest queuing pro-
vides a way to maximize resource utilization by smoothing out a stochastic pro-
duction function. This leads to some of our contemporary issues with waiting lists,
for example, lists for surgical procedures; long-term care facilities as well as queues
in emergency departments. Big data may provide a means to improve access.
6.2 Challenges—Give Me My Data
Big data cannot be successful in addressing the sharing of data unless legislation,
regulations and policies are revised to encourage integration without compromising
the security, privacy and confidentiality of health data and information [5]. The
responsibility for the personal health record and electronic health record must be
turned over to the citizen and patient. They, in consultation with their family and
health care provider, decide who should have access to data for the primary use of
that data. Until this is done inter institutional interoperability will be a challenge.
7 Safety
7.1 Opportunities—First Do no Harm
Governance bodies have a responsibility to protect their citizens and patient’s

safety. National governments have the responsibility of controlling the approval and
diffusion of pharmaceuticals, diagnostic tests, medical devices and biologicals. The
threshold of approval for the regulatory requirements is that the technology does no
harm, relative to the benefit received—it must be safe for use in the population. It
does not need to provide clinical benefit but the benefit must exceed the harm. The
responsibility for maintaining a safe environment in institutional facilities of the
health care system is the responsibility of the governance of provider organizations.
The compliance with credentialing and licensing as well as accreditation require-
ments coupled with the quality improvement movement have provided assurance to
the public that concerns for their safety is foremost. Health systems are committed
to continual improvement. Continual improvement is data intensive and requires a
commitment for the organization to look critically at its performance metrics and
identify means by which to improve outcomes for its patients.
Big data promises significant opportunities for improving the safety of care from
regulators who approve the drugs and medical devices. Ways in which big data
could be helpful is to monitor the effect and outcomes of drugs and medical devices
approved for use—post market.
7.2 Challenges—Communicate and Share Complete

and Unbiased Information with Me
Governing boards cannot claim to do no harm to patients unless they can publicly
declare adverse events with openness and transparency to the public. Big data
cannot make an inroad in advancing our societal understanding of how adverse
events occur and how they are remedied unless the governing bodies are prepared
to share the information. In the past the fear of litigation has prevented governing
bodies from making this information public, however, with appropriate safeguards
to ensure anonymity and a positive approach to quality improvement it has been
demonstrated that these adverse events can be addressed through a none accusatory
approach to the health care providers that contribute to continual quality
improvement that is supported by continual learning for the organization and for the
health care providers.
8 Worklife
8.1 Opportunities—Life-Long Learning Is Here to Stay

and Treat Me with Dignity and Respect
Health care delivery is a highly knowledge intensive endeavor requiring staff to

have a solid foundation of core knowledge in order to gain their credential to
practice. Because the growth in new knowledge about new health care interventions
is on such a rapid cycle—there is a very strong requirement to ensure that health
care governance take care of their staff to ensure that they function at their optimum
at all times. Professional development becomes a major priority for the organization
and the staff to stay current with the best standard of practice. Life-long learning
also extends to the organization. Governing bodies must also stay current with the
best evidence of effective and efficient health care delivery. Big data by integrating
the real time data of the patient’s health condition with evidence of best practice
provides an opportunity to have staff use an iterative cycle of continual learning and
improving practice.
8.2 Challenges—Ensure the Big Data Is Fit for Purpose
Big data may help to create a life-long learning capability for health care organi-
zations and health care professionals, but this will require a critical approach to
identify what are legitimate best practices that should be adopted. Structures and
processes will have to developed and implemented in organizations to identify,
assess, and apply best practices. Information and decision support systems will need
to be developed to ensure there is a continual iterative loop between the experience
from interventions and the lessons learned that will cause health care providers to
improve their practice. Pubic transparency of this experiences will be necessary to
ensure that public trust is maintained.
9 Client-Centered Services
9.1 Opportunities—As a Patient I Am Your Partner

in My Care
Health care provider organizations and professionals are expected to identify ways
to partner with the citizens and patients and their families in their care. Big data
offers the promise of allowing patients to choose how they use their own health care
data for self-managed care. Currently data is situated in isolated repositories with
little opportunity for interoperable linkage. Health care organizations and providers
are required to establish and populate the electronic health record for the purpose of
providing health care to the patient. Governing bodies are responsible for ensuring
the security and confidentiality of that information. This means that there is no legal
framework that would encourage governing boards to share the data with either the
patient or with other organizations that could use it effectively for the benefit of the
patient. This is because when the patient is admitted to the hospital they consent for
treatment but also for the information to be collected for their care, but it is not to be
shared with anyone other than with their permission. This creates an untenable
situation for the ubiquity of health data where expectations do not match interop-
erable capability. One of the proposals for addressing this issue is to give citizens
and patients ownership and access and allow them to determine who gets access
and when.
9.2 Challenges—Establish Common Interoperable

Ontologies and Standards for My Care
and Health Status
There is a significant movement toward patient centered care and coordinating care
much more effectively around that patient. Coordinated care is also being mobilized
toward integrated care delivery. Governing bodies must be aware of the challenges
of attempting to provide integrated care when the organization structures and
processes do not easily accommodate it. Rigby and colleagues point out “new
interactive patient portals will be needed to enable peer communication by all
stakeholders including patients and professionals. Few portals capable of this exist
to date. The evaluation of these portals as enablers of system change, rather than as
simple windows into electronic records, is at an early stage and novel evaluation
approaches are needed” [6].
It will be very difficult to link them together in a meaningful way. Some degree
of standardization will be necessary. Governing bodies will need to facilitate the
development of standardized ontologies, catalogues, nomenclatures for databases in
order that information about the individual patient can be linked to other databases
where other forms of information reside.
10 Continuity
10.1 Opportunities—Please Don’t Make Me Repeat

My History, Again!
Governance bodies are responsible to coordinate the care of citizens and patients
across the continuum of care. The continuum of care ranges from cradle to grave.
This involves the delivery of services that promote health, prevent disease, emer-
gency and acute care, rehabilitation services, long term care, community care,
public health and palliative services. Historically these services were delivered by
independent agencies and organizations with their own governance bodies. In
current times the health reform movement is centralizing the governance and
administrative responsibilities to more effectively centralize, integrate, coordinate
and collaborate on the health care delivery enterprise. The regulations and rules
around access and use of health data has not kept pace with the structural and
functional reforms underway. The result is that public expectations of continuity of
health data across health care providers are not being achieved. Big data serves to
close this gap.
10.2 Challenges—Identify Me but Protect Me
Citizens have been participating in informatics platforms to manage their financial

affairs electronically, arrange travel and access entertainment. These platforms all
have a requirement for the citizen to have an identifiable number. The health care
system has been late in facilitating the same opportunity because it was built in
piece meal fashion with information collection requirements local to the organi-
zation. For citizens and patients to participate in a health care system that provides
continuity it will be necessary for each citizen to have a single identifier. Having
one personal health number (PHN) will link them to databases in which that
individual may also have relevant information.
The challenge is rising above the organization boundaries and opening up the
discussion so that legislation and regulations can be put in place to have the unique
identifier. Big data may be able to find workarounds for this challenge but the
opportunity for error is many fold with multiple identifiers. One approach would be
to provide each citizen with the ownership to their personal health record for which
they would have the responsibility for determining who and what records health
care providers would have access to. Legislation and regulation will need to ensure
that safeguards to preserve the security and privacy of the information are in place.
11 Appropriateness
11.1 Opportunities—Provide Me the Best Care Current

Knowledge and Skills Can Provide
Do the right thing to provide me with the best results is the dictum driving
appropriateness in the health care system. The ascension of big data intimates that
knowing what is appropriate may be well established. Science and medicine have
provided the answers to many of the diseases that face humanity however there
remain many diseases and conditions for which the “right thing to do” is an open
question. Many health care interventions have a degree of uncertainty associated
with their outcomes. Big data may help reduce the uncertainty through rigorous
probabilistic analysis.
11.2 Challenges—“Hey Google”, Should I Have My

Prosthetic Hip Resurfaced or Replaced?
To be successful in moving from hypothesis-based science to real world trails there

will need to a strong linkage and integration of data and practice between the
delivery of health care and the research enterprise. There are currently significant
challenges in linkage of data between funders, providers and institutions that deliver
care. A major challenge to be addressed will be for governments and their agencies
to come to terms with identifying how they can provide citizens and patients the
safeguards they require and yet not inhibit their opportunity to enroll in clinical
trials of their choosing, if they match the clinical criteria. One strategy for gover-
nance bodies to address this challenge is to begin to open the door for research
institutions such as universities and clinical trialists to work alongside their provider
colleagues.
Big data may be able to traverse the gaps among the data points in a health
record, with the information a patient or health care provider has but the final leap is
one where the specific patient’s condition is linked to an evidence base of clinical
interventions in which the real time application of documenting the patient’s con-
dition, the therapeutic interventions and monitoring the outcomes so that the tra-
jectory of the patients clinical course contributes to an continual learning system of
care delivery. This would be of benefit for the individual patient as they would
benefit from the cumulative experience and in turn their documented experience
and results would enter the database and help inform future clinical decisions.
Another significant challenge for governance bodies to address will be the form
of relationship and ownership that will take place around the big data leading to
invention and intellectual property associated with new applications and how they
become commercialized. Shared risk and co development models of business
interest will need to be negotiated among health care providers, industry, patients
and researchers in order to effectively exploit the opportunities for societal interests.
12 Efficiency—Minimize Inputs, Maximize Outputs,

Outcomes and Value
12.1 Opportunities—You Can Only Spend a Dollar Once
Public expectations are that the expenditure of funds for health care is to be made to
achieve the greatest health benefits and value for society. Governing bodies are held
to account for making optimal decisions for the use of resources at their disposal.
They are held to account by the public and government that provides them with
funding. Opportunity cost dictates that spending money on one thing in health care
means that those funds are not available for other health expenditures. Spending
funds on one health benefit means it is not available for another competing health
benefit, which may be greater. Interests within the health care system will compete
for the resources, sometimes by losing sight of what is best for citizens or patients.
Governance and administration must make decisions that balance the competing
interests.
It does appear as if big data could be a powerful approach and tool for governing
bodies and administrators to extract efficiencies from the health care system.
12.2 Challenges—Be Prepared to Speak Truth to Power
Big data offers a solution to this conundrum but it comes with significant risks.
Although governance bodies may recognize that there are services and programs
that should come out of service there will be political forces that have a desire to
maintain the status quo because their employment, income stream and/or security
depends on them.
To address these issues governance bodies will be required to explicitly address
these issues through open and explicit information sharing, partnership and col-
laborations with their health care providers and patients to ensure that change
management strategies are developed and implemented such that a smooth transi-
tion takes place to replace some forms of inefficient program delivery with those
that are more efficient.
13 Concluding Remarks
Governance bodies of health care systems have a serious responsibility to mobilize

the reforms necessary to place citizens and patients at the center of the care delivery
process in order to effectively exploit the potential promises of big data.
Governance must function within the legal and regulatory requirements of their
society but at the same time they must collaborate, mobilize, align and nudge
government and other interests toward a recognition that the values and expecta-
tions of society are changing. Recalling Machiavelli’s dictum, there are, however,
serious perils for those leading the changes necessary. This chapter has highlighted
many of the pitfalls that citizens, patients, politicians, policy makers, health care
providers, may succumb to with an indiscriminant and uncritical approach to big
data. The best strategy for maximizing the promises of big data is to be aware of the
pitfalls and to plan accordingly.
Governance bodies must avail themselves of trusted data, information and
knowledge, as they are the best vaccine for speaking truth to power and avoiding
policy and decisions based on incompetence, confusion or malicious intent. The
public interest must be safeguarded from these threats. Governments at all levels,
national, state/provincial, municipal/local must be prepared to establish the political
institutions and instruments that protect the public interest in the storage, linkage
and application of big data. Principled standards of best practice should be
encouraged and developed at the global level so that countries with less capacity
and capability can benefit from those that have. Governance must be prepared to
collaborate in an all government approach to have in place enabling legal, regu-
latory, policy, standards, and guidance that involve weighing the health data,
information and evidence in order to balance the competing interests through rea-
soned deliberation. These deliberations must be held in open, explicit and trans-
parent public settings as recommended in the Accreditation Canada standard below:
Communicating and sharing complete and unbiased information with clients and families in
ways that are affirming and useful. Clients and families receive timely, complete, and
accurate information in order to effectively participate in care and decision-making.
(Accreditation Canada 2018) [3]
Governing bodies are the entrusted stewards of the public’s health. Our
responsibility is to provide them with the means to help them find ways to harness
the promise of big data and avoid negative consequences.
T. S. Eliot’s words may express best the challenge we face;
Where is the life we have lost? Where is the life we have lost in living. Where is the wisdom
we have lost in knowledge? Where is the knowledge we have lost in information? [7]
I would add where is the information, knowledge and wisdom we have lost in
the big data?
References
1. Coleridge ST (1798) The rime of the ancient mariner. https://www.bartleby.com/41/415.html.

Accessed on 24 June 2018
2. Bellazzi R (2014) Big data and biomedical informatics: a challenging opportunity. IMIA Yearb
Med Inform 8–13
3. Health Standards Organization (2018) https://healthstandards.org/assessment-programs/.
4. Machiavelli N (1513) The prince. https://www.goodreads.com/quotes/274551-it-ought-to-be-

remembered-that-there-is-nothing-more. Accessed 24 June 2018
5. Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: promise and potential
6. Rigby M, Georgiou A, Hyppönen H, Ammenwerth E, de Keizer C, Magrabi F, Scott P (2014)
Patient portals as a means of information and communication technology support to
patient-centric care coordination—the missing evidence and the challenges of evaluation. IMIA
Yearb Med Inform 148–159
7. Eliot TS (1934) Choruses from the rock. http://www.westminster.edu/staff/brennie/wisdoms/
eliot1.htm. Accessed on 24 June 2018
Part II
Human Factors and Ethical Perspectives
Big Data and Patient Safety
Elizabeth M. Borycki and Andre W. Kushniruk
1 Introduction
Big data promises to increase patient safety if vast amounts of relevant health
related data can be brought to bear in aiding decision making, reasoning and
promotion of health. For example, the advent of clinical decision support systems
that can apply best practice guidelines, alerts and reminders through continual
analysis of large repositories of patient data (e.g. running behind the scenes
checking patient records for adverse combinations of medications and flagging
problems) has been shown to increase patient safety [1]. As patient data increases
(as contained in patient record systems, data warehouses and genomic databases)
automated methods for scanning, checking health data for anomalies, issues and
health problems has proven to be an important advantage of digitizing health
information [2]. Improving personal health through the integration of various forms
of personal health data will require applications that can process large amounts of
adverse event health data and have considerable promise for improving patient
safety [2, 3]. Indeed, the advantages of the coming personalized medicine trend will
require big data coupled with new ways of automatically analyzing data. Such
advances promise to increase the effectiveness of treatment, management and
ultimately patient safety [4]. However, as the size of this data increases, the quality
and correctness of data collected using these new methods will be something that
will become an increasing concern [5–8]. In addition to this, big data can be
collected for the purposes of checking and improving data quality and reducing the
chance of technology-induced error—error that may be inadvertently introduced by
information technology itself [5, 6]. New ways of documenting and responding to
such error will be needed as the era of big data dawns [5]. One approach to
achieving this is by developing error reporting systems that can report on errors and
E. M. Borycki (&) A. W. Kushniruk

School of Health Information Science, University of Victoria, Victoria, Canada
e-mail: emb@uvic.ca

86 E. M. Borycki and A. W. Kushniruk
issues adversely affecting patient safety—these databases themselves will ulti-

mately become “big data” [5]. Other possibilities include automated error detection
and potential for automated detection of errors contained in big data [6].
2 Motivation
There has emerged a need to collect data about the safety of health information
technology (HIT) with the objective of improving the quality and safety of the
technologies patients and health professionals use in the process of providing and
receiving health care. With increased technological advances, the potential for
inadvertent introduction of error due to technology and in the data stored in large
databases will increase [3, 5, 6]. Technology-induced error refer to errors that result
from the complex interaction between humans and machines [5]. Such error may
manifest itself in incorrect use of technology, errors in decision making as a result of
using technology and resultant error in data stored and accessed in electronic
repositories. To address this growing concern, some researchers have repurposed
existing databases which were created to document medical and medication error to
also include documenting of technology-induced errors [3, 5, 6]. Other researchers
have begun collecting data about technology-induced errors either as an adjunct to
existing data collection approaches or in developing new methods for collecting data
created by the HIT themselves as they are used by patients and health professionals
during the process of patient care [2–8]. Much of this work parallels research that has
been done in areas such as aerospace, where data about aircraft failures and issues are
entered and accessed globally in an effort to increase air travel safety.
In this book chapter, the authors will discuss how technology-induced errors are
being managed and analyzed using existing sources of data (i.e. large data repos-
itories that collect data about patient safety incidents in healthcare) and also how
data being collected by HIT can be used to improve the quality and safety of
healthcare technologies and healthcare itself.
3 Existing Sources of Data
To date, data about technology-induced errors are being collected in incident

reporting systems that reside in hospital, regional health authority, provincial and
national systems used to collect information about patient safety incidents. These
data repositories, depending on their size, are considered to be big data about
patient safety incidents and healthcare quality, where technology-induced error data
are collected [3]. In health care, data are collected about instances, where there has
been a near miss or a medical error has occurred involving a patient, health care
professional, care process or technology [3, 5]. These data are collected in the form
of incident reports using incident reporting systems available at a hospital, regional,
Big Data and Patient Safety 87
state or provincial or national level. Such incident reporting systems collect data
across facilities and regions and are available for fine grained analysis of errors
involving technology (i.e. technology-induced errors). These data repositories have
been used to provide valuable insights into how errors can emerge and can prop-
agate throughout a healthcare system [3].
Researchers in Australia [9], Finland [3], China [10] and the United States of
America [11] have effectively used data from incident reporting systems to learn
about how technology-induced errors occur so that future events can be avoided.
Their work has involved reviewing individual incident reports for the presence or
absence of a technology-induced error, coding the data using taxonomies specific to
technology and errors, and analyzing the data for patterns of technology-induced
error occurrence to inform technology-specific strategies aimed at preventing errors
and to examine data for patterns that inform organizational learning at a broad level
(e.g. regional health authority, national and international level) [3, 9].
Horsky and colleagues [12] used incident reporting data to conduct fine grained
analyses of technology-induced errors to develop a more comprehensive insight
into the events that led to the error. Here, Horsky reviewed the initial incident report
and developed a comprehensive strategy for understanding how the technology,
organizational environment and people who were involved in the incident inter-
acted and how this led to a patient harm. In this work the researchers were able to
provide a report outlining recommendations for their institution aimed at preventing
future errors such as modifying the interface of the electronic health record system,
providing training for physicians to deal with unusual situations, and developing
new organizational policies and procedures [10].
Magrabi [11] and Palojoki and colleagues [3] analyzed data about
technology-induced errors found in incident reporting systems. After reviewing
incident reports and coding their data, the researchers analyzed the reports to
provide information about overall trends in the types of errors that are occurring and
the types of technologies that were involved. For example, Magrabi and colleagues
analyzed reported events that were stored in the US Food and Drug Administrative
Manufacturer and User Facility Device Experience (MAUDE) database [11]. Some
this work also involved in-depth analysis of the data to understand where these
types of errors occur most often (e.g. in an emergency department or the intensive
care unit) (Palojoki et al.) [12]. Palojoki et al. [13] extended this work by collecting
additional data in the form of health care professional surveys. The researchers
developed a survey tool that asks health professionals about their experiences
involving technology-induced errors. Here, the survey data helped to inform inci-
dent report analyses. The results of their work indicated that almost half of the
respondents to their survey indicated that a high level of risk related to a specific
error type they termed “extended electronic health record unavailability”. Other
risks included problems such as a tendency to select incorrectly from a list of items
(e.g. in selecting from a list of medications). In related work, Palojoki and col-
leagues found human-computer interaction problems were the most frequently
reported [12].
Kaipio et al. [14] employed large scale surveys (deployed online to thousands of
physicians) to learn about safety issues involving electronic medical records in
Finland. Kaipio added several strategic questions to a national survey about elec-
tronic medical record usability and workflow as an addition to the existing ques-
tions to provide some preliminary insights into health information technology
safety. The survey has been deployed in Finland with an invitation to all physicians.
The results of the survey indicated that physicians were very critical of the usability
of the electronic health record systems they were using. The survey also provided
detailed information about what usability problems were being encountered by
users of the main vendor based system available in Finland. In a follow-up study
also conducted at the national level in Finland two years later, it was found that
users’ impressions of the systems they were using had not substantially improved.
This work is currently pioneering and will ultimately lead to collection of large
amounts of data on usability and safety of healthcare systems, as other countries
begin to deploy similar online questionnaires [14]. It will be used to provide
feedback at multiple levels, including to vendors, national organizations and policy
makers. Other approaches that involve collection of usability and use data of sys-
tems such as electronic health records will also lead to collection of big data on
usage information. This information can be used by health regions and authorities
to identify how electronic resources are being used, potential bottlenecks and areas
where further analysis is needed [15].
4 Challenges
There are a number of challenges when dealing with big data related to improving
the safety and quality of healthcare processes and information technologies. Much
of the current collection of large databases of error information are based on vol-
untary incident reporting by end users of systems (e.g. doctors, nurses, pharmacists
etc.) [3, 9, 11]. This will need to be augmented by systems that allow patients and
citizens to enter information about errors [8]. In addition, many technology-induced
errors go undetected by the end user committing the error, and thus are not reported
[5]. This has required use of laboratory studies (i.e. clinical simulations) to analyze
when such error might occur, along with use of computer simulations to extrapolate
how frequently they would occur in the larger healthcare context. This work also
moves the focus from reporting on errors to preventing them. Along these lines,
automated methods for detecting error such as medication errors and
technology-induced errors will be needed [16]. Data mining and application of
predicative analytics using a growing database of patient data and information
contained in electronic health records will be needed to detect patterns that indicate
error and safety issues. For example, with the advent of wireless devices in hos-
pitals, methods for ensuring the data transmitted from one device to another is
correct and error free will become essential (which could involve approaches from
applied artificial intelligence). In addition, given that many of the information
Big Data and Patient Safety 89
systems used globally today that are being used across multiple countries, there will
be a need for cross border collection and sharing (interoperability) of data on
technology-induced errors. Finally, “big data” does not necessarily mean “good” or
“correct” or “useful” data. “Garbage in—garbage out” is an old computer science
adage expressing the fact that just having data is not everything—if data entering a
health information system is incorrect, spurious or wrong going in, then decisions
going out will be bad and will lead to a reduction in patient safety. Therefore, as our
health databases grow and become more complex, greater emphasis will need to be
placed on data integrity and the safety of our healthcare systems and big data will
play a major role in this trend.
5 Discussion and Conclusion
Researchers have suggested that big data will lead to improvements in patient
safety. One area of concern involving health information technologies is the ability
of some of these technologies to introduce new types of errors. Errors that arise,
when health professionals use systems in the process of providing patient care, are
referred to as technology-induced errors. Currently, technology-induced error data
is being collected in incident reporting systems that reside in national, provincial,
regional and hospital specific databases, and by researchers who are developing and
deploying national surveys aimed at improving the quality and safety of health
information technology. There are many challenges associated with analyzing data
captured by incident reporting systems. The quality of these datasets has been
critiqued. Many incident reporting systems rely on voluntary reports by health
professionals. As well, a subset of all incidents documented in incident reporting
systems involve technology-induced errors. Future work involving big data will
need to focus on patient reported patient safety incidents and detecting patterns of
errors and safety issues from collected data.
References
1. Shortliffe EH, Cimino JJ (2006) Biomedical informatics. Springer, New York

2. Sittig DF, Singh H (2012) Electronic health records and national patient-safety goals.
N Engl J Med 367:1854–1860
3. Palojoki S, Mäkelä M, Lehtonen L, Saranto K (2017) An analysis of electronic health record–
related patient safety incidents. Health Inform J 23(2):134–145
4. Alyass A, Turcotte M, Meyre D (2015) From big data analysis to personalized medicine for
all: challenges and opportunities. BMC Med Genomics 8(1):33
5. Kushniruk AW, Triola MM, Borycki EM, Stein B, Kannry JL (2005) Technology induced
error and usability: the relationship between usability problems and prescription errors when
using a handheld application. Int J Med Inform 74(7–8):519–526
6. Kushniruk A, Surich J, Borycki E (2012) Detecting and classifying technology-induced error

in the transmission of healthcare data. In: 24th international conference of the European
federation for medical informatics quality of life quality of information, vol 26
7. Borycki EM, Keay E (2010) Methods to assess the safety of health information systems.
Healthc Q 13:47–52
8. Borycki E, Dexheimer JW, Cossio CHL, Gong Y, Jensen S, Kaipio J, … Marcilly R (2016)
Methods for addressing technology-induced errors: the current state. Yearb Med Inf (1):30
9. Magrabi F, Ong MS, Runciman W, Coiera E (2010) An analysis of computer-related patient
safety incidents to inform the development of a classification. J Am Med Inform Assoc 17
(6):663–670
10. Samaranayake NR, Cheung STD, Chui WCM, Cheung BMY (2012) Technology-related
medication errors in a tertiary hospital: a 5-year analysis of reported medication incidents.
Int J Med Inform 81(12):828–833
11. Magrabi F, Ong MS, Runciman W, Coiera E (2011) Using FDA reports to inform a
classification for health information technology safety problems. J Am Med Inform Assoc 19
(1):45–53
12. Horsky J, Zhang J, Patel VL (2005) To err is not entirely human: complex technology and
user cognition. J Biomed Inform 38(4):264–266
13. Palojoki S, Tuuli P, Saranto K, Lehtonen L (2016) Electronic health record-related safety
soncerns: a cross-sectional survey of electronic health record users. JMIR Med Inform 4(2)
Apr–Jun:e13
14. Kaipio J, Lääveri T, Hyppönen H, Vainiomäki S, Reponen J, Kushniruk A, … Vänskä J
(2017) Usability problems do not heal by themselves: national survey on physicians’
experiences with EHRs in Finland. Int J Med Inform 97:266–281
15. Kushniruk A, Kaipio J, Nieminen M, Hyppönen H, Lääveri T, Nohr C, Kanstrup AM,
Christiansen MB, Kuo MH, Borycki E (2014) Human factors in the large: experiences from
Denmark, Finland and Canada in moving towards regional and national evaluations of health
information system usability: contribution of the IMIA Human Factors Working
Group. Yearb Med Inform 9(1):67
16. Kushniruk A, Nohr C, Jensen S, Borycki EM (2013) From usability testing to clinical
simulations: bringing context into the design and evaluation of usable and safe health
information technologies. Yearb Med Inform 22(01):78–85
Big Data Challenges from a Human
Factors Perspective
Andre W. Kushniruk and Elizabeth M. Borycki
1 Introduction
The collection and analysis of ever increasing amounts of healthcare data promises
to revolutionize and transform healthcare. Voluminous personal health data, fitness
data, genomic data, epidemiological data and other forms of health data are being
generated at an unprecedented rate and this trend will continue [1]. While advances
are being made in the automated collection and analysis of big data to keep up with
the generation of data, using machine learning, data mining and artificial intelli-
gence techniques, the issue of the human factor in all these developments still
remains central to the question of whether such large and complex collections of
data are useful and effective in helping to improve healthcare decision making and
processes. The impact of big data ultimately depends on human factors related to
effective access, use and application of such large data repositories to solve com-
plex and real healthcare problems and meet the information needs of health pro-
fessionals, healthcare management and ultimately patients. Indeed, the potential for
voluminous collection of data can easily lead to the phenomena known as cognitive
overload, whereby the limited cognitive processing capability of humans is over-
whelmed by the amount or complexity of data. Health data collected needs to be
collected, accessed and utilized by health professionals, patients and lay people in a
way that is understandable, effective and meets underlying information needs.
Collections of large amounts of data without consideration of the human factors
involved in its use and interaction with human end users is unlikely to lead to
improved healthcare and this must be taken into account by those designing,
implementing and deploying large data sets, interfaces to big data and decision
support that use big data with the objective of improving healthcare.
A. W. Kushniruk (&) E. M. Borycki

School of Health Information Science, University of Victoria, Victoria, Canada
e-mail: andrek@uvic.ca

92 A. W. Kushniruk and E. M. Borycki
Along these lines the issue of the usability of healthcare information systems has
come to the fore in health informatics more generally. Usability can be considered a
measure of ease of use of a system, user interface, data or technology in terms of its
effectiveness, efficiency, enjoyability, safety and learnability [2]. The principles that
have emerged from the field of usability engineering argue for the introduction of
technology that is both usable and useful to end users (e.g. physicians, nurses,
pharmacists, patients, lay people etc.) in helping to solve some real problem, make a
decision or reason about health issues. Nowhere is the concept of usability and the
need for consideration of human factors more germane than in the area of big data.
Indeed, failures of big data to achieve its promise have in many cases been directly
attributed to a lack of consideration of human factors, and more specifically,
usability of the systems, data or support provided to end users in the attempt to help
them. Therefore, considering the human factors of big data is an important and
essential topic that will not go away, but rather will become more and more critical
as the amount and complexity of data in healthcare continues to exponentially
increase over time.
2 Cognitive and Informational Aspects of Big Data

and Application of the User-Task-Context Matrix
There are a number of basic cognitive aspects that need to be considered in

attempting to reap the benefits of big data. Firstly, human limitations in processing
such large data repositories require new ways of interacting with large data sets,
new ways of visualizing complex health related data, and new multi-dimensional
approaches to accessing information from big data. Human information processing
is limited by cognitive capacity, for example humans tend to remember a very
limited (7 plus or minus 2 elements) in their working memory, limiting the amount
or complexity of information that can be effectively presented to them on a com-
puter screen. In addition, from a massive literature in human decision making,
humans are susceptible to a range of cognitive biases when using data to make
decisions or reason [3, 4]. Perceptual limitations of humans in the context of
understanding big data include consideration of issues in the display, summariza-
tion and layout of information contained in extremely large data bases. There is a
need for new visualization techniques, such as 3-D visualizations, wall maps,
immersive interfaces, augmented reality interfaces and haptic user interfaces [5, 6].
Furthermore, the information needs related to cognitive aspects of healthcare IT
are extremely important in basing design of systems and interfaces that can be
effectively used to access big data to meet real user needs and help to solve real
problems. As in the development of other forms of health information systems, a
clear understanding of user needs, questions and information requirements is no less
important in the design and organization of system using big data, data warehouses
and large distributed databases. To support reasoning about user needs in a range of
Big Data Challenges from a Human Factors Perspective 93
contexts, including the development of data warehouses and data marts, Kushniruk
and Turner have proposed a framework to characterize user needs known as the
User-Task-Context matrix [7]. This framework has been used for helping to design
interfaces to a variety of big data applications, including personal health applica-
tions and interfaces to large organizational data warehouses.
The three dimensions of the model are the following: (a) the User (b) the Task
(c) the Context of Use. For example, along the user dimension of an envisaged data
warehouse the categories of users corresponding to clinicians, statisticians, and
healthcare organization management might be identified from initial system
requirements. Each of these user types or classes could be further delineated in
terms of their information needs and requirements, creating a user profile for each
class of user. The task dimension refers to the different type of user interactions that
a system might support. For example, in the case of a data warehouse this might
include providing information and specific reports to support management rea-
soning about resource allocation in a health region, or identification of disease
concentrations. Finally, the third dimension is context that refers to the setting or
context of use of the data warehouse, for example, in the clinical setting, or in the
context of hospital managers making organizational decisions (Fig. 1).
In one example of application of the User-Task-Context matrix, a group of
potential end users of a data warehouse project for a regional health authority met to
arrive at an architecture for the warehouse. The framework from the
User-Task-Context Matrix was used to drive the requirements gathering through
delineation of: (a) the different user groups who would be using the data warehouse
Fig. 1 A 3-Dimensional requirement framework: the User-Task-Context matrix. Adapted from

Kushniruk and Turner [7]
(b) the type of tasks and information needs of each of the different user groups
(including types of reports and displays required) and (c) the different context of use
of the data warehouse (e.g. for optimizing local clinical decision making, for
making large-scale organizational decisions etc.). The design and organization of
both the back-end of the data warehouse as well as the user interface and user
interactions were designed based on the results of filling out details in the matrix
(regarding its 3 dimensions), to maximize the impact and usefulness of the big data
that ultimately were contained in this large regional data warehouse.
3 Case Study: The Promises and Challenges in Generating

Useful and Usable Genomic Results to Support Human
Decision Making
Personalized medicine is a promising approach to improve the quality and safety of

healthcare [8]. By applying personalized medicine, information about a patients’
genetic makeup, their proteins and their environmental exposures can be used to
treat and prevent disease. An example of personalized medicine is the customized
selection of appropriate therapies for a particular patient based on genetic infor-
mation about that patient. In addition, in genetic counselling the existence of
genetic mutations may underlie specific decisions about treatment and management.
Personalized medicine may also allow for application of drug therapies customized
to a particular patient, for example in cancer treatment. However, the challenges of
research in personalized medicine and application of findings from bioinformatics
and genomic research has been the need to reduce the immense body of data and
information contained in the human genome to something that is useful and
effective for improving health. This vast amount of information must be reduced to
a level that can be understood and applied so that it can be used by scientific
researchers, clinical researchers, clinicians and ultimately patients and their fami-
lies. This area serves as an example of the importance of human factors when
dealing with advances in big data, such as genomic research.
Figure 2 illustrates this need for reduction of information from the gene level
itself to its application at the level of the end user of this information. At the broad
end at the left hand side we can see there are billions of data elements (corre-
sponding to the immense number of nucleotides in human DNA). The figure
illustrates how this immense amount of information must summarized to provide
useful information to basic scientists, clinical researchers and bioinformatics spe-
cialists (the middle part of the figure). This information must in turn be summarized
and synthesized in a way that is useful for clinical scientists and researchers
involved in the “bioinformatics pipeline”. Finally, as we move further to the right
side of Fig. 2, we see that the knowledge and insights gained from this big data is
ultimately to be used “at the coal face” by clinicians, genetic counsellors and
patients themselves in making decisions relevant to all the genomic information

brought to bear and which might be most relevant to improve healthcare and
healthcare processes for the individual.
In recent work examining the human factors aspects of genomic data,
researchers have applied methods from usability engineering to assess the effec-
tiveness of different types of user interfaces and displays in both the areas of
research and application of personalized medicine. For example Shyr, Kushniruk
and Wasserman describe a study where they conducted usability analyses of dif-
ferent tools used by researchers in bioinformatics research laboratories to determine
the user interface features that best support clinical research and knowledge dis-
covery [9]. The study specifically focused on evaluating features of different types
of software for exome sequence analysis and involved recording clinical geneticists
interacting with two different systems while “thinking aloud” and being video
recorded as they used the tools for different tasks. Information overload was noted
as one issue that was identified in using the software tools. This lead to recom-
mendations for adding rich filtering functionalities and user defined workflow
management (including the ability to generate step-wise reports on the fly) to the
software. In addition, given the large amount of data that the software deals with,
faster response times were recommended as well as the ability to support multiple
clinicians annotating and reviewing data collaboratively and in parallel.
In a second set of studies Shyr, Kushniruk, van Karnebeek and Wasserman
describe their work in analyzing user information and display needs for different
types of end users of genomic information, including those targeted for use by
genetic counsellors and clinicians [10]. In this work, focus groups were held to
determine initial user needs for display of the information in the context of making
Fig. 2 Knowledge translation and the bioinformatics pipeline—from knowledge synthesis to use
in personalised medicine
decisions about treatment and planning for patients. The focus group discussions
were recorded, transcribed and analyzed for themes and requirements for design
that then formed the basis for development of new user interface prototypes. In
reflecting the varied needs of different types of users in dealing with large and
complex data sets related to patient genetic data, a number of clear preferences
emerged (that were used to base the design of the prototype user interfaces that
were developed). For example, it was found that bioinformatics researchers pre-
ferred command line user interfaces over graphical user interfaces for better com-
patibility with the existing base of bioinformatics software tools and for
customization flexibility when analyzing and examining large data sets.
Furthermore, clinical geneticists noted the limitations in the usability of current
software and the inability to participate in specific stages in the health informatics
pipeline. Both clinical geneticists and genetic counselors wanted an overarching
interactive graphical interface that would be used to simplify the large data sets by
using a tiered approach where only functionalities relevant to the user domain were
accessible (and with the system being flexibly connected to a range of relevant
databases). In general, users wanted interfaces that would summarize key clinical
findings from the large array of possible details to aid in their application of the
genomic patient information, mitigate against cognitive overload and help in
focusing attention on key elements of the data presented.
Further work is being conducted in this area and has focused on how to best
integrate genomic information (e.g. about gene mutations and risks associated with
them) with patient data contained in electronic health record systems. Indeed, to
effectively support applications like automated alerts or reminders that provide
information about patients related to genomic information, research will need to be
conducted that examines the user interface and human-computer interaction at the
level of the clinician, genetic counsellor or in the case of patient facing systems, the
patient themselves. Indeed, in order to take advantage of the rapid advances in
research in the area of personalized medicine, research will need to also include work
on arriving at systems and tools that are both useful and usable, that embed into work
activities for day-to-day application of knowledge (as in the use of electronic health
records) and support workflow, decision making and reasoning by humans.
4 Challenges for Big Data from a Human Factors Point

of View
There are a number of challenges for Big Data from a human factors perspective
and some prominent ones include the following:
– Electronic health record data is growing exponentially—electronic health record
systems are becoming widely used worldwide and are becoming ubiquitous.
These systems allow for storage and access to patient data that can be ever
increasing in size and complexity, requiring consideration of how much of the

big data can actually be processed and applied usefully by the end user of these
systems.
– Data visualization for research purposes—new approaches are needed to allow
for visualization of complex health data, including multi-dimensional visual
displays and augmented reality approaches to support humans in exploring big
data.
– Decision support systems that aggregate and apply large amounts of information
for presentation to end users will become more widely used. As an example,
IBM Watson Health can synthesize and apply massive amounts of health related
data for providing treatment plans in areas such as cancer care [11]. The user
interactions and usability of such advanced AI technologies will need to be
explored as such technology becomes more widely used.
– Data warehousing and data marts are now an important part of healthcare
managers toolsets. Human factors aspects will be key to ensuring their useful-
ness and their applicability in helping to solve real healthcare problems.
– Clinical research databases and clinical trial data are increasing in size, com-
plexity and scope and will require new approaches to data visualization to gain
full benefit from such large-scale studies. This is an area where a range of
technologies is needed, including web-based analysis tools, mHealth applica-
tions for directly collecting data from patient populations during trials, and data
mining applications to explore and discover new relationships among large
amounts of data in conjunction with data from other sources such as electronic
health records. All of these applications will require greater consideration of
human factors of large scale data collection and analysis.
– Personalized health data for health promotion and mHealth (e.g. data collected
from fitbits and other wearable devices) can produce a huge amount of data.
Such data could ultimately be integrated with other sources of data such as that
contained in electronic patient records, personal health records and other sources
of information maintained about a person or patient. However, this integration
and new ways are needed to present and display large amounts of data in a way
that makes sense to both the clinical and lay populations and that is useful and
usable (i.e. not leading to cognitive overload).
5 Future Directions for Work in the Human Factors

of Big Data
There a number of future directions for research in the human factors of Big Data.
The following are some of the directions the authors of this chapter have been and
currently involved with:
– Usability analyses and analysis of use of big data to iteratively feedback into
design and redesign into health information systems, such as data warehouses,
electronic health records, public health information systems and clinical deci-
sion support systems. This work includes developing principled methods for
coding and analysing usage and usability data [12].
– Automated tracking and analysis of human interactions with such data as a way
to lead to improved use and application. For example, in previous work the
authors have been involved in creating what they called a “Virtual Usability
Laboratory”—the VUL. The VUL was designed to collect and collate data from
various sources (e.g. online questionnaires, user tracking logs, error logs and
various forms of qualitative data) to provide detailed but large amounts of data
about users of healthcare information systems [13].
– Large scale usability analyses in healthcare to complement smaller scale qual-
itative studies and usability tests. Some of this work we have referred to as
“usability in the large” where data collected on use and usability of health
information systems may span not only health regions but also across entire
nations [14].
– Further work into creation of personalized health information systems that
populations of lay people, patients and healthcare professionals can interact with
(i.e. in collaboration with their healthcare organizations).
6 Conclusion
Big data is here to stay. Furthermore over time, big data will only become even
“bigger”, with new ways to collect and store huge and ever increasing amounts of
health information electronically. However, to be useful and effective, ultimately
such large repositories of data need to be synthesized, processed and used by
humans. In this chapter we have touched on a number of areas where human factors
research and application touch on big data initiatives and endeavors. To ensure
success of these projects and to really harness all this potential data for real
application in healthcare, greater and increasing attention will undoubtedly need to
be paid to the human factors of big data. There are a number of challenges that exist
that may currently limit the effectiveness and usefulness of big data and although
some of these are currently being addressed, the ever increasing amount of health
data will continually require new approaches and methods for improving human
interaction with big data.
References
1. Marconi K, Lehmann H (eds) (2014) Big data and health analytics. CRC Press, Boca Raton, FL
2. Kushniruk AW, Patel VL (2004) Cognitive and usability engineering methods for the
evaluation of clinical information systems. J Biomed Inform 37(1):56–76
3. Patel VL, Arocha JF, Kaufman DR (2001) A primer on aspects of cognition for medical
informatics. J Am Med Inform Assoc 8(4):324–343
4. Kushniruk AW (2001) Analysis of complex decision-making processes in health care:
cognitive approaches to health informatics. J Biomed Inform 34(5):365–376
5. Kortum P (2008) HCI beyond the GUI: design for haptic, speech, olfactory, and other
nontraditional interfaces. Elsevier, Amsterdam
6. Jacko JA, Sears A (2012) Human computer interaction handbook. CRC Press, Boca Raton, FL
7. Kushniruk A, Turner P (2012) A framework for user involvement and context in the design
and development of safe e-health systems. Stud Health Technol Inform 180:353–357
8. Cullis P (2015) The personalized medicine revolution: how diagnosing and treating disease
are about to change forever. Greystone Books
9. Shyr C, Kushniruk A, Wasserman WW (2014) Usability study of clinical exome analysis
software: top lessons learned and recommendations. J Biomed Inform 1(51):129–136
10. Shyr C, Kushniruk A, van Karnebeek CD, Wasserman WW (2015) Dynamic software design
for clinical exome and genome analyses: insights from bioinformaticians, clinical geneticists,
and genetic counselors. J Am Med Inform Assoc 23(2):257–268
11. Murdoch TB, Detsky AS (2013) The inevitable application of big data to health care. JAMA
309(13):1351–1352
12. Kushniruk AW, Borycki EM (2015) Development of a video coding scheme for analyzing the
usability and usefulness of health information systems. In: CSHI, 14 Aug 2015, pp 68–73
13. Kushniruk A, Kaipio J, Nieminen M, Hyppönen H, Lääveri T, Nohr C, Kanstrup AM,
Christiansen MB, Kuo MH, Borycki E (2014) Human factors in the large: experiences from
Denmark, Finland and Canada in moving towards regional and national evaluations of health
information system usability: contribution of the IMIA Human Factors Working
Group. Yearb Med Inform 9(1):67
14. Kaipio J, Lääveri T, Hyppönen H, Vainiomäki S, Reponen J, Kushniruk A, Borycki E,
Vänskä J (2017) Usability problems do not heal by themselves: national survey on physicians’
experiences with EHRs in Finland. Int J Med Inform 1(97):266–281
Big Data Privacy and Ethical Challenges
Paulette Lacroix
1 Introduction
Big data is a complex phenomenon of technical advances in storage capacity,

computational speed, the low cost of data collection, and predictive analytics. It is a
manifestation of data that are continuously collected, infinitely networked and
highly flexible. Data may be analyzed from highly disparate contexts to generate
new, unanticipated knowledge. Artificial Intelligence (AI) is a key to unlocking the
value of big data, and machine learning is one of the technical mechanisms that
underpins and facilitates AI. It is the combination of all three concepts that results in
big data analytics [1] and these very properties that challenge compliance with
information privacy principles that has recently led to significant legislative changes
in data protection [2]. Further, the use of profiling and automated decision-making
made possible by machine learning and AI go well beyond privacy protections and
will require ethical oversight.
Big data has also created disjunctions between its data science research methods
and existing research ethical paradigms. Traditional research ethics based on con-
ceptual, regulatory and institutional assumptions of human-related data are not
easily applied to big data analytics. The resultant conflicts over whether big data
research methods should be forced to meet existing norms has precipitated a timely
examination of privacy and ethical considerations in the collection and analysis of
big data.
Similarity with source 1 which however is cited (edps) in the text and added to the references.
https://edps.europa.eu/about-edps_en
P. Lacroix (&)
PC Lacroix Consulting Inc., North Vancouver, Canada
e-mail: placroix@placroix.ca

102 P. Lacroix
2 Privacy and Big Data Management
The advancement of technology that led to the possibility of big data occurred over
a short time frame outdistancing the development of legislative privacy protections.
To allow for big data-type practices in general, new or modified widespread privacy
frameworks for both public and private-sector entities must be implemented to
protect the privacy of individuals and ensure fair and ethical use of their personal
information.
Big data analytics are distinctive in the collection of significant amounts of data,
repurposing the use of that data, using anonymization in analysis, generating new
data from these analyses, and having opacity in data processing.
The Information Accountability Foundation [3] has distinguished four types of
new data produced by big data analytics:
1. Provided data consciously given by individuals, e.g. when filling in an online
form.
2. Observed data that is recorded automatically, e.g. by online cookies or sensors
or closed-circuit television (CCTV) linked to facial recognition.
3. Derived data that is produced from other data in a relatively simple and
straightforward fashion, e.g. calculating customer profitability from the number
of visits to a store and items bought.
4. Inferred data that is based on probabilities and produced by using a more
complex method of analytics to find correlations between datasets and using
these to categorize or profile individuals and populations, e.g. calculating credit
scores or predicting future health outcomes.
Thus, the privacy principal of direct collection from an individual for a specified
purpose is challenged by big data, affecting an individual’s personal autonomy
based on their right to control his or her personal data and the processing of such
data. Control requires awareness of the use of personal data and real freedom of
choice. These conditions, which are essential to the protection of fundamental
rights, and in particular the right to the protection of personal data, can be met
through different legal solutions tailored according to the given social and tech-
nological context [4].
The issue of meaningful informed consent also arises because big data analytics
involves data that may be a continuous collection over time and the intended
consequences are not known or fully understood at the time of collection. Further,
each data set will likely contain different data points or values about the individuals
whose personal information is being collected. The principle of data accuracy
requires data to be complete and up to date. The information should be represen-
tative of the target population, not include discriminatory proxies such as race,
ethnicity or religion, and results understood to be only correlations not causation
[5]. Linking data from various sources may increase the likelihood that decisions
from those data will be based on inaccurate information, or on the basis of an
individual’s historical record rather than current circumstances or more recent
Big Data Privacy and Ethical Challenges 103
patterns of conduct. Bias in large data sets may be unknown due to lack of sam-
pling, an intrinsic collection bias or because of poor research design. If a data set
contains a variable that is not protected by law but by proxy is discriminatory, such
as a geographic region that contains a high percentage of individuals with the same
racial or ethnic background, decisions made from the analysis may be based on race
and ethnicity. There is increasing concern that the use of such data may constitute a
form of data surveillance operating against the legitimate interests of the individual.
The development of advanced algorithms has enabled big data to detect the
presence of increasingly complex relationships among significantly large numbers
of variables, and this ability brings with it an all-important risk of re-identification
of identifiable individuals. De-identification, anonymization and pseudonymization
of data are recommended practices to mitigate risk of privacy breach in large, linked
data sets. Generally, a dataset is said to be de-identified if elements that might
immediately identify a person or organization have been removed or masked. Data
protection legislation defines different treatment for identifiable and non-identifiable
data, however it is sometimes difficult to make this distinction and especially with
derived data from big data analytics [2]. Identifiability of an individual is
increasingly being seen as a continuum, not binary, and disclosure risks increase
with dimensionality (i.e. number of variables), linkage of multiple data sources, and
the power of data analytics.
Big data profiling is a type of automated processing of personal information that
inputs an individual’s personal information into a predictive model, which then
processes the information according to the set of rules established by the model to
produce an evaluation or prediction concerning one or more attributes of the
individual. For example, it may be used to evaluate or predict an individual’s
eligibility for programs or services. Profiling not only processes personal infor-
mation but generates it as well, creating a new element of personal information that
will be associated with the individual. While profiling pre-defines individuals into
types or categories in a reductive approach to understanding human behavior, the
prediction is set at a point in time and some degree of error is expected in the
outcome. It is important for organizations who profile to promote transparency of
the logic used by the predictive model and the potential consequences of the results.
Organizations should verify results of decisions based solely on profiling to ensure
individuals may exercise their privacy right to challenge or respond to such deci-
sions. By its very nature, profiling treats individuals as fixed, transparent objects
rather than as dynamic, emergent subjects [5]. In addition to a loss of dignity or
respect, profiling may have larger effects on society and individuals.
A recommended best practice for an organization that profiles people is to first
consult with public and civil society organizations regarding the impact of the
proposed profiling and to conduct a privacy impact assessment.
The European Union (EU) General Data Protection Regulation (GDPR), fully
applicable in May 2018, supersedes the 1995 Data Protection Directive and
strengthens and harmonizes the protection of personal data for EU citizens.
The GDPR considers not only the location of the data processing but also whether
personal data relating to individuals located in the EU are being processed,
104 P. Lacroix
regardless of where the data controller is established in the world. This legislation
has a global reach and has effectively influenced legislative changes in privacy
protection in other countries [6]. The GDPR has expanded data protection princi-
ples to require organizations to demonstrate accountability in the collection, use and
disclosure of personal information. The emerging importance of accountability is in
direct response to the implications of the processing of personal data in a big data
world.
More specifically, the GDPR requires a data protection impact assessment be
completed for initiatives that involve “a systematic and extensive evaluation of
personal aspects relating to natural persons which is based on automated process-
ing, including profiling, and on which decisions are based that produce legal effects
concerning the natural person or similarly significantly affect the natural person”
[2]. Other provisions in the Regulations include data protection by design and
default, e.g. Privacy by Design [7] and certification e.g. the establishment of cer-
tification mechanisms and data protection seals and marks for organizations to
provide quick public access to the level of data protection of relevant products and
services.
A prevailing view is that any potential harms arising from big data analytics is
from how the data are used, not necessarily how the data were collected. The GDPR
accountability principle now concentrates focus on the use of personal information
through mechanisms such as scrutinizing the technical design of algorithms,
auditability of the analytics process and the application of software-defined regu-
lations. Accountability has been championed over transparency which to date is
known to have many limitations in protecting an individual’s right to privacy.
In a recent report the Information Commissioner for the United Kingdom pro-
posed the following six recommendations for organizations conducting big data
analytics:
1. Carefully consider whether the big data analytics requires the processing of
personal data and use appropriate techniques to anonymize personal data in the
dataset(s) before analysis.
2. Be transparent about the processing of personal data by using a combination of
approaches to provide meaningful privacy notices at appropriate stages
throughout a big data project. This may include the use of icons, just-in-time
notifications and layered privacy notices.
3. Embed a privacy impact assessment framework into big data processing
activities to help identify privacy risks and assess the necessity and propor-
tionality of a given project. The privacy impact assessment should involve input
from all relevant parties including data analysts, compliance officers, board
members and the public.
4. Adopt a privacy by design approach in the development and application of big
data analytics. This should include implementing technical and organizational
measures to address data security, data minimization and data segregation.
5. Develop ethical principles to help reinforce key data protection principles.
Employees in smaller organizations should use these principles as a reference
point when working on big data projects. Larger organisations should create
ethics boards to help scrutinize projects and assess complex issues arising from
big data analytics.
6. Implement innovative techniques to develop auditable machine learning algo-
rithms. Internal and external audits should be undertaken with a view to
explaining the rationale behind algorithmic decisions and checking for bias,
discrimination and errors [2].
Personal data protection regimes, like the GDPR, are instruments for the gov-
ernance of data flows and data processing and remain valuable for the protection of
personal data in line with classical data processing. Yet, they may be inadequate to
address the unprecedented challenges raised by big data. In particular, the frequent
incompatibility between big data and privacy principles. The purposes of
algorithm-driven big data analytics are often to discover otherwise invisible patterns
in the data, rather than to apply previous insights, test hypotheses, or develop
explanations. Add to this the technical complexities of machine learning and AI,
and the effect can be the distancing of supervisory authorities and undertakings
from the meaning of the right to data protection. Ethics allows this return to the
spirit of the law and offers other insights for conducting an analysis of a new digital
society, such as its collective ethos, its claims to social justice, democracy and
personal freedom [4].
3 Ethical Approaches to Big Data
The adoption of an ethical approach to big data processing is being driven by two
main factors. In the public sector, evidence of a lack of public awareness about the
use of data and the extent of data sharing have led to calls for ethical policies to be
made explicit. The commercial imperative in the private sector is to mitigate risk of
reputational harm due to public distrust and brand devaluation [8]. While it is now
recognized that adherence to privacy legislation is not enough, ethical frameworks
for big data analytics and research are highly contested and in flux.
At the heart of the ethics debate is the consequences of the speed, capacity and
continuous generation of big data, as well as a change in the relationality, flexi-
bility, repurposing and de-contextualization of data. In particular the intensification
of algorithmic profiling and ‘personalization’ where individuals are not treated as
persons but as temporary aggregates of data processed at an industrial scale. But
human beings are not identical to their data. Human values must be understood and
implemented within a social, cultural, political, economic and technological context
in which personal data and personal experience is made. Therefore, digital ethics
should take into account the widely changing relationship between digital and
human realities. Big data generates new ethical questions about what it means to be
human in relation to data, about human knowledge and about the nature of human
experience. It obliges us to re-examine how we live and work, how we socialize and
106 P. Lacroix
participate in communities, and our relations with others and, perhaps most
importantly, with ourselves. It invites ethical evaluation and a new interpretation of
fundamental notions in ethics, such as dignity, freedom, autonomy, solidarity,
equality, justice, and trust [4].
Trust, as a concept related to the perception of risk and uncertainty, has grown in
importance in the evolution of information technologies as a bridge between
technical and moral aspects of technically assisted communication systems.
Crucially, trust has a double-meaning in data protection. One is a
technologically-oriented, functional or knowledge concept: trust in a technology
refers to the confidence that it will not fail in its pure functionality, that its design
and engineered properties will carry out their expected function. The second
meaning is that trust is a moral concept referring to belief and reliance in a person or
organization that they will honour explicit or implicit promises and commitments
[4]. In this context data protection faces three interrelated crises of trust:
1. Individual trust in people, institutions and organizations that deal with personal data;
2. Institutional trust, transparency and accountability as a condition for keeping
track of the reputations of individuals and organizations and trust-building in a
society that requires access to personal data; and
3. Social trust in other members of social groups anchored in personal proximity
and physical interaction, which are being increasingly replaced by digital
connections.
Trust builds on shared assumptions about material and immaterial values, about
what is important and what is expendable. It stems from shared social practice,
shared habits, ways of life, common norms, convictions and attitudes. Trust is based
on shared experiences, on a shared past, shared traditions and shared memories.
It is concerning that big data science sidesteps many of the informal modes of
ethics regulation found in other science and technology communities. The precursor
disciplines of data science (computer science, physics, and applied mathematics)
have not historically fallen under the purview of ethics review at universities. The
reason for this is their work and contributions have historically been about systems
and not people, thus outside of human-subjects related ethics concerns. As a result,
the content of the datasets is considered irrelevant to the substantive questions of
human-related research including the privacy rights of research subjects. The result
is a disjunction between the familiar infrastructures and conceptual frameworks of
research ethics and the emerging epistemic conditions of big data. Data scientists
are often able to gain access to highly sensitive data about human subjects without
ever intervening in the lives of those subjects to obtain it. They may predict, or
infer, or gather data from disconnected public data sets. It is important to note that
big data research which re-uses de-identified or publicly available data will largely
be excused from ethics oversight as long as it meets unspecified privacy safeguards
such as anonymization or de-identification. Given the accepted definition of
human-subjects research, nearly all non-biomedical research would receive at most
perfunctory oversight due to the assumption that there is little or no risk of harm [8].
4 Emerging Digital Ethics
The consensus view of the European Advisory Group [4] is that a digital ethics
framework will provide new terms for identifying, analyzing and communicating
new human realities, in order to displace traditional value-based questions and
identify new challenges in view of values at stake and existing and foreseeable
technological changes. The purpose of digital ethics is not only to account for the
present, but also to perform a foresight function. The shift is twofold. First, the
object of legal regulation (i.e. an individual) can become less interesting, as a
phenomenon in the here-and-now and more an object for reasoned speculation
about its future role, all based on the predictive powers of the big data and algo-
rithmic processing. Second, while the analysis of legal issues is being pushed into
the future, what is understood as existing in the future becomes drawn into the
assessments of the present. For example, estimates of what the future will hold,
generated through the patterns gathered in big data analysis, are continuously
gaining in importance for the way criminal justice operates today and is purported
to operate tomorrow.
The focus of digital ethics is primarily on meta-ethical questions and considers
more general and fundamental questions about what it means to make claims about
ethics and human conduct in the digital age, when the baseline conditions of
‘human-ness’ are under the pressure of interconnectivity, algorithmic
decision-making, machine-learning, digital surveillance and the enormous collec-
tion of personal data, about what can and should be retained and what can and
should be adapted, from traditional normative ethics. The following examples
provide insight into the need for a digital ethics framework [4].
1. From the individual to the digital subject: That data exhausts neither personal
identity, nor the qualities of the communities to which individuals belong, that
data protection is not only about the protection of data, but primarily about the
protection of the persons behind the data. The question is whether the digital
representation of persons may expose them to new forms of vulnerability and
harm.
2. From analogue to digital life: The governing and the governed are distinct but
linked by mutually recognized principles of legal obligation and accountability.
Digital technologies have changed this. The use of algorithms and large data sets
can shape and direct the lives of individuals, therefore increasingly governed on
the basis of the data generated from their own behaviours and interactions. The
distinction between the forces that govern everyday life and persons who are
governed within it thus become more difficult to discern. Behaviour may be
governed by ‘nudging’, that is by minute, barely noticeable suggestions, which
can take a variety of forms and which may modify the scope of choices indi-
viduals have or believe they have.
3. From a risk society to a scored society: Risk assessment is carried out using
techniques of probability calculation, allowing individuals to be pooled and
situations with the same level of risks to be identified with each other for the
108 P. Lacroix
purposes of understanding the value of loss and the cost of compensation. In the
digital age, algorithms supported by big data can provide a far more detailed and
granular understanding of individual behaviours and propensities, allowing for
more individualized risk assessments and the apportioning of actual costs to
each individual; such assessment of risk threatens contractual or general prin-
ciples and widely shared ideas of solidarity. In this scored society, individuals
can be hyper-indexed and hyper-quantified. Beliefs and judgments about them
can be made through opaque credit or social scoring algorithms that must be
open to negotiation or contestation.
4. From human autonomy to the convergence of humans and machines: An
increasing number of technological artefacts, from prostheses like eyeglasses
and hearing aids, to smartphones, GPS, augmented reality glasses and more, can
be experienced in a symbiotic relationship with the human body. These artefacts
are experienced less as objects of the environment than as a means through
which the environment is experienced and acted upon. As such, they may tend
toward a seamless framing of our perception of reality. They may shape our
experience of the world in ways that can be difficult to assess critically. This
phenomenon of incorporation or even embodiment of technologies is even more
intense whenever the devices are implanted in the body. A parallel frontier of
convergence between human and machines is on the verge of being crossed by
intelligent, or rather ‘autonomous’, machines that are able to adapt their beha-
viours and rather than merely executing human commands, collaborate with, or
even replace human agents to help them identify problems to be solved, or to
identify the optimal paths to finding solutions to given problems.
5. From individual responsibility to distributed responsibility: The problems of
many hands and problems of collective action and collective inaction can lead to
tragedies of the commons and problematic moral assessments of complex
human endeavours, both low and high tech, where a number of people act
jointly via distant causal chains, while being separated in time and space from
each other and from the aggregated outcomes of their individual agency. The
problems of allocation and attribution of responsibilities are exacerbated by the
networked configuration of the digitized world.
6. From criminal justice to pre-emptive justice: In legal practice, the detection and
investigation of crime is no longer only a science of criminal acts, of identifying
and adjudicating events authored by identifiable, accountable individual actors
under precise conditions and in terms of moral and legal responsibility, but also a
statistically supported calculation of the likelihood of future crime, a structuring of
the governance of crime around the science of possible transgression and possible
guilt, removing moral character from the equation. The aim of criminal justice
remains the same: to provide security within society while at the same time
adhering to high standards of human rights and the rule of law. However, the shift
that marks one of the main backdrops of the digital age and calls for a new digital
ethics is that of trying to predict criminal behaviour in advance, using the output
of big data-driven analysis and smart algorithms to look into the future.
The new digital geopolitics created by differences in data protection rules across
national borders no longer represent the limits of data flows. The consequences for
global governance are significant. These digital geopolitics will impact national
cultures to the extent that national sovereignty will be increasingly strained between
national pressures and the shifting norms of the international system. There is
significance and urgency to developing a digital ethics framework, as evidenced by
digital ethics being the core topic of the 2018 International conference of Data
Protection and Privacy Commissioners.
5 Challenges for Healthcare Providers
Regulated healthcare providers are required to practice in compliance with their

professional code of ethics of which patient dignity, autonomy, privacy, confi-
dentiality and discrimination are considered. While the collection and analysis of
healthcare big data sets, such as biological material, genetics and associated data,
are highly anticipated to lead in the advancement of more precise and individual-
ized treatment regimens, health databases and biological data are collections of both
individuals and the population. In recognition of the ethical challenges of big data
as compared to conventional health-related research, the World Medical
Association adopted a declaration on ethical considerations regarding health data-
bases and biobanks to cover the collection, storage and use of identifiable data and
biological material beyond the individual care of patients [9]. As can be determined
by the global attention being given to the issues around big data analytics,
healthcare providers will be called upon to apply ethical considerations to the
collection and use of patient data in big data analytics. Specific circumstances for
ethical reflection include the following:
1. When using data derived from big data analytics, first recognize any potential
harms to an individual and either mitigate the risk of harms, abandon use of the
derived data or reconstitute the data from collection to analysis to remove
offending factors such as risk of re-identification, discriminatory proxy elements
and profiling biases.
2. Big data analytics employs inferential statistics i.e. probability and predictive
modeling, not cause and effect, and are generally based on trends and retro-
spective data. Yet the application of the results of big data analytics can shape
and direct the future lives of individuals, modifying the scope of choices an
individual has or believes they have, for instance personal autonomy based on
their rights to control personal data and freedom of choice.
3. Risk scoring uses probability calculations that allows the data of many indi-
viduals to be pooled. The risk of bias is in algorithms designed and used in big
data analytics that can provide more granular understanding of individual
behaviours and propensities, leading to opaque hyper-indexing of individuals.
The use of profiling and automated decision-making made possible by machine
110 P. Lacroix
learning can lead to individual bias and erosion of human rights. Human
oversight and accountability is necessary in profiling.
4. While technology will continue to converge in a symbiotic relationship with
humans, and generally to the benefit of human health and wellness, this con-
vergence has the potential to shape our perception of humanity and human
values over time. Humans are not identical to their data and should not be
temporary aggregates of data processing. Big data will generate new ethical
questions about what it means to be human in relation to data, about human
knowledge and about the nature of human experience.
5. Trust, a moral concept referring to belief and reliance in a person or organization
that they will honour explicit or implicit promises and commitments, stems from
shared social practice, shared habits, ways of life, common norms, convictions
and attitudes. Big data science researchers are often able to gain access to highly
sensitive data about human subjects without intervening in the lives of the
subjects to obtain it. The use of privacy impact assessments prior to release of
sensitive data provides a means for healthcare providers to determine and
mitigate risk, thus acknowledging the value of an individual’s trust of the
healthcare system while supporting the benefits of big data analytics.
6 Conclusion
Information governance and privacy frameworks in the management of big data

initiatives are strategies that fail to fully address the challenges of big data and will
require new concepts of data protection, new kinds of risk analysis and new
oversight models. The challenges also extend to traditional values of dignity,
autonomy, freedom, solidarity, equality, democracy and trust. There is growing
interest in ethical issues, both in the public and private spheres, as several tech-
nological trends require focused consideration of the relationship between tech-
nology and human values. The rise of big data has precipitated a distinct need to
fundamentally revisit the way ethical values are understood and applied, how they
are changing or being re-interpreted, and a need to determine their relevance to cope
with the new digital challenges.
References
1. Denham E (2017) Big data, artificial intelligence, machine learning and data protection.
Version 2.2. Information Commissioner’s Office. Available from https://ico.org.uk/media/for-
organisations/documents/2013559/big-data-ai-ml-and-data-protection.pdf. Accessed on 19
June 2018
2. European Union. General data protection regulation. Available from https://gdpr-info.eu and
https://ec.europa.eu/commission/priorities/justice-and-fundamental-rights/data-protection/
2018-reform-eu-data-protection-rules_en#abouttheregulationanddataprotection. Accessed on
19 June 2018
3. Abrams M (2014) The origins of personal data and its implications for governance. OECD.
Available from http://informationaccountability.org/wp-content/uploads/Data-Origins-Abrams.
pdf. Accessed on 19 June 2018
4. European Data Protection Supervisor Ethics Advisory Group (2018) Towards a digital ethics.
Available from https://edps.europa.eu/sites/edp/files/publication/18–01-25_eag_report_en.pdf.
5. Office of the Information and Privacy Commissioner of Ontario (2017) Big data guidelines.
Available from https://www.ipc.on.ca/wp-content/uploads/2017/05/bigdata-guidelines.pdf.
6. The Council for Big Data, Ethics and Society (2016) Perspectives on big data, ethics, and
society. Available from https://bdes.datasociety.net/wp-content/uploads/2016/05/Perspectives-
on-Big-Data.pdf. Accessed on 19 June 2018
7. Cavoukian A (2011) Privacy by design. The 7 foundational principles. Information and Privacy
Commissioner of Canada. Available from https://www.ipc.on.ca/wp-content/uploads/
Resources/7foundationalprinciples.pdf. Accessed on 19 June 2018
8. Metcalf J, Crawford K (2016) Where are human subjects in big data research? The emerging
ethics divide. Big Data Soc (Jan–June):1–14. Available from http://journals.sagepub.com/doi/
pdf/10.1177/2053951716650211. Accessed on 19 June 2018
9. World Medical Association (2016) WMA declaration of Taipei on ethical considerations
regarding health databases and biobanks. Available from https://www.wma.net/policies-post/
wma-declaration-of-taipei-on-ethical-considerations-regarding-health-databases-and-biobanks/.
Part III
Technological Perspectives
Health Lifestyle Data-Driven
Applications Using Pervasive
Computing
Luis Fernandez-Luque, Michaël Aupetit, Joao Palotti, Meghna Singh,

Ayman Fadlelbari, Abdelkader Baggag, Kamran Khowaja
and Dena Al-Thani
1 Introduction
The use of mobile technology and wearables for health has become a mass phe-
nomenon. Millions of people are using wearables devices (e.g. Apple Watch) and
mobile apps for health reasons. Data from mobile and wearable devices are cap-
tured to quantify patient reported outcomes, both to support clinical trials and
clinical practice. The combination of mobile, wearable technology and other con-
nected health devices is often referenced as pervasive or ubiquitous computing,
which refers to the tendency of embedding computing elements into everyday
objects (e.g. wearable devices, Internet of Things) [1]. Pervasive computing has
several potential applications in the health domain, but in particular, it can be very
useful to monitor lifestyle using wearables, patient-reported outcomes via mobile
phones and patient behaviours relying on the Internet of Things [2]. Since lifestyle
plays a major role in the prevention and management of multiple health conditions,
pervasive technologies can also be used to foster new applications for precision
medicine [3].
Transforming data from wearables and mobile devices into actionable knowl-
edge, which can support decision making of professionals and patients, is not a
trivial task. It consists of a complex process involving multiple steps, as shown in
L. Fernandez-Luque (&) M. Aupetit J. Palotti A. Baggag

Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
e-mail: lluque@hbku.edu.qa
A. Fadlelbari
Droobi Health, Doha, Qatar
K. Khowaja D. Al-Thani
Division of Information and Computing Technology, College of Science and Engineering,
Hamad Bin Khalifa University, Doha, Qatar
M. Singh
University of Minnesota, Minneapolis, USA

116 L. Fernandez-Luque et al.
Table 1. Furthermore, the selection of data sources also impacts the potential
applications. In terms of data-driven analytics, most of the discussion has been
focused on what has been called Big Data, which is widely covered in this textbook
and related surveys [4]. However, dealing with lifestyle, mobile and wearable
technologies brings additional challenges which are covered in this chapter. We
also need to consider that when collecting personal health data, we might have
scenarios where small data about one individual behavior has more value than a
huge dataset from a large population. Small data does not necessarily mean worse
data, and the boundary between big and small is not always clear [5]. Table 1
Table 1 Steps required for big data value extraction from mobile and wearable health devices.
Adapted from Curry [6]
Data processing Description
steps
Data acquisition With regards to lifestyle, the use of mobile and wearable technologies
has the capacity to capture the context of patients at the right time and
place. Captured data can come directly from user interfaces (e.g.
psychological patient-reported outcomes [7]) or a wide variety of
sensors [8]. These sensors are not limited to wearables, as there are
implantable and semi-implantable sensors such as continuous glucose
monitoring devices.
Data curation and Extracting insights from health data also requires the capacity to curate
storage and assess its quality [9]. In addition, it is important to ensure its
interoperability so it can be integrated with larger datasets. This is of
special importance if we foresee the need of integrating lifestyle data
into Electronic Health Records (EHRs) for various applications (e.g.
integration of sensor data in EHR using HL7 FHIR [10]). As
information about individuals becomes increasingly integrated, we
might find use cases where lifestyle data gets integrated with other
biomedical data sources (e.g. clinical data, genotype). In the long run,
storing such connected data might become a serious challenge. Also
ethical aspects such as privacy, consent, are relevant when storing and
sharing health data.
Data analysis Numerous types of applications that can be built around data-driven
lifestyle require the use of different machine learning techniques [11].
The data analytics techniques are heavily dependent on the applications.
For example, a health coaching solution might require real time pattern
recognition in order to provide recommendations to the patients.
However, aggregated public health data about lifestyle for policy makers
might not require such real time analysis, but rather clustering of the
population by predicted health risks.
Data applications The range of applications includes visualization dashboard, clinical
safety and logistics, decision support systems for professionals,
coaching systems for patients, etc. At the application level, one of the
biggest challenges is the user engagement and usability, consequently,
this area of work includes the use of human computer interactions
techniques.
Health Lifestyle Data-Driven Applications Using Pervasive … 117
summarizes the main steps involved in the creation of data-driven applications from
the acquisition of data to the creation of applications.
This chapter is structured as follows. In the next section, we provide an overview
of data-driven applications relying on pervasive health technologies, including
general aspects and examples from diabetes management. Finally, in the discussion
section we provide a summary of the main non-technical challenges, in particular
socio-ethical aspects, that could be barriers to the development of healthy lifestyle
data-driven applications.
2 Types of Lifestyle Applications Based on Pervasive

Health Data
2.1 Data-Driven Population-Scale Using Pervasive Health

Data
The aggregation of data from wearable and mobile devices of large populations
allows the tracking of lifestyle patterns, such as physical activity, in near real time,
which in turn allows understanding modifiable risk factors to inform public health
officials. Tim Althoff, in a recent review, highlighted potential applications and
technical challenges [12]. The same author, in a related study, reported on the use of
mobile data to study physical activity patterns on a global scale using data of over
seven hundred thousand users across the globe [13]. For many of these approaches,
challenges include access to such datasets by public health officials, representativity
of the data, and transformation of data into decision support tools. Microsoft
Research has also studied large datasets from wearable devices to better understand
sleep patterns at the population level, integrating search logs and data from health
apps as well [14]. Another approach for public health studies involving pervasive
health solutions is the use of sensors and mobile technologies (e.g. mobile sleep
labs) within observational studies. For example, the website www.sleepdata.org
incorporates data from actigraphy devices (clinical wearable devices for studying
sleep and physical activity) from thousands of patients across several years [15].
These datasets have been used in applications such as developing techniques for the
detection of sleep apnea [16], which is an example of how epidemiological per-
vasive data can be used to create new diagnostics applications.
2.2 Data-Driven Pervasive Health for Behavioral Change
Behavior change refers to the modification or alteration of human behavior, within

the domain of healthcare. Applications for behavior change can be aimed at
increasing physical activity, adherence to medications, or smoking cessation. In
these approaches, the process starts by identifying determinants of a particular

behavior. Then, attempts are made to alter the subject’s perceptions, normally
through some form of intervention, in order to affect that determinant, which is
ultimately responsible for the behavior in question. Machine learning models can be
used to assess the lifestyle/behavior patterns of the user. The intersection of
behavior change and data-driven pervasive health occurs in the definition of these
models. After their classification, the model is referred back to infer the correct type
of interventions. Shifts in the stages of change can be identified by changes in the
patterns present in the user’s data, which will be fed back into the loop. Lifestyle
data-driven models are used to select the motivational message to send at the right
place and time, which is termed just-in-time recommendations [17] and
context-aware recommender systems [18]. Recommendations based on wearable
data can also be used to provide detailed feedback on physical activities such as
dancing [19].
Recently, a new approach is emerging in which the models are not built by
aggregating datasets from many users, but learned at the individual level, for
example by the use of reinforcement learning algorithms (such as Q-Learning [58]),
where the effectiveness of each intervention can be identified based on its direct
impact on behavior (as seen in the generated data). The added benefit here is that
interventions can be personalized to the individual (micro-targeted), which would
not be possible using a behavioral model-based approach. However, the obvious
drawback is the amount of data required for such a system to be effective (the more
longitudinal data available, the more effective the model). The same techniques can
also incorporate contextual factors in determining intervention type (such as time or
location), or take the reverse view, determining the best context for a given inter-
vention. Approaches for health coaching based on wearable data and reinforcement
learning have been tested with patients with diabetes [20] and to help people
managing their stress level [21].
2.3 Visualization of Pervasive Health Data
Representation of health data, including insights derived from machine learning

methods (e.g. clusters, predictions), to support decision making is not a trivial task.
Despite decades of research in clinical information systems, many challenges still
exist, such as representing data for clinicians in a way that reduces errors and
cognitive fatigue [22, 23]. Visualizations have additional challenges when char-
acteristics of lifestyle and behavioral data are considered.
Besides the sparse and heterogeneous nature of such data, the needs of its target
audience widely vary: while health professionals seek support in their decision
making, patients look for advice and self-awareness. Few studies are looking into
the most effective way to represent lifestyle data to support the needs of the different
users, which, in many cases, is focused on dashboards [22], usually reporting
activity through time as line charts or data summaries as bar charts. Many other
visual metaphors exist, like node-link networks in the Health Infoscape [24], trees
and maps [25], scatterplots from dimension-reduced data, or parallel coordinate
plots [26], but they are difficult to understand by the users targeted by these data
visualizations. Based on user needs and feedback, the visualization experts design
the most efficient visual metaphors and interactions to use for specific data, tasks
and users. Still, data visualization literacy [27] is a key challenge in educating users
to understand these powerful graphics and pave the way to enable their use at large
scale for pervasive health data visualization. In Qatar, we are working on visual-
ization analytics of wearable data to better understand behavioral patterns of chil-
dren with obesity (see Fig. 1).
Scalability of large data sets is another challenge, a standard way to address this
issue is to pre-process the data even before the rendering stage where visual
metaphors encoding the data are displayed as pixel images. Data mining and
machine learning techniques [29] are used to summarize the data with simple
statistics such as counting, averaging or selecting prototypical examples with vector
quantization approaches [30]. Other techniques like dimension reduction [31] and
feature selection are employed to reduce the number of features. Therefore the
scalability issue with visualizing big data is not in rendering but in computing
minimal summaries that are still meaningful and useful to the user.
The best option, to ensure visualizations are meaningful and easy to understand,
is to adopt a user-centric design approach which progressively proposes more
advanced graphics, finely tuned to the user’s needs. Any of the graphics can be
adopted if it is usable, useful and the user has been trained to use it. We can expect
that in the future, digital health literacy would also incorporate elements such as the
capacity to understand visual analytics and machine learning key concepts.
Fig. 1 Visualization dashboard for sleep and physical activity of children with obesity [28]
2.4 Diabetes Management
Diabetes is a chronic condition where many different lifestyle factors play a role in
the control of the disease. Physical activity, nutrition, sleep and stress interact with
biological factors that influence how we metabolize glucose or even our appetite
[32]. Furthermore, many complications of the disease such as fatigue are often the
result of lifestyle factors. These physiological factors that influence insulin sensi-
tivity (and consequently diabetes control) are not yet incorporated into closed-loop
artificial pancreas systems [33] (see Fig. 2 for an example), where insulin pumps
are adjusted automatically using data from continuous glucose monitoring devices.
Lifestyle-data related to diabetes management has been collected using mobile and
wearable technologies for many years, to support the decision making of healthcare
professionals, patients and relatives [34, 35]. There are also examples of how
physical activity data from wearable can be used to create data-driven coaching
solutions for diabetes [20].
Fig. 2 Example of closed-loop diabetes system. Published by Ghafar-Zadeh et al. [36]

3 Discussion on Barriers to Data-Driven Health Lifestyle

Applications
In this subsection, we will provide a non-exhaustive overview of some barriers

ahead for the development of data-driven lifestyle applications, that can be divided
in three categories (a) Lifestyle medicine, (b) Technical aspects and (c) human and
factors (see Table 2).
There is very limited knowledge on how lifestyle such as physical activity plays
a role in some complex health conditions such as cancer [37]. If we do not fully
understand how lifestyle plays a role in the management of some health conditions,
it will be extremely hard to create applications to support lifestyle in those
conditions. Examples on how to use data-driven solutions to understand lifestyle
health aspects are emerging [37, 38], including lifestyle patterns in complex
chronic conditions like cancer [39] and sleep medicine [40]. Another issue is that
discipline of health behavioral change is still far from being mature in the use of
data-driven approaches to adjust the lifestyle coaching to the context of patients
[41].
One of the main promises of the data-driven healthcare paradigm is that the
combination of mobile and wearables datasets with biomedical data sources, such
as clinical and genomic will bring new applications and knowledge [42]. This
integration can facilitate, for example, the emergence of exposome informatics [43].
However, exposome informatics requires to achieve new levels of semantic
Table 2 Barriers to data-driven health lifestyle applications

Lifestyle medicine
Challenges There is limited knowledge on how lifestyle plays a role in complex health
conditions like cancer
Opportunities Wearables and mobile technology are becoming increasingly common to
study lifestyle in multiple health conditions
In sleep medicine, the use of data-driven methods for diagnosis (e.g.
actigraphy) is common practice and can be transferred to other domains
Technical aspects
Challenges Lifestyle data is not yet integrated with other biomedical databases beyond
very few pilots. In addition, the data quality issues are not yet fully
understood with regards of lifestyle data (e.g. wearables)
Lack of standard with regards and harmonization of the use of artificial
intelligence for health
Opportunities High adoption of mobile and wearable devices by the general public and in
clinical trials, which also drives prices down
Human and factors
Challenges With regards of wearable and mobile technologies there are very few studies
looking into human factors (e.g. gender, digital health literacy)
There is limited knowledge on cultural and organizational aspects
Concerns on data privacy and ownership are emerging
interoperability across multiple health applications, which might be feasible by the

use of API-level semantic interoperability standards such as HL7-FHIR [44].
Further, an important barrier will be the heterogeneous quality of available data,
which we know is a challenge in clinical datasets [45] and most likely will be of
lesser quality in consumer devices such as fitness trackers. Another technical issue
is that the rapid advancement of artificial intelligence for health is happening before
we have harmonized and standardized important aspects such as evaluation or how
to ensure the representativity and safety of such techniques. This might be a barrier
for both research and deployment. To address this issue a new initiative has been
launched by the International Telecommunication Union in cooperation with the
World Health Organization [46].
In a previous study, we found out that gender can play a major role in the usage
of mobile and wearable technologies in a project related to childhood obesity [47].
Cultural factors are not just limited to gender aspects, but also organizational.
Several studies [48] have reported the failure of information management systems
in hospitals, highlighting the need for a better understanding of these systems, their
users, and the context. For such critical settings, a clinical failure becomes inevi-
table when the user is not considered during the lifecycle of the system design and
development. Such systems can negatively affect patient lives and cause medical
errors [49]. For instance, if a system, that allows clinicians to view patient electronic
health record during a patient visit, makes it difficult for clinicians to view crucial
information (i.e. drug allergies), then a medical error might occur. Therefore, we
argue that involving the target user in different stages of the design of data-driven
solutions is of vital importance.
User engagement occurs in many forms, and is strictly related to the stage of the
project and context (i.e. wearable technology used by patients, or in-hospital used
by clinicians). These forms of user engagements include ethnographic studies,
contextual inquiries, observational studies, traditional usability testing, and partic-
ipatory design workshops. In some cases, a combination of these approaches is
needed to fully understand the different aspects of the system. One limitation of
many of these co-design methods is the lack of methodologies which allow
incorporating the patient and end-user’s perspectives into the design of data-driven
algorithms. There are efforts for creating explainable machine learning models [50],
but there is still a gap between explaining machine learning models and building
these explanations into applications that are user-driven. The complexity of
explaining the outcomes of models to both patients and professionals also opens the
discussion about the legal and moral responsibility of the decision making based on
data-driven algorithms [51, 52].
Data collected in massive proportions from wearables and mobile technologies
can also be used to infer patterns that can raise security concerns. For example,
aggregated geo-located data from Fitbit was used to track secret military facilities
[51]. This is just one example of the many potential implications of privacy and
secondary data usage. Furthermore, there are serious concerns about the adherence
and enforcement of privacy rules by many mobile health applications, which have
led to significant problems even in “trusted” health apps [53]. In addition to
regulation, another approach to foster privacy and security of health apps and
wearables is increasing the skills of users by increasing their digital health literacy
[57].
Any public health intervention should aim to reduce health disparities and ensure
equity among the population. In terms of data-driven personal health applications,
the representativity of the data presents a major challenge. Early adopters of
technology tend to be those with higher education and better socio-economic
conditions. Consequently, data-driven models may inadvertently create biases,
which may lead to the models not performing equally well for the individuals
underrepresented in the datasets used for training the models. While designing
data-driven health applications, it is imperative that training data is representative of
the population to be served, to avoid unethical and biased situations [54]; for
example by ensuring the enrolment of minorities and less favored communities.
These biases are prone to be high in lifestyle datasets, as cultural factors are
well-known to affect our lifestyles and routines. Approaches are emerging to use
machine learning to reduce biases [55].
Other socio-economic factors can become barriers or enablers to data-driven
personal health applications. The increased availability of data can increase the
quality of machine learning models, but it also increases the value of data for many
organizations. Consequently, there are serious concerns regarding the “privatiza-
tion” of health data [56]. A good example of “privatization” are fitness sensors
which do not provide an open APIs to access raw data and neither integration
capabilities with third-party applications. Further, in many countries, healthcare
providers cannot use devices such as Fitbit due to concerns with using a Fitbit’s
cloud which is located outside their country.
4 Conclusions
The increasing penetration of mobile and wearable technologies has been paving
the way for development of innovative data-driven personal health applications.
These new applications are building upon decades of experience in using mobile
and wearables in the health domain, but these are being launched at an unprece-
dented scale. We must refrain from the buzz and hype, and acknowledge the new
socio-ethical challenges that require a strong multidisciplinary partnership with
deep engagement of clinicians and patients, to ensure that these technological
developments really improve public health and do not contribute to further increase
health disparities.
References
1. Orji R, Moffatt K (2018) Persuasive technology for health and wellness: state-of-the-art and
emerging trends. Health Inform J 24:66–91
2. Riazul Islam SM, Kwak D, Humaun Kabir M, Hossain M, Kwak KS (2015) The internet of
things for health care: a comprehensive survey. IEEE Access 3:678–708
3. Intille S (2016) The precision medicine initiative and pervasive health research. IEEE
Pervasive Comput 15:88–91
4. Fang R, Pouyanfar S, Yang Y, Chen S-C, Iyengar SS (2016) Computational health
informatics in the big data age. ACM Comput Surv 49:1–36
5. Faraway JJ, Augustin NH (2018) When small data beats big data. Stat Probab Lett 136:142–
145
6. Curry E (2016)The big data value chain: definitions, concepts, and theoretical approaches. In:
New horizons for a data-driven e economy, pp 29–37
7. Heron KE, Smyth JM (2010) Ecological momentary interventions: incorporating mobile
technology into psychosocial and health behaviour treatments. Br J Health Psychol 15:1–39
8. Rodgers MM, Pai VM, Conroy RS (2015) Recent advances in wearable sensors for health
monitoring. IEEE Sens J 15:3119–3126
9. Bialke M, Rau H, Schwaneberg T, Walk R, Bahls T, Hoffmann W (2017) MosaicQA—a
general approach to facilitate basic data quality assurance for epidemiological research.
Methods Inf Med 56:e67–e73
10. Walinjkar A, Woods J (2017) Personalized wearable systems for real-time ECG classification
and healthcare interoperability: real-time ECG classification and FHIR interoperability. In:
Internet technologies and applications (ITA). https://doi.org/10.1109/itecha.2017.8101902
11. Habib ur Rehman M, Liew CS, Wah TY, Shuja J, Daghighi B (2015) Mining personal data
using smartphones and wearable devices: a survey. Sensors 15:4430–4469
12. Althoff T (2017) Population-scale pervasive health. IEEE Pervasive Comput 16:75–79
13. Althoff T, Sosič R, Hicks JL, King AC, Delp SL, Leskovec J (2017) Large-scale physical
activity data reveal worldwide activity inequality. Nature 547:336–339
14. Althoff T, Horvitz E, White RW, Zeitzer J (2017) Harnessing the web for population-scale
physiological sensing. In: Proceedings of the 26th international conference on world wide
web—WWW ’17. https://doi.org/10.1145/3038912.3052637
15. Dean DA 2nd, Goldberger AL, Mueller R, Kim M, Rueschman M, Mobley D et al (2016)
Scaling up scientific discovery in sleep medicine: the national sleep research resource. Sleep
39:1151–1164
16. Haidar R, Koprinska I, Jeffries B (2017) Sleep apnea event detection from nasal airflow using
convolutional neural networks. lecture notes in computer science. pp 819–827
17. Jaimes LG, Llofriu M, Raij A (2016) Preventer, a selection mechanism for just-in-time
preventive interventions. IEEE Transact Affect Comput 7:243–257
18. Schäfer H, Hors-Fraile S, Karumur RP, Valdez AC, Said A, Torkamaan H, et al (2017)
Towards health (aware) recommender systems. In: Proceedings of the 2017 international
conference on digital health—DH ’17. https://doi.org/10.1145/3079452.3079499
19. Dias Pereira dos Santos A, Yacef K, Martinez-Maldonado R (2017) Let’s dance: how to build
a user model for dance students using wearable technology. In: Proceedings of the 25th
conference on user modeling, adaptation and personalization—UMAP ’17, ACM Press, New
York, USA, pp 183–191
20. Hochberg I, Feraru G, Kozdoba M, Mannor S, Tennenholtz M, Yom-Tov E (2016)
Encouraging physical activity in patients with diabetes through automatic personalized
feedback via reinforcement learning improves glycemic control. Diabetes Care 39:e59–e60
21. Hu X, Hsueh P-YS, Chen C-H, Diaz KM, Cheung Y-KK, Qian M (2017) A first step towards
behavioral coaching for managing stress: a case study on optimal policy estimation with
multi-stage threshold Q-learning. In: AMIA annual symposiym proceedings, vol 930–939
22. Badgeley MA, Shameer K, Glicksberg BS, Tomlinson MS, Levin MA, McCormick PJ et al
(2016) EHDViz: clinical dashboard development using open-source technologies. BMJ Open
6:e010579
23. Wanderer JP, Nelson SE, Ehrenfeld JM, Monahan S, Park S (2016) Clinical data
visualization: the current state and future needs. J Med Syst 40:275
24. MIT health infoscape [Internet]. Available http://senseable.mit.edu/healthinfoscape/
25. Araujo MLD, Mejova Y, Aupetit M, Weber I (2017)Visualizing health awareness in the
middle east. In: AAAI conference on web and social media ICWSM, p 726
26. The data visualisation catalogue [Internet] Available https://datavizcatalogue.com/index.html
27. Börner K, Maltese A, Balliet RN, Heimlich J (2016) Investigating aspects of data
visualization literacy using 20 information visualizations and 273 science museum visitors.
Inf Vis 15:198–213
28. Aupetit M, Fernandez-Luque L, Singh M, Srivastava J (2017) Visualization of wearable data
and biometrics for analysis and recommendations in childhood obesity. In: IEEE 30th
international symposium on computer-based medical systems (CBMS). https://doi.org/10.
1109/cbms.2017.120
29. Bishop CM (2016) Pattern recognition and machine learning. Springer
30. Aupetit M, Couturier P, Massotte P (2002) Gamma-observable neighbours for vector
quantization. Neural Netw 15:1017–1027
31. Lespinats S, Aupetit M, Meyer-Baese A (2015) ClassiMap: a new dimension reduction
technique for exploratory data analysis of labeled data. Int J Pattern Recognit Artif Intell
29:1551008
32. Arora T, Choudhury S, Taheri S (2015) The relationships among sleep, nutrition, and obesity.
Curr Sleep Med Rep 1:218–225
33. Kudva YC, Carter RE, Cobelli C, Basu R, Basu A (2014) Closed-loop artificial pancreas
systems: physiological input to enhance next-generation devices. Diabetes Care 37:1184–
1190
34. Heintzman ND (2015) A digital ecosystem of diabetes data and technology: services, systems,
and tools enabled by wearables, sensors, and apps. J Diabetes Sci Technol 10:35–41
35. Dadlani V, Levine JA, McCrady-Spitzer SK, Dassau E, Kudva YC (2015) Physical activity
capture technology with potential for incorporation into closed-loop control for type 1
diabetes. J Diabetes Sci Technol 9:1208–1216
36. Ghafar-Zadeh E (2015) Wireless integrated biosensors for point-of-care diagnostic applica-
tions. Sensors 15:3236–3261
37. Ratjen I, Schafmayer C, di Giuseppe R, Waniek S, Plachta-Danielzik S, Koch M et al (2017)
Postdiagnostic physical activity, sleep duration, and TV watching and all-cause mortality
among long-term colorectal cancer survivors: a prospective cohort study. BMC Cancer
17:701
38. Gell NM, Grover KW, Humble M, Sexton M, Dittus K (2017) Efficacy, feasibility, and
acceptability of a novel technology-based intervention to support physical activity in cancer
survivors. Support Care Cancer 25:1291–1300
39. Gresham G, Schrack J, Gresham LM, Shinde AM, Hendifar AE, Tuli R et al (2018) Wearable
activity monitors in oncology trials: Current use of an emerging technology. Contemp Clin
Trials 64:13–21
40. Smith MT, McCrae CS, Cheung J, Martin JL, Harrod CG, Heald JL et al (2018) Use of
actigraphy for the evaluation of sleep disorders and circadian rhythm sleep-wake disorders: an
American Academy of Sleep Medicine clinical practice guideline. J Clin Sleep Med 14:1231–
1237
41. Nahum-Shani I, Smith SN, Spring BJ, Collins LM, Witkiewitz K, Tewari A et al (2018)
Just-in-time adaptive interventions (JITAIS) in mobile health: key components and design
principles for ongoing health behavior support. Ann Behav Med 52:446–462
42. Weber GM, Mandl KD, Kohane IS (2014) Finding the missing link for big biomedical data.
JAMA 311:2479–2480
43. Martin Sanchez F, Sanchez FM, Gray K, Bellazzi R, Lopez-Campos G (2014) Exposome
informatics: considerations for the design of future biomedical research information systems.
J Am Med Inform Assoc 21:386–390
44. Alterovitz G, Warner J, Zhang P, Chen Y, Ullman-Cullere M, Kreda D et al (2015) SMART
on FHIR Genomics: facilitating standardized clinico-genomic apps. J Am Med Inform Assoc
22:1173–1178
45. Sáez C, Zurriaga O, Pérez-Panadés J, Melchor I, Robles M, García-Gómez JM (2016)
Applying probabilistic temporal and multisite data quality control methods to a public health
mortality registry in Spain: a systematic approach to quality control of repositories. J Am Med
Inform Assoc 23:1085–1095
46. ITU and WHO launch new initiative to leverage power of Artificial Intelligence for health. In:
International telecommunication union [Internet]. Available https://www.itu.int/en/
mediacentre/Pages/2018-pr18.aspx
47. Fernandez-Luque L, Singh M, Ofli F, Mejova YA, Weber I, Aupetit M et al (2017)
Implementing 360° quantified self for childhood obesity: feasibility study and experiences
from a weight loss camp in Qatar. BMC Med Inform Decis Mak 17:37
48. Kushniruk AW, Triola MM, Borycki EM, Stein B, Kannry JL (2005) Technology induced
error and usability: the relationship between usability problems and prescription errors when
using a handheld application. Int J Med Inform 74:519–526
49. Borycki EM, Kushniruk AW (2008) Where do technology-induced errors come from?
Towards a model for conceptualizing and diagnosing errors caused by technology. In:
Human, social, and organizational aspects of health information systems, pp 148–166
50. Chakraborty S, Tomsett R, Raghavendra R, Harborne D, Alzantot M, Cerutti F, et al (2017)
Interpretability of deep learning models: a survey of results. In: Smart world, ubiquitous
intelligence & computing, advanced & trusted computed, scalable computing & communi-
cations, cloud & big data computing, internet of people and smart city innovation
(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). https://doi.org/10.1109/uic-atc.
2017.8397411
51. Sly L (2018) US soldiers are revealing sensitive and dangerous information by jogging. In:
The Washington post [Internet]. Available https://www.washingtonpost.com/world/the-us-
military-reviews-its-rules-as-new-details-of-us-soldiers-and-bases-emerge/2018/01/29/
6310d518-050f-11e8-aa61-f3391373867e_story.html?utm_term=.91cdbf6f3e38
52. Froomkin AM, Michael Froomkin A, Kerr IR, Pineau J (2018) When AIs outperform doctors:
the dangers of a tort-induced over-reliance on machine learning and what (not) to do about it.
SSRN Electron J. https://doi.org/10.2139/ssrn.3114347
53. Huckvale K, Prieto JT, Tilney M, Benghozi P-J, Car J (2015) Unaddressed privacy risks in
accredited health and wellness apps: a cross-sectional systematic assessment. BMC Med 13.
https://doi.org/10.1186/s12916-015-0444-y
54. Yapo A, Weiss J (2018) Ethical implications of bias in machine learning. In: Proceedings of
the 51st Hawaii international conference on system sciences. https://doi.org/10.24251/hicss.
2018.668
55. Hajian S, Bonchi F, Castillo C (2016) Algorithmic bias: from discrimination discovery to
fairness-aware data mining. In: Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining—KDD ’16, ACM Press, New York,
USA, pp 2125–2126
56. Wilbanks JT, Topol EJ (2016) Stop the privatization of health data. Nature 535:345–348
57. Norman CD, Skinner HA (2006) eHealth literacy: essential skills for consumer health in a
networked world. J Med Internet Res 8(2):e9
58. Hu X, Hsueh P-YS, Chen C-H, Diaz KM, Parsons FE, Ensari I, Qian M, Cheung Y-KK An
interpretable health behavioral intervention policy for mobile device users. IBM Journal of
Research and Development 62 (1):4:1-4:6
Big Data Challenges from an Integrative
Exposome/Expotype Perspective
Fernando Martin-Sanchez
1 Introduction
In recent years we have witnessed an enormous growth in the scientific literature

reporting applications of big data in the biomedical and health domains [1].
Regardless of the consideration of whether the term big data really captures the
main challenges currently faced by biomedical informatics, this chapter reflects on
the fact that most of these articles address issues related to either the characteri-
zation of genetic and clinical information or the integration of genotype-phenotype
data, but only in a few occasions do they describe work that integrate data on
environmental risk factors measured at the individual level, an aspect that is key for
understanding the biological basis of diseases.
Most diseases result from the complex interplay between genetic and environ-
mental factors. The exposome is a new concept that seeks to define biotechnical
approaches to systematically measure a large subset of environmental exposures of an
individual from conception to end of life [2] and associate them with health and
disease status [3]. In its broadest sense, the exposome encompasses not only exposures
to environmental stressors, but also the physical environment, the built environment,
socio-economic factors, access to health care and life habits or behaviors [4].
While the environment and exposome has great impact on health, biomedical
informaticians have paid limited attention so far to developing methods to process
and integrate data about the contribution of environmental factors to individual
health (environmental epidemiology has focused on the impact of the environment,
but at aggregated, population level) [5]. There is a need for new digital methods and
resources that collect, store, annotate, analyze and present reliable and updated
information about environmental factors affecting our health on both population
and individual/patient scale. The exposome demands a systematic research effort
F. Martin-Sanchez (&)
Instituto de Salud Carlos III, Madrid, Spain
e-mail: fmartin@isciii.es

128 F. Martin-Sanchez
equivalent to what has been done to characterize the human genome (and also the
human phenome) [6, 7]. Defining the concept of expotype, analogous to genotype
and phenotype, could represent an opportunity to make progress in the character-
ization of human individual exposome data.
The use of digital health technologies, coupled with advances in the characterization
of individual exposomes and the development of participatory medicine, converge in
projects such as the US Precision Medicine Initiative (e.g. PMI) of high potential to
support truly integrative research approaches (gene, environment, phenotype) [8].
It has been estimated that the attributable risk from the genome for chronic
disease development is only somewhere between 10 and 30%. Even in the area of
rare diseases, it has been estimated that only 80% of them have a genetic cause. In
the rest, infectious and environmental causes are responsible for its development.
We also know that the environment dominates over host genetics in shaping human
gut microbiota [9] and that the local environment directly affects disease risk [10].
The necessary connection between genotype and phenotype must be carried out
in any case through the environment, since it modulates different modes of
expression of genetic information, leading to different phenotypic manifestations.
Gene-environment interaction studies are becoming very prominent, but informatics
is only starting to grasp the complexity of big data processing in those truly inte-
grative approaches where genetics, clinical and environmental data need to be
jointly processed for a better understanding of disease mechanisms. For instance,
while genomic data consist of stable linear sequences, the exposome data are
non-linear heterogeneous variables that change in time and space.
The assessment of the exposome now can take advantage from the emergence of
innovative digital technologies—including wearable devices and personal sensors,
mobile apps, global positioning systems, and geographic information systems
which enable new and more detailed exposure measurement at the individual level.
Research about the exposome is lagging behind research efforts about the
genome and other—omics. One of the reasons for this being the fragmentation of
the landscape of disciplines that are interested in characterizing the exposome from
different perspectives:
– Environmental health—Exposure, toxicology (Enviroexposome or expososome)
– Health services research (Access to healthcare exposome)
– Urbanism—“Built environment” (Urban exposome)
– Occupational health (Occupational exposome)
– Epidemiology (Public Health Exposome)
– Sociology (Socioexposome)
– Nanomedicine (Nanoexposome)
– Infections (Infectoexposome)
– Medical procedures exposome
– Medications (Drugexposome)
– Psychology (Psychoexposome)
– Digital Technology (Digital component of the exposome)
Big Data Challenges from an Integrative Exposome/Expotype … 129
Table 1 Challenges for Challenge

processing individual
exposome (expotype) data New methods and data sources to collect individual exposomes
Annotating data and representing knowledge
Creating individual exposome profiles from electronic health
records
Analyzing and visualizing data
Addressing ethical, legal and social issues
Using exposome data in biomedical research, precision
medicine and clinical care
Educating biomedical informaticians in these matters
Being aware of relevant initiatives in this area
The exposome concept therefore tries to provide a unified vision for the pro-
cessing of exposure data that are relevant for human health. The following sections
describe eight challenges in terms of processing individual exposome (expotype)
big data and integrating them with genomic and clinical data for biomedical
research and clinical practice. These challenges are summarized in Table 1.
2 New Methods and Data Sources to Collect Individual

Exposomes
Exposure information in the broad sense comprises all non-genetic data of an

individual (including behavioral factors, social determinants of health,
physico-chemical exposures), and these data can be obtained from multiple sources,
including biomonitoring [11], high throughput omics technologies (molecules
reporting exposure to particular environmental agents such as smoking [12]),
Geographic Information Systems [13], environmental sensors and questionnaires
[14], Electronic Health Records [15], and digital health tools and services (surveys,
sensors, social media, search engines, self-quantification systems, mobile apps and
Direct to Consumer Services) [16]. Exposome data can be considered a new layer in
the multi-omics data model.
Although there is no yet a commonly agreed standard central repository for
associations between exposures, genes and diseases, researchers should be aware of
several databases that can be instrumental in dealing with exposome science data
such as the US National Health and Nutrition Environmental Survey (NHANES;
https://www.cdc.gov/nchs/nhanes/index.htm), the Toxic Exposome Database
(T3DB; http://www.t3db.ca) and the Comparative Toxicogenomics Database
(CTD; http://ctdbase.org), the Toxicant and Disease Database (TDD; https://www.
healthandenvironment.org/what-we-do/toxicant-and-disease-database/), Exposome
Explorer (http://exposome-explorer.iarc.fr/), the Toxin and Toxic-target database
(http://www.t3db.ca/), the Human Metabolome Database (http://www.hmdb.ca/),
the Public Health Exposome (http://communitymappingforhealthequity.org/public-

health-exposome-data/) and other emerging data sources, such as those from NIH
Children’s Health Exposure Analysis Resource (CHEAR) and European exposome
projects.
3 Annotating Data and Representing Knowledge
In recent years, important advances have been made in standardizing the repre-
sentation of genotype and phenotype data. For both domains, there already exist
terminologies and controlled vocabularies, ontologies and classification systems
that allow the exchange and integration of data for further analysis. However, in the
case of environmental factors and exposures that affect human health, we are still
facing a very fragmented field, where different scientific disciplines have still dif-
ferent views of the exposome (toxicology, environmental science, public health,
health services research, urbanism). They use different taxonomies to catalog the
different environmental factors and these are not interconnected [17].
Although the elaboration of the complete exposome of an individual is still far
from the reach of research laboratories, because of its enormous complexity [18]
and relatively recent definition, it is now possible to carry out studies of partial
exposomes, as summarized in Fig. 1, for example, focused on a disease [19], health
condition [20], organ [21], geographical location [22] or employment status [23].
Fig. 1 Examples of partial exposomes that have been published

Several efforts are in place to reconcile the different views of the exposome into
a single ontology: Exposure Ontology (ExO; https://www.ebi.ac.uk/ols/ontologies/
exo), Children’s Health Exposure Analysis Resource (CHEAR; http://purl.
bioontology.org/ontology/CHEAR). There also exist tools such as PhenX (https://
www.phenxtoolkit.org/) that can enable better data exchange and integration with
other sources of data (genomic, phenomic).
4 Creating Individual Exposome Profiles from Electronic

Health Records
Several years ago, the author of this chapter, along with Dr. Guillermo Lopez
Campos, now at Queens University, Belfast, developed the new concept of
Expotype, which has been presented in various scientific events (e.g. a keynote at
MIE 2015 conference, in Madrid). Also the concept of expotype/expotyping was
explained in our article [24] published in 2016. Expotype was our suggested word
to define partial views of an individual exposome. It can be defined as “a specific
set of exposome elements of an individual accumulated during a certain time/
space”. For instance, the number of steps walked by an individual during a specific
time/space window (as illustrated in Table 2). A mixture of expotypes, in combi-
nation with an individual genotype, is responsible for a mixture of phenotypes,
along time.
Dr. Sarigiannis [25] mentioned the term expotype in the abstract of an article
published in 2017, defining it as “the vector of exposures an individual is exposed
over time”. Although this would be the first time that the term was used in an
abstract, therefore appearing in Pubmed searches, the author did not develop further
the concept.
Table 2 Example of an individual partial exposome profile, or expotype

Variable Value Coding system
Person ID F-123456 National identity card
Category Behavior Exposome ontology
Subcategory Physical activity Classification of data and activity in
self-quantification systems (CDA-SQS)
Timeframe 01/03/2017–31/03/2017 Internet time range
Geolocation Madrid, Spain GPS coordinates
Method Mobile app XYZ
Measurement steps
Value 342.000
Error 3%
In their article [26] Fan and collaborators mention our 2016 article. Based on our
proposed concept of expotype, they concurred that it is important to extract all
individual exposome information available in Electronic Health Records (we
christened this process as expotyping) [24], and developed a template-driven
approach to identifying exposome concepts from the Unified Medical Language
System (UMLS). They used selected ontological relations and the derived concepts
were evaluated in terms of literature coverage and the ability to assist in annotating
clinical text.
Finally, the paper by Rattray et al. [27] introduces the concept of “Exposotype”
with a more restricted meaning (“the metabolomic profile of an individual that
reflects an event of exposure”). From our perspective, an exposotype would be a
particular case of expotype.
There are several data from electronic health records that can be used to generate
expotypes, such as demographic data (e.g. residential, education level), health
behavior (e.g. tobacco, alcohol, and injection drug use), history of medication (type,
dose, frequency, duration), history of infection (agent, duration), or medical pro-
cedures and imaging (e.g. magnetic resonance imaging, CT scan, X-ray, …).
On November 2013, the Institute of Medicine (IOM) released the report:
“Capturing Social and Behavioral Domains and Measures in Electronic Health
Records: Phase 2” [28], which recommends a “concrete approach to including
social and behavioral determinants in the clinical context to increase clinical
awareness of the patient’s state, broadly considered, and to connect clinical, public
health, and community resources for work in concert”.
Until the socioeconomic information and other individual exposure factors
(expotypes) are not stored properly and regularly in the Electronic Health Records,
efforts will have to be made to extract this data from the current EHR, both from
structured as unstructured (text) fields. The following articles briefly describe
various approaches and perspectives already carried out in this field.
Casey et al. [29] reviewed how EHR studies have been used to evaluate
exposures to risks and resources in the physical environment (e.g. air pollution,
green space) and health outcomes (e.g. hypertension, diabetes, migraines). EHR
data sets have allowed environmental and social epidemiologists to leverage data on
patients distributed across a wide range of physical, built, and social environments.
Linking geocoded addresses to location-specific data and using geographic infor-
mation systems (GIS) it is possible to study an individual’s proximity to hazards
(e.g. air pollution) related to disease.
Biro et al. [30] showed the utility of linking primary care electronic medical
records with census data to study the determinants of chronic disease. They used
postal codes to link patient data from EMR with additional information on envi-
ronmental determinants of health, demonstrating an association between obesity
and area-level deprivation.
Wang et al. [31] investigated tobacco use data from structured (social history)
and unstructured sources (clinical notes) in the EHR. They implemented a natural
language processing pipeline and showed that structured fields alone may not be
able to provide a complete view of tobacco use information.
In 2016, Gottlieb et al. published an article in the Journal Health Affairs [15]
describing current opportunities and barriers for integrating social and clinical data.
They discussed the process of Extraction of data about social determinant of health
out of EHRs and concluded that ICD-10 provides an expanded set of codes
reflecting patient social characteristics in the form of z-codes (e.g. Z56: Problems
related with employment and unemployment—Z56.0).
Maranhao et al. [32] worked with nutrigenomic (personalized nutrition) infor-
mation in the openEHR data set. They identified 117 clinical statements, as well as
27 archetype-friendly concepts in a bibliographic review (26 articles). This group
also modeled four new archetypes (waist-to-height ratio, genetic test results, genetic
summary, and diet plan) and created a specific nutrigenomic template for nutrition
care. The archetypes and the specific openEHR template developed in this study
gave dieticians and other health professionals an important tool to their nutrige-
nomic clinical practices, besides a set of nutrigenomic data for clinical research.
Lastly, Boland et al. recently published their study: “Uncovering exposures
responsible for birth season—disease effects: a global study” [33], where the team
demonstrated, using EHR data from more than 6 clinical sites, 10 million patients, 3
countries, 2 continents, and 5 climates, that seasonality and climate play an
important role in human health and disease. Geography and climate modulate
disease risk and/or severity while also altering our exposure to diverse environ-
mental factors. Based on a previously published SeaWAS (for Season-Wide
Association Study), they found correlation between each of 12 exposures across
133 diseases during 5 different developmental stages (i.e. 3 trimesters,
pregnancy-wide, and perinatal). For their work with EHR data they used OHDSI—
CDM in three sites, and mapping of ICD-9 to SNOMED in the other three.
5 Analyzing and Visualizing Data
In addition to the already cited challenges related to collecting, storing, standard-

izing and annotating environmental data, there exist important needs for new sta-
tistical [34, 35] and informatics methods [7] that can be applied to the analysis of
environmental and gene-environment association studies data. There are still few
examples of well characterized gene-environment interactions and the analytical
complexity is daunting. Problems related to confounding elements, signal-to-noise-
ratio and combinatory explosion justify this statement.
Harvard Medical School’s academic Dr. Chirag Patel is a pioneer and obligatory
reference in this field. He and his collaborators have published seminal articles on
new methodologies to assess the impact of many environmental risk factors
simultaneously in the development of diseases, including Environmental-Wide
Association Studies (EWAS) [36, 37] and phenotype-exposure association maps
[38]. His group has also made important contributions showing how new visual-
ization methods can be used to represent relationships between environmental
factors and phenotypes, such as correlation globes [39]. INDIV 3-D is a theoretical
model that could serve to represent this complex set of multi-level health data as
well [40].
6 Addressing Ethical, Legal and Social Issues
On November 16–17, 2016, the US National Academies of Sciences, Engineering,

and Medicine held a 2-day workshop to explore the implications of producing and
accessing individual and community level environmental exposure data in the
United States [41]. The main challenges identified by participants included:
• Need for best practices related to quality checks to ensure that the data collected
by members of the public using the different sensor tools are accurate. Issues
around device calibration and data evaluation need to be taken into account.
• Personal sensing devices can pose privacy challenges for members of the public.
Data from environmental health studies are not typically reported back to study
participants, and institutional review boards try to prevent any harm that could
result from the release of the data. It is also important to distinguish between
individual and community level environmental data.
• Projects that collect exposure data should pursue diversity in the constitution of
their participant cohorts. It is known that the users of personal monitoring
devices belong to a greater extent to population groups with high educational
levels and purchasing power, and efforts must be made to reduce the risk that
economic or social minorities do not participate sufficiently in these studies.
7 Using Exposome Data in Biomedical Research, Precision

Medicine and Clinical Care
The information about the exposome of individuals represents a key aspect of future
biomedical research projects. Individuals generate data in their contacts with health
systems, which are normally stored in their electronic health records. They also
generate data themselves using new technologies and digital health services. When
they participate in authorized research projects, their biological samples are stored
in biobanks and then processed in laboratories to obtain their molecular data
(genome, proteome, …). All this information must be processed to feed the data
that are needed in biomedical research [5, 42]. The adequate extraction of data
provided by participants, clinical systems and laboratory systems should lead to the
generation of genotypes, expotypes and phenotypes annotated with standards to
allow their integration and joint analysis, as described in Fig. 2.
Fig. 2 Individual biomedical data flows in biomedical research
Precision Medicine explicitly recognizes the need to incorporate individual

exposome data together with genomic and clinical data [43–46]. The NIH AllofUS
program places special emphasis on participants completing very detailed ques-
tionnaires about their health habits and it is expected that they will collect data
through mobile applications or wearables.
A new avenue for research that has recently been reported by our group is the
so-called “Digital component of the exposome” [47]. Defined as: ‘the whole set of
tools and platforms that an individual use and the activities and processes that an
individual engage with as part of his/her digital life’, this concept recognizes the
convenience of initiating studies that characterize the individual’s exposure to
digital technologies, as another area of the exposome. For example, children are
known to be 27% more likely to suffer from depression when they frequently used
social media [48].
There is also a growing interest in incorporating data on the social determinants
of health in clinical practice. These data can also be considered part of the expo-
some and will allow the design of interventions that are appropriate for the
socioeconomic context of individuals [49].
8 Educating Biomedical Informaticians in These Matters
The scientific community involved in Exposome research is aware of the need to

train new scientists in areas related to this new concept [50]. Several activities have
been organized, among which we can highlight the Summer Course organized by
the Hercules Center, Emory University, in June 2016 [51], the symposium orga-
nized in July 2017 by the Yale University School of Public Health [52] as well as
other online initiatives (e.g. http://www.chiragjpgroup.org/exposome-analytics-course/),
offered by agencies, universities, research centers or collaborative projects in this
domain.
However, the training of biomedical informaticians in the processing and inte-
gration of exposome data is still insufficient. It should be noted that the author was
the creator and responsible for a subject of 13 weeks (3 credits) called
“Environmental and Participatory Health Informatics” that was taught in the second
semester of 2016 to students of the Master of Health Informatics offered by Weill
Cornell Medicine. During this activity, we sustained that whilst biomedical infor-
matics has focused mainly on the management of clinical and genetic information,
in the coming years it will be convenient to introduce students to the management
Fig. 3 New sources of individual exposome data complementing existing phenome and genome
data
of environmental risk data at the individual level. The departments of biomedical

informatics have traditionally been structured in the areas of bioinformatics, med-
ical imaging, clinical informatics, and public health informatics (following the
different levels of biological complexity from the molecule to the population).
Perhaps in the future biomedical informatics experts will need to know how the
environment affects all those areas. Students will need to acquire the knowledge and
methods that allow them to navigate the complete genome/exposome/phenome
triangle, managing their data sources and analyzing their complex interactions to
better understand health and the development of diseases and contribute to the
design of more precise preventive, diagnostic and therapeutic solutions (as illus-
trated in Fig. 3).
9 Being Aware of Relevant Initiatives in This Area
Some of the most important health institutions including the US NIH, through its
NIESH and NIOSH institutes, the CDC, or the US EPA, already have programs in
place around the Exposome.
The main research funding agencies, at the international level, have supported the
creation of consortiums and networks in this space. For example, the European
Commission financed the projects HELIX, EXPOSOMICS and HEALS in the pre-
vious R&D Framework Program that dealt with specific aspects of the Exposome.
The NIH has funded Research Centers like Hercules (https://emoryhercules.com/) or
the Children’s Health Exposure Analysis Resource (CHEAR; https://www.niehs.nih.
gov/research/supported/exposure/chear/).
Japan supported the JECS program (www.env.go.jp/chemi/ceh/en/) to study the
effects of the environment on children. We are also witnessing the creation of
monographic research centers on the Exposome, such as the TNO—Utrecht
Exposome Hub (https://www.uu.nl/en/research/life-sciences/research/hubs/utrecht-
exposome-hub), the Institute for Exposomic Research (http://icahn.mssm.edu/
research/exposomic) at the Icahn School of Medicine, in New York or the I3CARE
International Exposome Center (http://exposome.iras.uu.nl/), a global collaboration
between University of Utrecht, the University of Toronto, and the Chinese
University of Hong Kong.
The International Medical Informatics Association (IMIA) has recently created a
Working Group on informatics aspects related to the exposome to support
researchers, clinicians and consumers navigate throughout the entire “data to
knowledge” life cycle: data collection, knowledge representation, annotation,
integration with genomic and phenomic data, analytics, and visualization
(https://exposomeinformatics.wordpress.com).
10 Conclusion
The objective of this chapter is to raise awareness among potential readers about the
importance of advancing in those aspects related to the processing of exposome big
data. Although it is a relatively recent area and in rapid progression, it is beyond the
scope of this contribution to offer an exhaustive catalog of all the resources,
methods and experiences that have already been reported in the literature. Instead,
we have chosen, based on our own experience and a literature review, to identify
eight challenges that can introduce the reader to this field and motivate her to search
for more information. It is our desire that the biomedical informatics and data
science community recognize exposome informatics as a new area of activity, key
for precision medicine and biomedical research, and with clear potential to be
useful in clinical practice in the coming years.
References
1. Martin-Sanchez F, Verspoor K (2014) Big data in medicine is driving big changes. Yearb
Med Inform 15(9):14–20
2. Wild CP (2005) Complementing the genome with an “exposome”: the outstanding challenge
of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol
Biomarkers 14(8):1847–1850
3. Patel CJ, Ioannidis JP (2014) Studying the elusive environment in large scale. JAMA 311(21):
2173–2174
4. Wild CP (2012) The exposome: from concept to utility. Int J Epidemiol 41(1):24–32
5. Martin Sanchez F, Gray K, Bellazzi R, Lopez-Campos G (2014) Exposome informatics:
considerations for the design of future biomedical research information systems. JAMIA.
21(3):386–390
6. Thomas DC, Lewinger JP, Murcray CE, et al (2012) Invited commentary: GE-Whiz!
Ratcheting gene-environment studies up to the whole genome and the whole exposome. Am J
Epidemiol 175:203–207; discussion 208–209
7. Manrai AK, Cui Y, Bushel PR, Hall M, Karakitsios S, Mattingly CJ, Ritchie M, Schmitt C,
Sarigiannis DA, Thomas DC, Wishart D, Balshaw DM, Patel CJ (2017) Informatics and data
analytics to support exposome-based discovery for public health. Annu Rev Public Health 20
(38):279–294. https://doi.org/10.1146/annurev-publhealth-082516-012737 Epub 2016 Dec 23
PubMed PMID: 28068484
8. Collins FS, Varmus H (2015) A new initiative on precision medicine. NEJM 372(9):793–795
9. Rothschild D, Weissbrod O, Barkan E, Kurilshikov A, Korem T, Zeevi D, Costea PI,
Godneva A, Kalka IN, Bar N, Shilo S, Lador D, Vila AV, Zmora N, Pevsner-Fischer M,
Israeli D, Kosower N, Malka G, Wolf BC, Avnit-Sagi T, Lotan-Pompan M, Weinberger A,
Halpern Z, Carmi S, Fu J, Wijmenga C, Zhernakova A, Elinav E, Segal E (2018)
Environment dominates over host genetics in shaping human gut microbiota. Nature
555(7695):210–215
10. Favé MJ, Lamaze FC, Soave D, Hodgkinson A, Gauvin H, Bruat V, Grenier JC, Gbeha E,
Skead K, Smargiassi A, Johnson M, Idaghdour Y, Awadalla P (2018) Gene-by-environment
interactions in urban populations modulate risk phenotypes. Nat Commun 9(1):827
11. Dennis KK, Marder E, Balshaw DM, Cui Y, Lynes MA, Patti GJ, Rappaport SM,
Shaughnessy DT, Vrijheid M, Barr DB (2017) Biomonitoring in the Era of the Exposome.
Environ Health Perspect 125(4):502–510
12. Ding YP, Ladeiro Y, Morilla I, Bouhnik Y, Marah A, Zaag H, Cazals-Hatem D, Seksik P,
Daniel F, Hugot JP, Wainrib G, Tréton X, Ogier-Denis E (2017) Integrative network-based
analysis of colonic detoxification gene expression in ulcerative colitis according to smoking
status. J Crohns Colitis 11(4):474–484
13. Jacquez GM, Sabel CE, Shi C (2015) Genetic GIScience: toward a place-based synthesis of
the genome, exposome, and behavome. Ann Assoc Am Geogr 105(3):454–472
14. Centers for Disease Control and Prevention (CDC), National Center for Health Statistics
(NCHS) (2016) National health and nutrition examination survey data. U.S. Department of
Health and Human Services, Centers for Disease Control and Prevention, Hyattsville, MD
[last visit 2017-0403]. Available from https://www.cdc.gov/nchs/nhanes/
15. Gottlieb L, Tobey R, Cantor J, Hessler D, Adler NE (2016) Integrating social and medical
data to improve population health: opportunities and barriers. Health Aff (Millwood) 35(11):
2116–2123
16. Swan M (2012) Health 2050: the realization of personalized medicine through crowdsourcing,
the quantified self, and the participatory biocitizen. J Pers Med 2(3):93–118
17. Kiossoglou P, Borda A, Gray K, Martin-Sanchez F, Verspoor K, Lopez-Campos G (2017)
Characterising the scope of exposome research: a generalisable approach. Stud Health
Technol Inform 245:457–461
18. Cui Y, Balshaw DM, Kwok RK, Thompson CL, Collman GW, Birnbaum LS (2016) The
exposome: embracing the complexity for discovery in environmental health. Environ Health
Perspect 124(8):A137–A140
19. Smith MT, Zhang L, McHale CM, Skibola CF, Rappaport SM (2011) Benzene: the exposome
and future investigations of leukemia etiology. Chem Biol Interact 192(1–2):155–159
20. Goldfarb DS (2016) The exposome for kidney stones. Urolithiasis 44(1):3–7
21. Rappaport SM, Barupal DK, Wishart D, Vineis P, Scalbert A (2014) The blood exposome and
its role in discovering causes of disease. Environ Health Perspect 122(8):769–774
22. Donald CE, Scott RP, Blaustein KL, Halbleib ML, Sarr M, Jepson PC et al (2016) Silicone
wristbands detect individuals’ pesticide exposures in West Africa. R Soc Open Sci 3(8):
160433
23. Faisandier L, Bonneterre V, De Gaudemaris R, Bicout DJ (2011) Occupational exposome: a
network-based approach for characterizing occupational health problems. J Biomed Inform
44(4):545–552
24. Martin-Sanchez FJ, Lopez-Campos GH (2016) The new role of biomedical informatics in the
age of digital medicine. Methods Inf Med 55(5):392–402
25. Sarigiannis DA (2017) Assessing the impact of hazardous waste on children’s health: the
exposome paradigm. Environ Res 158:531–541
26. Fan JW, Li J, Lussier YA (2017) Semantic modeling for exposomics with exploratory
evaluation in clinical context. J Healthc Eng 2017:3818302
27. Rattray NJW, Deziel NC, Wallach JD, Khan SA, Vasiliou V, Ioannidis JPA, Johnson CH
(2018) Beyond genomics: understanding exposotypes through metabolomics. Hum Genomics
12(1):4
28. Institute of Medicine (2014) Capturing social and behavioral domains and measures in
electronic health records: phase 2. The National Academies Press, Washington, DC. https://
doi.org/10.17226/18951
29. Casey JA, Schwartz BS, Stewart WF, Adler NE (2016) Using electronic health records for
population health research: a review of methods and applications. Annu Rev Public Health
37:61–81
30. Biro S, Williamson T, Leggett JA, Barber D, Morkem R, Moore K, Belanger P, Mosley B,
Janssen I (2016) Utility of linking primary care electronic medical records with Canadian
census data to study the determinants of chronic disease: an example based on socioeconomic
status and obesity. BMC Med Inform Decis Mak 11(16):32
31. Wang Y, Chen ES, Pakhomov S, Lindemann E, Melton GB (2016) investigating longitudinal
tobacco use information from social history and clinical notes in the electronic health record.
In: AMIA annual symposiym proceedings, pp 1209–1218
32. Maranhão PA, Bacelar-Silva GM, Ferreira DNG, Calhau C, Vieira-Marques P, Cruz-Correia
RJ (2018) Nutrigenomic information in the openEHR data set. Appl Clin Inform 9(1):
221–231
33. Boland MR, Parhi P, Li L, Miotto R, Carroll R, Iqbal U, Nguyen PA, Schuemie M, You SC,
Smith D, Mooney S, Ryan P, Li YJ, Park RW, Denny J, Dudley JT, Hripcsak G, Gentine P,
Tatonetti NP (2017) Uncovering exposures responsible for birth season—disease effects: a
global study. J Am Med Inform Assoc. https://doi.org/10.1093/jamia/ocx105. [Epub ahead of
print]
34. Agier L, Portengen L, Chadeau-Hyam M, Basagaña X, Giorgis-Allemand L, Siroux V,
Robinson O, Vlaanderen J, González JR, Nieuwenhuijsen MJ, Vineis P, Vrijheid M, Slama R,
Vermeulen R (2016) A systematic comparison of linear regression-based statistical methods
to assess exposome-health associations. Environ Health Perspect 124(12):1848–1856
35. Barrera-Gómez J, Agier L, Portengen L, Chadeau-Hyam M, Giorgis-Allemand L, Siroux V,
Robinson O, Vlaanderen J, González JR, Nieuwenhuijsen M, Vineis P, Vrijheid M,
Vermeulen R, Slama R, Basagaña X (2017) A systematic comparison of statistical methods to
detect interactions in exposome-health associations. Environ Health 16(1):74
36. Patel CJ, Chen R, Kodama K et al (2013) Systematic identification of interaction effects
between genome- and environment-wide associations in type 2 diabetes mellitus. Hum Genet
132:495–508
37. McGinnis DP, Brownstein JS, Patel CJ (2016) Environment-wide association study of blood
pressure in the national health and nutrition examination survey (1999–2012). Sci Rep 26(6):
30373
38. Patel CJ (2017) Analytic complexity and challenges in identifying mixtures of exposures
associated with phenotypes in the exposome era. Curr Epidemiol Rep 4(1):22–30
39. Patel CJ, Manrai AK (2015) Development of exposome correlation globes to map out
environment-wide associations. Pac Symp Biocomput 231–242
40. Lopez-Campos G, Bellazzi R, Martin-Sanchez F (2013) INDIV-3D. A new model for
individual data integration and visualisation using spatial coordinates. Stud Health Technol
Inform 190:172–174
41. National Academies of Sciences, Engineering, and Medicine (2017) Measuring personal
environmental exposures. In: Proceedings of a workshop—in brief. The National Academies
Press, Washington, DC. https://doi.org/10.17226/24711
42. Dagliati A, Marinoni A, Cerra C, Decata P, Chiovato L, Gamba P, Bellazzi R (2015)
Integration of administrative, clinical, and environmental data to support the management of
type 2 diabetes mellitus: from satellites to clinical care. J Diabetes Sci Technol 10(1):19–26
43. Antman EM, Loscalzo J (2016) Precision medicine in cardiology. Nat Rev Cardiol 13(10):
591–602
44. Rappaport SM (2016) Genetic factors are not the major causes of chronic diseases. PLoS One
11(4):e0154387
45. Galli SJ (2016) Toward precision medicine and health: opportunities and challenges in
allergic diseases. J Allergy Clin Immunol 137(5):1289–1300
46. Agustí A, Bafadhel M, Beasley R, Bel EH, Faner R, Gibson PG, Louis R, McDonald VM,
Sterk PJ, Thomas M, Vogelmeier C, Pavord ID (2017) On behalf of all participants in the
seminar. Precision medicine in airway diseases: moving to clinical practice. Eur Respir J 50(4)
47. Lopez-Campos G, Merolli M, Martin-Sanchez F (2017) Biomedical informatics and the
digital component of the exposome. Stud Health Technol Inform 245:496–500
48. Measuring national well-being: insights into children’s mental health and well-being (2015)
ONS. Accessed March 23, 2018. https://www.ons.gov.uk/peoplepopulationandcommunity/
wellbeing/articles/measuringnationalwellbeing/2015-10-20
49. Cantor MN, Thorpe L (2018) Integrating data on social determinants of health into electronic
health records. Health Aff (Millwood) 37(4):585–590
50. Dennis KK, Jones DP (2016) The exposome: a new frontier for education. Am Biol Teach
78(7):542–548
51. Niedzwiecki MM, Miller GW (2017) The exposome paradigm in human health: lessons from
the emory exposome summer course. Environ Health Perspect 125(6):064502
52. Johnson CH, Athersuch TJ, Collman GW, Dhungana S, Grant DF, Jones DP, Patel CJ,
Vasiliou V (2017) Yale school of public health symposium on lifetime exposures and human
health: the exposome; summary and future reflections. Hum Genomics 11(1):32
Glossary
Algorithm means a process or set of rules to be followed in calculations or other

problem-solving operations to achieve a goal, especially a mathematical rule or
procedure used to compute a desired result, produce the answer to a question or the
solution to a problem in a finite number of steps.
Anonymization means it is not possible to identify an individual from the data
itself or from that data in combination with other data, taking account of all the
means that are reasonably likely to be used to identify them. If the data is no longer
personal data, it is not covered by data protection legislation.
Biological material refers to a sample obtained from an individual human being,
living or deceased, which can provide biological information, including genetic
information, about that individual.
De-identification means the process of removing any information that identifies an
individual, or for which there is a reasonable expectation that the information could
be used, either alone or with other information, to identify an individual, while
preserving as much utility in the information as possible.
Exposome means the measure of all the exposures of an individual in a lifetime
and how those exposures relate to health. The exposome encompasses the totality
of human environmental (i.e. non-genetic) exposures from conception onwards,
complementing the genome.
Expotype is our suggested word to define partial views of an individual exposome.
It can be defined as a specific set of exposome elements of an individual accu-
mulated during a certain time/space. For instance, the number of steps walked by
an individual during a specific time/space window). A mixture of expotypes, in
combination with an individual genotype, is responsible for a mixture of pheno-
types, along time.

Lecture Notes in Bioengineering, https://doi.org/10.1007/978-3-030-06109-8
144 Glossary
Expotyping means to characterize individual exposure profiles in specific circum-

stances or in a particular time/spatial window. For example extracting information
about the smoking pattern of a patient from his/her electronic clinical record.
Genome means an organism’s complete set of DNA, including all of its genes. In
humans, a copy of the entire genome—more than 3 billion DNA base pairs—is
contained in all cells that have a nucleus.
Genotype means the genetic makeup of an organism or group of organisms with
reference to a single trait, set of traits, or an entire complex of traits.
Genotyping means the process of determining which genetic variants an individual
possesses.
Machine Learning means the field of study that gives computers the ability to
learn without being explicitly programmed.
Phenome means the set of all phenotypes expressed by a cell, tissue, organ,
organism, or species. Just as the genome and proteome signify all of an organism's
genes and proteins, the phenome represents the sum total of its phenotypic traits.
Phenotype means the set of observable characteristics of an individual resulting
from the interaction of its genotype with the environment.
Phenotyping means the process of determining the expression of genotypes that
can be directly distinguished (e.g., by clinical observation of external appearance or
serologic tests).
Predictive Analytics means using techniques from statistics, modeling, machine
learning, and data mining that analyze current and historical facts to help simulate
scenario-based decision making and make speculative, rationalistic and proba-
bilistic predictions about future events (e.g. used in actuarial science, marketing,
financial services, credit scoring, insurance, telecommunications, retail, travel,
healthcare, pharmaceuticals and other fields).
Profiling means any form of automated processing of personal data consisting of
using those data to evaluate certain personal aspects relating to a natural person, in
particular to analyze or predict aspects concerning that natural person’s perfor-
mance at work, economic situation, health, personal preferences, interests, relia-
bility, behaviour, location or movements.
Pseudonymization means the processing of personal data in such a manner that the
personal data can no longer be attributed to a specific data subject without the use
of additional information, provided that such additional information is kept sepa-
rately and is subject to technical and organizational measures to ensure that the
personal data are not attributed to an identified or identifiable natural person.
Social determinants of health are economic and social conditions—and their
distribution among the population—that influence individual and group differences
in health status.

Big Data, Big Challenges - A Healthcare Perspective - Background, Issues, Solutions and Research Directions-Sp

Uploaded by

Copyright:

Available Formats

You might also like

Big Data, Big Challenges - A Healthcare Perspective - Background, Issues, Solutions and Research Directions-Sp

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data, Big Challenges - A Healthcare Perspective - Background, Issues, Solutions and Research Directions-Sp

Uploaded by

Copyright:

Available Formats

Lecture Notes in Bioengineering

Big Data, Big

Big Data, Big Challenges:

ISSN 2195-271X ISSN 2195-2728 (electronic)

Library of Congress Control Number: 2018964922

© Springer Nature Switzerland AG 2019

Doha, Qatar Mowafa Househ

Part I Health Professional Perspective

Part II Human Factors and Ethical Perspectives

Part III Technological Perspectives

Suzanne Bakken and Theresa A. Koleck

The International Council of Nurses provides a global deﬁnition of nursing as

© Springer Nature Switzerland AG 2019 3

2 Data Governance and Data Science Infrastructure

Extraction/ Wrangling Computation Modeling and Reporting and

Data Science Infrastructure

3.1 Electronic Health Records and Symptom Science

Symptoms (e.g., pain, fatigue, sleep disturbance, anxiety, depression, nausea)

Table 1 Summary of key data science challenges for case studies

that patient conﬁdentiality is maintained and for ease of data manipulation.

Large scale omic (e.g., genomic, epigenomic, transcriptomic, proteomic, metabo-

following hemorrhage. The selection of the timeframe was biologically-driven

3.3 Twitter and Dementia Caregiving

3.4 Prediction and Sepsis Campaign Guideline

3.5 Intelligent Sensors and Aging in Place

In a series of studies, Rantz and a team of interdisciplinary colleagues examined the

3.6 Dashboards and Nurse Numeracy and Graph Literacy

4.1 Ethical Conduct of Research

4.2 Data Science Competencies

However, data science is increasingly integrated into routine operations of

The availability of data sources to address questions of interest to nurses is on the

1. International Council of Nursing. Deﬁnition of nursing. International Council of Nurses,

and cost in aging in place housing. Nurs Outlook 63(6):650–655. https://doi.org/10.1016/j.

© Springer Nature Switzerland AG 2019 17

1.1 Other Related Areas in Health Innovation

Fig. 1 Chapter summary—the concepts discussed

1.2 Basic Resources Needed to Deliver Big Data

1.3 Trust, Privacy and Governance Issues

2 Deﬁning Clinical Big Data

3 Long-Term Technical Concerns

move towards professionally owned semantic interoperability, many countries ﬁnd

5 Is It Worth the Effort?

We have examined multiple challenges in this chapter without acknowledging the

34. Martin J, Asan, Yi Y, Alberola T, Rodríguez-Iglesias B, Jiménez-Almazán J, et al (2015)

99. Morrison Z, Robertson A, Cresswell K, Crowe S, Sheikh A (2011) Understanding

1 The Promises of Big Data from a Pharmacy Perspective

© Springer Nature Switzerland AG 2019 33

These promises are stimulated by the digitization of health care practices,

2 Medication-Related Data are Complicated

approval process leading to the emission of a speciﬁc code in each jurisdiction.

Useful links for drug catalogues and drug terminologies

In addition, medication-related data are complicated because they may refer to

The medication The prescription

Fig. 1 Medication-related concepts, by type

Table 2 Medication-related concepts and associated labels in different terminologies

Table 3 Other medication-related concepts used to classify medications

Table 4 Prescription-related concepts and examples of labels in different terminologies

3 Prescription-Related Data are Even More Complicated