Professional Documents
Culture Documents
Big Data, Big Challenges - A Healthcare Perspective - Background, Issues, Solutions and Research Directions-Sp
Big Data, Big Challenges - A Healthcare Perspective - Background, Issues, Solutions and Research Directions-Sp
Big Data, Big Challenges - A Healthcare Perspective - Background, Issues, Solutions and Research Directions-Sp
Mowafa Househ
Andre W. Kushniruk
Elizabeth M. Borycki Editors
Elizabeth M. Borycki
Editors
123
Editors
Mowafa Househ Andre W. Kushniruk
Division of Information and Computing School of Health Information Sciences
Technology, College of Science and University of Victoria
Engineering Victoria, BC, Canada
Hamad Bin Khalifa University, Qatar
Foundation
Doha, Qatar
Elizabeth M. Borycki
School of Health Information Sciences
University of Victoria
Victoria, BC, Canada
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Much has been written about the utilization of big data analytics methods, tools and
technologies to collect, process, visualize and make use of high volume structured
and unstructured data in a number of fields such as finance, insurance, sports,
agriculture, and health. With the fast and ever-increasing growth of user-generated
data from the Internet, such as social media content, data from wireless medical
devices and mobile apps, big data analytical methods, tools and technologies have
become recognized as the only plausible “go-to” solutions that are able to make
sense of such voluminous, disorganized, fluid and free-flowing data. Within health
care, there is a growing knowledge base of big data related studies and imple-
mentations in public health, clinical decision making, disease prevention, and
healthcare cost reduction. As with any new field, much of the research and dis-
cussions center upon the added value and opportunities that new technologies, such
as big data analytics methods, tools and technologies can provide. However, as the
domain area begins to mature through increased implementation, evaluation studies,
and user experiences, the problems, and challenges relating to the methods, tools,
and technologies used for big data analytics begin to emerge. For the past five
years, much of the literature on big data analytics has focused on the benefits of big
data in improving all areas of health care. A new wave of research is beginning to
emerge challenging some of the assumptions made by the positive assertions for big
data analytics in health care. That is the motivation behind this book, which is not
only about sharing success stories or opportunities for big data in health care, but
also to address the arising challenges that many researchers have overlooked.
What makes this book unique is that it examines both the opportunities and
focuses more on the challenges in applying big data analytics methods, tools and
technologies within health care from a number of perspectives. The book is divided
into three parts and eleven chapters. The first part of the book examines the
healthcare professional perspective on the challenges and opportunities of big data
analytics from a nursing, medical, public health, and health administrator per-
spective. Most of the chapters are included in the first part of the book. The second
part of the book focuses on human factors and ethical challenges and opportunities
related to big data analytics in health care. There are three chapters in part two
v
vi Preface
of the book that address topics related to patient safety, user-centered design, and
ethical issues. Part three of the book includes two chapters that examine the
technical challenges in the utilization of big data analytics in health care. The first
chapter examines the challenges and opportunities of big data analytics from a data
scientist’s perspective. The second chapter examines the integrative exposum/
expotype perspective related to big data analytics in health care.
The book provides health data scientists, health care professionals, and health-
care managers and policymakers the first comprehensive insight into the challenges
and opportunities of big data analytics in health care. The book will challenge some
of the pre-held conceptions and notions students and professionals of big data
analytics in health care currently possess and challenge them to derive new solu-
tions and ideas to the proposed challenges suggested within the book.
vii
viii Contents
1 Introduction
S. Bakken (&)
School of Nursing, Department of Biomedical Informatics, and Data Science Institute,
Columbia University, 630 W. 168th Street, New York, NY 10032, USA
e-mail: sbh22@cumc.columbia.edu
T. A. Koleck
School of Nursing, Columbia University, New York, NY, USA
characterizing the data used according to the criteria for big data [6, 7], all studies
met the criterion of volume, most met the criterion of variety, and a minority met
the criterion of velocity. Veracity and value were not explicitly analyzed. Electronic
health records (EHRs) were the primary data source for 14 studies although several
studies integrated EHR data with other data sources. The study purposes were
categorized as knowledge discovery, prediction, and evaluation. Since the time of
this review, additional nursing studies have been conducted that reflect data sources
beyond EHRs and structured data sources including omics [8], social media [9], and
sensors [10]. Moreover, health policy considerations for data science have been
delineated from a nursing science perspective [11].
The purpose of this chapter is to summarize the benefits and key challenges
related to big data streams and data science from the perspective of nursing. The
benefits and challenges are considered from the perspective of data governance as
well as data science infrastructure and pipeline and illustrated through six case
examples. In addition, two cross-cutting issues (ethical conduct of research and data
science competencies) are addressed.
A number of authors have published data science pipelines. However, from the
perspective of nursing, data science starts with a question and because of the nature
of the data, which is often protected health information from a U.S. Health
Insurance Portability and Accountability Act (HIPAA) standpoint, requires careful
consideration of data governance (Fig. 1). In addition, the infrastructure required
for data science is often significantly different from the data management and
analytic pipelines typically available to nurse scientists and clinicians due to the
volume of data and the processing power needed to ingest, wrangle
(i.e., pre-process using semi-automated tools), compute and analyze, model and
validate, and interpret (visualize and report) the data. Moreover, data science
requires platforms beyond SAS, STATA, and R such as Apache Hadoop
Map-Reduce, Apache Mahout (machine learning algorithms), Sparks Machine
Learning Library, and R-Hadoop to support reduction and analysis of
multi-dimensional data through methods such as K-means, random forest classifier,
neural network backpropagation, support vector machines, and Gaussian discrim-
inative analysis. A data science infrastructure must also support visualization of the
data for analysis, interpretation, and reporting through general tools such as Tableau
and tools for special purposes (e.g., Sentiment Viz for visualization of Tweet
contents, ORA for visualization of network structures).
Table 1 displays a summary of challenges related to aspects of data governance,
data science infrastructure, and data science pipeline in a set of case examples that
are described in more detail in the following section.
Big Data Challenges from a Nursing Perspective 5
RRE
Question
Data Governance
Fig. 1 Data governance, data science infrastructure, and data science pipeline. Adapted from
Tesla Institute [12]
3 Case Examples
Three case examples from the authors’ experience, which are focused on knowl-
edge discovery from electronic health records (EHRs), omics, and social media and
reflect multiple challenges, are described first. This is followed by briefer
descriptions of three more case examples from the literature that highlight a specific
challenge.
3.2 Omics
Social media are an important data stream for capturing perceptions as well as
behaviors in the daily lives of participants [9]. In addition to content mining, data
science methods support the analysis of network structures which are important to
address questions regarding social support and other types of relatedness. Yoon and
colleagues mined Twitter to gain an understanding of the caregiving experience of
Latinos caring for a person living with dementia [16]. Although very limited in
character length, Tweets have associated metadata which results in more than 20
data elements per Tweet including explicit and extractable characteristics of the user
Big Data Challenges from a Nursing Perspective 9
and the Tweet [39]. Through the methods of topic modeling, sentiment analysis,
and network analysis (macro, meso, micro), they found that (a) frequently occurring
dementia topics were related to mental health and caregiving, (b) the sentiments
expressed in the Tweets were more negative than positive, and (c) network patterns
demonstrated lack of social connectedness [15, 16]. In terms of challenges, data
governance was not an issue because a sample of Tweets is publically available on
a daily basis and research use is supported by the Twitter terms of agreement.
However, there were key challenges across analyses related to data science
infrastructure and pipeline. Regarding infrastructure, the institution lacked graphical
user interfaces to its existing high performance computing resources and policies
limited data storage. For pipeline, a key challenge to extraction was cost. Twitter
charges for extraction of retrospective datasets and the federal grant supporting the
research did not have sufficient budget for this purpose. To address these issues,
relevant Tweets were downloaded on a daily basis, pre-processed and then com-
bined to form the analytic Tweet corpus. A second challenge related to extraction
was defining the lexicon for the extraction to capture the Tweets of populations of
interest. This requires application of a set of cultural analytic techniques that begins
with a corpus of text that is labeled (e.g., song lyrics by a Black lyricist, a Latino
poem) and results in an algorithm suitable for text retrieval for that population. Such
techniques were applied to create a Latino Tweet corpus. In addition, a variety of
existing tools were combined to create a pipeline: extraction/ingestion (NODEXL,
nCapture), wrangling (Notepad++, Tableau), structural analyses including visual-
ization (ORA, Pajek), and content analysis including visualization (Weka,
Sentiment Viz).
Pruinelli et al. [17] used EHR data to examine the effect of delay within the 3 h
Surviving Campaign Guideline on patients with severe sepsis and septic shock.
Applying sequential propensity score matching, they found that the statistically
significant time in minutes after which a delay increased the risk of death was:
lactate—20 min, blood culture—50 min, crystalloids—100 min, and antibiotic
therapy—125 min. They identified one challenge related to data wrangling.
Typically, crystalloid volume is documented in unstructured nursing flowsheets.
Consequently, actual volume cannot be precisely determined from orders alone. To
address this issue, the authors suggested the need to standardize flowsheet data. In
another report, some of the authors described the creation and validation of
flowsheet information models for five nursing-sensitive quality indicators, five
physiological systems, and expanded vital signs and anthropometric measures [40].
10 S. Bakken and T. A. Koleck
Dashboards are increasingly being integrated into clinical practice and used by
executives and managers for overviews of their organizations or units in terms of
processes as well as cost and quality indicators. There is currently less direct use of
dashboards by clinicians at the point of care to inform their decision making for
individual or groups of patients. A systematic review on the use of clinical dash-
boards revealed positive impact of clinical dashboards on care processes and out-
comes in some contexts [43]. However, the authors noted that it is unclear what
dashboard characteristics are associated with improved outcomes and how dash-
boards are integrated into care and decision making. To address the first knowledge
gap, Dowding and colleagues assessed the relationship between home care nurses’
numeracy and literacy and their comprehension of visual display information in a
dashboard project focused on providing feedback on quality metrics to home care
nurses at the point of care for patients with congestive heart failure [19]. Home care
nurses (n = 196) best understood information displayed as bar graphs (88%), fol-
lowed by Tables (81%), line graphs (77%), and spider graphs (41%). Twenty-five
percent of the nurses had low numeracy and/or low graph literacy. Those with low
numeracy and graph literacy had poorer comprehension across formats, 63 and 65%
respectively. Such findings suggest that the data science competencies of clinicians
related to interpretation of visual displays must be considered along with
methodological and infrastructure aspects for optimal use of dashboards to inform
patient care decision making.
Big Data Challenges from a Nursing Perspective 11
4 Cross-Cutting Issues
Ethical conduct of research and data science competencies are two major
cross-cutting issues for data science from the nursing perspective.
The historic Belmont Report articulated three principles for ethical conduct of
research that must be considered for use of big data streams and data science
methods: respect for persons (i.e., autonomy), beneficence, and justice [44].
Respect for persons includes two separate moral requirements: acknowledgment of
autonomy and protection of those with diminished autonomy. Informed consent is
the primary mechanism for protection of autonomy. Some big data streams have
explicit opt-in or opt-out consent processes and use of protected health information
(PHI) from EHRs and other electronic clinical data resources for research has
ethical and regulatory oversight from institutional review boards and national
regulations such as HIPAA in the U.S. In contrast, social network sites and other
quantified-self technologies include terms of agreement for data use that may not be
read or fully comprehended by users. This can result in use of an individual’s data
in the absence of informed consent.
Beneficence involves optimizing benefits while minimizing risks to ensure that
scare resources are used wisely. Poor methodological rigor and loss of confiden-
tiality through commodification of data pose threats to beneficence. To ensure
appropriate decision making based on study findings, methodological rigor is
needed in terms of selection of appropriate data streams as well as at each stage of
the data science pipeline. Loss of confidentiality and commodification of patient/
consumer-generated data can occur through presumption as digital content is pro-
duced and consumed by individuals as they access websites, use mobile health
applications, and post and respond to social network messages. Individuals may
vary in their willingness to have their data used for public health versus commercial
purposes because they do not typically reap financial benefits from commodifica-
tion of their data [45, 46].
The principle of justice requires fair procedures and equitable outcomes in the
selection of research participants. For data science, this means consideration of
characteristics of the individuals or populations comprising the data streams that
will be used to address the research question. For example, (a) the severity of illness
and sociodemographic composition of patients represented in EHR data vary by
type and location of the healthcare organization, (b) Latinos are less likely than
Whites or Blacks to use an app for health tracking [47], (c) and racial and ethnic
minorities are less likely to participate in biobanks [48, 49]. Such biases in the data
streams may limit the relevance of discoveries and predictions to those at greatest
12 S. Bakken and T. A. Koleck
risk for health disparities. Consequently, researchers must carefully match their
selection of data streams to their research questions.
The required data science competencies for nurses will vary by role and take into
account what are general competencies for nurses versus what is needed by spe-
cialists including nursing informatics specialists, chief nursing informatics officers,
and nurse scientists conducting data science research. As with nursing informatics
competencies in the past, the manner in which these competencies will be acquired
through education at the undergraduate, master’s, and doctoral levels will be
defined over time by bodies that provide oversight for nursing education with input
from the nursing community. To date, there has been most consideration of com-
petencies for nurse scientists given the increasingly prominent role of data science
in discovery and expertise is typically conceptualized in three broad areas: com-
putational (e.g., cloud computing, workflow automation, visual analytics), mathe-
matical and statistical (e.g., research design, traditional and machine learning
analytic techniques), and domain (e.g., nursing, genomics, public health) [50].
Published Venn diagrams of these three areas emphasize the interdisciplinary team
science aspects of data science by naming the intersection of all the competencies
“the unicorn”. Educational pathways for nurse scientists should reflect their primary
areas of knowledge development [3, 11]. For example:
• Create computational methods and tools—doctoral or post-doctoral training in a
computational field such as computer science, data science, or biomedical
informatics. The nursing perspective will inform the types of computational
methods and tools developed.
• Apply data science as major method of inquiry in nursing research—doctoral
training in nursing with interdisciplinary data science specialization integrated
into nursing PhD or post-doctoral program. For example, trainees in the
Reducing Health Disparities Through Informatics Pre- and Post-doctoral
Training program at Columbia University have course work and applied
research opportunities in data science primarily related to data mining and
information visualization.
• Awareness of data science as an approach in nursing research—doctoral training
in nursing and generalist training in data science. Every nurse scientist should
have a general understanding of data science similar to their familiarity with
qualitative inquiry, experimental and quasi-experimental designs, and health
services research. In the U.S., the National Institute for Nursing Research has
made significant efforts to meet this need for existing nurse scientists through
the provision of week-long Boot Camps in Data Science and Precision Health
[51].
Big Data Challenges from a Nursing Perspective 13
5 Conclusion
Acknowledgements Manuscript preparation was supported by grants from the National Institutes
of Health: Precision in Symptom Self-Management (PriSSM) Center, New York City Hispanic
Dementia Caregiver Research Program, and Reducing Health Disparities Through Informatics
(RHeaDI) Pre- and Post-doctoral Training Program.
References
4. Bakken S, Reame N (2016) The promise and potential perils of big data for advancing
symptom management research in populations at risk for health disparities. Annu Rev Nurs
Res 34(1):247–260. https://doi.org/10.1891/0739-6686.34.247
5. Westra BL, Sylvia M, Weinfurter EF, Pruinelli L, Park JI, Dodd D et al (2017) Big data
science: a literature review of nursing research exemplars. Nurs Outlook 65(5):549–561.
https://doi.org/10.1016/j.outlook.2016.11.021
6. IBM. IBM big data & analytics hub 2015. Available from: http://www.ibmbigdatahub.com/
infographic/four-vs-big-data
7. Marr B. Big data: the 5 Vs 2015 [cited 1 Feb 2015]. Available from: http://www.slideshare.
net/BernardMarr/140228-big-data-volume-velocity-variety-varacity-value
8. Koleck TA, Conley YP (2015) Identification and prioritization of candidate genes for
symptom variability in breast cancer survivors based on disease characteristics at the cellular
level. Breast Cancer (Dove Med Press) 8:29–37. https://doi.org/10.2147/BCTT.S88434
9. Yoon S, Elhadad N, Bakken S (2013) A practical approach for content mining of Tweets.
Am J Prev Med 45(1):122–129. https://doi.org/10.1016/j.amepre.2013.02.025
10. Rantz MJ, Skubic M, Popescu M, Galambos C, Koopman RJ, Alexander GL et al (2015) A
new paradigm of technology-enabled ‘Vital Signs’ for early detection of health change for
older adults. Gerontology 61(3):281–290. https://doi.org/10.1159/000366518
11. Bakken S (2017) Data science. In: Hinshaw AS, Grady PA (eds) Shaping health policy
through nursing research. Springer
12. Tesla Institute. Understanding the data science pipeline [cited 14 Feb 2018]. Available from:
http://www.tesla-institute.com/index.php/using-joomla/extensions/languages/278-
understanding-the-data-science-pipeline
13. Koleck T, Bakken S, Kim M, Wesmiller S, Tatonetti N (in preparation) Use of electronic
health records to examine demographic and clinical predictors of postoperative nausea and
vomiting in women following gynecologic surgical procedures. J Perianesthesia Nurs
14. Arockiaraj AI, Shaffer JR, Koleck TA, Weeks DE, Conley YP (in preparation) Methylomic
data processing protocol shows difference in sample quality and methylation profiles between
blood and cerebral spinal fluid following acute subarachnoid hemorrhage. Genet Epigenetics
15. Yoon S (2016) What can we learn about mental health needs from Tweets mentioning
dementia on World Alzheimer’s Day? J Am Psychiatr Nurses Assoc 22(6):498–503. https://
doi.org/10.1177/1078390316663690
16. Yoon S, Co MC Jr, Bakken S (2016) Network visualization of dementia tweets. Stud Health
Technol Inform 225:925
17. Pruinelli L, Yadav P, Hoff A, Steinbach M, Kumar V, Delaney CW et al (2018) Delay within
the 3-hour surving sepsis campaign guideline on mortality for patients with severe sepsis and
septic shock. Crit Care Med. https://doi.org/10.1097/ccm.0000000000002949. [Epub ahead
of print]
18. Rantz M, Phillips LJ, Galambos C, Lane K, Alexander GL, Despins L et al (2017)
Randomized trial of intelligent sensor system for early illness alerts in senior housing. J Am
Med Dir Assoc 18(10):860–870. https://doi.org/10.1016/j.jamda.2017.05.012
19. Dowding D, Merrill JA, Onorato N, Barron Y, Rosati RJ, Russell D (2018) The impact of
home care nurses’ numeracy and graph literacy on comprehension of visual display
information: implications for dashboard design. J Am Med Inform Assoc 25(2):175–182.
https://doi.org/10.1093/jamia/ocx042
20. Lee KA, Meek P, Grady PA (2014) Advancing symptom science: nurse researchers lead the
way. Nurs Outlook 62(5):301–302. https://doi.org/10.1016/j.outlook.2014.05.010
21. Miaskowski C, Barsevick A, Berger A, Casagrande R, Grady PA, Jacobsen P et al (2017).
Advancing symptom science through symptom cluster research: expert panel proceedings and
4ecommendations. J Natl Cancer Inst 109(4). https://doi.org/10.1093/jnci/djw253
22. Cohen B, Vawdrey DK, Liu J, Caplan D, Furuya EY, Mis FW et al (2015) Challenges
associated with using large data sets for quality assessment and research in clinical settings.
Policy Polit Nurs Pract 16(3–4):117–124. https://doi.org/10.1177/1527154415603358
Big Data Challenges from a Nursing Perspective 15
23. Cowie MR, Blomster JI, Curtis LH, Duclaux S, Ford I, Fritz F et al (2017) Electronic health
records to facilitate clinical research. Clin Res Cardiol 106(1):1–9. https://doi.org/10.1007/
s00392-016-1025-6
24. Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF et al (2017) Natural
language processing systems for capturing and standardizing unstructured clinical informa-
tion: a systematic review. J Biomed Inform 73:14–29. https://doi.org/10.1016/j.jbi.2017.07.
012
25. Pereira L, Rijo R, Silva C, Martinho R (2015) Text mining applied to electronic medical
records: a literature review. Int J E-Health Med Commun (IJEHMC) 6(3):1–18. https://doi.
org/10.4018/IJEHMC.2015070101
26. Weiskopf NG, Weng C (2013) Methods and dimensions of electronic health record data
quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 20(1):144–
151. https://doi.org/10.1136/amiajnl-2011-000681
27. Coughlin SS (2014) Toward a road map for global -omics: a primer on -omic technologies.
Am J Epidemiol 180(12):1188–1195. https://doi.org/10.1093/aje/kwu262
28. McCall MK, Stanfill AG, Skrovanek E, Pforr JR, Wesmiller SW, Conley YP (2018)
Symptom science: omics supports common biological underpinnings across symptoms. Biol
Res Nurs 20(2):183–191. https://doi.org/10.1177/1099800417751069
29. Birney E, Smith GD, Greally JM (2016) Epigenome-wide association studies and the
interpretation of disease-omics. PLoS Genet 12(6):e1006105. https://doi.org/10.1371/journal.
pgen.1006105
30. Riancho J, Del Real A, Riancho JA (2016) How to interpret epigenetic association studies: a
guide for clinicians. Bonekey Rep 5:797. https://doi.org/10.1038/bonekey.2016.24
31. Baumgartel K, Zelazny J, Timcheck T, Snyder C, Bell M, Conley YP (2011) Molecular
genomic research designs. Annu Rev Nurs Res 29:1–26
32. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD et al (2014)
Minfi: a flexible and comprehensive bioconductor package for the analysis of Infinium DNA
methylation microarrays. Bioinformatics 30(10):1363–1369. https://doi.org/10.1093/
bioinformatics/btu049
33. Chen J, Just AC, Schwartz J, Hou L, Jafari N, Sun Z et al (2016) CpGFilter: model-based
CpG probe filtering with replicates for epigenome-wide association studies. Bioinformatics 32
(3):469–471. https://doi.org/10.1093/bioinformatics/btv577
34. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The sva package for removing
batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28
(6):882–883. https://doi.org/10.1093/bioinformatics/bts034
35. Xu X, Gammon MD, Hernandez-Vargas H, Herceg Z, Wetmur JG, Teitelbaum SL et al
(2012) DNA methylation in peripheral blood measured by LUMA is associated with breast
cancer in a population-based study. FASEB J 26(6):2657–2666. https://doi.org/10.1096/fj.11-
197251
36. Xu Z, Niu L, Li L, Taylor JA (2016) ENmix: a novel background correction method for
Illumina HumanMethylation450 BeadChip. Nucleic Acids Res 44(3):e20. https://doi.org/10.
1093/nar/gkv907
37. Phipson B, Maksimovic J, Oshlack A (2016) missMethyl: an R package for analyzing data
from Illumina’s HumanMethylation450 platform. Bioinformatics 32(2):286–288. https://doi.
org/10.1093/bioinformatics/btv560
38. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic
Acids Res 28(1):27–30. KEGG accessible at: http://www.genome.jp/kegg/kegg1.html
39. Sinnenberg L, Buttenheim AM, Padrez K, Mancheno C, Ungar L, Merchant RM (2017)
Twitter as a tool for health research: a systematic review. Am J Public Health 107(1):143-e8
40. Westra BL, Christie B, Johnson SG, Pruinelli L, LaFlamme A, Sherman SG et al (2017)
Modeling flowsheet data to support secondary use. Comput Inform Nurs 35(9):452–458.
https://doi.org/10.1097/CIN.0000000000000350
41. Rantz M, Lane K, Phillips LJ, Despins LA, Galambos C, Alexander GL et al (2015)
Enhanced registered nurse care coordination with sensor technology: impact on length of stay
16 S. Bakken and T. A. Koleck
Michael Bainbridge
1 Introduction
It’s hard to avoid hearing the mentions of ‘Big Data’ in 2018. News headlines,
periodicals and books are awash with it. The prospect of taking the wisdom of
crowds, in particular their data, and distilling it into a formal source of high quality
data is most appealing [1]. Uses hoped for include decision support, decision
making, aggregation for public health planning, and epidemiological research. The
list is long, frequently visited and lengthened [2]. The audience is wide from you
and I the consumer, to multinational pharmacological companies [3, 4]. This
chapter will address the issue of big data largely from the perspective of data
derived at the point of care being used with other data sources, to support inferences
that would be unlikely to be made via the conventional routes such as analysis of a
single organisation’s data.
Companies ranging from small start-ups to the established blue chip giants are
investing significantly. The prevailing belief seems to be that ‘if only the might of
Big Data and Deep Learning were applied to Health and Medicine, then the benefits
would flow in abundance’. This chapter will examine these beliefs, propose some
definitions and investigate some of the pitfalls and bear traps for the unwary. We
also examine medicine’s readiness to embrace these challenges in its leadership, its
architecture and maturity of thought. Other chapters will focus on the more tech-
nical aspects, the ‘how’ and the ‘where’. This chapter will explore the clinical
aspects, the potential benefits and the attention needed in the data collection process
which will be required to achieve these benefits. If, through implementation, big
data cannot answer the question to an individual practitioner of “does this help me
deliver 21st Century, safe, state of the art, evidence-based, personalised care?” then
it will fail to be meet expectations. Equally, a consumer posed with the sharing of
M. Bainbridge (&)
University of Victoria, Victoria, BC, Canada
e-mail: bainbrid@uvic.ca
their data could ask the same question replacing ‘deliver’ with ‘receive’ and adding
“reducing my risk and ensuring affordability”.
Put simply, we are examining the aggregation of data from clinical and other
sources and applying it to questions relating to all areas of medicine. It is worth
differentiating between time-sensitive applications such as:
• the direct delivery of care
• decision support—including the incorporation of genomics and other ‘omic data
into decisions
And applications which are less time-sensitive:
• the planning of care
• the commissioning of care
• the audit of the quality and safety of care delivery process
• multiple areas of clinical and pharmacological research
• defining and measuring outcomes
• multiple potential commercial applications
The audience is potentially wide for both types of application but they put
different ‘stresses’ on the systems providing this service. In all, however, there is an
inherent risk to trust, privacy and security which we will discuss in more detail later
in the chapter. It must be emphasised that access to these data is not something to
consider only for clinicians; utility and availability must be considered for the
clinician and the populations and individuals that they serve, as well as their
non-clinical carers. This sharing process, of course, brings its own challenges to
society as issues around security, availability, privacy are accentuated. Marked
differences in beliefs and expectation also exist between the baby boom generation
and millennials who have never lived without the internet or social media [5].
Successful and rapid consultation and agreement on these issues must be achieved.
It is important that this is done early and well before disclosures and decisions are
made which become irrevocable. There is already a rising tide of data breaches,
both inadvertent and malicious [6]. Big data magnifies this worry, raising the
spectre of disclosures on an unprecedented scale.
Big data is closely connected with other areas in health that are all, at time of
writing, still close to the peak of the Gartner Hype Cycle although descent into the
“trough of disillusionment” is only a matter of time. Alongside Big Data, the issues
of Decision support systems (DSS), Knowledge Support Systems (KSS), Artificial
Big Data Challenges for Clinical and Precision Medicine 19
Intelligence (AI), Precision Medicine1 and the application of both genomics and
multiple sub-species of ‘omics, all consume multiple column inches. All are related
and form an enticing set of potential benefits for the application of computation to
clinical care. Despite these concepts having been in use in healthcare for over
30 years [7], definitions and wildly differing expected outcomes still exist from
their application. In this chapter, we have the expectation that:
• A Decision Support System will bring options or opportunities for clinicians,
consumers, carers singly, and in combinations, to be presented with different
options for the delivery of care (or indeed its withdrawal).
• Artificial Intelligence systems will make decisions on behalf of the same actors
and may then deliver care autonomously or pause for approval.
• Decisions made by both DSS and AI may be improved by improving the quality
of the information presented to them.
• Precision Medicine focuses on identifying which interventions will be effective
for an individual, based on genetic, environmental, and lifestyle factors.
• The much heralded availability of a person’s genome and the application of
these data to their phenotype (previous medical history plus environmental, and
lifestyle factors) will be major driver (Fig. 1).
1
The older term ‘Personalised Medicine’, which is often used interchangeably with ‘Precision
Medicine’ will not be used in this Chapter. Personalised Medicine which implies specific man-
ufacture or synthesis for the individual is a valid but much more specific concept.
20 M. Bainbridge
Much is made of the benefits of big data in the health arena; its analysis and
presentation. The quality and structure of the source data is also important. This is
well recognised in the literature [8, 9] but less well delivered in the real-world of
clinical computing and care [10]. The problems are magnified when the
time-sensitive tasks above are examined and even basic infrastructure issues such as
connectivity and reliability of bandwidth become significant blockers.
Along with data access and data quality, trust and privacy are vital to the initial
acceptability of the incorporation and even permission to use and continue to use
data. Inadvertent identification of data subjects due to poor design is just one aspect.
Harm and disadvantage caused to people so identified is another. These issues are
complex especially when examining an insurance and an actuarially driven sector.
Disclosure of genomic data from any time after embryonic implantation could, for
certain diagnoses, preclude that person ever getting a loan or a mortgage [11].
Huntington’s Chorea may, for example, manifest from 4 to 85 years [12].
In 2016, Gartner defined eight sources of big data relevant to health [13]. Ernst and
Young and others have characterised it as the ‘four V’s’—Volume, Variety,
Velocity and Veracity [14, 15]. Likewise McKinsey [16] have examined Big Data
and characterised the five ‘rights’ that its use could deliver: right living, right care,
right provider, right value and right innovation. All suggest that the flow of data
through the health ecosystem would improve both the 5 rights [17] as well as actual
clinical outcomes. As you will hear elsewhere in this book, the benefits from access
to and the use of big data are undeniable. However, there are multiple issues which
could potentially derail, devalue and undermine uptake, use and acceptance of the
concept.
Let’s examine the Gartner sources of health related big data in more detail and
address some of their potential value as well as problems:
Physician’s Free-Text Notes—Without doubt, this resource exists in volume
but, despite public perception otherwise, it is highly variable and unstructured. Free
text is prone to semantic error and is often created without significant contextual
cues. Family history, for example is not well recorded [18, 19] in clinical records.
Simple issues such as negation, which may totally change the meaning and be
Big Data Challenges for Clinical and Precision Medicine 21
differently treated according to the algorithm reading the text (still only a partially
solved problem) [20, 21]. Unfortunately, making the transition from free text to
structured and coded records remains the same significant barrier it has been for
decades [22, 23].
Patient Generated Health Data—This data source is beginning to grow
exponentially but in a very unstructured way as health and wellness data is captured
in a wide variety of applications [24, 25]. This data source largely consists of free
text with fewer data items than data created by clinicians. New sources of data from
wearables [26, 27] are also contributing. However, just because you can measure a
data point regularly doesn’t mean that it is of value. Conversely, there is a possi-
bility that we have yet to recognise the value of large volume information like heart
rate for example. More importantly, many clinicians paternalistically reject these
data as they were captured by unskilled observers. There is also professional and
legal anxiety that they may be the source of a new and unfunded duty of care
[28, 29].
Genomic Data—Within the next 2–5 years we will be confronted by large
numbers of the global population being offered affordable full genetic sequencing
[30, 31] possibly at birth [32]. Genomic sequencing of IVF embryos has taken place
since 2014 [33] and comprehensive sequencing forms part of much pre-conceptual
counseling [34]. These data estimated to be between 100 and 150 GB per human
and their standardisation of representation [35] in the genomics community will
offer substantial opportunity for precision and perhaps also personalisation of
medicine. However, much work is needed on the interface between genomic and
phenotype information. This has been a source of discussion for many years and is
still the source of much argument in a crowded landscape. Without doubt, current
clinical systems will need significant redesign in order to benefit from and include
genomics data [36, 37].
Physiological Monitoring Data—There is an increasing overlap between these
data and data traditionally captured in intensive and unscheduled care environ-
ments. The same restrictions and issues apply.
Medical Imaging Data—The size of these data is measured in petabytes and
despite advances in the Natural Language Processing of the reporting process, the
actual image data remains largely unstructured [38–40].
The last three of Gartner’s sources, Publically Available Data, Credit Card
and Purchasing Data and Social Media Data, are out of the scope of this chapter
but obviously a potential source of much information previously thought inacces-
sible to clinical care. Volume, diversity and questionable veracity will all prove to
be limiting in their utility. The potential for privacy breaches through inference
attacks [41] linking to health data is also great.
22 M. Bainbridge
High quality data is a precursor to the delivery of clinical care. However, despite
decades of evidence to support it, a major item that would deliver high quality data
is still not in place. We are, of course, talking about structured and coded clinical
data.
The opportunity offered by the coding and structuring of clinical data is not
promoted or implemented at anything like the scale necessary. Since Larry Weed’s
seminal paper in 1968 [42], medicine has known what has been required to deliver
interoperable care; to capture coded, defined and structured information capable of
being shared between clinicians without compromise to the meaning being intro-
duced by the sharing process. For reasons outside the scope of this book, this has
not been addressed at scale until very recently. Work is now starting with global
collaborations through the Systematized Nomenclature of Medicine (SNOMED)
[43–46] and FHIR [47] which will start to address these issues and greatly con-
tribute to the quality and granular structuring of clinical data available for analysis.
It is hoped that this work will also see an end to 30 years of coding wars where the
confusion between a nomenclature (e.g. SNOMED) and a classification like the
International Classification of Diseases [48] (ICD) when it is realised once and for
all that both can exist to perform different but related (and sometimes mapped)
tasks. For the first time since the inception of digital clinical records, there is an
alignment of the technical aspects surrounding them.
Along with the issues above, basic provision of appropriate hardware, connec-
tivity and service availability should not be underestimated even in countries
thought to be at an advanced stage in their development. For example, in a shared
record, it is vital that all parties are using the same (and hopefully most recent)
release of the terminology so that gaps and inadvertent changes in meaning do not
occur. If you are disconnected from the record while seeing the patient, what
happens to the data when you reconnect? Who is responsible and accountable for
the orchestration of care to the best possible standard? How are the data maintained
accurately when there may be multiple authors?
Some countries acknowledge this issue and are addressing it with syndication
and ontology services [49, 50]. Clinical systems around the world are still largely
proprietary in their coding systems for the capture of data and in the data models
that they have in use in order to capture and reproduce these data items. However,
this is changing in some countries. The UK mandated SNOMED CT implemen-
tation in Primary care by April 2018 and has plans for Secondary care to follow by
2020 [51]. New Zealand have started their migration from the obsolete READ
standard to standardise on SNOMED CT [52]. This use of clinical terminology at
the point of care is catalysing an increased understanding that this approach,
through the uptake of Professional Record Standards, can, for the first time, start to
deliver fit-for-purpose interoperable records.
The UK has recently become the first to take a professional standards approach
with the inception of the Professional Records Standards Body [53]. Without this
Big Data Challenges for Clinical and Precision Medicine 23
offered in these new delivery paradigms; “quis custodiet ipsos custodes?” It will
also be vital to address whether the workforce and public are ready to accept the end
of paternalism and place their trust in shared decision-making and interpretation.
4 Trust/Privacy/Governance
The availability of data and its use and reuse depends upon a level of trust afforded
to the custodians of the data. Indeed, good medical care is often equated with this
trust-based relationship. Sadly, several early implementations of Big Data have very
publicly abused this trust relationship through the naivety of their approach. In
some cases it seems commercial pressures may have also clouded judgement.
Notable recent examples are not difficult to find.
The UK NHS Royal Free Hospital and Google Deep-Mind collaboration was
reported as “Royal Free breached UK data law in 1.6 m patient deal after only
7 months into the contract” [63–67]. Also in the UK, the care.data project, which
was supposed to be the flagship of NHS Information technology was widely
reported as a ‘debacle’ when it was summarily closed down in 2016 [68–70]. This
followed years of professional anxiety about trust and covert agendas [71–74].
Large-scale errors and naivety are not solely confined to UK Government. In
2016 the Australian Department of Human Services (DHS) publicly released a
‘de-identified’ dataset containing 3 million patients’ data stretching back 30 years
[75]. It was taken down a few months later when local researchers successfully
re-identified some of the people whose data was published [76]. This issue high-
lights what will be a continuing problem. Large datasets however well ‘anon-
ymised’ are always vulnerable to re-identification if they are of large enough scale.
Just as encryption protocols are always at risk of being broken (whether through
quantum computing [77, 78] or some other technique), the issue of re-identification
of large datasets is one which will not go away. We can only hope to avoid
disadvantage through ‘controlled’ and transparent processes and ensuring that data
subjects permission is both sought and achieved [79, 80].
Another instance of recent issues in the use of big data relates again to the UK
where an application using AI algorithms to interview patients was put to use as an
NHS branded resource. Videos published online show the system that was live and
in public use making potentially fatal errors of ‘diagnosis’ [81]. Currently, there is
no published evidence available that the AI was validated through anything other
than internal and offline testing before going live [82–85].
The trust issue is even more important with the recent announcements in the
USA where large corporate interests such as Warren Buffett, JP Morgan, Amazon
[86] and also CVS and Aetna [87] are to merge and become healthcare providers
covering, and in all likelihood dominating, large populations in the USA. Each have
large databases with information gained in their commercial activities. Aggregated,
these data sources could be a great source for good or a massive risk to privacy for a
Big Data Challenges for Clinical and Precision Medicine 25
significant proportion of the United States population that have used a pharmacy, a
credit card or done online shopping [88].
Each of these examples show how easy it is to misuse data and abuse consumer
trust through activities with large datasets, which although well intentioned, are just
not well thought through. Once out of the bottle, the data genie and disadvantage it
will bring, will not easily go back in. This problem is especially acute with genomic
information. With the potential for a genome to be known not long after conception
and certainly from birth onwards, the potential to disadvantage an entire person’s
life becomes a distinct possibility.
Finally in this section, we should examine clinical leadership in this space and
the failure of the professions to fully engage with the information agenda. We have
seen and examined the significant but still insufficient investment in infrastructure
for an industry where the IT is both mission and safety critical [89–91]. A similar
investment gap exists with clinical informatics. It is only recently that this became a
valid career choice in the UK [92]. There are only a few countries that support fully
accredited structures for clinicians to take a career in clinical informatics without
compromising their registration (where revalidation also exists).
This brief exploration of some of the clinical aspects of big data has examined the
hype, the real potential benefits and also the potential pitfalls. If we can address the
challenging issues globally then there is no doubt that the benefits will be signifi-
cant. To jump to the future where published evidence is able to be immediately
tested against a global-sized database and changes to care pathways and plans are
suggested by ever more sophisticated AI backed by deep learning may be a little
way off, but the first steps have been taken [67, 93, 94]. An approach ensuring high
quality structured and coded data [52, 95, 96] is one which should be taken.
Uniquely identifying data subjects is essential for precision. The UK [97, 98],
Australia [99, 100], New Zealand [101, 102] and Nordic [103] countries have all
mandated this approach and maybe the USA and others will follow shortly [104,
105]. What needs to occur is global in nature. It has far reaching implications in the
digital capture of all personal data whether this is for clinical care, illness prevention
or wellness promotion. This vision cannot be achieved at small scale. Global level
coordination and leadership is needed now if we are going to meet the challenge of
big data [106]. In this way we may be ready to address the well documented
challenges of aged care, increased expectation of care, safety of care and budgetary
restriction coupled with a reduction in the availability of a skilled workforce at the
same time [107]. Coupled with this global approach will need to be sustained and
appropriate level of investment in people and workflow-sensitive, interoperable,
precision systems to capture and report on clinical data captured at the point of care
and need.
26 M. Bainbridge
6 The Future
References
1. Big data, big hype? [Internet] (2014) [cited 24 Feb 2018]. Available from: https://www.
wired.com/insights/2014/04/big-data-big-hype/
2. Hurwitz J, Nugent A, Halper F, Kaufman M (2013) Big data for dummies, 1st edn
3. Adamson D (2015) Big data in healthcare made simple [Internet]. Health Catalyst [cited 24 Feb
2018]. Available from: https://www.healthcatalyst.com/big-data-in-healthcare-made-simple
4. Bate A, Reynolds RF, Caubel P (2018) The hope, hype and reality of big data for
pharmacovigilance. Ther Adv Drug Saf 9(1):5–11
5. Anonymous (2008) Chapter 67: children, young people and attitudes to privacy [Internet].
Australian Privacy Law and Practice (ALRC report 108) [cited 25 Feb 2018]. Available from:
https://www.alrc.gov.au/publications/For%20Your%20Information%3A%20Australian%20
Privacy%20Law%20and%20Practice%20%28ALRC%20Report%20108%29%20/67-childre
6. Collier R (2012) Medical privacy breaches rising. CMAJ 184(4):E215–E216
7. Keen PGW (1980) Decision support systems: a research perspective. https://dspace.mit.edu/
handle/17211/47172 [Internet]. [cited 24 Feb 2018]. Available from: https://dspace.mit.edu/
handle/1721.1/47172?show=full?show=full
8. Jugulum R (2016) Importance of data quality for analytics. In: Quality in the 21st century.
Springer, Cham, pp 23–31
9. Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big
data era. Data Sci J 14:2
10. Middleton B, Bloomrosen M, Dente MA, Hashmat B, Koppel R, Overhage JM et al (2013)
Enhancing patient safety and quality of care by improving the usability of electronic health
record systems: recommendations from AMIA. J Am Med Inform Assoc 20(e1):e2–e8
11. Novas C, Rose N (2000) Genetic risk and the birth of the somatic individual. Econ Soc
29(4):485–513
12. Sermon K, Goossens V, Seneca S, Lissens W, De Vos A, Vandervorst M et al (1998)
Preimplantation diagnosis for Huntington’s disease (HD): clinical application and analysis of
the HD expansion in affected embryos. Prenat Diagn 18(13):1427–1436
Big Data Challenges for Clinical and Precision Medicine 27
13. Sini E (2016) How big data is changing healthcare.pdf [Internet]. Humanitas Hospital Italy.
Available from: https://www.eiseverywhere.com/file_uploads/9b7793c3ad732c28787b2a8
bc0892c31_Elena-Sini_How-Big-Data-is-Changing-Healthcare.pdf
14. Big opportunities, big challenges [Internet]. [cited 25 Feb 2018]. Available from: http://
www.ey.com/gl/en/services/advisory/ey-big-data-big-opportunities-big-challenges
15. Bellazzi R (2014) Big data and biomedical informatics: a challenging opportunity. Yearb
Med Inform 22(9):8–13
16. The big-data revolution in US health care: accelerating value and innovation [Internet].
[cited 18 Dec 2017]. Available from: https://www.mckinsey.com/industries/healthcare-
systems-and-services/our-insights/the-big-data-revolution-in-us-health-care
17. Grissinger M (2010) The five rights: a destination without a map. Pharm Ther 35(10):542
18. Polubriaginof F, Tatonetti NP, Vawdrey DK (2015) An assessment of family history
information captured in an electronic health record. AMIA Annu Symp Proc 5(2015):2035–
2042
19. Nathan PA, Johnson O, Clamp S, Wyatt JC (2016) Time to rethink the capture and use of
family history in primary care. Br J Gen Pract 66(653):627–628
20. Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J et al (2015) DEEPEN: a
negation detection system for clinical text incorporating dependency relation into NegEx.
J Biomed Inform 1(54):213–219
21. Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D et al (2014) Negation’s not solved:
generalizability versus optimizability in clinical natural language processing. PLoS One
9(11):e112774
22. Ford EW, Menachemi N, Phillips MT (2006) Predicting the adoption of electronic health
records by physicians: when will health care be paperless? J Am Med Inform Assoc 13(1):
106–112
23. Warner JL, Jain SK, Levy MA (2016) Integrating cancer genomic data into electronic health
records. Genome Med 8(1):113
24. Richard Lilford AM (2012) Looking back, moving forward [Internet]. University of
Birmingham [cited 17 Oct 2017]. Available from: https://www.birmingham.ac.uk/
Documents/college-mds/haps/projects/cfhep/news/HSJ.pdf
25. Wood WA, Bennett AV, Basch E (2015) Emerging uses of patient generated health data in
clinical research. Mol Oncol 9(5):1018–1024
26. Haghi M, Thurow K, Stoll R (2017) Wearable devices in medical internet of things:
scientific research and commercially available devices. Healthc Inform Res 23(1):4–15
27. Montgomery K, Chester J (2017) Health wearable devices in the big data era: ensuring
privacy, security, and consumer protection. American University, Washington
28. Zhu H, Colgan J, Reddy M, Choe EK (2016) Sharing patient-generated data in clinical
practices: an interview study. AMIA Annu Symp Proc 2016:1303–1312
29. Cohen DJ, Keller SR, Hayes GR, Dorr DA, Ash JS, Sittig DF (2016) Integrating
patient-generated health data into clinical care settings or clinical decision-making: lessons
learned from project healthdesign. JMIR Hum Factors 3(2):e26
30. Burn J (2013) Should we sequence everyone’s genome? Yes. BMJ 21(346):f3133
31. Herper M (2017) Illumina promises to sequence human genome for $100—but not quite yet.
Forbes Magazine [Internet]. [cited 25 Feb 2018]. Available from: https://www.forbes.com/
sites/matthewherper/2017/01/09/illumina-promises-to-sequence-human-genome-for-100-but-
not-quite-yet/
32. Rochman B (2017) Full genome sequencing for newborns raises questions. Scientific
American [Internet]. [cited 25 Feb 2018]. Available from: https://www.scientificamerican.
com/article/full-genome-sequencing-for-newborns-raises-questions/
33. Rojahn SY (2014) DNA sequencing of IVF embryos. MIT Technology Review [Internet].
[cited 25 June 2018]. Available from: https://www.technologyreview.com/s/524396/dna-
sequencing-of-ivf-embryos/
28 M. Bainbridge
57. Database normalization and design techniques [Internet] (2008) Barry Wise NJ SEO [cited
25 June 2018]. Available from: http://www.barrywise.com/2008/01/database-normalization-
and-design-techniques/
58. McDonald K (2018) MSIA questions need for minimum functionality requirements
project [Internet]. Pulse+IT [cited 26 Feb 2018]. Available from: https://www.
pulseitmagazine.com.au:443/news/australian-ehealth/4171-msia-questions-need-for-minimum-
functionality-requirements-project
59. GP2GP [Internet]. [cited 15 Sep 2017]. Available from: https://digital.nhs.uk/gp2gp
60. DSCN 09/2010 initial standard—ISB—patient banner [Internet]. [cited 27 Feb 2018].
Available from: http://webarchive.nationalarchives.gov.uk/+http://www.isb.nhs.uk/documents/
isb-1505/dscn-09-2010/index_html
61. Common User Interface (CUI) [Internet]. [cited 07 Dec 2018]. Available from: https://
webarchive.nationalarchives.gov.uk/20160921150545/http://systems.digital.nhs.uk/data/cui/
uig
62. National guidelines for on-screen display of medicines information | Safety and Quality
[Internet]. [cited 26 Feb 2018]. Available from: https://www.safetyandquality.gov.au/our-
work/medication-safety/electronic-medication-management/national-guidelines-for-on-screen-
display-of-medicines-information/
63. DeepMind-Royal Free deal is “cautionary tale” for healthcare in the algorithmic age
[Internet] (2017) University of Cambridge [cited 23 Feb 2018]. Available from: http://www.
cam.ac.uk/research/news/deepmind-royal-free-deal-is-cautionary-tale-for-healthcare-in-the-
algorithmic-age
64. Hodson H (2016) Revealed: Google AI has access to huge haul of NHS patient data. New
Scientist [Internet]. [cited 23 Feb 2018]. Available from: https://www.newscientist.com/
article/2086454-revealed-google-ai-has-access-to-huge-haul-of-nhs-patient-data/
65. Basu S. Should the NHS share patient data with Google’s DeepMind? [Internet].
WIRED UK [cited 19 Feb 2018]. Available from: http://www.wired.co.uk/article/nhs-
deepmind-google-data-sharing
66. Vincent J (2017) Google’s DeepMind made “inexcusable” errors handling UK health data,
says report [Internet]. The Verge [cited 15 Nov 2017]. Available from: https://www.
theverge.com/2017/3/16/14932764/deepmind-google-uk-nhs-health-data-analysis
67. Powles J, Hodson H (2017) Google DeepMind and healthcare in an age of algorithms.
Health Technol 7(4):351–367
68. How the NHS got it so wrong with care.data [Internet] (2016) [cited 19 Feb 2018]. Available
from: http://www.telegraph.co.uk/science/2016/07/07/how-the-nhs-got-it-so-wrong-with-caredata/
69. Temperton J. NHS care.data scheme closed after years of controversy [Internet].
WIRED UK [cited 15 Sep 2017]. Available from: http://www.wired.co.uk/article/care-
data-nhs-england-closed
70. NHS (2013) NHS England sets out the next steps of public awareness about care.data
[Internet]. [cited 15 Sep 2017]. Available from: https://www.england.nhs.uk/2013/10/care-
data/
71. van Staa T-P, Goldacre B, Buchan I, Smeeth L (2016) Big health data: the need to earn
public trust. BMJ 14(354):i3636
72. McCartney M (2014) Care.data doesn’t care enough about consent. BMJ 348:g2831
73. Godlee F (2016) What can we salvage from care.data? BMJ 354:i3907
74. Mann N (2016) Learn from the mistakes of care.data. BMJ 354:i4289
75. Cowan P. Govt releases billion-line “de-identified” health dataset [Internet]. iTnews [cited
18 Feb 2018]. Available from: http://www.itnews.com.au/news/govt-releases-billion-line-
de-identified-health-dataset-433814
76. Lubarsky B (2017) Re-identification of “anonymized” data. Georgetown Law Technol Rev
12:202–212
77. Why quantum computers might not break cryptography | Quanta Magazine [Internet].
Quanta Magazine [cited 25 Feb 2018]. Available from: https://www.quantamagazine.org/
why-quantum-computers-might-not-break-cryptography-20170515/
30 M. Bainbridge
78. Bernstein DJ, Heninger N, Lou P, Valenta L (2017) Post-quantum RSA. In: Post-quantum
cryptography. Lecture notes in computer science. Springer, Cham, pp 311–329
79. Wan Z, Vorobeychik Y, Xia W, Clayton EW, Kantarcioglu M, Malin B (2017) Expanding
access to large-scale genomic data while promoting privacy: a game theoretic approach.
Am J Hum Genet 100(2):316–322
80. Malin B, Sweeney L (2004) How (not) to protect genomic data privacy in a distributed
network: using trail re-identification to evaluate and design anonymity protection systems.
J Biomed Inform 37(3):179–192
81. Murphy D (2017) @CareQualityComm—this is one of the triages relating to the 48yr old
30/day smoker woken from sleep with chest pain. It is now updated. pic.twitter.com/
BJG27sft4J [Internet]. @DrMurphy11 [cited 27 Feb 2018]. Available from: https://twitter.
com/DrMurphy11/status/848110663054622721
82. Middleton K, Butt M, Hammerla N, Hamblin S, Mehta K, Parsa A (2016) Sorting out
symptoms: design and evaluation of the “babylon check” automated triage system [Internet].
arXiv [cs.AI]. Available from: http://arxiv.org/abs/1606.02041
83. Crouch H (2017) Babylon health services says it has “duty” to point out CQC
“shortcomings” [Internet]. Digital Health [cited 18 Feb 2018]. Available from: https://
www.digitalhealth.net/2017/12/babylon-health-services-says-duty-point-cqc-shortcomings/
84. McCartney M (2017) Margaret McCartney: innovation without sufficient evidence is a
disservice to all. BMJ 5(358):j3980
85. Ogden J (2016) CQC and BMA set out their positions on GP inspections. Prescriber 27
(6):44–48
86. Dent S (2018) Amazon gets into healthcare with Warren Buffet and JP Morgan [Internet].
Engadget [cited 25 Feb 2018]. Available from: https://www.engadget.com/2018/01/30/
amazon-healthcare-warren-buffet-jpmorgan-chase/
87. Terlep S (2017) The real reason CVS wants to buy Aetna? Amazon.com. WSJ Online
[Internet]. [cited 25 Feb 2018]; Available from: https://www.wsj.com/articles/the-real-
reason-cvs-wants-to-buy-aetna-amazon-com-1509057307
88. Blumenthal D (2017) Realizing the value (and profitability) of digital health data. Ann Intern
Med 166(11):842–843
89. How much should small businesses spend on IT annually? [Internet] (2015) Optimal
Networks [cited 26 Feb 2018]. Available from: https://www.optimalnetworks.com/2015/03/
06/small-business-spend-it-annually/
90. Atasoy H, Chen P-Y, Ganju K (2017) The spillover effects of health IT investments on
regional healthcare costs. Manage Sci [Internet]. Available from: https://doi.org/10.1287/
mnsc.2017.2750
91. Appleby J, Gershlick B (2017) Keeping up with the Johanssons: how does UK health
spending compare internationally? BMJ 3(358):j3568
92. Williams J, Bullman D (2018) The faculty of clinical informatics [Internet]. FCI [cited 26
Feb 2018]. Available from: https://www.facultyofclinicalinformatics.org.uk/
93. Klasko SK (2017) Interview with Deborah DiSanzo of IBM Watson health. Healthc
Transform 2(2):60–70
94. Fogel AL, Kvedar JC (2018) Artificial intelligence powers digital medicine. NPJ Digit Med
1(1):5
95. Personalised health and care 2020 [Internet]. GOV.UK [cited 25 June 2018]. Available from:
https://www.gov.uk/government/publications/personalised-health-and-care-2020
96. Spencer SA (2016) Future of clinical coding. BMJ 26(353):i2875
97. McBeth R (2015) NHS number use becomes law | Digital Health [Internet]. Digital Health.
[cited 15 Nov 2017]. Available from: https://www.digitalhealth.net/2015/10/nhs-number-
use-becomes-law/
98. NHS number [Internet]. [cited 15 Sep 2017]. Available from: https://digital.nhs.uk/NHS-
Number
Big Data Challenges for Clinical and Precision Medicine 31
Aude Motulsky
What if, we had access to real-life data about how medications are prescribed,
dispensed, administered, and taken? What if. We had the ability to capture the
consequences associated with medication use, both intended (relieve symptoms or
treat and prevent diseases) and unintended (side effects, adverse events) not only
from clinical trials and anecdotal experiences, but from large cohorts of patients
with various characteristics (age, gender, ethnic origins, socioeconomic character-
istics, etc.)? We would then be able to assess effectiveness and safety of medica-
tions from a population perspective, better understand drivers of prescribing
practices and consumption behaviors (e.g. adherence), and inform the
decision-making processes of policy makers, clinicians and patients by providing
them with the risk benefit ratio of each drug—the added value—driven by real-life
data, and adapted to their local or individual characteristics [1, 2]. These are the
promises of Big Data from a pharmacy perspective: to close the gap between
science and practice surrounding medications and provide a personalized answer to
the question: “Should I take (or prescribe? or cover?) this medication”?
A. Motulsky (&)
Department of Management, Evaluation and Health Policy,
School of Public Health, Academic Health Center of the
Université de Montréal, Université de Montréal, Montréal, Canada
e-mail: aude.motulsky@umontreal.ca
Medications are highly complicated because they change so quickly. New medi-
cations enter and others are withdrawn from the market monthly, with different
trends in different jurisdictions. It is difficult to find another health-related concept
that is so volatile and locally situated. The first entry point for the approval of any
medication is the regulatory agency in a given jurisdiction, such as Health Canada,
the Federal Drug Agency (FDA) in the USA and the European Medicines Agency
(EMA) in Europe. They maintain lists of medications approved in their jurisdictions
with related numeric codes and descriptors (not standard), that are called a drug
catalogue (Table 1). These codes are always at the brand level, i.e. describing a
product on the market that may contain more than one active molecule. Each new
brand, being from a generic or an innovator company, will have to go through the
1
Between different jurisdictions, there is no standard terminology to describe the electronic record
applications that are used by clinicians to replace paper charts. In this chapter, we will use the term
electronic health record (EHR) to describe the computerized system that is replacing the paper
chart in health care organizations, with features such as electronic clinical documentation and
prescription (including primary care and acute care settings). It is used as a synonym to electronic
medical record (EMR).
Big Data Challenges from a Pharmacy Perspective 35
Table 1 An example of the Canadian drug catalogue: drug identification numbers (DINs) for
selected atorvastatin oral tablets
DIN Product name Company Active Strength Pharmaceutical Route
ingredient (mg) form
02230711 Lipitor Pfizer Atorvastatin 10 Tablet Oral
02295261 Apo-atorvastatin Apotex Atorvastatin 10 Tablet Oral
02295288 Apo-atorvastatin Apotex Atorvastatin 20 Tablet Oral
02348713 Atorvastatin Sanis Atorvastatin 20 Tablet Oral
health
The patient
Intake, expected effects
(indication), actual and
perceived effects (signs
and symptoms, side
effects, adverse events)
The concepts related to the medication include information about the product’s
name, the molecule(s) that is(are) found within the product, and the formulation. In
pharmaceutical terms, formulation refers to the way a medication is prepared to be
administered. Hence, the form refers to what is held in the hands, such as tablets,
capsules, solutions, powders for inhalation, etc. In some cases, it may also include
information about the containers that are utilized to administer the product: inhalers,
syringes, cartridges, transdermal patches, rings, etc. And, most of the time, it is
strongly linked to the route of administration of the product, because the excipients
that are used to ensure the molecule is going to be absorbed, without being painful
or uncomfortable, are adapted to the way the medication will be administered (e.g.
orally, topically, in the eye, injected). But the route is ultimately related to the
prescription and what is administered to the patient. Hence, the route is not only
determined by the formulation and many scenarios are possible for a given form.
For example, pills which are normally taken orally can be administered intrav-
aginally (e.g. misoprostol), and eye drops can be administered orally (e.g. atropine).
Finally, the strength represents the amount of the molecule that is found in a given
quantity of the product that is defined using units. Tables 2 and 3 present different
medication-related concepts and examples of their label in different terminologies.
The border is blurry between these concepts, and they are usually grouped
together in a way that makes sense for the purposes of their utilization in different
drug-related terminologies. For example, in RxNorm, the drug terminology
developed and maintained by the National Library of Medicine in the USA, the
drug name is always linked to the route, and the strength is always linked to the
form to support the electronic prescribing process.
Classification systems for medications have been developed to be able to group
similar medications (e.g. based on their chemical structure or pharmacological
action), or medications that are used similarly (e.g. based on their therapeutic
action). Table 3 presents different characteristics that are used to classify medica-
tions. The World Health Organization (WHO) maintains a classification system for
Big Data Challenges from a Pharmacy Perspective 37
all medications approved around the world, the Anatomical Therapeutic Chemical
(ATC) system that is preferred for comparative purposes between jurisdictions
(https://www.whocc.no/atc_ddd_index/). Many other classification systems are
available, such as the American Hospital Formulary Services (AHFS) and the
British National Formulary (BNF) systems, all based on their own logic.
38 A. Motulsky
Prescription-related data are even more complicated. Here, medication data are
contextualized for a given patient at a given point in time. According to the ISO
standard on medication management concepts, a prescription represents: (1) an
instruction by a health care provider; (2) a request to dispense; and (3) advice to
patients about their treatment [5]. It may include different information related to
taking or administering the medication (the regimen), and to the duration of the
treatment (Table 4). Again, this occurs without a standard method of referring to
these concepts and for structuring them in an electronic format. Variables related to
the regimen are necessary in order to calculate the daily dose that a patient receives,
while variables related to the duration are important to estimate the exposure of a
patient to a given medication over time (and also to estimate the daily dose when
the instructions are not available). Sources for prescription-related data are diverse,
with their own specificities that are important to highlight.
Prescription-related data may come from what was given to a health care provider
using the “professional” way of writing instructions (e.g. 1 CO TID), but could also
come from what was given to a patient, where instructions are translated into
patient-friendly language (e.g. take one tablet three times a day). At this time, there is
no standard in North America regarding the instruction field structure, and wide
Big Data Challenges from a Pharmacy Perspective 39
Fig. 2 Prescription related databases, per step of the medication management process, and their
associated risk of errors when estimating medication exposure. EHR = electronic health record;
eMAR = electronic medication administation record
The retail pharmacy practice is one of the first health care sectors to have com-
puterized its activities beginning in the mid-1980s. Primarily for billing purposes,
pharmacy management systems have allowed for the creation of large databases of
40 A. Motulsky
The WHO has defined average daily doses (DDD) for a given indication and a
given route of administration for each molecule. For example, the DDD for oral
hydromorphone when used for pain is 20 mg, while DDD for rectal and injectable
hydromorphone is 4 mg because the bioavailability of the drug is higher when
administered intravenously or intra rectally (i.e. to achieve the same concentration
in the blood, 20 mg is needed orally while only 4 mg is needed through the other
routes). This is because the absorption of the drug through the gut is never 100%,
and because the drug usually passes through the liver before absorption, leading to
what is called the first passage effect, that is avoided when taken intra rectally or
injected directly in the vein. Reporting daily doses in the proportion of the DDD is a
standard way to estimate the magnitude of exposure to a medication. In an ideal
world, it would need to be combined with BMI, renal and liver functions, and
genotypes of a given patient to be able to better estimate this exposure in relation to
the pharmacokinetics of the drug in a given individual (and thus the blood con-
centration of this drug).
Calculating the daily dose is complicated. It can be estimated from the
instructions (1 mg twice a day will result in 2 mg as a daily dose), or the duration
for a given quantity of a given product (30 pills of 1 mg dispensed for a duration of
15 days will give a daily dose of 2 mg). However, instructions are rarely available
in a standard and structured format that would make this calculation straightforward
[13]. Quantity is usually available, but needs to be mapped to the duration to make
sense, especially in some countries such as France where the quantity dispensed is
rarely aligned with what is needed by the patient for a given treatment as it is
restricted by available packaging. Typically, a French pharmacist will dispense the
smallest format available (e.g. a box of 28 pills) to a patient, even if the prescription
is written for 1 pill per day for 5 days. Using the quantity might lead to an incorrect
analysis of prescribing/dispensing patterns if the duration is not taken into
consideration.
However, the duration might be difficult to assess when the treatment is as
needed, or with a changing dose over time. This is frequent with medication for
pain (e.g. pregabalin), for diabetes (e.g. insulin), or warfarin, where patients will
adjust their daily dose depending on their condition. Thus, estimation of daily dose
would be highly facilitated with standard and structured instructions, including an
assessment of the chronicity status of the medication (chronic or acute, as needed or
regular), and the stability of the dose over time (e.g. successive dose = take 10 mg
for 10 days and then 20 mg; or alternate dose = take 2 mg Monday Wednesday
Friday and 3 mg other days).
Ultimately, the core of the potential of Big Data and also its challenges, rests with
the patient. This is where the potential for Big Data is disruptive, but will only be
actualized if data is captured relating to the reason the patient is prescribed the
42 A. Motulsky
medication (the indication), and what the impact of taking the medication is for a
given patient over time (both expected and unexpected). This is where potential
adverse drug events can be captured prospectively, as well as where real-life drug
effectiveness can be aligned with the practices of prescribers and patients.
Observational studies through Big Data may be the key in assessing the safety and
effectiveness of all types of prescribing practices, as well as fostering our ability to
understand pharmacogenetic drivers of different responses to drugs based on
individual genotypes. It may revolutionize the way medication are tested, approved,
and continuously evaluated after their approval. It is thus not surprising that major
pharmaceutical companies are investing massively in data analytics departments,
and trying to buy, or create business relationships with EHR and other health-data
owners [14]. It will be important to ensure academic researchers and public
agencies have the same analytic capabilities than private companies, both in terms
of data access, merging, and analysis.
At the moment, the approval process of medications is based on clinical trials,
and only certain indications are evaluated, and thus approved. These indications are
called on-label indications. But what prescribers and patients do after an approval
may be far from what was evaluated in clinical trials [15, 16], and little is known on
the true added value of medications in this context. Similarly, pharmacosurveillance
programs are based on voluntary reporting of adverse events, by patients and health
care providers, and would benefit from proactive surveillance of the actual out-
comes associated with exposure to medication, flagging potential patterns. But the
missing link is exactly there: we need to find a way to identify outcomes associated
with medication use, both intended and unintended. To do that, we need to find a
way to map signs and symptoms of patients to medication usage in both directions:
from the indication—or health concern—that the prescriber is trying to address with
a medication, to the actual consequences for a given patient over time. For example,
nausea can be a health concern, and a medication is going to be prescribed for it;
but it could also be a side effect of medications, and this need to be captured
electronically. However, the indication is rarely documented with the prescription,
and no standard terminology is available to document medication-related indica-
tions [17]. Many pilot projects are ongoing, primarily in the USA, to incorporate
indication as a mandatory field when prescribing medications, [18] and even to start
the prescribing process with selecting the indication rather than the medication [19].
However, this is far from being the norm in the prescribing process. Similarly,
diagnosis, health problems or health concerns may be documented in electronic
records (e.g. using ICD or SNOMED CT as the standards), but signs and symptoms
following medication usage are rarely documented (e.g. when a medication is
stopped because of a side effect reported by the patient). Capturing the signs and
symptoms in a standard way, using a common terminology, that can be mapped to
medication-related concepts such as indication and side effects, is a priority, for
enabling our analytic capacity from a pharmacy perspective.
Big Data Challenges from a Pharmacy Perspective 43
8 In Conclusion
References
1. McMahon AW, Dal Pan G (2018) Assessing drug safety in children—the role of real-world
data. N Engl J Med 378(23):2155–2157
2. Schneeweiss S (2014) Learning from big health care data. N Eng J Med 370(23):2161–2163
3. Dhavle AA, Ward-Charlerie S, Rupp MT, Amin VP, Ruiz J (2015) Analysis of national drug
code identifiers in ambulatory e-prescribing. J Manag Care Spec Pharm 21(11):1025–1031
4. Motulsky A, Sicotte C, Gagnon MP, Payne-Gagnon J, Langué-Dubé JA, Rochefort CM,
Tamblyn R (2015) Challenges to the implementation of a nationwide electronic prescribing
network in primary care: a qualitative study of users’ perceptions. J Am Med Inform Assoc 22
(4):838–848
5. ISO/TR 20831:2017 (2017) Health informatics—medication management concepts and
definitions
6. Dhavle AA, Rupp MT (2015) Towards creating the perfect electronic prescription. J Am Med
Inform Assoc 22(e1):e7–e12
44 A. Motulsky
7. Dhavle AA, Yang Y, Rupp MT, Singh H, Ward-Charlerie S, Ruiz J (2016) Analysis of
prescribers’ notes in electronic prescriptions in ambulatory practice. JAMA Intern Med 176
(4):463–470
8. Aabenhus R, Hansen MP, Siersma V, Bjerrum L (2017) Clinical indications for antibiotic use
in Danish general practice: results from a nationwide electronic prescription database. Scand J
Prim Health Care 35(2):162–169
9. Ekedahl A, Brosius H, Jönsson J, Karlsson H, Yngvesson M (2011) Discrepancies between
the electronic medical record, the prescriptions in the Swedish national prescription repository
and the current medication reported by patients. Pharmacoepidemiol Drug Saf 20(11):1177–
1183
10. Kivekas E, Enlund H, Borycki E, Saranto K (2016) General practitioners’ attitudes towards
electronic prescribing and the use of the national prescription centre. J Eval Clin Pract 22
(5):816–825
11. Fischer MA, Stedman MR, Lii J, Vogeli C, Shrank WH, Brookhart MA, Weissman JS (2010)
Primary medication non-adherence: analysis of 195,930 electronic prescriptions. J Gen Intern
Med 25(4):284–290
12. Tamblyn R, Eguale T, Huang A, Winslade N, Doran P (2014) The incidence and determinants
of primary nonadherence with prescribed medication in primary care: a cohort study. Ann
Intern Med 160(7):441–450
13. McTaggart S, Nangle C, Caldwell J, Alvarez-Madrazo S, Colhoun H, Bennie M (2018) Use
of text-mining methods to improve efficiency in the calculation of drug exposure to support
pharmacoepidemiology studies. Int J Epidemiol 47(2):617–624
14. Hirschler B (2018) Big pharma, big data: why drugmakers want your health records. Reuters,
1 Mar 2018. https://www.reuters.com/article/us-pharmaceuticals-data/big-pharma-big-data-
why-drugmakers-want-your-health-records-idUSKCN1GD4MM. Accessed on 18 Mar 2018
15. Eguale T, Buckeridge DL, Winslade NE, Benedetti A, Hanley JA, Tamblyn R (2012) Drug,
patient, and physician characteristics associated with off-label prescribing in primary care.
Arch Intern Med 172(10):781–788
16. Eguale T, Buckeridge DL, Verma A, et al (2016) Association of off-label drug use and
adverse drug events in an adult population. JAMA Intern Med 176 (1):55–63
17. Salmasian H, Tran TH, Chase HS, Friedman C (2015) Medication-indication knowledge
bases: a systematic review and critical appraisal. J Am Med Inform Assoc 22(6):1261–1270
18. Galanter WL, Bryson ML, Falck S, Rosenfield R, Laragh M, Shrestha N, Schiff GD,
Lambert BL (2014) Indication alerts intercept drug name confusion errors during comput-
erized entry of medication orders. PLOS ONE 9(7)
19. Schiff GD, Seoane-Vazquez E, Wright A (2016) Incorporating indications into medication
ordering-time to enter the age of reason. N Eng J Med 375(4):306–309
Big Data Challenges from a Public
Health Informatics Perspective
David Birnbaum
Whether the three core functions of public health are called assessment, policy
development and assurance … or assessment, promotion and protection … these
give rise to a wide-ranging set of recognized responsibilities. Specifically, the 10
Essential Public Health Services have been defined as: (1) monitor health status to
identify and solve community health problems; (2) diagnose and investigate health
problems and health hazards in the community; (3) inform, educate, and empower
persons about health issues; (4) mobilize community partnerships to identify and
solve health problems; (5) develop policies and plans that support individual
and community health efforts; (6) enforce laws and regulations that protect health
and ensure safety; (7) link persons to needed personal health services and assure the
provision of health care when otherwise unavailable; (8) assure a competent public
and personal health care workforce; (9) evaluate effectiveness, accessibility, and
quality of personal and population-based health services; and (10) conduct research
for new insights and innovative solutions to health problems [1]. Clearly, this
defines a data-driven mandate.
Public health’s vanguard has moved from an era of relying on receipt of data
through paper forms and telephone notifications, through an era of automated data
transmission into siloes unique to each public health program without interoper-
ability, to reach the point where interoperability between information systems and
expertise in informatics are of paramount importance. Any individual wanting to
This is source 1 of chapter four. They may want to cite this paper by Lazer et al., science 2014, if
this extract is taken from here http://science.sciencemag.org/content/343/6176/1203.full
D. Birnbaum (&)
Applied Epidemiology, 609 Cromar Road, North Saanich V8L 5M5, BC, Canada
e-mail: david.birnbaum@ubc.ca
There are obvious benefits to reducing undesirable delays, but on the other hand big
data may be exacerbating what has been called by several authors the tyranny of the
moment. An unintended consequence of technological change over the past decade
has been the constant promise then impatient expected norm of everything always
becoming faster, of “timely” becoming instant while still being accurate. From the
internet to e-mail, and now to communication of findings from data mining, what
was intended to be time-saving advances can wind up consuming recipients’ lives
to the point of diminishing thoughtful reflection time, accelerating the spread of
confusion rather than enlightenment. When compounded by a 24 h a day delivery
of news by various media outlets and social media, it can seem that getting current
information can never be fast enough and getting credible information can never be
accurate enough. This can challenge the ability of public health communications to
influence public opinion on issues that spread rapidly through social media, all the
while protecting the credibility and trustworthiness of public health agencies
themselves.
One of the major challenges faced by American public health agencies under
their federal government’s Meaningful Use initiative has been inadequate
48 D. Birnbaum
Important lessons also can be learned from the independent audit of Panorama [11],
a project to develop a seamless national public health information system for
Canada. This Auditor General’s report documents serious problems in all three
aspects audited—functionality, stability and usability, stemming from deficiencies
in project leadership, contract management, system development and accountabil-
ity. It contains quotations regarding benefits from core functionality in the system
produced, and responses from public health agencies and the Ministry of Health to
recommendations made in the audit report, but also notes that Panorama has not
become a national pan-Canadian or even a total provincial pan-British Columbian
information system as originally intended. Started in 2004 and implemented in
2011, Panorama is years late in delivery, significantly over-budget in costs, and for
reasons explained in detail in the report the Auditor General states that “The
ministry’s failure to meet established budgets and deliver the full scope of both
projects indicates that Panorama did not achieve value for money.”
As public health departments acquire the capacity to not only collect large
volumes of detailed data about individuals, and database linkage capabilities grow
across the internet, the challenge of balancing legitimate access to information of
public importance versus the expectation of patient privacy protection also has
overwhelmed the adequacy of traditional approaches under existing legal authority
[12, 13]. Changes recommended by Information and Privacy Commissionaires as
Big Data Challenges from a Public Health Informatics Perspective 49
well as public health leaders must be addressed within their respective national and
state or provincial jurisdictions; however, a harmonized international framework is
also needed to ensure compatibility and interoperability between jurisdictions.
Thus, the realm of national politics and international trade agreements is also
germane to the future of public health informatics. Past experience with such
agreements is cause for caution within the public health community [14, 15].
Intrinsic in this aspect is the question of data ownership—whether healthcare
providers and corporate entities own patient care data or simply are stewards of
patient-owned records of care. Also, the question of when if not whether the suc-
cession of electronic patient record systems developed to archive these records for
entire populations will satisfy the working needs and expectations of all stake-
holders [16].
Beyond the challenges of collecting big data rests the challenges of analysis and
visualization. The Institute for Health Metrics and Evaluation has been at the
forefront of studying the Global Burden of Disease and exploring ways to visualize
its complex data sets (http://www.healthdata.org/results/data-visualizations). Others
have developed platforms like HealthMap (http://www.healthmap.org/en/) to sim-
ply improve real-time accessibility of “a unified and comprehensive view of the
current global state of infectious diseases and their effect on human and animal”.
Limitations and pitfalls of familiar graphs and charts have been identified by authors
like William Cleveland [17], who in his 1993 book presents tools for visually
encoding and decoding the “hypervariate” and “multiway” data that are more
complex than the more familiar univariate, bivariate and trivariate types of data
often seen. As Cleveland says, “Visualization is critical to data analysis. It provides
a front line of attack, revealing intricate structure in data that cannot be absorbed in
any other way. We discover unimagined effects, and we challenge imagined ones
… When a graph is made, quantitative and categorical information is encoded by a
display method. Then the information is visually decoded. This visual perception is
a vital link. No matter how clever the choice of the information, and no matter how
technologically impressive the encoding, a visualization fails if the decoding fails.
Some display methods lead to efficient, accurate decoding, and others lead to
inefficient, inaccurate decoding.” Modeling is another approach to using data to
inform decisions. Of course not all public health problems need big data to discover
useful answers, but richer data sets may be able to support the creation, refinement
and validation of more meaningful models. As the statistician George
Box cautioned, “All models are wrong but some models are useful” [18]. Modeling
complex feedback-driven health systems spans expertise from healthcare profes-
sions, systems analysts, statisticians, engineers and others. Consider, for example,
how public health and systems science methods were combined to model
the structure and behavior of an entire country’s immunization system [19].
50 D. Birnbaum
Several countries maintain big data resources available for health service
research. For example, the Canadian Institutes of Health Research (http://www.
cihr-irsc.gc.ca/e/49941.html), the U.S. National Institutes of Health (https://
commonfund.nih.gov/bd2k), the U.S. Department of Health and Human Services
(https://www.healthdata.gov/), the European Union (http://data.europa.eu/euodp/
en/home), the UK Government (https://data.gov.uk/data/search?theme-primary=
Health), etc. Philanthropic foundations also have committed to sharing high quality
data (e.g. https://www.gatesfoundation.org/How-We-Work/General-Information/
Big Data Challenges from a Public Health Informatics Perspective 51
7 Conclusion
replaced by one involving analysis at the push of a button on data sets too large to
examine, or use of algorithms that had not been tested for validity to control
automated equipment, errors occurred and harm as well as near-misses resulted
when such error was not immediately recognized [28]. William Vaughan and Paul
Erhlich have variously been quoted as saying that “To err is human, to really foul
things up requires a computer” (https://quoteinvestigator.com/2010/12/07/foul-
computer/). What, then, should we say about amplifying the power of computers
with big data? Perhaps “The combination of a strong epidemiologic foundation,
robust knowledge integration, principles of evidence-based medicine, and an
expanded translation research agenda can put Big Data on the right course” [29].
There is no denying the potential in big data to improve our understanding of
complex systems, to advance personalized medicine that can improve the safety and
effectiveness of medical therapy, and to improve public health’s ability to inform
decisions that can safeguard population health. However, the path to those benefits
must be navigated with due discipline and caution.
References
Donald W. M. Juzwishin
1 Introduction
D. W. M. Juzwishin (&)
University of Victoria, British Columbia, Canada
e-mail: djuzwishin@uvic.ca
3.1 Definitions
Table 1 describes the role and responsibility of administration and leadership in the
health care system.
The role and responsibility of governance bodies and leadership is to work
together and execute on the legal requirements of health care delivery. We will
review the HSO framework to describe and analyze the eight standards in Table 2
by which health care is expected to be delivered to the population. These standards
are generally reflective of the expectations of other accreditation bodies in other
modern democratic and open societies.
In addition to the eight dimensions of excellence in delivering care, four values
are identified as key to express the aspirational relationship between citizens,
patients, health care providers and the governance bodies and administration. The
four values are summarized in Table 3.
Big data can either be a facilitator or a threat to achieving these values. In this
chapter we take a critical look at the hope and hype of big data with a view to
preparing ways that administrators can engage with big data effectively.
The promise of big data for administrators and leaders on the surface appears
enormous. Many of the promises are theoretical, they appear conceptually sound,
but other than some very early results from high quality studies few have delivered
on the promise. In this section we want to be critical of ensuring that we have
thought through carefully the unintended consequences of big data.
5 Population Focus
Health care leaders are expected to bring a social determinants and population wide
perspective to their roles. Social and financial status, genetic predisposition and
environmental factors are all seen as influencing the health status of a community.
Coupling a social determinants of health perspective with an all-government
approach suggests that personal health data could be linked to other data from
58 D. W. M. Juzwishin
social, education, geographic and economic sources to identify what social, eco-
nomic and public policy gaps exist and what interventions might be appropriate to
improve a community’s health status. Leaders are mobilized by an egalitarian
distribution of health status in a community focusing on marginalized populations
that exhibit health status lower than the general population. Big data could be
utilized to identify the gaps in health status to mobilize policy and interventions to
address the gaps.
6 Accessibility
Timely access to the right health care service in a convenient location for the patient
is important. Institutional health care delivery within four walls has traditionally
expected that patients will come to the location to receive the service. This may be
Big Data Challenges from a Healthcare Administration Perspective 59
One of the challenges for big data to improve the accessibility of services for the
citizen is the fragmentation and lack of linkage among data repositories. Many
jurisdictions have yet to provide easy access and ownership of all personal health
data to citizens.
There is also a need to differentiate between the primary uses of the health data
versus secondary use. Primary use is for the purpose of delivering care. The con-
fidentiality and privacy of this information is generally protected by law and
restricted for use between the patient and the care giver(s). Secondary use would be
for policy development, quality improvement, research or innovation. The
de-identification of data for purposes of research and policy planning will be
essential for big data to make the promised contribution.
7 Safety
7.1 Opportunities
Health care leaders are committed to delivering safe and effective health care.
Big data could be mobilized to assess whether the outcomes of health care
interventions such as diagnostic tests, rehabilitation, surgical procedures, and other
therapies are being delivered in a safe manner. Are the benefits of the interventions
greater than the harms?
Today’s clinical and administrative leaders are encouraged to communicate and
share complete and unbiased information with their patients and families in ways that
are affirming and constructive. This is why organizations have only recently come to
publicly apologize for negative consequences of a patients interaction with the health
care system. There exists an opportunity for big data to make a significant improve-
ment in the safety record for the industry to adopt the same approach as the airlines
have of no blame and open expression of airplane transport accidents. This type of
no-fault approach has been demonstrated to continually improve the safety record of
60 D. W. M. Juzwishin
the airlines. Openly publishing an organizations adverse event rate experience could
help build trust with the community. Big data could provide an opportunity for open,
explicit and transparent reporting of health provider’s safety record.
One approach to this has been the establishment of registries that monitor and report
on the trajectory of interventions and the outcomes associated with them. This
raises the need to handle audits, be transparent with the results and report to the
public. Big data can help to track a number of different variables to help understand
adverse events and how they can be avoided. Big data will require that health care
ontologies, nomenclatures, catalogs, terms and databases be developed and agreed
to. The challenge is that not all professions and leaders will be open to the challenge
dialogues necessary to arrive at standardized and a systematic approach. On the
cautionary side Kuziemsky has identified a number of unintended and negative
consequences at the individual, organization and social levels of applying big data
approaches (Kuziemsky [3]). Leaders and administrators will need to be sensitive
to context and to include the care providers and patients early in the conversation to
ensure that they are part of the solution.
8 Worklife
Organizations are encouraged to learn how they can effectively improve the climate
of dignity and respect that is generated between and among staff as well as with
patients and leadership. New forms and platforms of communication and crowd-
sourcing of data would be useful to have a broader spectrum of communication
between health care leaders and staff to identify opportunities for improvement of
the work setting. Finding new ways to have health care providers work alongside
data scientists seeking ways to improve health care delivery and understanding the
care delivery process could result in important dividends for patients.
Big data may provide an opportunity for organizations to improve worklife by
documenting the incidence of work place injuries and associating this with other
variables such as location, time, exposure to infectious agents and other environ-
mental factors. This could formalize and improve organization performance and
staff satisfaction.
Big Data Challenges from a Healthcare Administration Perspective 61
One of the challenges with big data and worklife is that very little is understood
about the relationship of the two. Usability studies and the systematic review of the
barriers and critical success factors in the implementation of clinical information
systems in health care provider organizations demonstrate that the challenges are
rarely technical and more often socio cultural and not well understood. For big data
to make a contribution to the effective improvement of the worklife of health care
providers much more research and understanding is necessary as to what the bar-
riers to effective use of information systems are. Health leaders should consider
adopting usability approaches and methods to determine what would best suit their
need for big data to inform their policy and decision making.
9 Client-Centered Services
Health care leaders are looking for ways to engage with their community members
so that their values and perspectives can be understood and used to inform ways to
improve health care delivery to them. One approach that big data provides is using
crowdsourcing as a means of gauging public opinion.
A rapidly growing area of big data is consumer health informatics, the use of
patient generated health data or mobile health to monitor health status. As tech-
nological advancements continue with sensors, machine learning and the price of
these devices decrease they are becoming more widespread. Research is being
undertaken to try and understand how these might be useful for maintenance of the
patient’s health and the effective delivery of health care. There is also a cautionary
note being voiced by Redmond [4] about the need for policy and regulation to be in
place to ensure that the detailed data from wearable sensor data is not abused
causing an invasion of privacy or prejudice of individuals. Other issues arise as to
how these data might be usefully integrated into the personal health record.
62 D. W. M. Juzwishin
There is a significant movement toward patient centered care and coordinating care
much more effectively around the patient. A big challenge for big data is that there
are currently a large number of ontologies, nomenclatures and structures of data-
bases being used that will make it very challenging for them to talk to one another.
In a systematic review Kruse [5] identified nine challenges facing big data; “data
structure, security, data standardization, data storage and transfers, managerial
issues such as governance and ownership, lack of skill of data analysts, inaccuracies
in data, regulatory compliance, and real-time analytics” [5]. Organizations like
ISQUAL, ISO, Accreditation Canada, HL7 and IMIA will need to be encouraged to
collaborate with governing bodies and administrators to arrive at a consensus on
standardized approaches. It will be next to impossible to make sense of big data
unless these foundational blocks are put into place.
10 Continuity
Health care leaders are interested in developing streams of data from patients that
will enable the citizen and patient to better self-manage their health. This encour-
ages shared decision making with their providers and having virtual care accessible
to them. Health leaders will need to be attentive to “changes for reimbursement for
health care services, increased adoption of relevant technologies, patient engage-
ment, and calls for data transparency raise the importance of patient-generated
Big Data Challenges from a Healthcare Administration Perspective 63
11 Appropriateness
Big data could make a significant contribution to providing answers to many of the
vexatious diseases such as type II diabetes, obesity and other chronic diseases;
however, the time from discovery of new knowledge in basic science and its
clinical application can take decades to benefit the patient. The current approach of
using hypothesis based clinical research, which requires complex and expensive
randomized control trials, is both resource intensive and time consuming. Big data
provides the promise of real world data driven research where the rigor and internal
validity of clinical trials are maintained, and confounding variables are accom-
modated [7]. The objective in many clinical research projects is to reduce the
uncertainty about which intervention(s) will result in the most clinically effective
outcome. Building longitudinal data sets linking interventions to patient outcomes
and monitoring these over times would provide a foundation for the continually
learning health care system in improving its performance.
Big data promises to be a significant support to this effort by providing the
repository of the world’s medical knowledge to the physician’s fingertips with
artificial intelligence and machine learning programs such as Watson [8]. Prompts
and reminders linking the patient’s condition through the personal health record to
the literature can potentially improve safety and outcomes for patients. The map-
ping of the human genome and targeting of interventions based on risk factors for
patient’s promises improved outcomes.
64 D. W. M. Juzwishin
The promises identified above are powerful and engaging but our current infras-
tructure does not permit us to make progress unless significant challenges are
addressed.
To begin with appropriateness of care is not only a technical question it is also a
social, political and moral question. Health care leaders will need to be sensitive to
and respectful of the citizen and patient preferences.
To be successful moving toward real-world trials there will need to be a strong
linkage and integration of data and practice between the delivery of health care and
the research enterprise. There are currently significant challenges in linkage of data
between funders, providers and institutions that deliver care. A major challenge to
be addressed will be for governments and their agencies to come to terms with
identifying how they can provide citizens and patients the safeguards they require
and yet not inhibit their opportunity to enroll in clinical trials of their choosing. One
strategy for leaders to address this challenge is to open the door for research
institutions such as universities and clinical trialists to work alongside their provider
colleagues.
Big data may be able to traverse the gaps among the data points in a health
record, with the information a patient or health care provider has but the final leap is
one where the specific patient’s condition is linked to an evidence base of clinical
interventions in which the real time application of documenting the patient’s con-
dition, the therapeutic interventions and monitoring the outcomes so that the tra-
jectory of the patients clinical course contributes to an continual learning system of
care delivery. This would be of benefit for the individual patient as they would
benefit from the cumulative experience and in turn their documented experience
and results would enter the database and help inform future clinical decisions.
Big data cannot deliver on this promise unless there is complete consensus on
standards and a commitment of the patient, health care provider organizations, and
professions to share this information among themselves. Murphy holds out hope
stating “A new architecture for EMRS is evolving which could unite Big Data,
machine learning, and clinical care through a microservice-based architecture which
can host applications focused on quite specific aspects of clinical care, such as
managing cancer immunotherapy … informatics innovation, medical research and
clinical care go hand in hand as we look to infuse science-based practice into
healthcare. Innovative methods will lead to a new ecosystem of applications (apps)
interacting with healthcare providers to fill a promise that is still to be determined”
[9]. Watson is an attempt to build a machine learning capacity to bring this promise
of big data to life but the ingredients are far from being able to deliver in the real
world setting. Big data will rely health leaders to come to strong consensus on
information sharing in partnership and collaboration for this to become a reality.
Big Data Challenges from a Healthcare Administration Perspective 65
The cost of health care is a major concern for leaders. Big data holds the promise to
support more effective and efficient management of resources. Unmet needs could
be identified, access and quality of interventions could be improved, and the con-
nection between interventions and outcomes could be determined and acted upon.
Tradeoffs between programs to achieve the optimal outputs and outcomes from the
financial investment could be made.
Reduction of waste by identifying and removing ineffective, unsafe or harmful
interventions, technologies or services is another promise big data can make.
The continual learning system approach could stimulate the shift of resources
among programs and financial silos to test various hypotheses for care delivery to
improve efficiency of the health system.
Big data could be useful in identifying ways in which to incentivize behavior
within programs or reimbursement systems to achieve the best patient and popu-
lation health outcomes. Introducing disincentives could help eliminate poor prac-
tices and behaviors. Experimentation with concepts where citizens and patients are
provided with the financial means to seek their optimal health seeking behavior
could be addressed through more effective linkage between interventions and
outcomes. Contracting and procurement decisions of health care systems could be
re oriented toward health authorities paying for services based on the value received
rather than products delivered. This would refocus our thinking from being input
and output oriented to thinking about ways to link the outputs to promised patient
outcomes. The emergence of block chain technology as a means to track the
transactional elements from acquisition to impact for the patient could be facilitated
through big data.
Efficiency is formulaic in addressing the relationship between the cost of inputs and
the process resulting in program and patient outcomes. Health care leaders are
accountable for the services delivered and outcomes through budgeting planning
processes and reporting on the results. New funding to address opportunities for
innovation are constrained because of the attempts of governments to bend the cost
curve downward. Leaders are driving into the future but looking into the rear-view
mirror. The rapidity with which technological and clinical innovation is accelerating
into the care environment renders the current approach ineffective.
Big data offers a solution to this conundrum, but it comes with significant risks.
66 D. W. M. Juzwishin
Although health care leaders may recognize that there are services and programs
that should be phased out there will be political forces that have a desire to maintain
the status quo because their employment, income stream and/or security depends on
them.
Big data can provide leaders with contemporaneous data that will be required to
explicitly address these issues through open and explicit information sharing,
partnership and collaborations with their health care providers and patients to
ensure that change management strategies are developed and implemented such that
a smooth transition takes place to replace some forms of inefficient program
delivery with those that are more efficient.
13 Concluding Remarks
Health care leaders will need to consult widely and exercise a strong will to work
collaboratively with their partners to use big data effectively to improve health care
delivery. The promise of big data is enormous, however, the risks associated with
the uncritical deployment and application of big data are not be ignored. Leaders
must become proactive to put in place infrastructure, standards, and capacity to
effectively harness the power of big data to benefit the health of citizens.
Bellazzi reminds us “The way forward with the big data opportunity will require
properly applied engineering principles to design studies and applications, to avoid
preconceptions or over-enthusiasms, to fully exploit the available technologies, and
to improve data processing and data management regulations” [10]. Leaders will
need to be very vigilant to ensure that their approaches and uses of big data are
accurate and true. Nothing will erode the confidence of citizens more quickly than
data that is false and untrustworthy.
References
7. Martin-Sanchez F, Verspoor K (2014) Big data in medicine is driving big changes. IMIA
Yearb Med Inform
8. Kohn MS, Sun J, Knoop S, Shabo A, Carmeli B, Sow D, Syed-Mahmood T, Rapp W (2014)
IBM’s health analytics and clinical decision support. IMIA Yearb Med Inform 154–162
9. Murphy S, Castro V, Mandl K (2017) Grappling with the future use of big data for
translational medicine and clinical care. IMIA Yearb Med Inform 96–102
10. Bellazzi R (2014) Big data and biomedical informatics: a challenging opportunity. IMIA
Yearb Med Inform 8–13
Big Data Challenges from a Healthcare
Governance Perspective
Donald W. M. Juzwishin
1 Introduction
“Water, water everywhere, Nor any drop to drink” [1]. In the Rime of the Ancient
Mariner Samuel Taylor Coleridge describes a sailor, stranded on a ship, surrounded
by water that he cannot drink to quench his thirst. His survival depends on water
being in a form that can sustain life. In its current form it would hasten his death.
The sea of data, information and evidence we are swirling in is a reminder of the
plight of the sailor. At one level the citizen is awash in health data, information and
knowledge about health and health care delivery however access to healthcare and
the distribution of health outcomes is suboptimal. At the moment comprehensive
personal health care data is rarely or readily accessible to the citizen because it is
institutionally owned. The contemporary patient or citizen is analogous to the sailor
—awash in data but not in a form that sustains personal health.
The global repositories of health data, information and knowledge are growing
exponentially and differentiating between truth and myth is becoming increasingly
challenging. The blurring of lines among inaccurate data, misinformation or pseudo
knowledge for policy and decision-making can lead to significant negative conse-
quences for citizens or society. Governance bodies must be prepared to review and
critically assess the veracity and merits of the data, information and evidence
emerging. Coupled with the growth of new forms of data, contributing to big data
from social media, sensor and surveillance technologies, financial transactions,
localization and movement data, the human genome and the Internet of Everything
will further exacerbate the challenges for governance.
Having anticipated the rise in the prominence of big data, the yearbook of the
International Medical Informatics Society in 2014 dedicated the entire volume to
the theme, “big data—smart health strategies” [2]. The contributors examined a
D. W. M. Juzwishin (&)
University of Victoria, Victoria, BC, Canada
e-mail: djuzwishin@uvic.ca
wide range of topics identifying opportunities and challenges associated with big
data in healthcare delivery. To date it serves as the most comprehensive and
high-quality examination of the subject.
Absent from that work, however, was a description and analysis of the impact
that the emergence of big data has and will have for the governance of health care
systems. Is big data a hope for the future of governance? Is it big hype? Can it
provide a platform for an effective use of health data to improve the outcomes of
citizens and the effective delivery of services? What are the opportunities and the
challenges? This chapter will attempt to redress the gap in the literature and provide
a way forward.
This chapter is not about the best practices for healthcare data governance. Our
attention is directed toward how best practices in governance of healthcare systems
can successfully address the challenges and risks in the indiscriminant use of big
data. Having established that big data is a new and promising concept it is also
threatening several fundamental values of society such as who owns the personal
health record? Who has access to it? How is access to be controlled? How do
governing bodies use it to achieve their own objectives in the interests of citizens
and patients?
In this chapter we will:
• Define and identify the legal, regulatory frameworks as well as the values that
provide opportunities but also are threatened by big data;
• Identify the standards and best practices that governance bodies and adminis-
trators aspire to;
• Identify the opportunities of using big data;
• Identify the challenges to the effective use of big data;
• Provide guidance on how big data can be exploited by governance bodies for the
benefit of the citizens and the health care system.
3.1 Definitions
For the purpose of this chapter we define big data as the total accumulation of all
past, current and emerging health data, information and knowledge that can be
usefully applied to govern and manage the health care delivery system for the
citizens of society.
Big Data Challenges from a Healthcare Governance Perspective 71
Table 1 Functions, definitions and mechanisms (excerpted from HSO, pp. 3–4)
Function/ Guideline Mechanisms
entity
Governance The governing body is accountable for the Acts, regulations, license,
quality of services/care, and supports the privileges, scope of
organization to achieve its goals, consistent practice
with its mandated objectives and its
accountability to stakeholders
Governing The body that holds authority, ultimate Bylaws
body decision-making ability, and accountability for Health profession
an organization and its services. This may be a legislation
board of directors, a Health Advisory Medical staff bylaws
Committee, a Chief and Council, or other body
The promise of big data for governance bodies on the surface appears enormous.
Many of the promises are theoretical, they appear conceptually sound, but other
than some very early results from high quality studies few have delivered on the
promise. In this section we want to be critical of ensuring that we have thought
through carefully the unintended negative consequences of big data.
Big data will not have its potential realized for the health care system unless
significant changes are made to accommodate the requirements of big data in a
Big Data Challenges from a Healthcare Governance Perspective 73
thoughtful and systematic approach. Big data could become the greatest nightmare
for governance bodies if they are not able to come to terms with how to harness its
potential in service to the community. Risk of breaching the confidentiality of
patients and citizens is a significant risk that governing bodies and administration
must be prepared to address.
It would be wise to heed the words of Niccolo Machiavelli:
It ought to be remembered that there is nothing more difficult to take in hand, more perilous
to conduct, or more uncertain in its success, than to take the lead in the introduction of a
new order of things. Because the innovator has for enemies all those who have done well
under the old conditions, and lukewarm defenders in those who may do well under the new.
This coolness arises partly from fear of the opponents, who have the laws on their side, and
partly from the incredulity of men, who do not readily believe in new things until they have
had a long experience of them. [4]
5 Population Focus
Big data promise several opportunities for governance entities to work effectively to
identify and anticipate the health care needs of the community. Big data could help
healthcare providers comply with the standards of health care delivery through
public monitoring and reporting on their performance. Public health surveillance is
an approach that facilitates the government, governance bodies, and health care
providers gaining a good understanding of what health needs are and how they could
be met. Health authorities and government departments of health could prepare their
planning, programming and funding based on health surveillance data. Health care
provider organizations could also survey their community members through social
media platforms and crowdsourcing to understand what their heath needs are.
Big data could be useful to deal with disasters such as tornados, tsunamis,
earthquakes, fires and floods that arise unexpectedly and require government and
health care organizations to respond. Databases identifying the location of citizens,
particularly those who are in danger and vulnerable to the threat, would be useful.
Part of the opportunity and difficulty that big data will face in helping with the
transformation of the system is that governing bodies do not regularly collect
outcomes data for the citizens or patients they provide service to. They count the
number of emergency visits, the number of surgeries or the number of patient days.
They rarely have data as to the short term or long-term consequences of the
interventions and their impact on the health status of those patients. Big data could
begin to identify ways to link between the identified needs, interventions, outputs
and outcomes but this will require a new set of metrics, patient-oriented outcome
measures such as EQ5D. These will need to be introduced as a regular follow up to
all health care interventions. Some research and innovation activity are beginning to
recognize the importance of using outcome measures to assess the clinical and cost
effectiveness of health care delivery with new interventions that are introduced; in
fact, it has become a condition of funding.
6 Accessibility
Accessibility for the citizen means getting the health care that they need when they
need it. In publicly funded health care systems citizens expect timely and equitable
access to health care services. The citizen’s ability to pay for service is never to be a
barrier to access medically necessary services. In reality, because of the limited
resources available to fund healthcare services there is very little slack in the
system. Throughput is optimized by differentiating among levels and types of care;
emergent, urgent and elective with a view that in the public interest queuing pro-
vides a way to maximize resource utilization by smoothing out a stochastic pro-
duction function. This leads to some of our contemporary issues with waiting lists,
for example, lists for surgical procedures; long-term care facilities as well as queues
in emergency departments. Big data may provide a means to improve access.
Big data cannot be successful in addressing the sharing of data unless legislation,
regulations and policies are revised to encourage integration without compromising
the security, privacy and confidentiality of health data and information [5]. The
responsibility for the personal health record and electronic health record must be
turned over to the citizen and patient. They, in consultation with their family and
health care provider, decide who should have access to data for the primary use of
that data. Until this is done inter institutional interoperability will be a challenge.
Big Data Challenges from a Healthcare Governance Perspective 75
7 Safety
Governing boards cannot claim to do no harm to patients unless they can publicly
declare adverse events with openness and transparency to the public. Big data
cannot make an inroad in advancing our societal understanding of how adverse
events occur and how they are remedied unless the governing bodies are prepared
to share the information. In the past the fear of litigation has prevented governing
bodies from making this information public, however, with appropriate safeguards
to ensure anonymity and a positive approach to quality improvement it has been
demonstrated that these adverse events can be addressed through a none accusatory
approach to the health care providers that contribute to continual quality
improvement that is supported by continual learning for the organization and for the
health care providers.
76 D. W. M. Juzwishin
8 Worklife
Big data may help to create a life-long learning capability for health care organi-
zations and health care professionals, but this will require a critical approach to
identify what are legitimate best practices that should be adopted. Structures and
processes will have to developed and implemented in organizations to identify,
assess, and apply best practices. Information and decision support systems will need
to be developed to ensure there is a continual iterative loop between the experience
from interventions and the lessons learned that will cause health care providers to
improve their practice. Pubic transparency of this experiences will be necessary to
ensure that public trust is maintained.
9 Client-Centered Services
Health care provider organizations and professionals are expected to identify ways
to partner with the citizens and patients and their families in their care. Big data
offers the promise of allowing patients to choose how they use their own health care
data for self-managed care. Currently data is situated in isolated repositories with
little opportunity for interoperable linkage. Health care organizations and providers
Big Data Challenges from a Healthcare Governance Perspective 77
are required to establish and populate the electronic health record for the purpose of
providing health care to the patient. Governing bodies are responsible for ensuring
the security and confidentiality of that information. This means that there is no legal
framework that would encourage governing boards to share the data with either the
patient or with other organizations that could use it effectively for the benefit of the
patient. This is because when the patient is admitted to the hospital they consent for
treatment but also for the information to be collected for their care, but it is not to be
shared with anyone other than with their permission. This creates an untenable
situation for the ubiquity of health data where expectations do not match interop-
erable capability. One of the proposals for addressing this issue is to give citizens
and patients ownership and access and allow them to determine who gets access
and when.
There is a significant movement toward patient centered care and coordinating care
much more effectively around that patient. Coordinated care is also being mobilized
toward integrated care delivery. Governing bodies must be aware of the challenges
of attempting to provide integrated care when the organization structures and
processes do not easily accommodate it. Rigby and colleagues point out “new
interactive patient portals will be needed to enable peer communication by all
stakeholders including patients and professionals. Few portals capable of this exist
to date. The evaluation of these portals as enablers of system change, rather than as
simple windows into electronic records, is at an early stage and novel evaluation
approaches are needed” [6].
It will be very difficult to link them together in a meaningful way. Some degree
of standardization will be necessary. Governing bodies will need to facilitate the
development of standardized ontologies, catalogues, nomenclatures for databases in
order that information about the individual patient can be linked to other databases
where other forms of information reside.
10 Continuity
Governance bodies are responsible to coordinate the care of citizens and patients
across the continuum of care. The continuum of care ranges from cradle to grave.
78 D. W. M. Juzwishin
This involves the delivery of services that promote health, prevent disease, emer-
gency and acute care, rehabilitation services, long term care, community care,
public health and palliative services. Historically these services were delivered by
independent agencies and organizations with their own governance bodies. In
current times the health reform movement is centralizing the governance and
administrative responsibilities to more effectively centralize, integrate, coordinate
and collaborate on the health care delivery enterprise. The regulations and rules
around access and use of health data has not kept pace with the structural and
functional reforms underway. The result is that public expectations of continuity of
health data across health care providers are not being achieved. Big data serves to
close this gap.
11 Appropriateness
Do the right thing to provide me with the best results is the dictum driving
appropriateness in the health care system. The ascension of big data intimates that
knowing what is appropriate may be well established. Science and medicine have
Big Data Challenges from a Healthcare Governance Perspective 79
provided the answers to many of the diseases that face humanity however there
remain many diseases and conditions for which the “right thing to do” is an open
question. Many health care interventions have a degree of uncertainty associated
with their outcomes. Big data may help reduce the uncertainty through rigorous
probabilistic analysis.
Public expectations are that the expenditure of funds for health care is to be made to
achieve the greatest health benefits and value for society. Governing bodies are held
to account for making optimal decisions for the use of resources at their disposal.
They are held to account by the public and government that provides them with
funding. Opportunity cost dictates that spending money on one thing in health care
means that those funds are not available for other health expenditures. Spending
funds on one health benefit means it is not available for another competing health
benefit, which may be greater. Interests within the health care system will compete
for the resources, sometimes by losing sight of what is best for citizens or patients.
Governance and administration must make decisions that balance the competing
interests.
It does appear as if big data could be a powerful approach and tool for governing
bodies and administrators to extract efficiencies from the health care system.
Big data offers a solution to this conundrum but it comes with significant risks.
Although governance bodies may recognize that there are services and programs
that should come out of service there will be political forces that have a desire to
maintain the status quo because their employment, income stream and/or security
depends on them.
To address these issues governance bodies will be required to explicitly address
these issues through open and explicit information sharing, partnership and col-
laborations with their health care providers and patients to ensure that change
management strategies are developed and implemented such that a smooth transi-
tion takes place to replace some forms of inefficient program delivery with those
that are more efficient.
13 Concluding Remarks
government and other interests toward a recognition that the values and expecta-
tions of society are changing. Recalling Machiavelli’s dictum, there are, however,
serious perils for those leading the changes necessary. This chapter has highlighted
many of the pitfalls that citizens, patients, politicians, policy makers, health care
providers, may succumb to with an indiscriminant and uncritical approach to big
data. The best strategy for maximizing the promises of big data is to be aware of the
pitfalls and to plan accordingly.
Governance bodies must avail themselves of trusted data, information and
knowledge, as they are the best vaccine for speaking truth to power and avoiding
policy and decisions based on incompetence, confusion or malicious intent. The
public interest must be safeguarded from these threats. Governments at all levels,
national, state/provincial, municipal/local must be prepared to establish the political
institutions and instruments that protect the public interest in the storage, linkage
and application of big data. Principled standards of best practice should be
encouraged and developed at the global level so that countries with less capacity
and capability can benefit from those that have. Governance must be prepared to
collaborate in an all government approach to have in place enabling legal, regu-
latory, policy, standards, and guidance that involve weighing the health data,
information and evidence in order to balance the competing interests through rea-
soned deliberation. These deliberations must be held in open, explicit and trans-
parent public settings as recommended in the Accreditation Canada standard below:
Communicating and sharing complete and unbiased information with clients and families in
ways that are affirming and useful. Clients and families receive timely, complete, and
accurate information in order to effectively participate in care and decision-making.
(Accreditation Canada 2018) [3]
Governing bodies are the entrusted stewards of the public’s health. Our
responsibility is to provide them with the means to help them find ways to harness
the promise of big data and avoid negative consequences.
T. S. Eliot’s words may express best the challenge we face;
Where is the life we have lost? Where is the life we have lost in living. Where is the wisdom
we have lost in knowledge? Where is the knowledge we have lost in information? [7]
I would add where is the information, knowledge and wisdom we have lost in
the big data?
References
1 Introduction
Big data promises to increase patient safety if vast amounts of relevant health
related data can be brought to bear in aiding decision making, reasoning and
promotion of health. For example, the advent of clinical decision support systems
that can apply best practice guidelines, alerts and reminders through continual
analysis of large repositories of patient data (e.g. running behind the scenes
checking patient records for adverse combinations of medications and flagging
problems) has been shown to increase patient safety [1]. As patient data increases
(as contained in patient record systems, data warehouses and genomic databases)
automated methods for scanning, checking health data for anomalies, issues and
health problems has proven to be an important advantage of digitizing health
information [2]. Improving personal health through the integration of various forms
of personal health data will require applications that can process large amounts of
adverse event health data and have considerable promise for improving patient
safety [2, 3]. Indeed, the advantages of the coming personalized medicine trend will
require big data coupled with new ways of automatically analyzing data. Such
advances promise to increase the effectiveness of treatment, management and
ultimately patient safety [4]. However, as the size of this data increases, the quality
and correctness of data collected using these new methods will be something that
will become an increasing concern [5–8]. In addition to this, big data can be
collected for the purposes of checking and improving data quality and reducing the
chance of technology-induced error—error that may be inadvertently introduced by
information technology itself [5, 6]. New ways of documenting and responding to
such error will be needed as the era of big data dawns [5]. One approach to
achieving this is by developing error reporting systems that can report on errors and
2 Motivation
There has emerged a need to collect data about the safety of health information
technology (HIT) with the objective of improving the quality and safety of the
technologies patients and health professionals use in the process of providing and
receiving health care. With increased technological advances, the potential for
inadvertent introduction of error due to technology and in the data stored in large
databases will increase [3, 5, 6]. Technology-induced error refer to errors that result
from the complex interaction between humans and machines [5]. Such error may
manifest itself in incorrect use of technology, errors in decision making as a result of
using technology and resultant error in data stored and accessed in electronic
repositories. To address this growing concern, some researchers have repurposed
existing databases which were created to document medical and medication error to
also include documenting of technology-induced errors [3, 5, 6]. Other researchers
have begun collecting data about technology-induced errors either as an adjunct to
existing data collection approaches or in developing new methods for collecting data
created by the HIT themselves as they are used by patients and health professionals
during the process of patient care [2–8]. Much of this work parallels research that has
been done in areas such as aerospace, where data about aircraft failures and issues are
entered and accessed globally in an effort to increase air travel safety.
In this book chapter, the authors will discuss how technology-induced errors are
being managed and analyzed using existing sources of data (i.e. large data repos-
itories that collect data about patient safety incidents in healthcare) and also how
data being collected by HIT can be used to improve the quality and safety of
healthcare technologies and healthcare itself.
state or provincial or national level. Such incident reporting systems collect data
across facilities and regions and are available for fine grained analysis of errors
involving technology (i.e. technology-induced errors). These data repositories have
been used to provide valuable insights into how errors can emerge and can prop-
agate throughout a healthcare system [3].
Researchers in Australia [9], Finland [3], China [10] and the United States of
America [11] have effectively used data from incident reporting systems to learn
about how technology-induced errors occur so that future events can be avoided.
Their work has involved reviewing individual incident reports for the presence or
absence of a technology-induced error, coding the data using taxonomies specific to
technology and errors, and analyzing the data for patterns of technology-induced
error occurrence to inform technology-specific strategies aimed at preventing errors
and to examine data for patterns that inform organizational learning at a broad level
(e.g. regional health authority, national and international level) [3, 9].
Horsky and colleagues [12] used incident reporting data to conduct fine grained
analyses of technology-induced errors to develop a more comprehensive insight
into the events that led to the error. Here, Horsky reviewed the initial incident report
and developed a comprehensive strategy for understanding how the technology,
organizational environment and people who were involved in the incident inter-
acted and how this led to a patient harm. In this work the researchers were able to
provide a report outlining recommendations for their institution aimed at preventing
future errors such as modifying the interface of the electronic health record system,
providing training for physicians to deal with unusual situations, and developing
new organizational policies and procedures [10].
Magrabi [11] and Palojoki and colleagues [3] analyzed data about
technology-induced errors found in incident reporting systems. After reviewing
incident reports and coding their data, the researchers analyzed the reports to
provide information about overall trends in the types of errors that are occurring and
the types of technologies that were involved. For example, Magrabi and colleagues
analyzed reported events that were stored in the US Food and Drug Administrative
Manufacturer and User Facility Device Experience (MAUDE) database [11]. Some
this work also involved in-depth analysis of the data to understand where these
types of errors occur most often (e.g. in an emergency department or the intensive
care unit) (Palojoki et al.) [12]. Palojoki et al. [13] extended this work by collecting
additional data in the form of health care professional surveys. The researchers
developed a survey tool that asks health professionals about their experiences
involving technology-induced errors. Here, the survey data helped to inform inci-
dent report analyses. The results of their work indicated that almost half of the
respondents to their survey indicated that a high level of risk related to a specific
error type they termed “extended electronic health record unavailability”. Other
risks included problems such as a tendency to select incorrectly from a list of items
(e.g. in selecting from a list of medications). In related work, Palojoki and col-
leagues found human-computer interaction problems were the most frequently
reported [12].
88 E. M. Borycki and A. W. Kushniruk
Kaipio et al. [14] employed large scale surveys (deployed online to thousands of
physicians) to learn about safety issues involving electronic medical records in
Finland. Kaipio added several strategic questions to a national survey about elec-
tronic medical record usability and workflow as an addition to the existing ques-
tions to provide some preliminary insights into health information technology
safety. The survey has been deployed in Finland with an invitation to all physicians.
The results of the survey indicated that physicians were very critical of the usability
of the electronic health record systems they were using. The survey also provided
detailed information about what usability problems were being encountered by
users of the main vendor based system available in Finland. In a follow-up study
also conducted at the national level in Finland two years later, it was found that
users’ impressions of the systems they were using had not substantially improved.
This work is currently pioneering and will ultimately lead to collection of large
amounts of data on usability and safety of healthcare systems, as other countries
begin to deploy similar online questionnaires [14]. It will be used to provide
feedback at multiple levels, including to vendors, national organizations and policy
makers. Other approaches that involve collection of usability and use data of sys-
tems such as electronic health records will also lead to collection of big data on
usage information. This information can be used by health regions and authorities
to identify how electronic resources are being used, potential bottlenecks and areas
where further analysis is needed [15].
4 Challenges
There are a number of challenges when dealing with big data related to improving
the safety and quality of healthcare processes and information technologies. Much
of the current collection of large databases of error information are based on vol-
untary incident reporting by end users of systems (e.g. doctors, nurses, pharmacists
etc.) [3, 9, 11]. This will need to be augmented by systems that allow patients and
citizens to enter information about errors [8]. In addition, many technology-induced
errors go undetected by the end user committing the error, and thus are not reported
[5]. This has required use of laboratory studies (i.e. clinical simulations) to analyze
when such error might occur, along with use of computer simulations to extrapolate
how frequently they would occur in the larger healthcare context. This work also
moves the focus from reporting on errors to preventing them. Along these lines,
automated methods for detecting error such as medication errors and
technology-induced errors will be needed [16]. Data mining and application of
predicative analytics using a growing database of patient data and information
contained in electronic health records will be needed to detect patterns that indicate
error and safety issues. For example, with the advent of wireless devices in hos-
pitals, methods for ensuring the data transmitted from one device to another is
correct and error free will become essential (which could involve approaches from
applied artificial intelligence). In addition, given that many of the information
Big Data and Patient Safety 89
systems used globally today that are being used across multiple countries, there will
be a need for cross border collection and sharing (interoperability) of data on
technology-induced errors. Finally, “big data” does not necessarily mean “good” or
“correct” or “useful” data. “Garbage in—garbage out” is an old computer science
adage expressing the fact that just having data is not everything—if data entering a
health information system is incorrect, spurious or wrong going in, then decisions
going out will be bad and will lead to a reduction in patient safety. Therefore, as our
health databases grow and become more complex, greater emphasis will need to be
placed on data integrity and the safety of our healthcare systems and big data will
play a major role in this trend.
Researchers have suggested that big data will lead to improvements in patient
safety. One area of concern involving health information technologies is the ability
of some of these technologies to introduce new types of errors. Errors that arise,
when health professionals use systems in the process of providing patient care, are
referred to as technology-induced errors. Currently, technology-induced error data
is being collected in incident reporting systems that reside in national, provincial,
regional and hospital specific databases, and by researchers who are developing and
deploying national surveys aimed at improving the quality and safety of health
information technology. There are many challenges associated with analyzing data
captured by incident reporting systems. The quality of these datasets has been
critiqued. Many incident reporting systems rely on voluntary reports by health
professionals. As well, a subset of all incidents documented in incident reporting
systems involve technology-induced errors. Future work involving big data will
need to focus on patient reported patient safety incidents and detecting patterns of
errors and safety issues from collected data.
References
1 Introduction
The collection and analysis of ever increasing amounts of healthcare data promises
to revolutionize and transform healthcare. Voluminous personal health data, fitness
data, genomic data, epidemiological data and other forms of health data are being
generated at an unprecedented rate and this trend will continue [1]. While advances
are being made in the automated collection and analysis of big data to keep up with
the generation of data, using machine learning, data mining and artificial intelli-
gence techniques, the issue of the human factor in all these developments still
remains central to the question of whether such large and complex collections of
data are useful and effective in helping to improve healthcare decision making and
processes. The impact of big data ultimately depends on human factors related to
effective access, use and application of such large data repositories to solve com-
plex and real healthcare problems and meet the information needs of health pro-
fessionals, healthcare management and ultimately patients. Indeed, the potential for
voluminous collection of data can easily lead to the phenomena known as cognitive
overload, whereby the limited cognitive processing capability of humans is over-
whelmed by the amount or complexity of data. Health data collected needs to be
collected, accessed and utilized by health professionals, patients and lay people in a
way that is understandable, effective and meets underlying information needs.
Collections of large amounts of data without consideration of the human factors
involved in its use and interaction with human end users is unlikely to lead to
improved healthcare and this must be taken into account by those designing,
implementing and deploying large data sets, interfaces to big data and decision
support that use big data with the objective of improving healthcare.
Along these lines the issue of the usability of healthcare information systems has
come to the fore in health informatics more generally. Usability can be considered a
measure of ease of use of a system, user interface, data or technology in terms of its
effectiveness, efficiency, enjoyability, safety and learnability [2]. The principles that
have emerged from the field of usability engineering argue for the introduction of
technology that is both usable and useful to end users (e.g. physicians, nurses,
pharmacists, patients, lay people etc.) in helping to solve some real problem, make a
decision or reason about health issues. Nowhere is the concept of usability and the
need for consideration of human factors more germane than in the area of big data.
Indeed, failures of big data to achieve its promise have in many cases been directly
attributed to a lack of consideration of human factors, and more specifically,
usability of the systems, data or support provided to end users in the attempt to help
them. Therefore, considering the human factors of big data is an important and
essential topic that will not go away, but rather will become more and more critical
as the amount and complexity of data in healthcare continues to exponentially
increase over time.
contexts, including the development of data warehouses and data marts, Kushniruk
and Turner have proposed a framework to characterize user needs known as the
User-Task-Context matrix [7]. This framework has been used for helping to design
interfaces to a variety of big data applications, including personal health applica-
tions and interfaces to large organizational data warehouses.
The three dimensions of the model are the following: (a) the User (b) the Task
(c) the Context of Use. For example, along the user dimension of an envisaged data
warehouse the categories of users corresponding to clinicians, statisticians, and
healthcare organization management might be identified from initial system
requirements. Each of these user types or classes could be further delineated in
terms of their information needs and requirements, creating a user profile for each
class of user. The task dimension refers to the different type of user interactions that
a system might support. For example, in the case of a data warehouse this might
include providing information and specific reports to support management rea-
soning about resource allocation in a health region, or identification of disease
concentrations. Finally, the third dimension is context that refers to the setting or
context of use of the data warehouse, for example, in the clinical setting, or in the
context of hospital managers making organizational decisions (Fig. 1).
In one example of application of the User-Task-Context matrix, a group of
potential end users of a data warehouse project for a regional health authority met to
arrive at an architecture for the warehouse. The framework from the
User-Task-Context Matrix was used to drive the requirements gathering through
delineation of: (a) the different user groups who would be using the data warehouse
(b) the type of tasks and information needs of each of the different user groups
(including types of reports and displays required) and (c) the different context of use
of the data warehouse (e.g. for optimizing local clinical decision making, for
making large-scale organizational decisions etc.). The design and organization of
both the back-end of the data warehouse as well as the user interface and user
interactions were designed based on the results of filling out details in the matrix
(regarding its 3 dimensions), to maximize the impact and usefulness of the big data
that ultimately were contained in this large regional data warehouse.
Fig. 2 Knowledge translation and the bioinformatics pipeline—from knowledge synthesis to use
in personalised medicine
96 A. W. Kushniruk and E. M. Borycki
decisions about treatment and planning for patients. The focus group discussions
were recorded, transcribed and analyzed for themes and requirements for design
that then formed the basis for development of new user interface prototypes. In
reflecting the varied needs of different types of users in dealing with large and
complex data sets related to patient genetic data, a number of clear preferences
emerged (that were used to base the design of the prototype user interfaces that
were developed). For example, it was found that bioinformatics researchers pre-
ferred command line user interfaces over graphical user interfaces for better com-
patibility with the existing base of bioinformatics software tools and for
customization flexibility when analyzing and examining large data sets.
Furthermore, clinical geneticists noted the limitations in the usability of current
software and the inability to participate in specific stages in the health informatics
pipeline. Both clinical geneticists and genetic counselors wanted an overarching
interactive graphical interface that would be used to simplify the large data sets by
using a tiered approach where only functionalities relevant to the user domain were
accessible (and with the system being flexibly connected to a range of relevant
databases). In general, users wanted interfaces that would summarize key clinical
findings from the large array of possible details to aid in their application of the
genomic patient information, mitigate against cognitive overload and help in
focusing attention on key elements of the data presented.
Further work is being conducted in this area and has focused on how to best
integrate genomic information (e.g. about gene mutations and risks associated with
them) with patient data contained in electronic health record systems. Indeed, to
effectively support applications like automated alerts or reminders that provide
information about patients related to genomic information, research will need to be
conducted that examines the user interface and human-computer interaction at the
level of the clinician, genetic counsellor or in the case of patient facing systems, the
patient themselves. Indeed, in order to take advantage of the rapid advances in
research in the area of personalized medicine, research will need to also include work
on arriving at systems and tools that are both useful and usable, that embed into work
activities for day-to-day application of knowledge (as in the use of electronic health
records) and support workflow, decision making and reasoning by humans.
There are a number of challenges for Big Data from a human factors perspective
and some prominent ones include the following:
– Electronic health record data is growing exponentially—electronic health record
systems are becoming widely used worldwide and are becoming ubiquitous.
These systems allow for storage and access to patient data that can be ever
Big Data Challenges from a Human Factors Perspective 97
There a number of future directions for research in the human factors of Big Data.
The following are some of the directions the authors of this chapter have been and
currently involved with:
98 A. W. Kushniruk and E. M. Borycki
– Usability analyses and analysis of use of big data to iteratively feedback into
design and redesign into health information systems, such as data warehouses,
electronic health records, public health information systems and clinical deci-
sion support systems. This work includes developing principled methods for
coding and analysing usage and usability data [12].
– Automated tracking and analysis of human interactions with such data as a way
to lead to improved use and application. For example, in previous work the
authors have been involved in creating what they called a “Virtual Usability
Laboratory”—the VUL. The VUL was designed to collect and collate data from
various sources (e.g. online questionnaires, user tracking logs, error logs and
various forms of qualitative data) to provide detailed but large amounts of data
about users of healthcare information systems [13].
– Large scale usability analyses in healthcare to complement smaller scale qual-
itative studies and usability tests. Some of this work we have referred to as
“usability in the large” where data collected on use and usability of health
information systems may span not only health regions but also across entire
nations [14].
– Further work into creation of personalized health information systems that
populations of lay people, patients and healthcare professionals can interact with
(i.e. in collaboration with their healthcare organizations).
6 Conclusion
Big data is here to stay. Furthermore over time, big data will only become even
“bigger”, with new ways to collect and store huge and ever increasing amounts of
health information electronically. However, to be useful and effective, ultimately
such large repositories of data need to be synthesized, processed and used by
humans. In this chapter we have touched on a number of areas where human factors
research and application touch on big data initiatives and endeavors. To ensure
success of these projects and to really harness all this potential data for real
application in healthcare, greater and increasing attention will undoubtedly need to
be paid to the human factors of big data. There are a number of challenges that exist
that may currently limit the effectiveness and usefulness of big data and although
some of these are currently being addressed, the ever increasing amount of health
data will continually require new approaches and methods for improving human
interaction with big data.
References
1. Marconi K, Lehmann H (eds) (2014) Big data and health analytics. CRC Press, Boca Raton, FL
Big Data Challenges from a Human Factors Perspective 99
2. Kushniruk AW, Patel VL (2004) Cognitive and usability engineering methods for the
evaluation of clinical information systems. J Biomed Inform 37(1):56–76
3. Patel VL, Arocha JF, Kaufman DR (2001) A primer on aspects of cognition for medical
informatics. J Am Med Inform Assoc 8(4):324–343
4. Kushniruk AW (2001) Analysis of complex decision-making processes in health care:
cognitive approaches to health informatics. J Biomed Inform 34(5):365–376
5. Kortum P (2008) HCI beyond the GUI: design for haptic, speech, olfactory, and other
nontraditional interfaces. Elsevier, Amsterdam
6. Jacko JA, Sears A (2012) Human computer interaction handbook. CRC Press, Boca Raton, FL
7. Kushniruk A, Turner P (2012) A framework for user involvement and context in the design
and development of safe e-health systems. Stud Health Technol Inform 180:353–357
8. Cullis P (2015) The personalized medicine revolution: how diagnosing and treating disease
are about to change forever. Greystone Books
9. Shyr C, Kushniruk A, Wasserman WW (2014) Usability study of clinical exome analysis
software: top lessons learned and recommendations. J Biomed Inform 1(51):129–136
10. Shyr C, Kushniruk A, van Karnebeek CD, Wasserman WW (2015) Dynamic software design
for clinical exome and genome analyses: insights from bioinformaticians, clinical geneticists,
and genetic counselors. J Am Med Inform Assoc 23(2):257–268
11. Murdoch TB, Detsky AS (2013) The inevitable application of big data to health care. JAMA
309(13):1351–1352
12. Kushniruk AW, Borycki EM (2015) Development of a video coding scheme for analyzing the
usability and usefulness of health information systems. In: CSHI, 14 Aug 2015, pp 68–73
13. Kushniruk A, Kaipio J, Nieminen M, Hyppönen H, Lääveri T, Nohr C, Kanstrup AM,
Christiansen MB, Kuo MH, Borycki E (2014) Human factors in the large: experiences from
Denmark, Finland and Canada in moving towards regional and national evaluations of health
information system usability: contribution of the IMIA Human Factors Working
Group. Yearb Med Inform 9(1):67
14. Kaipio J, Lääveri T, Hyppönen H, Vainiomäki S, Reponen J, Kushniruk A, Borycki E,
Vänskä J (2017) Usability problems do not heal by themselves: national survey on physicians’
experiences with EHRs in Finland. Int J Med Inform 1(97):266–281
Big Data Privacy and Ethical Challenges
Paulette Lacroix
1 Introduction
Similarity with source 1 which however is cited (edps) in the text and added to the references.
https://edps.europa.eu/about-edps_en
P. Lacroix (&)
PC Lacroix Consulting Inc., North Vancouver, Canada
e-mail: placroix@placroix.ca
The advancement of technology that led to the possibility of big data occurred over
a short time frame outdistancing the development of legislative privacy protections.
To allow for big data-type practices in general, new or modified widespread privacy
frameworks for both public and private-sector entities must be implemented to
protect the privacy of individuals and ensure fair and ethical use of their personal
information.
Big data analytics are distinctive in the collection of significant amounts of data,
repurposing the use of that data, using anonymization in analysis, generating new
data from these analyses, and having opacity in data processing.
The Information Accountability Foundation [3] has distinguished four types of
new data produced by big data analytics:
1. Provided data consciously given by individuals, e.g. when filling in an online
form.
2. Observed data that is recorded automatically, e.g. by online cookies or sensors
or closed-circuit television (CCTV) linked to facial recognition.
3. Derived data that is produced from other data in a relatively simple and
straightforward fashion, e.g. calculating customer profitability from the number
of visits to a store and items bought.
4. Inferred data that is based on probabilities and produced by using a more
complex method of analytics to find correlations between datasets and using
these to categorize or profile individuals and populations, e.g. calculating credit
scores or predicting future health outcomes.
Thus, the privacy principal of direct collection from an individual for a specified
purpose is challenged by big data, affecting an individual’s personal autonomy
based on their right to control his or her personal data and the processing of such
data. Control requires awareness of the use of personal data and real freedom of
choice. These conditions, which are essential to the protection of fundamental
rights, and in particular the right to the protection of personal data, can be met
through different legal solutions tailored according to the given social and tech-
nological context [4].
The issue of meaningful informed consent also arises because big data analytics
involves data that may be a continuous collection over time and the intended
consequences are not known or fully understood at the time of collection. Further,
each data set will likely contain different data points or values about the individuals
whose personal information is being collected. The principle of data accuracy
requires data to be complete and up to date. The information should be represen-
tative of the target population, not include discriminatory proxies such as race,
ethnicity or religion, and results understood to be only correlations not causation
[5]. Linking data from various sources may increase the likelihood that decisions
from those data will be based on inaccurate information, or on the basis of an
individual’s historical record rather than current circumstances or more recent
Big Data Privacy and Ethical Challenges 103
patterns of conduct. Bias in large data sets may be unknown due to lack of sam-
pling, an intrinsic collection bias or because of poor research design. If a data set
contains a variable that is not protected by law but by proxy is discriminatory, such
as a geographic region that contains a high percentage of individuals with the same
racial or ethnic background, decisions made from the analysis may be based on race
and ethnicity. There is increasing concern that the use of such data may constitute a
form of data surveillance operating against the legitimate interests of the individual.
The development of advanced algorithms has enabled big data to detect the
presence of increasingly complex relationships among significantly large numbers
of variables, and this ability brings with it an all-important risk of re-identification
of identifiable individuals. De-identification, anonymization and pseudonymization
of data are recommended practices to mitigate risk of privacy breach in large, linked
data sets. Generally, a dataset is said to be de-identified if elements that might
immediately identify a person or organization have been removed or masked. Data
protection legislation defines different treatment for identifiable and non-identifiable
data, however it is sometimes difficult to make this distinction and especially with
derived data from big data analytics [2]. Identifiability of an individual is
increasingly being seen as a continuum, not binary, and disclosure risks increase
with dimensionality (i.e. number of variables), linkage of multiple data sources, and
the power of data analytics.
Big data profiling is a type of automated processing of personal information that
inputs an individual’s personal information into a predictive model, which then
processes the information according to the set of rules established by the model to
produce an evaluation or prediction concerning one or more attributes of the
individual. For example, it may be used to evaluate or predict an individual’s
eligibility for programs or services. Profiling not only processes personal infor-
mation but generates it as well, creating a new element of personal information that
will be associated with the individual. While profiling pre-defines individuals into
types or categories in a reductive approach to understanding human behavior, the
prediction is set at a point in time and some degree of error is expected in the
outcome. It is important for organizations who profile to promote transparency of
the logic used by the predictive model and the potential consequences of the results.
Organizations should verify results of decisions based solely on profiling to ensure
individuals may exercise their privacy right to challenge or respond to such deci-
sions. By its very nature, profiling treats individuals as fixed, transparent objects
rather than as dynamic, emergent subjects [5]. In addition to a loss of dignity or
respect, profiling may have larger effects on society and individuals.
A recommended best practice for an organization that profiles people is to first
consult with public and civil society organizations regarding the impact of the
proposed profiling and to conduct a privacy impact assessment.
The European Union (EU) General Data Protection Regulation (GDPR), fully
applicable in May 2018, supersedes the 1995 Data Protection Directive and
strengthens and harmonizes the protection of personal data for EU citizens.
The GDPR considers not only the location of the data processing but also whether
personal data relating to individuals located in the EU are being processed,
104 P. Lacroix
regardless of where the data controller is established in the world. This legislation
has a global reach and has effectively influenced legislative changes in privacy
protection in other countries [6]. The GDPR has expanded data protection princi-
ples to require organizations to demonstrate accountability in the collection, use and
disclosure of personal information. The emerging importance of accountability is in
direct response to the implications of the processing of personal data in a big data
world.
More specifically, the GDPR requires a data protection impact assessment be
completed for initiatives that involve “a systematic and extensive evaluation of
personal aspects relating to natural persons which is based on automated process-
ing, including profiling, and on which decisions are based that produce legal effects
concerning the natural person or similarly significantly affect the natural person”
[2]. Other provisions in the Regulations include data protection by design and
default, e.g. Privacy by Design [7] and certification e.g. the establishment of cer-
tification mechanisms and data protection seals and marks for organizations to
provide quick public access to the level of data protection of relevant products and
services.
A prevailing view is that any potential harms arising from big data analytics is
from how the data are used, not necessarily how the data were collected. The GDPR
accountability principle now concentrates focus on the use of personal information
through mechanisms such as scrutinizing the technical design of algorithms,
auditability of the analytics process and the application of software-defined regu-
lations. Accountability has been championed over transparency which to date is
known to have many limitations in protecting an individual’s right to privacy.
In a recent report the Information Commissioner for the United Kingdom pro-
posed the following six recommendations for organizations conducting big data
analytics:
1. Carefully consider whether the big data analytics requires the processing of
personal data and use appropriate techniques to anonymize personal data in the
dataset(s) before analysis.
2. Be transparent about the processing of personal data by using a combination of
approaches to provide meaningful privacy notices at appropriate stages
throughout a big data project. This may include the use of icons, just-in-time
notifications and layered privacy notices.
3. Embed a privacy impact assessment framework into big data processing
activities to help identify privacy risks and assess the necessity and propor-
tionality of a given project. The privacy impact assessment should involve input
from all relevant parties including data analysts, compliance officers, board
members and the public.
4. Adopt a privacy by design approach in the development and application of big
data analytics. This should include implementing technical and organizational
measures to address data security, data minimization and data segregation.
5. Develop ethical principles to help reinforce key data protection principles.
Employees in smaller organizations should use these principles as a reference
Big Data Privacy and Ethical Challenges 105
point when working on big data projects. Larger organisations should create
ethics boards to help scrutinize projects and assess complex issues arising from
big data analytics.
6. Implement innovative techniques to develop auditable machine learning algo-
rithms. Internal and external audits should be undertaken with a view to
explaining the rationale behind algorithmic decisions and checking for bias,
discrimination and errors [2].
Personal data protection regimes, like the GDPR, are instruments for the gov-
ernance of data flows and data processing and remain valuable for the protection of
personal data in line with classical data processing. Yet, they may be inadequate to
address the unprecedented challenges raised by big data. In particular, the frequent
incompatibility between big data and privacy principles. The purposes of
algorithm-driven big data analytics are often to discover otherwise invisible patterns
in the data, rather than to apply previous insights, test hypotheses, or develop
explanations. Add to this the technical complexities of machine learning and AI,
and the effect can be the distancing of supervisory authorities and undertakings
from the meaning of the right to data protection. Ethics allows this return to the
spirit of the law and offers other insights for conducting an analysis of a new digital
society, such as its collective ethos, its claims to social justice, democracy and
personal freedom [4].
The adoption of an ethical approach to big data processing is being driven by two
main factors. In the public sector, evidence of a lack of public awareness about the
use of data and the extent of data sharing have led to calls for ethical policies to be
made explicit. The commercial imperative in the private sector is to mitigate risk of
reputational harm due to public distrust and brand devaluation [8]. While it is now
recognized that adherence to privacy legislation is not enough, ethical frameworks
for big data analytics and research are highly contested and in flux.
At the heart of the ethics debate is the consequences of the speed, capacity and
continuous generation of big data, as well as a change in the relationality, flexi-
bility, repurposing and de-contextualization of data. In particular the intensification
of algorithmic profiling and ‘personalization’ where individuals are not treated as
persons but as temporary aggregates of data processed at an industrial scale. But
human beings are not identical to their data. Human values must be understood and
implemented within a social, cultural, political, economic and technological context
in which personal data and personal experience is made. Therefore, digital ethics
should take into account the widely changing relationship between digital and
human realities. Big data generates new ethical questions about what it means to be
human in relation to data, about human knowledge and about the nature of human
experience. It obliges us to re-examine how we live and work, how we socialize and
106 P. Lacroix
participate in communities, and our relations with others and, perhaps most
importantly, with ourselves. It invites ethical evaluation and a new interpretation of
fundamental notions in ethics, such as dignity, freedom, autonomy, solidarity,
equality, justice, and trust [4].
Trust, as a concept related to the perception of risk and uncertainty, has grown in
importance in the evolution of information technologies as a bridge between
technical and moral aspects of technically assisted communication systems.
Crucially, trust has a double-meaning in data protection. One is a
technologically-oriented, functional or knowledge concept: trust in a technology
refers to the confidence that it will not fail in its pure functionality, that its design
and engineered properties will carry out their expected function. The second
meaning is that trust is a moral concept referring to belief and reliance in a person or
organization that they will honour explicit or implicit promises and commitments
[4]. In this context data protection faces three interrelated crises of trust:
1. Individual trust in people, institutions and organizations that deal with personal data;
2. Institutional trust, transparency and accountability as a condition for keeping
track of the reputations of individuals and organizations and trust-building in a
society that requires access to personal data; and
3. Social trust in other members of social groups anchored in personal proximity
and physical interaction, which are being increasingly replaced by digital
connections.
Trust builds on shared assumptions about material and immaterial values, about
what is important and what is expendable. It stems from shared social practice,
shared habits, ways of life, common norms, convictions and attitudes. Trust is based
on shared experiences, on a shared past, shared traditions and shared memories.
It is concerning that big data science sidesteps many of the informal modes of
ethics regulation found in other science and technology communities. The precursor
disciplines of data science (computer science, physics, and applied mathematics)
have not historically fallen under the purview of ethics review at universities. The
reason for this is their work and contributions have historically been about systems
and not people, thus outside of human-subjects related ethics concerns. As a result,
the content of the datasets is considered irrelevant to the substantive questions of
human-related research including the privacy rights of research subjects. The result
is a disjunction between the familiar infrastructures and conceptual frameworks of
research ethics and the emerging epistemic conditions of big data. Data scientists
are often able to gain access to highly sensitive data about human subjects without
ever intervening in the lives of those subjects to obtain it. They may predict, or
infer, or gather data from disconnected public data sets. It is important to note that
big data research which re-uses de-identified or publicly available data will largely
be excused from ethics oversight as long as it meets unspecified privacy safeguards
such as anonymization or de-identification. Given the accepted definition of
human-subjects research, nearly all non-biomedical research would receive at most
perfunctory oversight due to the assumption that there is little or no risk of harm [8].
Big Data Privacy and Ethical Challenges 107
The consensus view of the European Advisory Group [4] is that a digital ethics
framework will provide new terms for identifying, analyzing and communicating
new human realities, in order to displace traditional value-based questions and
identify new challenges in view of values at stake and existing and foreseeable
technological changes. The purpose of digital ethics is not only to account for the
present, but also to perform a foresight function. The shift is twofold. First, the
object of legal regulation (i.e. an individual) can become less interesting, as a
phenomenon in the here-and-now and more an object for reasoned speculation
about its future role, all based on the predictive powers of the big data and algo-
rithmic processing. Second, while the analysis of legal issues is being pushed into
the future, what is understood as existing in the future becomes drawn into the
assessments of the present. For example, estimates of what the future will hold,
generated through the patterns gathered in big data analysis, are continuously
gaining in importance for the way criminal justice operates today and is purported
to operate tomorrow.
The focus of digital ethics is primarily on meta-ethical questions and considers
more general and fundamental questions about what it means to make claims about
ethics and human conduct in the digital age, when the baseline conditions of
‘human-ness’ are under the pressure of interconnectivity, algorithmic
decision-making, machine-learning, digital surveillance and the enormous collec-
tion of personal data, about what can and should be retained and what can and
should be adapted, from traditional normative ethics. The following examples
provide insight into the need for a digital ethics framework [4].
1. From the individual to the digital subject: That data exhausts neither personal
identity, nor the qualities of the communities to which individuals belong, that
data protection is not only about the protection of data, but primarily about the
protection of the persons behind the data. The question is whether the digital
representation of persons may expose them to new forms of vulnerability and
harm.
2. From analogue to digital life: The governing and the governed are distinct but
linked by mutually recognized principles of legal obligation and accountability.
Digital technologies have changed this. The use of algorithms and large data sets
can shape and direct the lives of individuals, therefore increasingly governed on
the basis of the data generated from their own behaviours and interactions. The
distinction between the forces that govern everyday life and persons who are
governed within it thus become more difficult to discern. Behaviour may be
governed by ‘nudging’, that is by minute, barely noticeable suggestions, which
can take a variety of forms and which may modify the scope of choices indi-
viduals have or believe they have.
3. From a risk society to a scored society: Risk assessment is carried out using
techniques of probability calculation, allowing individuals to be pooled and
situations with the same level of risks to be identified with each other for the
108 P. Lacroix
purposes of understanding the value of loss and the cost of compensation. In the
digital age, algorithms supported by big data can provide a far more detailed and
granular understanding of individual behaviours and propensities, allowing for
more individualized risk assessments and the apportioning of actual costs to
each individual; such assessment of risk threatens contractual or general prin-
ciples and widely shared ideas of solidarity. In this scored society, individuals
can be hyper-indexed and hyper-quantified. Beliefs and judgments about them
can be made through opaque credit or social scoring algorithms that must be
open to negotiation or contestation.
4. From human autonomy to the convergence of humans and machines: An
increasing number of technological artefacts, from prostheses like eyeglasses
and hearing aids, to smartphones, GPS, augmented reality glasses and more, can
be experienced in a symbiotic relationship with the human body. These artefacts
are experienced less as objects of the environment than as a means through
which the environment is experienced and acted upon. As such, they may tend
toward a seamless framing of our perception of reality. They may shape our
experience of the world in ways that can be difficult to assess critically. This
phenomenon of incorporation or even embodiment of technologies is even more
intense whenever the devices are implanted in the body. A parallel frontier of
convergence between human and machines is on the verge of being crossed by
intelligent, or rather ‘autonomous’, machines that are able to adapt their beha-
viours and rather than merely executing human commands, collaborate with, or
even replace human agents to help them identify problems to be solved, or to
identify the optimal paths to finding solutions to given problems.
5. From individual responsibility to distributed responsibility: The problems of
many hands and problems of collective action and collective inaction can lead to
tragedies of the commons and problematic moral assessments of complex
human endeavours, both low and high tech, where a number of people act
jointly via distant causal chains, while being separated in time and space from
each other and from the aggregated outcomes of their individual agency. The
problems of allocation and attribution of responsibilities are exacerbated by the
networked configuration of the digitized world.
6. From criminal justice to pre-emptive justice: In legal practice, the detection and
investigation of crime is no longer only a science of criminal acts, of identifying
and adjudicating events authored by identifiable, accountable individual actors
under precise conditions and in terms of moral and legal responsibility, but also a
statistically supported calculation of the likelihood of future crime, a structuring of
the governance of crime around the science of possible transgression and possible
guilt, removing moral character from the equation. The aim of criminal justice
remains the same: to provide security within society while at the same time
adhering to high standards of human rights and the rule of law. However, the shift
that marks one of the main backdrops of the digital age and calls for a new digital
ethics is that of trying to predict criminal behaviour in advance, using the output
of big data-driven analysis and smart algorithms to look into the future.
Big Data Privacy and Ethical Challenges 109
The new digital geopolitics created by differences in data protection rules across
national borders no longer represent the limits of data flows. The consequences for
global governance are significant. These digital geopolitics will impact national
cultures to the extent that national sovereignty will be increasingly strained between
national pressures and the shifting norms of the international system. There is
significance and urgency to developing a digital ethics framework, as evidenced by
digital ethics being the core topic of the 2018 International conference of Data
Protection and Privacy Commissioners.
learning can lead to individual bias and erosion of human rights. Human
oversight and accountability is necessary in profiling.
4. While technology will continue to converge in a symbiotic relationship with
humans, and generally to the benefit of human health and wellness, this con-
vergence has the potential to shape our perception of humanity and human
values over time. Humans are not identical to their data and should not be
temporary aggregates of data processing. Big data will generate new ethical
questions about what it means to be human in relation to data, about human
knowledge and about the nature of human experience.
5. Trust, a moral concept referring to belief and reliance in a person or organization
that they will honour explicit or implicit promises and commitments, stems from
shared social practice, shared habits, ways of life, common norms, convictions
and attitudes. Big data science researchers are often able to gain access to highly
sensitive data about human subjects without intervening in the lives of the
subjects to obtain it. The use of privacy impact assessments prior to release of
sensitive data provides a means for healthcare providers to determine and
mitigate risk, thus acknowledging the value of an individual’s trust of the
healthcare system while supporting the benefits of big data analytics.
6 Conclusion
References
1. Denham E (2017) Big data, artificial intelligence, machine learning and data protection.
Version 2.2. Information Commissioner’s Office. Available from https://ico.org.uk/media/for-
organisations/documents/2013559/big-data-ai-ml-and-data-protection.pdf. Accessed on 19
June 2018
2. European Union. General data protection regulation. Available from https://gdpr-info.eu and
https://ec.europa.eu/commission/priorities/justice-and-fundamental-rights/data-protection/
Big Data Privacy and Ethical Challenges 111
2018-reform-eu-data-protection-rules_en#abouttheregulationanddataprotection. Accessed on
19 June 2018
3. Abrams M (2014) The origins of personal data and its implications for governance. OECD.
Available from http://informationaccountability.org/wp-content/uploads/Data-Origins-Abrams.
pdf. Accessed on 19 June 2018
4. European Data Protection Supervisor Ethics Advisory Group (2018) Towards a digital ethics.
Available from https://edps.europa.eu/sites/edp/files/publication/18–01-25_eag_report_en.pdf.
Accessed on 19 June 2018
5. Office of the Information and Privacy Commissioner of Ontario (2017) Big data guidelines.
Available from https://www.ipc.on.ca/wp-content/uploads/2017/05/bigdata-guidelines.pdf.
Accessed on 19 June 2018
6. The Council for Big Data, Ethics and Society (2016) Perspectives on big data, ethics, and
society. Available from https://bdes.datasociety.net/wp-content/uploads/2016/05/Perspectives-
on-Big-Data.pdf. Accessed on 19 June 2018
7. Cavoukian A (2011) Privacy by design. The 7 foundational principles. Information and Privacy
Commissioner of Canada. Available from https://www.ipc.on.ca/wp-content/uploads/
Resources/7foundationalprinciples.pdf. Accessed on 19 June 2018
8. Metcalf J, Crawford K (2016) Where are human subjects in big data research? The emerging
ethics divide. Big Data Soc (Jan–June):1–14. Available from http://journals.sagepub.com/doi/
pdf/10.1177/2053951716650211. Accessed on 19 June 2018
9. World Medical Association (2016) WMA declaration of Taipei on ethical considerations
regarding health databases and biobanks. Available from https://www.wma.net/policies-post/
wma-declaration-of-taipei-on-ethical-considerations-regarding-health-databases-and-biobanks/.
Accessed on 19 June 2018
Part III
Technological Perspectives
Health Lifestyle Data-Driven
Applications Using Pervasive
Computing
1 Introduction
The use of mobile technology and wearables for health has become a mass phe-
nomenon. Millions of people are using wearables devices (e.g. Apple Watch) and
mobile apps for health reasons. Data from mobile and wearable devices are cap-
tured to quantify patient reported outcomes, both to support clinical trials and
clinical practice. The combination of mobile, wearable technology and other con-
nected health devices is often referenced as pervasive or ubiquitous computing,
which refers to the tendency of embedding computing elements into everyday
objects (e.g. wearable devices, Internet of Things) [1]. Pervasive computing has
several potential applications in the health domain, but in particular, it can be very
useful to monitor lifestyle using wearables, patient-reported outcomes via mobile
phones and patient behaviours relying on the Internet of Things [2]. Since lifestyle
plays a major role in the prevention and management of multiple health conditions,
pervasive technologies can also be used to foster new applications for precision
medicine [3].
Transforming data from wearables and mobile devices into actionable knowl-
edge, which can support decision making of professionals and patients, is not a
trivial task. It consists of a complex process involving multiple steps, as shown in
Table 1. Furthermore, the selection of data sources also impacts the potential
applications. In terms of data-driven analytics, most of the discussion has been
focused on what has been called Big Data, which is widely covered in this textbook
and related surveys [4]. However, dealing with lifestyle, mobile and wearable
technologies brings additional challenges which are covered in this chapter. We
also need to consider that when collecting personal health data, we might have
scenarios where small data about one individual behavior has more value than a
huge dataset from a large population. Small data does not necessarily mean worse
data, and the boundary between big and small is not always clear [5]. Table 1
Table 1 Steps required for big data value extraction from mobile and wearable health devices.
Adapted from Curry [6]
Data processing Description
steps
Data acquisition With regards to lifestyle, the use of mobile and wearable technologies
has the capacity to capture the context of patients at the right time and
place. Captured data can come directly from user interfaces (e.g.
psychological patient-reported outcomes [7]) or a wide variety of
sensors [8]. These sensors are not limited to wearables, as there are
implantable and semi-implantable sensors such as continuous glucose
monitoring devices.
Data curation and Extracting insights from health data also requires the capacity to curate
storage and assess its quality [9]. In addition, it is important to ensure its
interoperability so it can be integrated with larger datasets. This is of
special importance if we foresee the need of integrating lifestyle data
into Electronic Health Records (EHRs) for various applications (e.g.
integration of sensor data in EHR using HL7 FHIR [10]). As
information about individuals becomes increasingly integrated, we
might find use cases where lifestyle data gets integrated with other
biomedical data sources (e.g. clinical data, genotype). In the long run,
storing such connected data might become a serious challenge. Also
ethical aspects such as privacy, consent, are relevant when storing and
sharing health data.
Data analysis Numerous types of applications that can be built around data-driven
lifestyle require the use of different machine learning techniques [11].
The data analytics techniques are heavily dependent on the applications.
For example, a health coaching solution might require real time pattern
recognition in order to provide recommendations to the patients.
However, aggregated public health data about lifestyle for policy makers
might not require such real time analysis, but rather clustering of the
population by predicted health risks.
Data applications The range of applications includes visualization dashboard, clinical
safety and logistics, decision support systems for professionals,
coaching systems for patients, etc. At the application level, one of the
biggest challenges is the user engagement and usability, consequently,
this area of work includes the use of human computer interactions
techniques.
Health Lifestyle Data-Driven Applications Using Pervasive … 117
summarizes the main steps involved in the creation of data-driven applications from
the acquisition of data to the creation of applications.
This chapter is structured as follows. In the next section, we provide an overview
of data-driven applications relying on pervasive health technologies, including
general aspects and examples from diabetes management. Finally, in the discussion
section we provide a summary of the main non-technical challenges, in particular
socio-ethical aspects, that could be barriers to the development of healthy lifestyle
data-driven applications.
The aggregation of data from wearable and mobile devices of large populations
allows the tracking of lifestyle patterns, such as physical activity, in near real time,
which in turn allows understanding modifiable risk factors to inform public health
officials. Tim Althoff, in a recent review, highlighted potential applications and
technical challenges [12]. The same author, in a related study, reported on the use of
mobile data to study physical activity patterns on a global scale using data of over
seven hundred thousand users across the globe [13]. For many of these approaches,
challenges include access to such datasets by public health officials, representativity
of the data, and transformation of data into decision support tools. Microsoft
Research has also studied large datasets from wearable devices to better understand
sleep patterns at the population level, integrating search logs and data from health
apps as well [14]. Another approach for public health studies involving pervasive
health solutions is the use of sensors and mobile technologies (e.g. mobile sleep
labs) within observational studies. For example, the website www.sleepdata.org
incorporates data from actigraphy devices (clinical wearable devices for studying
sleep and physical activity) from thousands of patients across several years [15].
These datasets have been used in applications such as developing techniques for the
detection of sleep apnea [16], which is an example of how epidemiological per-
vasive data can be used to create new diagnostics applications.
visual metaphors exist, like node-link networks in the Health Infoscape [24], trees
and maps [25], scatterplots from dimension-reduced data, or parallel coordinate
plots [26], but they are difficult to understand by the users targeted by these data
visualizations. Based on user needs and feedback, the visualization experts design
the most efficient visual metaphors and interactions to use for specific data, tasks
and users. Still, data visualization literacy [27] is a key challenge in educating users
to understand these powerful graphics and pave the way to enable their use at large
scale for pervasive health data visualization. In Qatar, we are working on visual-
ization analytics of wearable data to better understand behavioral patterns of chil-
dren with obesity (see Fig. 1).
Scalability of large data sets is another challenge, a standard way to address this
issue is to pre-process the data even before the rendering stage where visual
metaphors encoding the data are displayed as pixel images. Data mining and
machine learning techniques [29] are used to summarize the data with simple
statistics such as counting, averaging or selecting prototypical examples with vector
quantization approaches [30]. Other techniques like dimension reduction [31] and
feature selection are employed to reduce the number of features. Therefore the
scalability issue with visualizing big data is not in rendering but in computing
minimal summaries that are still meaningful and useful to the user.
The best option, to ensure visualizations are meaningful and easy to understand,
is to adopt a user-centric design approach which progressively proposes more
advanced graphics, finely tuned to the user’s needs. Any of the graphics can be
adopted if it is usable, useful and the user has been trained to use it. We can expect
that in the future, digital health literacy would also incorporate elements such as the
capacity to understand visual analytics and machine learning key concepts.
Fig. 1 Visualization dashboard for sleep and physical activity of children with obesity [28]
120 L. Fernandez-Luque et al.
Diabetes is a chronic condition where many different lifestyle factors play a role in
the control of the disease. Physical activity, nutrition, sleep and stress interact with
biological factors that influence how we metabolize glucose or even our appetite
[32]. Furthermore, many complications of the disease such as fatigue are often the
result of lifestyle factors. These physiological factors that influence insulin sensi-
tivity (and consequently diabetes control) are not yet incorporated into closed-loop
artificial pancreas systems [33] (see Fig. 2 for an example), where insulin pumps
are adjusted automatically using data from continuous glucose monitoring devices.
Lifestyle-data related to diabetes management has been collected using mobile and
wearable technologies for many years, to support the decision making of healthcare
professionals, patients and relatives [34, 35]. There are also examples of how
physical activity data from wearable can be used to create data-driven coaching
solutions for diabetes [20].
regulation, another approach to foster privacy and security of health apps and
wearables is increasing the skills of users by increasing their digital health literacy
[57].
Any public health intervention should aim to reduce health disparities and ensure
equity among the population. In terms of data-driven personal health applications,
the representativity of the data presents a major challenge. Early adopters of
technology tend to be those with higher education and better socio-economic
conditions. Consequently, data-driven models may inadvertently create biases,
which may lead to the models not performing equally well for the individuals
underrepresented in the datasets used for training the models. While designing
data-driven health applications, it is imperative that training data is representative of
the population to be served, to avoid unethical and biased situations [54]; for
example by ensuring the enrolment of minorities and less favored communities.
These biases are prone to be high in lifestyle datasets, as cultural factors are
well-known to affect our lifestyles and routines. Approaches are emerging to use
machine learning to reduce biases [55].
Other socio-economic factors can become barriers or enablers to data-driven
personal health applications. The increased availability of data can increase the
quality of machine learning models, but it also increases the value of data for many
organizations. Consequently, there are serious concerns regarding the “privatiza-
tion” of health data [56]. A good example of “privatization” are fitness sensors
which do not provide an open APIs to access raw data and neither integration
capabilities with third-party applications. Further, in many countries, healthcare
providers cannot use devices such as Fitbit due to concerns with using a Fitbit’s
cloud which is located outside their country.
4 Conclusions
The increasing penetration of mobile and wearable technologies has been paving
the way for development of innovative data-driven personal health applications.
These new applications are building upon decades of experience in using mobile
and wearables in the health domain, but these are being launched at an unprece-
dented scale. We must refrain from the buzz and hype, and acknowledge the new
socio-ethical challenges that require a strong multidisciplinary partnership with
deep engagement of clinicians and patients, to ensure that these technological
developments really improve public health and do not contribute to further increase
health disparities.
124 L. Fernandez-Luque et al.
References
1. Orji R, Moffatt K (2018) Persuasive technology for health and wellness: state-of-the-art and
emerging trends. Health Inform J 24:66–91
2. Riazul Islam SM, Kwak D, Humaun Kabir M, Hossain M, Kwak KS (2015) The internet of
things for health care: a comprehensive survey. IEEE Access 3:678–708
3. Intille S (2016) The precision medicine initiative and pervasive health research. IEEE
Pervasive Comput 15:88–91
4. Fang R, Pouyanfar S, Yang Y, Chen S-C, Iyengar SS (2016) Computational health
informatics in the big data age. ACM Comput Surv 49:1–36
5. Faraway JJ, Augustin NH (2018) When small data beats big data. Stat Probab Lett 136:142–
145
6. Curry E (2016)The big data value chain: definitions, concepts, and theoretical approaches. In:
New horizons for a data-driven e economy, pp 29–37
7. Heron KE, Smyth JM (2010) Ecological momentary interventions: incorporating mobile
technology into psychosocial and health behaviour treatments. Br J Health Psychol 15:1–39
8. Rodgers MM, Pai VM, Conroy RS (2015) Recent advances in wearable sensors for health
monitoring. IEEE Sens J 15:3119–3126
9. Bialke M, Rau H, Schwaneberg T, Walk R, Bahls T, Hoffmann W (2017) MosaicQA—a
general approach to facilitate basic data quality assurance for epidemiological research.
Methods Inf Med 56:e67–e73
10. Walinjkar A, Woods J (2017) Personalized wearable systems for real-time ECG classification
and healthcare interoperability: real-time ECG classification and FHIR interoperability. In:
Internet technologies and applications (ITA). https://doi.org/10.1109/itecha.2017.8101902
11. Habib ur Rehman M, Liew CS, Wah TY, Shuja J, Daghighi B (2015) Mining personal data
using smartphones and wearable devices: a survey. Sensors 15:4430–4469
12. Althoff T (2017) Population-scale pervasive health. IEEE Pervasive Comput 16:75–79
13. Althoff T, Sosič R, Hicks JL, King AC, Delp SL, Leskovec J (2017) Large-scale physical
activity data reveal worldwide activity inequality. Nature 547:336–339
14. Althoff T, Horvitz E, White RW, Zeitzer J (2017) Harnessing the web for population-scale
physiological sensing. In: Proceedings of the 26th international conference on world wide
web—WWW ’17. https://doi.org/10.1145/3038912.3052637
15. Dean DA 2nd, Goldberger AL, Mueller R, Kim M, Rueschman M, Mobley D et al (2016)
Scaling up scientific discovery in sleep medicine: the national sleep research resource. Sleep
39:1151–1164
16. Haidar R, Koprinska I, Jeffries B (2017) Sleep apnea event detection from nasal airflow using
convolutional neural networks. lecture notes in computer science. pp 819–827
17. Jaimes LG, Llofriu M, Raij A (2016) Preventer, a selection mechanism for just-in-time
preventive interventions. IEEE Transact Affect Comput 7:243–257
18. Schäfer H, Hors-Fraile S, Karumur RP, Valdez AC, Said A, Torkamaan H, et al (2017)
Towards health (aware) recommender systems. In: Proceedings of the 2017 international
conference on digital health—DH ’17. https://doi.org/10.1145/3079452.3079499
19. Dias Pereira dos Santos A, Yacef K, Martinez-Maldonado R (2017) Let’s dance: how to build
a user model for dance students using wearable technology. In: Proceedings of the 25th
conference on user modeling, adaptation and personalization—UMAP ’17, ACM Press, New
York, USA, pp 183–191
20. Hochberg I, Feraru G, Kozdoba M, Mannor S, Tennenholtz M, Yom-Tov E (2016)
Encouraging physical activity in patients with diabetes through automatic personalized
feedback via reinforcement learning improves glycemic control. Diabetes Care 39:e59–e60
21. Hu X, Hsueh P-YS, Chen C-H, Diaz KM, Cheung Y-KK, Qian M (2017) A first step towards
behavioral coaching for managing stress: a case study on optimal policy estimation with
multi-stage threshold Q-learning. In: AMIA annual symposiym proceedings, vol 930–939
Health Lifestyle Data-Driven Applications Using Pervasive … 125
22. Badgeley MA, Shameer K, Glicksberg BS, Tomlinson MS, Levin MA, McCormick PJ et al
(2016) EHDViz: clinical dashboard development using open-source technologies. BMJ Open
6:e010579
23. Wanderer JP, Nelson SE, Ehrenfeld JM, Monahan S, Park S (2016) Clinical data
visualization: the current state and future needs. J Med Syst 40:275
24. MIT health infoscape [Internet]. Available http://senseable.mit.edu/healthinfoscape/
25. Araujo MLD, Mejova Y, Aupetit M, Weber I (2017)Visualizing health awareness in the
middle east. In: AAAI conference on web and social media ICWSM, p 726
26. The data visualisation catalogue [Internet] Available https://datavizcatalogue.com/index.html
27. Börner K, Maltese A, Balliet RN, Heimlich J (2016) Investigating aspects of data
visualization literacy using 20 information visualizations and 273 science museum visitors.
Inf Vis 15:198–213
28. Aupetit M, Fernandez-Luque L, Singh M, Srivastava J (2017) Visualization of wearable data
and biometrics for analysis and recommendations in childhood obesity. In: IEEE 30th
international symposium on computer-based medical systems (CBMS). https://doi.org/10.
1109/cbms.2017.120
29. Bishop CM (2016) Pattern recognition and machine learning. Springer
30. Aupetit M, Couturier P, Massotte P (2002) Gamma-observable neighbours for vector
quantization. Neural Netw 15:1017–1027
31. Lespinats S, Aupetit M, Meyer-Baese A (2015) ClassiMap: a new dimension reduction
technique for exploratory data analysis of labeled data. Int J Pattern Recognit Artif Intell
29:1551008
32. Arora T, Choudhury S, Taheri S (2015) The relationships among sleep, nutrition, and obesity.
Curr Sleep Med Rep 1:218–225
33. Kudva YC, Carter RE, Cobelli C, Basu R, Basu A (2014) Closed-loop artificial pancreas
systems: physiological input to enhance next-generation devices. Diabetes Care 37:1184–
1190
34. Heintzman ND (2015) A digital ecosystem of diabetes data and technology: services, systems,
and tools enabled by wearables, sensors, and apps. J Diabetes Sci Technol 10:35–41
35. Dadlani V, Levine JA, McCrady-Spitzer SK, Dassau E, Kudva YC (2015) Physical activity
capture technology with potential for incorporation into closed-loop control for type 1
diabetes. J Diabetes Sci Technol 9:1208–1216
36. Ghafar-Zadeh E (2015) Wireless integrated biosensors for point-of-care diagnostic applica-
tions. Sensors 15:3236–3261
37. Ratjen I, Schafmayer C, di Giuseppe R, Waniek S, Plachta-Danielzik S, Koch M et al (2017)
Postdiagnostic physical activity, sleep duration, and TV watching and all-cause mortality
among long-term colorectal cancer survivors: a prospective cohort study. BMC Cancer
17:701
38. Gell NM, Grover KW, Humble M, Sexton M, Dittus K (2017) Efficacy, feasibility, and
acceptability of a novel technology-based intervention to support physical activity in cancer
survivors. Support Care Cancer 25:1291–1300
39. Gresham G, Schrack J, Gresham LM, Shinde AM, Hendifar AE, Tuli R et al (2018) Wearable
activity monitors in oncology trials: Current use of an emerging technology. Contemp Clin
Trials 64:13–21
40. Smith MT, McCrae CS, Cheung J, Martin JL, Harrod CG, Heald JL et al (2018) Use of
actigraphy for the evaluation of sleep disorders and circadian rhythm sleep-wake disorders: an
American Academy of Sleep Medicine clinical practice guideline. J Clin Sleep Med 14:1231–
1237
41. Nahum-Shani I, Smith SN, Spring BJ, Collins LM, Witkiewitz K, Tewari A et al (2018)
Just-in-time adaptive interventions (JITAIS) in mobile health: key components and design
principles for ongoing health behavior support. Ann Behav Med 52:446–462
42. Weber GM, Mandl KD, Kohane IS (2014) Finding the missing link for big biomedical data.
JAMA 311:2479–2480
126 L. Fernandez-Luque et al.
43. Martin Sanchez F, Sanchez FM, Gray K, Bellazzi R, Lopez-Campos G (2014) Exposome
informatics: considerations for the design of future biomedical research information systems.
J Am Med Inform Assoc 21:386–390
44. Alterovitz G, Warner J, Zhang P, Chen Y, Ullman-Cullere M, Kreda D et al (2015) SMART
on FHIR Genomics: facilitating standardized clinico-genomic apps. J Am Med Inform Assoc
22:1173–1178
45. Sáez C, Zurriaga O, Pérez-Panadés J, Melchor I, Robles M, García-Gómez JM (2016)
Applying probabilistic temporal and multisite data quality control methods to a public health
mortality registry in Spain: a systematic approach to quality control of repositories. J Am Med
Inform Assoc 23:1085–1095
46. ITU and WHO launch new initiative to leverage power of Artificial Intelligence for health. In:
International telecommunication union [Internet]. Available https://www.itu.int/en/
mediacentre/Pages/2018-pr18.aspx
47. Fernandez-Luque L, Singh M, Ofli F, Mejova YA, Weber I, Aupetit M et al (2017)
Implementing 360° quantified self for childhood obesity: feasibility study and experiences
from a weight loss camp in Qatar. BMC Med Inform Decis Mak 17:37
48. Kushniruk AW, Triola MM, Borycki EM, Stein B, Kannry JL (2005) Technology induced
error and usability: the relationship between usability problems and prescription errors when
using a handheld application. Int J Med Inform 74:519–526
49. Borycki EM, Kushniruk AW (2008) Where do technology-induced errors come from?
Towards a model for conceptualizing and diagnosing errors caused by technology. In:
Human, social, and organizational aspects of health information systems, pp 148–166
50. Chakraborty S, Tomsett R, Raghavendra R, Harborne D, Alzantot M, Cerutti F, et al (2017)
Interpretability of deep learning models: a survey of results. In: Smart world, ubiquitous
intelligence & computing, advanced & trusted computed, scalable computing & communi-
cations, cloud & big data computing, internet of people and smart city innovation
(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). https://doi.org/10.1109/uic-atc.
2017.8397411
51. Sly L (2018) US soldiers are revealing sensitive and dangerous information by jogging. In:
The Washington post [Internet]. Available https://www.washingtonpost.com/world/the-us-
military-reviews-its-rules-as-new-details-of-us-soldiers-and-bases-emerge/2018/01/29/
6310d518-050f-11e8-aa61-f3391373867e_story.html?utm_term=.91cdbf6f3e38
52. Froomkin AM, Michael Froomkin A, Kerr IR, Pineau J (2018) When AIs outperform doctors:
the dangers of a tort-induced over-reliance on machine learning and what (not) to do about it.
SSRN Electron J. https://doi.org/10.2139/ssrn.3114347
53. Huckvale K, Prieto JT, Tilney M, Benghozi P-J, Car J (2015) Unaddressed privacy risks in
accredited health and wellness apps: a cross-sectional systematic assessment. BMC Med 13.
https://doi.org/10.1186/s12916-015-0444-y
54. Yapo A, Weiss J (2018) Ethical implications of bias in machine learning. In: Proceedings of
the 51st Hawaii international conference on system sciences. https://doi.org/10.24251/hicss.
2018.668
55. Hajian S, Bonchi F, Castillo C (2016) Algorithmic bias: from discrimination discovery to
fairness-aware data mining. In: Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining—KDD ’16, ACM Press, New York,
USA, pp 2125–2126
56. Wilbanks JT, Topol EJ (2016) Stop the privatization of health data. Nature 535:345–348
57. Norman CD, Skinner HA (2006) eHealth literacy: essential skills for consumer health in a
networked world. J Med Internet Res 8(2):e9
58. Hu X, Hsueh P-YS, Chen C-H, Diaz KM, Parsons FE, Ensari I, Qian M, Cheung Y-KK An
interpretable health behavioral intervention policy for mobile device users. IBM Journal of
Research and Development 62 (1):4:1-4:6
Big Data Challenges from an Integrative
Exposome/Expotype Perspective
Fernando Martin-Sanchez
1 Introduction
F. Martin-Sanchez (&)
Instituto de Salud Carlos III, Madrid, Spain
e-mail: fmartin@isciii.es
equivalent to what has been done to characterize the human genome (and also the
human phenome) [6, 7]. Defining the concept of expotype, analogous to genotype
and phenotype, could represent an opportunity to make progress in the character-
ization of human individual exposome data.
The use of digital health technologies, coupled with advances in the characterization
of individual exposomes and the development of participatory medicine, converge in
projects such as the US Precision Medicine Initiative (e.g. PMI) of high potential to
support truly integrative research approaches (gene, environment, phenotype) [8].
It has been estimated that the attributable risk from the genome for chronic
disease development is only somewhere between 10 and 30%. Even in the area of
rare diseases, it has been estimated that only 80% of them have a genetic cause. In
the rest, infectious and environmental causes are responsible for its development.
We also know that the environment dominates over host genetics in shaping human
gut microbiota [9] and that the local environment directly affects disease risk [10].
The necessary connection between genotype and phenotype must be carried out
in any case through the environment, since it modulates different modes of
expression of genetic information, leading to different phenotypic manifestations.
Gene-environment interaction studies are becoming very prominent, but informatics
is only starting to grasp the complexity of big data processing in those truly inte-
grative approaches where genetics, clinical and environmental data need to be
jointly processed for a better understanding of disease mechanisms. For instance,
while genomic data consist of stable linear sequences, the exposome data are
non-linear heterogeneous variables that change in time and space.
The assessment of the exposome now can take advantage from the emergence of
innovative digital technologies—including wearable devices and personal sensors,
mobile apps, global positioning systems, and geographic information systems
which enable new and more detailed exposure measurement at the individual level.
Research about the exposome is lagging behind research efforts about the
genome and other—omics. One of the reasons for this being the fragmentation of
the landscape of disciplines that are interested in characterizing the exposome from
different perspectives:
– Environmental health—Exposure, toxicology (Enviroexposome or expososome)
– Health services research (Access to healthcare exposome)
– Urbanism—“Built environment” (Urban exposome)
– Occupational health (Occupational exposome)
– Epidemiology (Public Health Exposome)
– Sociology (Socioexposome)
– Nanomedicine (Nanoexposome)
– Infections (Infectoexposome)
– Medical procedures exposome
– Medications (Drugexposome)
– Psychology (Psychoexposome)
– Digital Technology (Digital component of the exposome)
Big Data Challenges from an Integrative Exposome/Expotype … 129
The exposome concept therefore tries to provide a unified vision for the pro-
cessing of exposure data that are relevant for human health. The following sections
describe eight challenges in terms of processing individual exposome (expotype)
big data and integrating them with genomic and clinical data for biomedical
research and clinical practice. These challenges are summarized in Table 1.
In recent years, important advances have been made in standardizing the repre-
sentation of genotype and phenotype data. For both domains, there already exist
terminologies and controlled vocabularies, ontologies and classification systems
that allow the exchange and integration of data for further analysis. However, in the
case of environmental factors and exposures that affect human health, we are still
facing a very fragmented field, where different scientific disciplines have still dif-
ferent views of the exposome (toxicology, environmental science, public health,
health services research, urbanism). They use different taxonomies to catalog the
different environmental factors and these are not interconnected [17].
Although the elaboration of the complete exposome of an individual is still far
from the reach of research laboratories, because of its enormous complexity [18]
and relatively recent definition, it is now possible to carry out studies of partial
exposomes, as summarized in Fig. 1, for example, focused on a disease [19], health
condition [20], organ [21], geographical location [22] or employment status [23].
Several efforts are in place to reconcile the different views of the exposome into
a single ontology: Exposure Ontology (ExO; https://www.ebi.ac.uk/ols/ontologies/
exo), Children’s Health Exposure Analysis Resource (CHEAR; http://purl.
bioontology.org/ontology/CHEAR). There also exist tools such as PhenX (https://
www.phenxtoolkit.org/) that can enable better data exchange and integration with
other sources of data (genomic, phenomic).
Several years ago, the author of this chapter, along with Dr. Guillermo Lopez
Campos, now at Queens University, Belfast, developed the new concept of
Expotype, which has been presented in various scientific events (e.g. a keynote at
MIE 2015 conference, in Madrid). Also the concept of expotype/expotyping was
explained in our article [24] published in 2016. Expotype was our suggested word
to define partial views of an individual exposome. It can be defined as “a specific
set of exposome elements of an individual accumulated during a certain time/
space”. For instance, the number of steps walked by an individual during a specific
time/space window (as illustrated in Table 2). A mixture of expotypes, in combi-
nation with an individual genotype, is responsible for a mixture of phenotypes,
along time.
Dr. Sarigiannis [25] mentioned the term expotype in the abstract of an article
published in 2017, defining it as “the vector of exposures an individual is exposed
over time”. Although this would be the first time that the term was used in an
abstract, therefore appearing in Pubmed searches, the author did not develop further
the concept.
In their article [26] Fan and collaborators mention our 2016 article. Based on our
proposed concept of expotype, they concurred that it is important to extract all
individual exposome information available in Electronic Health Records (we
christened this process as expotyping) [24], and developed a template-driven
approach to identifying exposome concepts from the Unified Medical Language
System (UMLS). They used selected ontological relations and the derived concepts
were evaluated in terms of literature coverage and the ability to assist in annotating
clinical text.
Finally, the paper by Rattray et al. [27] introduces the concept of “Exposotype”
with a more restricted meaning (“the metabolomic profile of an individual that
reflects an event of exposure”). From our perspective, an exposotype would be a
particular case of expotype.
There are several data from electronic health records that can be used to generate
expotypes, such as demographic data (e.g. residential, education level), health
behavior (e.g. tobacco, alcohol, and injection drug use), history of medication (type,
dose, frequency, duration), history of infection (agent, duration), or medical pro-
cedures and imaging (e.g. magnetic resonance imaging, CT scan, X-ray, …).
On November 2013, the Institute of Medicine (IOM) released the report:
“Capturing Social and Behavioral Domains and Measures in Electronic Health
Records: Phase 2” [28], which recommends a “concrete approach to including
social and behavioral determinants in the clinical context to increase clinical
awareness of the patient’s state, broadly considered, and to connect clinical, public
health, and community resources for work in concert”.
Until the socioeconomic information and other individual exposure factors
(expotypes) are not stored properly and regularly in the Electronic Health Records,
efforts will have to be made to extract this data from the current EHR, both from
structured as unstructured (text) fields. The following articles briefly describe
various approaches and perspectives already carried out in this field.
Casey et al. [29] reviewed how EHR studies have been used to evaluate
exposures to risks and resources in the physical environment (e.g. air pollution,
green space) and health outcomes (e.g. hypertension, diabetes, migraines). EHR
data sets have allowed environmental and social epidemiologists to leverage data on
patients distributed across a wide range of physical, built, and social environments.
Linking geocoded addresses to location-specific data and using geographic infor-
mation systems (GIS) it is possible to study an individual’s proximity to hazards
(e.g. air pollution) related to disease.
Biro et al. [30] showed the utility of linking primary care electronic medical
records with census data to study the determinants of chronic disease. They used
postal codes to link patient data from EMR with additional information on envi-
ronmental determinants of health, demonstrating an association between obesity
and area-level deprivation.
Wang et al. [31] investigated tobacco use data from structured (social history)
and unstructured sources (clinical notes) in the EHR. They implemented a natural
language processing pipeline and showed that structured fields alone may not be
able to provide a complete view of tobacco use information.
Big Data Challenges from an Integrative Exposome/Expotype … 133
In 2016, Gottlieb et al. published an article in the Journal Health Affairs [15]
describing current opportunities and barriers for integrating social and clinical data.
They discussed the process of Extraction of data about social determinant of health
out of EHRs and concluded that ICD-10 provides an expanded set of codes
reflecting patient social characteristics in the form of z-codes (e.g. Z56: Problems
related with employment and unemployment—Z56.0).
Maranhao et al. [32] worked with nutrigenomic (personalized nutrition) infor-
mation in the openEHR data set. They identified 117 clinical statements, as well as
27 archetype-friendly concepts in a bibliographic review (26 articles). This group
also modeled four new archetypes (waist-to-height ratio, genetic test results, genetic
summary, and diet plan) and created a specific nutrigenomic template for nutrition
care. The archetypes and the specific openEHR template developed in this study
gave dieticians and other health professionals an important tool to their nutrige-
nomic clinical practices, besides a set of nutrigenomic data for clinical research.
Lastly, Boland et al. recently published their study: “Uncovering exposures
responsible for birth season—disease effects: a global study” [33], where the team
demonstrated, using EHR data from more than 6 clinical sites, 10 million patients, 3
countries, 2 continents, and 5 climates, that seasonality and climate play an
important role in human health and disease. Geography and climate modulate
disease risk and/or severity while also altering our exposure to diverse environ-
mental factors. Based on a previously published SeaWAS (for Season-Wide
Association Study), they found correlation between each of 12 exposures across
133 diseases during 5 different developmental stages (i.e. 3 trimesters,
pregnancy-wide, and perinatal). For their work with EHR data they used OHDSI—
CDM in three sites, and mapping of ICD-9 to SNOMED in the other three.
factors and phenotypes, such as correlation globes [39]. INDIV 3-D is a theoretical
model that could serve to represent this complex set of multi-level health data as
well [40].
The information about the exposome of individuals represents a key aspect of future
biomedical research projects. Individuals generate data in their contacts with health
systems, which are normally stored in their electronic health records. They also
generate data themselves using new technologies and digital health services. When
they participate in authorized research projects, their biological samples are stored
in biobanks and then processed in laboratories to obtain their molecular data
(genome, proteome, …). All this information must be processed to feed the data
that are needed in biomedical research [5, 42]. The adequate extraction of data
provided by participants, clinical systems and laboratory systems should lead to the
generation of genotypes, expotypes and phenotypes annotated with standards to
allow their integration and joint analysis, as described in Fig. 2.
Big Data Challenges from an Integrative Exposome/Expotype … 135
Fig. 3 New sources of individual exposome data complementing existing phenome and genome
data
Big Data Challenges from an Integrative Exposome/Expotype … 137
Some of the most important health institutions including the US NIH, through its
NIESH and NIOSH institutes, the CDC, or the US EPA, already have programs in
place around the Exposome.
The main research funding agencies, at the international level, have supported the
creation of consortiums and networks in this space. For example, the European
Commission financed the projects HELIX, EXPOSOMICS and HEALS in the pre-
vious R&D Framework Program that dealt with specific aspects of the Exposome.
The NIH has funded Research Centers like Hercules (https://emoryhercules.com/) or
the Children’s Health Exposure Analysis Resource (CHEAR; https://www.niehs.nih.
gov/research/supported/exposure/chear/).
Japan supported the JECS program (www.env.go.jp/chemi/ceh/en/) to study the
effects of the environment on children. We are also witnessing the creation of
monographic research centers on the Exposome, such as the TNO—Utrecht
Exposome Hub (https://www.uu.nl/en/research/life-sciences/research/hubs/utrecht-
exposome-hub), the Institute for Exposomic Research (http://icahn.mssm.edu/
research/exposomic) at the Icahn School of Medicine, in New York or the I3CARE
International Exposome Center (http://exposome.iras.uu.nl/), a global collaboration
between University of Utrecht, the University of Toronto, and the Chinese
University of Hong Kong.
The International Medical Informatics Association (IMIA) has recently created a
Working Group on informatics aspects related to the exposome to support
researchers, clinicians and consumers navigate throughout the entire “data to
knowledge” life cycle: data collection, knowledge representation, annotation,
integration with genomic and phenomic data, analytics, and visualization
(https://exposomeinformatics.wordpress.com).
138 F. Martin-Sanchez
10 Conclusion
The objective of this chapter is to raise awareness among potential readers about the
importance of advancing in those aspects related to the processing of exposome big
data. Although it is a relatively recent area and in rapid progression, it is beyond the
scope of this contribution to offer an exhaustive catalog of all the resources,
methods and experiences that have already been reported in the literature. Instead,
we have chosen, based on our own experience and a literature review, to identify
eight challenges that can introduce the reader to this field and motivate her to search
for more information. It is our desire that the biomedical informatics and data
science community recognize exposome informatics as a new area of activity, key
for precision medicine and biomedical research, and with clear potential to be
useful in clinical practice in the coming years.
References
1. Martin-Sanchez F, Verspoor K (2014) Big data in medicine is driving big changes. Yearb
Med Inform 15(9):14–20
2. Wild CP (2005) Complementing the genome with an “exposome”: the outstanding challenge
of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol
Biomarkers 14(8):1847–1850
3. Patel CJ, Ioannidis JP (2014) Studying the elusive environment in large scale. JAMA 311(21):
2173–2174
4. Wild CP (2012) The exposome: from concept to utility. Int J Epidemiol 41(1):24–32
5. Martin Sanchez F, Gray K, Bellazzi R, Lopez-Campos G (2014) Exposome informatics:
considerations for the design of future biomedical research information systems. JAMIA.
21(3):386–390
6. Thomas DC, Lewinger JP, Murcray CE, et al (2012) Invited commentary: GE-Whiz!
Ratcheting gene-environment studies up to the whole genome and the whole exposome. Am J
Epidemiol 175:203–207; discussion 208–209
7. Manrai AK, Cui Y, Bushel PR, Hall M, Karakitsios S, Mattingly CJ, Ritchie M, Schmitt C,
Sarigiannis DA, Thomas DC, Wishart D, Balshaw DM, Patel CJ (2017) Informatics and data
analytics to support exposome-based discovery for public health. Annu Rev Public Health 20
(38):279–294. https://doi.org/10.1146/annurev-publhealth-082516-012737 Epub 2016 Dec 23
PubMed PMID: 28068484
8. Collins FS, Varmus H (2015) A new initiative on precision medicine. NEJM 372(9):793–795
9. Rothschild D, Weissbrod O, Barkan E, Kurilshikov A, Korem T, Zeevi D, Costea PI,
Godneva A, Kalka IN, Bar N, Shilo S, Lador D, Vila AV, Zmora N, Pevsner-Fischer M,
Israeli D, Kosower N, Malka G, Wolf BC, Avnit-Sagi T, Lotan-Pompan M, Weinberger A,
Halpern Z, Carmi S, Fu J, Wijmenga C, Zhernakova A, Elinav E, Segal E (2018)
Environment dominates over host genetics in shaping human gut microbiota. Nature
555(7695):210–215
10. Favé MJ, Lamaze FC, Soave D, Hodgkinson A, Gauvin H, Bruat V, Grenier JC, Gbeha E,
Skead K, Smargiassi A, Johnson M, Idaghdour Y, Awadalla P (2018) Gene-by-environment
interactions in urban populations modulate risk phenotypes. Nat Commun 9(1):827
Big Data Challenges from an Integrative Exposome/Expotype … 139
11. Dennis KK, Marder E, Balshaw DM, Cui Y, Lynes MA, Patti GJ, Rappaport SM,
Shaughnessy DT, Vrijheid M, Barr DB (2017) Biomonitoring in the Era of the Exposome.
Environ Health Perspect 125(4):502–510
12. Ding YP, Ladeiro Y, Morilla I, Bouhnik Y, Marah A, Zaag H, Cazals-Hatem D, Seksik P,
Daniel F, Hugot JP, Wainrib G, Tréton X, Ogier-Denis E (2017) Integrative network-based
analysis of colonic detoxification gene expression in ulcerative colitis according to smoking
status. J Crohns Colitis 11(4):474–484
13. Jacquez GM, Sabel CE, Shi C (2015) Genetic GIScience: toward a place-based synthesis of
the genome, exposome, and behavome. Ann Assoc Am Geogr 105(3):454–472
14. Centers for Disease Control and Prevention (CDC), National Center for Health Statistics
(NCHS) (2016) National health and nutrition examination survey data. U.S. Department of
Health and Human Services, Centers for Disease Control and Prevention, Hyattsville, MD
[last visit 2017-0403]. Available from https://www.cdc.gov/nchs/nhanes/
15. Gottlieb L, Tobey R, Cantor J, Hessler D, Adler NE (2016) Integrating social and medical
data to improve population health: opportunities and barriers. Health Aff (Millwood) 35(11):
2116–2123
16. Swan M (2012) Health 2050: the realization of personalized medicine through crowdsourcing,
the quantified self, and the participatory biocitizen. J Pers Med 2(3):93–118
17. Kiossoglou P, Borda A, Gray K, Martin-Sanchez F, Verspoor K, Lopez-Campos G (2017)
Characterising the scope of exposome research: a generalisable approach. Stud Health
Technol Inform 245:457–461
18. Cui Y, Balshaw DM, Kwok RK, Thompson CL, Collman GW, Birnbaum LS (2016) The
exposome: embracing the complexity for discovery in environmental health. Environ Health
Perspect 124(8):A137–A140
19. Smith MT, Zhang L, McHale CM, Skibola CF, Rappaport SM (2011) Benzene: the exposome
and future investigations of leukemia etiology. Chem Biol Interact 192(1–2):155–159
20. Goldfarb DS (2016) The exposome for kidney stones. Urolithiasis 44(1):3–7
21. Rappaport SM, Barupal DK, Wishart D, Vineis P, Scalbert A (2014) The blood exposome and
its role in discovering causes of disease. Environ Health Perspect 122(8):769–774
22. Donald CE, Scott RP, Blaustein KL, Halbleib ML, Sarr M, Jepson PC et al (2016) Silicone
wristbands detect individuals’ pesticide exposures in West Africa. R Soc Open Sci 3(8):
160433
23. Faisandier L, Bonneterre V, De Gaudemaris R, Bicout DJ (2011) Occupational exposome: a
network-based approach for characterizing occupational health problems. J Biomed Inform
44(4):545–552
24. Martin-Sanchez FJ, Lopez-Campos GH (2016) The new role of biomedical informatics in the
age of digital medicine. Methods Inf Med 55(5):392–402
25. Sarigiannis DA (2017) Assessing the impact of hazardous waste on children’s health: the
exposome paradigm. Environ Res 158:531–541
26. Fan JW, Li J, Lussier YA (2017) Semantic modeling for exposomics with exploratory
evaluation in clinical context. J Healthc Eng 2017:3818302
27. Rattray NJW, Deziel NC, Wallach JD, Khan SA, Vasiliou V, Ioannidis JPA, Johnson CH
(2018) Beyond genomics: understanding exposotypes through metabolomics. Hum Genomics
12(1):4
28. Institute of Medicine (2014) Capturing social and behavioral domains and measures in
electronic health records: phase 2. The National Academies Press, Washington, DC. https://
doi.org/10.17226/18951
29. Casey JA, Schwartz BS, Stewart WF, Adler NE (2016) Using electronic health records for
population health research: a review of methods and applications. Annu Rev Public Health
37:61–81
30. Biro S, Williamson T, Leggett JA, Barber D, Morkem R, Moore K, Belanger P, Mosley B,
Janssen I (2016) Utility of linking primary care electronic medical records with Canadian
census data to study the determinants of chronic disease: an example based on socioeconomic
status and obesity. BMC Med Inform Decis Mak 11(16):32
140 F. Martin-Sanchez
31. Wang Y, Chen ES, Pakhomov S, Lindemann E, Melton GB (2016) investigating longitudinal
tobacco use information from social history and clinical notes in the electronic health record.
In: AMIA annual symposiym proceedings, pp 1209–1218
32. Maranhão PA, Bacelar-Silva GM, Ferreira DNG, Calhau C, Vieira-Marques P, Cruz-Correia
RJ (2018) Nutrigenomic information in the openEHR data set. Appl Clin Inform 9(1):
221–231
33. Boland MR, Parhi P, Li L, Miotto R, Carroll R, Iqbal U, Nguyen PA, Schuemie M, You SC,
Smith D, Mooney S, Ryan P, Li YJ, Park RW, Denny J, Dudley JT, Hripcsak G, Gentine P,
Tatonetti NP (2017) Uncovering exposures responsible for birth season—disease effects: a
global study. J Am Med Inform Assoc. https://doi.org/10.1093/jamia/ocx105. [Epub ahead of
print]
34. Agier L, Portengen L, Chadeau-Hyam M, Basagaña X, Giorgis-Allemand L, Siroux V,
Robinson O, Vlaanderen J, González JR, Nieuwenhuijsen MJ, Vineis P, Vrijheid M, Slama R,
Vermeulen R (2016) A systematic comparison of linear regression-based statistical methods
to assess exposome-health associations. Environ Health Perspect 124(12):1848–1856
35. Barrera-Gómez J, Agier L, Portengen L, Chadeau-Hyam M, Giorgis-Allemand L, Siroux V,
Robinson O, Vlaanderen J, González JR, Nieuwenhuijsen M, Vineis P, Vrijheid M,
Vermeulen R, Slama R, Basagaña X (2017) A systematic comparison of statistical methods to
detect interactions in exposome-health associations. Environ Health 16(1):74
36. Patel CJ, Chen R, Kodama K et al (2013) Systematic identification of interaction effects
between genome- and environment-wide associations in type 2 diabetes mellitus. Hum Genet
132:495–508
37. McGinnis DP, Brownstein JS, Patel CJ (2016) Environment-wide association study of blood
pressure in the national health and nutrition examination survey (1999–2012). Sci Rep 26(6):
30373
38. Patel CJ (2017) Analytic complexity and challenges in identifying mixtures of exposures
associated with phenotypes in the exposome era. Curr Epidemiol Rep 4(1):22–30
39. Patel CJ, Manrai AK (2015) Development of exposome correlation globes to map out
environment-wide associations. Pac Symp Biocomput 231–242
40. Lopez-Campos G, Bellazzi R, Martin-Sanchez F (2013) INDIV-3D. A new model for
individual data integration and visualisation using spatial coordinates. Stud Health Technol
Inform 190:172–174
41. National Academies of Sciences, Engineering, and Medicine (2017) Measuring personal
environmental exposures. In: Proceedings of a workshop—in brief. The National Academies
Press, Washington, DC. https://doi.org/10.17226/24711
42. Dagliati A, Marinoni A, Cerra C, Decata P, Chiovato L, Gamba P, Bellazzi R (2015)
Integration of administrative, clinical, and environmental data to support the management of
type 2 diabetes mellitus: from satellites to clinical care. J Diabetes Sci Technol 10(1):19–26
43. Antman EM, Loscalzo J (2016) Precision medicine in cardiology. Nat Rev Cardiol 13(10):
591–602
44. Rappaport SM (2016) Genetic factors are not the major causes of chronic diseases. PLoS One
11(4):e0154387
45. Galli SJ (2016) Toward precision medicine and health: opportunities and challenges in
allergic diseases. J Allergy Clin Immunol 137(5):1289–1300
46. Agustí A, Bafadhel M, Beasley R, Bel EH, Faner R, Gibson PG, Louis R, McDonald VM,
Sterk PJ, Thomas M, Vogelmeier C, Pavord ID (2017) On behalf of all participants in the
seminar. Precision medicine in airway diseases: moving to clinical practice. Eur Respir J 50(4)
47. Lopez-Campos G, Merolli M, Martin-Sanchez F (2017) Biomedical informatics and the
digital component of the exposome. Stud Health Technol Inform 245:496–500
48. Measuring national well-being: insights into children’s mental health and well-being (2015)
ONS. Accessed March 23, 2018. https://www.ons.gov.uk/peoplepopulationandcommunity/
wellbeing/articles/measuringnationalwellbeing/2015-10-20
49. Cantor MN, Thorpe L (2018) Integrating data on social determinants of health into electronic
health records. Health Aff (Millwood) 37(4):585–590
Big Data Challenges from an Integrative Exposome/Expotype … 141
50. Dennis KK, Jones DP (2016) The exposome: a new frontier for education. Am Biol Teach
78(7):542–548
51. Niedzwiecki MM, Miller GW (2017) The exposome paradigm in human health: lessons from
the emory exposome summer course. Environ Health Perspect 125(6):064502
52. Johnson CH, Athersuch TJ, Collman GW, Dhungana S, Grant DF, Jones DP, Patel CJ,
Vasiliou V (2017) Yale school of public health symposium on lifetime exposures and human
health: the exposome; summary and future reflections. Hum Genomics 11(1):32
Glossary