Kondylakis 2018

2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI)
4-7 March 2018

Las Vegas, Nevada, USA

Haridimos Kondylakis, Lefteris Koumakis, Manolis Tsiknakis, Kostas Marias
terminologies that are employed for semantically uplifting and

Abstract The advancements in healthcare have brought to

the fore the need for flexible access to health-related information integrating data. In addition, it provides tools for effective and
and created an ever-growing demand for efficient data efficient data access and exploration. More specifically our
management infrastructures. To this direction, in this paper, we contribution is the following.
present an effective and efficient data management
infrastructure implemented for the iManageCancer EU project. We present a unique data management architecture that
The architecture focuses on enabling data access to multiple, does not limit diversity but highly embraces it.
heterogeneous and diverse data source that are initially available
in a data lake. Parts of these data are integrated and semantically
To store all internal and external data we employ the
uplifted using a modular ontology. This integration can be either notion of data lake, allowing various types of databases to
at run-time or through an ETL process ensuring efficient access co-exist, each one of them employed for storing specific
to the integrated information. A unique feature of out platform types of data according to the individual needs of the
is that it allows the uninterrupted, continuous evolution of corresponding applications.
ontologies/terminologies. Finally, summarization tools enable
To enable a common model over all these data, we create
the quick understanding of the available information, whereas
the IMC Semantic Core ontology, a modular ontology
APIs and anonymization services ensure the secure access to the
requested information. embedding all relevant biomedical state-of-the-art
ontologies.
I. INTRODUCTION
To integrate all available information, we use an advanced
Recent reports by the eHealth Task Force [1] and by the data integration tool, exelixis, able to integrate on real-
European Alliance for personalized Medicine [2], focus on time selected information out of the available sources and
redesigning health in Europe by achieving a vision of to offline publish a semantic repository.
affordable, more personalized and less intrusive care,
ultimately, increasing the quality of life as well as lowering A unique feature of this tool is that it enables the
mortality. Such a vision depends on the application of uninterrupted evolution of the employed
information technology, the effective use of data [3] and ontologies/terminologies. We argue that evolution should
requires a radical redesign of e-health to meet these challenges. not be treated as a side-'(('%5$65#4#>(+345-%.#44%+5+;'0@
Among others, two important levers for change are identified, in a modern data management infrastructure.
+' >
and >
Fully In addition, we exploit a novel graphical summarization
exposing, integrating, linking and exploring health data will tool, to enable quick understanding and exploration of the
have a tremendous impact on improving the integrated available information, either integrated or proprietary.
diagnosis, treatment and prevention of disease in individuals,
enabling the secondary use of healthcare data for research, Finally, to enable access to both the integrated and the
eventually transforming the way in which care is provided. proprietary information various data access services are
implemented and used, able to properly anonymize data
To this direction, this paper describes the approach adapted according to the policy/security constraints adopted each
for managing large and heterogeneous health datasets within time.
the iManageCancer EU project [4]. The project, has the
objective to provide a cancer specific self-management To the best of our knowledge, no other data management
platform designed according to the needs of patient groups implementation possess all those features, accepting a variety
while in parallel focusing on the wellbeing of the cancer of data sources, enabling access to both integrated and
patient with special emphasis on avoiding, early detecting and proprietary data, allowing the evolution of the
managing adverse events of cancer therapy but also, ontologies/terminologies and the exploration though
importantly, on the psycho-emotional evaluation and self- summaries of the available data.
motivated goals.
The rest of this paper is structured as follows: In Section 2
To achieve the aforementioned goals a unique data we present the high-level data management architecture. Then
management architecture has been designed and implemented. we analyze in detail each one of the individual components.
It is focusing on managing diverse types of big data,
embracing the continuous evolution of related standards and
*Research supported by iManageCancer EU project under the contract ICS, N. Plastira 100, Greece, GR 70013, phone: 0030-2810-391449, e-mail:
under grant agreement No 643529 {kondylak, koumaki, kmarias, tsiknaki}@ics.forth.gr.
Haririmos Kondylakis, Lefteris Koumakis, Kostas Marias and Manolis Kostas Marias and Manolis Tsiknakis are also with the Department of
Tsiknakis are with the Computational BioMedicine Laboratory, FORTH- Informatics Engineering, of the Technological Educational Institute of Crete.
978-1-5386-2405-0/18/$31.00 ©2018 IEEE 361

Then in Section 3 we compare our approach with approaches train decision support tools. Furthermore, streaming data from
from relevant research projects and we conclude this paper. wearables and sensors continuously feed the data lake. For
storing those types of data, high-throughput NoSQL
II. THE DATA MANAGEMENT LAYER Cassandra databases are used. Cassandra has built-in support
for Map-Reduce based data analysis with advanced replication
An overview of the data management architecture is shown
functions providing input to applications that require big-data
in Figure 1. Existing ICT systems and tools store and push
real-time analysis.
data to the iManageCancer data lake. Selected subsets out of
the data lake are semantically uplifted, integrated and External live databases: Besides data that are extracted,
explored, through novel data integration tools. Both the transformed and loaded (ETL) to our internal databases, a
proprietary and the integrated information, is then served variety of external databases are also used to enable access to
through various data access APIs. The appropriate drug interactions, health related educational information and
anonymization of the data is enforced whenever needed and dietary information (e.g. calories).
appropriate security mechanism guarantee the regulated
accedd to the information. Bellow we will describe in more B. The IMC Semantic Core Ontology
detail each one of the aforementioned modules of the platform. For enabling a common representation of knowledge
across the continuum of care and across the different
information sources, we developed the iMC Semantic Core
ontology. It is used as the virtual schema of all data stored
within the platform, and is able to semantically describe the
different types of data required and processed by the platform.
The development of the iMC Semantic Core Ontology was
based on the following principles: a) Reuse: Exploit already
established high quality ontologies; b) Granularity: A single
ontological resource is not adequate to model the multi-faceted
ecosystem of eHealth so multiple ontologies should be used;
c) Modularity: Create a framework where different ontologies
would be able to integrate many modules through mappings
and equivalences between ontology terms; d) multilinguality:
since the data management layer is going to be used to store
medical data in three European countries (United Kingdom,
Italy and Germany) we would like to be able to identify the
concepts used in those countries as well.
Figure 1. The data management architecture
A. The Data Lake

The bottom layer of the data management architecture is
an instantiation of the data lake concept. There are multiple,
heterogeneous databases using different languages for
retrieving data (e.g., SQL, CQL, APIs), different technologies
for storing and serving those data, and different security
requirements. We argue that this heterogeneity is inherent in
the medical domain and data management architectures should
not limit but embrace diversity.
Internal, live databases: The iManageCancer platform is
centered around a Personal Health Record (PHR) which
regularly monitors the psycho-emotional status of the patient
and periodically record the everyday life experiences of the
cancer patient in terms of side-effects of the therapy. At the
same time, several apps monitor medications, pain, side-
effects of cancer treatment, lifestyle and diet choices, whereas
serious games try to educate and encourage patients. All those
different applications have their own databases, which are
mostly relational databases.
Figure 2. The modules of MHA Semantic Core Ontology1
Ingested databases: On the other hand, external data from
hospitals are ingested, staged and further cleaned in order to
1
ACGT: ACGT Master Ontology, BFO: Basic Formal Ontology, CHEBI: History Ontology, FMA: Foundation Model of Anatomy, FOAF: Friend of a
Chemical Entities of Biological Interest, CIDOC-CRM: CIDOC Conceptual Friend Ontology, GALEN: Galen Ontology, GO: Gene Ontology, GRO:
Reference Model, CTO: Clinical Trial Ontology, DO: Human Disease Gene Regulation Ontology, HDOT: Health Trunk Ontology, IAO:
Ontology, DTO: Disease Treatment Ontology, FHHO: Family Health Information Artifact Ontology, ICD: International Classification of Diseases,
362
The starting point for the development of iMC Semantic As such, a unique feature of the exelixis data integration
Core Ontology was the MHA Semantic Core Ontology [5]. As engine is that it enables the uninterrupted evolution of the
such, it reuses 34 sub-ontologies from the MHA Semantic modules of the IMC Semantic Core Ontology. For example,
Core ontology, extending it with 14 additional ones. All those consider that the term SNOMEDCD/50834005 is replaced by
sub-ontologies are integrated using an extension of the the term SNOMEDCD/50834XXX. Although change is
Translational Medicine Ontology [6], used as an upper layer common in biomedical ontologies, in all data integration
ontology. Among the added ontologies are HDOT, a cancer approaches so far, the mappings are either recreated from
specific ontology developed within the p-Medicine project and scratch or adapted/changed after each change. However
multilingual versions of the ICD-10, LOINC, Mesh and Medra creating from scratch new mappings or adapting existing ones
ontologies. An overview of the different modules of the iMC is a really time-consuming and error-prone process. On the
Semantic Core Ontology is shown in Figure 2. The integration other hand, many biomedical ontologies change too often - for
is achieved by introducing terms from these sub-ontologies to example, Gene ontology releases a new version every single
the TMO ontology and via relations of equivalence and day and NCI-T releases a new version every month [11]. As
subsumption from eTMO to the various ontology modules. such, adapting or re-creating mappings is not a viable solution.
These relations (~400) were manually identified and verified exelixis, given multiple ontology versions automatically
using the NCBO BioPortal2. identifies the changes, reuses past mappings and automatically
rewrites input queries among ontology versions. Using this
C. The exelixis Data Integration Engine mechanism, mappings to a previous ontology version can co-
Mappings. Having a way to model all available exist with mappings to a recent ontology version and work
information the next task in order to provide unified data uninterrupted. A screenshot of the exelixis platform is shown
access is to map the available sources to this global model. The in Figure 3.
mapping language adopted permits mapping specific ontology
subgraphs to source queries. Those queries can be SQL, CQL,
SPARQL queries or API calls. Defining such mappings is an
error-prone and time-consuming work. As such, we only
define mappings for the selected information that needs to be
integrated and semantically.
Real-time & Offline Integration. exelixis [7], [8], [9],
[10] allows both the real-time access over the integrated
information and the offline ETL of the semantically
transformed and integrated information to a triple store. The
benefit of the first approach is that the latest information is
always accessed, however at a cost of the execution time. On
the other hand, accessing the already transformed information
is faster. However, it offers access to maybe outdated
information. Another benefit of the second approach is that we
can recreate from scratch the resulting triples at any time. For
reasons of efficiency, the exelixis transforms periodically only
the newly inserted information by checking the timestamps of
the data. For the iMC project, we employee both approaches,
the first for apps that require access to the latest information of
the patient whereas for the analytics over patient cohorts we
use the second one. In this way, we benefit from the
advantages of both solutions.
Uninterrupted evolution of ontologies/terminologies.
Besides enabling interoperability and integrating selected
available data, a key aspect usually neglected by data
management architectures is the continuous evolution of the
ontologies/terminologies used. We argue that evolution should
not be treated as a side-'(('%5$65#4#>(+345-%.#44%+5+;'0@+0#
modern data management infrastructure and this requires the Figure 3. A screenshot of the exelixis system
redesign and the restructure of the available solutions and
D. The RDF Digest
frameworks to reflect this requirement.
Having a vast amount of information available, effective
and efficient methods are required for the quick understanding
ICO: Informed Consent Ontology, LOINC: Logical Observation Identifier PLACE: Place Ontology, PRO: Protein Ontology, RO: Relation Ontology,
Names and Codes, MESH: Medical Subject Headings, NCI-T: NCI SBO: Systems Biology Ontology, SNOMED-CT: SNOMED clinical terms,
theraurus, NIFSTD: Neuroscience Information Framework Standardized SO: Sequence Ontology, SYMP: Symptom Ontology, TIME: Time
ontology, NNEW: New Weather Ontology, OBI: Ontology for Biomedical Ontology, UMLS: Unified Modeling Language System.
2
Investigation, OCRE: Ontology for Clinical Research, OMRSE: Ontology of http://bioportal.bioontology.org/
Medically Related Social Entities, PATO: Phenotypic Quality Ontology,
363
of the data sources. To this direction, within the the real-time integration and the offline ETL, offering the
iManageCancer project a novel ontology summarization tool advantages of both solutions. For modelling available data the
has been developed. Ontology summarization aspires to INTEGRATE and the EURECA projects, relied on
produce an abridged version of the original ontology that SNOMED-CT, LOINC, and MEDRA whereas in p-Medicine
highlights its most representative concepts. Central questions a cancer ontology was generated. In our approach, we reuse all
to this direction are how to identify the most important nodes those ontologies, adding more than 40 other subontologies,
and then how to link them in order to produce a valid sub- providing as such an almost complete representation of the
schema graph. RDFDigest [12], [13], [14] tries to answer the ehealth domain. Furthermore, a unique selling point of our
first question by adapting eight centrality measures from graph solution is that it enables the uninterrupted evolution of the
theory and then in order to link those nodes exploits ontological submodules combining them with novel data
approximations of the Graph-Steiner tree algorithm. The result exploration tools.
is a graph including the most important nodes of the schema,
whereas the users can explore the provided summaries, To the best of our knowledge there is no other data
retrieving statistical information for the data as well. A management architecture embracing such a diversity, offering
screenshot of the tool is shown in Figure 4 whereas the system the benefits of individual, state of the art, data management
systems for persistence in a data lake and the semantic
is also available online3.
integration of selected data in the upper layer, incorporating
5*'>'71.65+10-by-&'4+)0@23+0%+2.'#0&1(('3+ng multiple data
exploration possibilities.
REFERENCES
[1] eHealth Task Force. Redesigning Health in Europe for 2020. Available
online: http://ec.europa.eu/digital-agenda/en/news/eu-task-force-
ehealth-redesigning-health-europe-2020.
[2] D. Horgan, M. Jansen, L. Leyens, et al. >An index of barriers for the
implementation of personalised medicine and pharmacogenomics in
Europe@, Public Health Genomics 2014;17(5-6):287-98.
[3] #35#-+4 #--#.+4 163.#-+4 '5 #. >Enhancing health care
delivery through ambient intelligence applications@, Sensors 12 (9),
11435-11450.
[4] 10&:.#-+4 6%6310)'5#.>+#0#)'#0%'3Developing
Figure 4. A screenshot of the RDFDigest system a platform for Empowering patients and strengthening self-
/#0#)'/'05+0%#0%'3&+4'#4'4@
E. Data Anonymization & Data Access APIs [5] 10&:.#-+4 2#0#-+4 (#-+#0#-+4 '5 #. >+)+5#. #5+'05
All data available through the aforementioned data Personalized and Translational Data Management through the
MyHealthAva5#331,'%5@
management architecture can be accessed using data access
APIs. Using those APIs, application requests are transformed [6] 6%+#011#00''5#.>*'3#04.#5+10#.'&+%+0'051.1):
and Knowledge Base: Driving Personalized Medicine by Bridging the
to queries that are either targeting individual data sources from #2 $'58''0 '0%* #0& '&4+&'@ 1630#. 1( +1/'&+%#. '/#05+%4
the data lake or targeting the integrated information - available 2.Suppl 2 (2011): S1. 2015.
using SPARQL queries. Specific attention is given to the [7] H 10&:.#-+4 .'9164#-+4 >051.1): '71.65+10 8+5*165 5'#34@
security mechanisms implemented on top of the data Journal of Web Semantics 2013:19:42-58.
management layer. To this direction, one of the key aspects in [8] 10&:.#-+4 .'9164#-+4 >9'.+9+4 71.7+0) 051.1):-Based
accessing individual health data is the capability to provide #5#05')3#5+10:45'/@-1286.
anonymized data according to the ethical/security requirement [9] H. Kondylakis, D. Plexousakis >Ontology evolution in data
of the corresponding data usage scenario. To this end, the data integration: query rewriting to the rescue@, ER 2011.
management architecture incorporates a data anonymization [10] H. Kondylakis, D. Plexousakis>Ontology Evolution: Assisting Query
Migration, ER 2012.
service provided by the ARX data anonymization tool [15].
[11] 31<364-+#*/>71.65+101($+1/'&+%#.1051.1)+'4#0&
mappings: Overview 1( 3'%'05 #2231#%*'4@ 1/265#5+10#. #0&
III. RELATED WORK & CONCLUSIONS Structural Biotechnology Journal, 2016:14333=340.
Projects with similar goals for collecting, storing and [12] A. Pappas, G. Troullinou, G. Roussakis, H. Kondylakis, D.
accessing multiple, heterogeneous health data were the .'9164#-+4>92.13+0)/2135#0%''#463'4(136//#3+;+0)
KBs, ESWC, 2017:387-403.
eHealthMonitor4, the INTEGRATE5, the p-Medicine6 and the
[13] G. Troullinou, H 10&:.#-+4 #4-#.#-+ '5 #. >051.1):
EURECA7. All projects identified the need for a common 0&'345#0&+0) !+5*165 '#34 *' 6//#3+;#5+10 2231#%*@
model of representation in terms of ontological resources and Semantic Web 2017:8(6): 797-815.
tried to either perform a real-time integration of the underlying [14] 316..+016 10&:.#-+4 #4-#.#-+ '5 #. > +)'45
sources (the first two ones) or ETL to a central repository (the 051.1):92.13#5+104+0)6//#3+'4@!
last two ones). However, in our approach we allow both access [15] F. P3#44'3#0&1*./#:'3>655+0)45#5+45+%#.&+4%.1463'%10531.+051
to the proprietary data through the data lake and to selected 23#%5+%'*' "&#5##010:/+;#5+10511.@+0'&+%#.#5#3+7#%:
integrated information. In addition, our approach allows both Handbook, 2015:111=148.
3 6
http://www.ics.forth.gr/isl/rdf-digest/ http://www.p-medicine.eu/
4 7
http://ehealthmonitor.eu/ http://eurecaproject.eu/
5
http://www.fp7-integrate.eu/
364

Kondylakis 2018

Uploaded by

Copyright:

Available Formats

You might also like

Kondylakis 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kondylakis 2018

Uploaded by

Copyright:

Available Formats

2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI)

4-7 March 2018

terminologies that are employed for semantically uplifting and

Abstract The advancements in healthcare have brought to

978-1-5386-2405-0/18/$31.00 ©2018 IEEE 361

A. The Data Lake

You might also like

Kondylakis 2018

Uploaded by

Copyright:

Available Formats

You might also like

Kondylakis 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kondylakis 2018

Uploaded by

Copyright:

Available Formats

2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI)

4-7 March 2018

          

terminologies that are employed for semantically uplifting and

Abstract The advancements in healthcare have brought to

978-1-5386-2405-0/18/$31.00 ©2018 IEEE 361

A. The Data Lake

You might also like

Abstract The advancements in healthcare have brought to